The Rust compiler has a few assumptions that it makes about the behavior of all code. Violations of those assumptions are referred to as Undefined Behavior. Since Rust is a safe-by-default language, programmers usually do not have to worry about those rules (the compiler and libraries ensure that safe code always satisfies the assumptions), but authors of unsafe code are themselves responsible for upholding these requirements.
Those assumptions are listed in the Rust reference. The one that seems to be most surprising to many people is the clause which says that Rust code may not produce “[…] an invalid value, even in private fields and locals”. The reference goes on to explain that “ producing a value happens any time a value is assigned to or read from a place, passed to a function/primitive operation or returned from a function/primitive operation”. In other words, even just constructing , for example, an invalid bool , is Undefined Behavior—no matter whether that bool is ever actually “used” by the program. The purpose of this post is to explain why that rule is so strict.
Is it correct to say that that requirement extends to "Composite Values", like structs, that have invariants that must be maintained between the fields? For example, say a struct that represents a date that has 3 fields for year, month, and date where the invariants are that it always represent a valid date. By invariants, I mean that other code, including "unsafe", should be able to depend upon it representing a valid date. So, any code, when updating such a struct, must do so in a way that prevents any other code, including concurrent threads, from seeing an invalid "value"? If "unsafe" code were updating the private fields, it would have to ensure that at the end of the function AND BEFORE CALLING ANY OTHER FUNCTION THAT TAKES THE COMPOSITE VALUE AS AN INPUT, that the invariants were met? Is that correct? Or is that taking the requirement too far?
I think here you are mixing up Rust's two kinds of invariants. Only the validity invariant is UB when being violated. The validity invariant is fixed by the language spec, the user has no influence here. In contrast, what you are describing is a safety invariant.
The compiler doesn't care when code violates safety invariants, it doesn't even know what safety invariants are. You only get actual, Miri-detectable, "language UB" once the code does something that is specified as UB in the reference.
When libraries specify assumptions they make about user code, violations of those assumptions do not necessarily lead to language-level UB, but they could. We could call this "library UB", and it basically means you are leaving the stability guarantee provided by the library and may encounter undocumented behavior (which may or may not be language UB now, and that could change in the future as well with library upgrades).
For example, it is not UB to create a non-UTF-8 &str. But it could be UB to call a &str-taking method on such an ill-formed str (depending on what that method does, it may crucially rely on UTF-8). The ill-formed str violates the safety invariant but satisfies the validity invariant (the latter is the same as the validity invariant of &[u8]).
This almost sounds like a loophole in the proposition that "safe code can never produce UB". It sounds like I can create a struct in safe code that violates the safety invariants of that type, even though the individual components are valid, which could eventually, somewhere in some code lead to UB without ever involving unsafe code. Or are you saying that only possible way to get UB here is that some unsafe code would have to operate on this struct that has invariants while making the assumption that the invariants are met, which violates the requirement that "unsafe code cannot rely on safe code to maintain invariants"? Somehow, that seems not quite right because "unsafe" code must rely upon the safe code within the privacy boundaries of the module to maintain invariants. Is that not the case?
EDIT 1:
If not in safe code what about if unsafe code creates a struct (or enum) value where the invariants of the struct are not correct. If I understand correctly, this isn't "Language Level UB" so unsafe code is not required to ensure the invariants are met, but, this can lead to UB elsewhere (even in safe code). Doesn't that mean that the requirement that composite invariants must be maintained at unsafe boundaries by unsafe code or it is UB, or, in this case is it only UB when the invalid composite value is used?
EDIT 2:
This gets me thinking about how "unsafe" is used and whether some additional requirements are warranted. Unsafe today should rely upon the privacy boundary of the containing module to ensure that no values that the unsafe code depends upon are modifiable directly, without going through a method/function defined within the module, which modifies those values in a way that maintains the necessary invariants for the unsafe code within the module. This means that "unsafe" code is relying upon "safe" (though private) code to maintain invariants. Should it be that structs must declare which fields are used in invariants of the unsafe code within the module and it should be prohibited to mutate those fields, even within the module, without using "unsafe" AND that "unsafe" code that uses/references values of structs that are not declared as part of the invariants would be a compile-time error? This would prevent "unsafe" code relying upon anything other than a specifically declared contract where upholding the contract always requires "unsafe" because a violation of that contract could lead to UB?
@gbutler just because something is not immediate "language UB" doesn't mean it is right. There is a strict set of rules that the language and the compiler mandates (and that Miri checks), and then there are other invariants on top of that which unsafe code authors use to ensure that the first set of rules is indeed always upheld. The latter are basically part of the "proof" that the former are upheld; Miri cannot detect when those user-defined invariants are violated and neither can the language spec say anything meaningful about them. It still feels like you are confusing those two related but clearly distinct concerns.
That is indeed the case.
If your library with a "date" struct lets safe code produce an invalid date struct, then your library is unsound. That would be like making Vec::set_len a safe function.
But none of that is related to let b: bool = transmute(3u8) being insta-UB, I think. Now we are discussing how to prove a library sound, whereas the topic of my post is to define what exactly is and is not Undefined Behavior (and thus define soundness in the first place).
Ah, now we are moving towards the proposal of "unsafe fields" which comes up semi-regularly:
LLVM is not allowed to "speculatively execute" (I assume you mean something like loop-invariant code motion) things that could introduce new UB. This is in fact the reason why LLVM, for many operations, says that they are not UB but merely return poison/undef. (For example: out-of-bounds getelementptr inbounds; overflowing add/sub nsw/nuw.) A truly UB-risking operation is hard to move around.
But Rust is a surface language so it does not have such concerns; it can always declare full UB and later refine this to something less invasive on the MIR/LLVM level.
"Clearly" the solution for LICM is to hoist the if b { 42 } else { 23 } to the first iteration, effectively duplicating the loop condition (into "enter loop" and "continue loop") .
(Presumably this doesn't really scale to more complicated situations and other optimizations)
I have long been curious whether programs like this exhibit UB:
fn set_to_zero(s: &mut str) {
unsafe {
let bytes: &mut [u8] = s.as_bytes_mut();
for b in bytes {
*b = 0;
}
}
}
fn main() {
let mut s = "🍔".to_string();
set_to_zero(&mut s);
println!("{:?}", s);
}
The string s is valid UTF-8 before set_to_zero is called, and it is valid UTF-8 after it returns.
However, in the middle of the for loop, the memory belonging to s is not valid UTF-8.
But, this occurs only while bytes holds an exclusive borrow of the value, making s inaccessible.
In this case, the invalid data is not just unused but cannot be used. Does this count as “producing an invalid value”? If so, it’s effectively impossible to modify a str byte-by-byte if it contains multi-byte characters. This would make operations like reversing a mutable str in-place impossible to implement in Rust.
P.S. To be clear, I think it’s fine if the answer is “no, this is not possible in Rust.” It would be a bit limiting, but there are workarounds. Mostly, I‘m just curious whether the validity rules could/should be written to allow use cases like this.
(One workaround would be for functions like set_to_zero or reverse_in_place to take &mut [u8] instead, and require callers to transform/forget/consume the original string owner before calling the function and recreate it afterward.)
The utf-8-ness of str is not/no longer a validity invariant. As a safety invariant, you are free to violate it in a region in your program where this is not observed by code outside your control. So, set_to_zero as above should now be legal.
Thanks! I had missed that change. (I see the reference is updated to reflect this in Rust 1.45.0, which will be released tomorrow. If only I'd seen this thread one day later, I wouldn't have had to write up that long question!)
And you can rephrase it to be about e.g. &mut bool, to avoid the str/UTF-8 situation. Is it legal to turn that into &mut u8, put a 3 into it, and then put a 0 into it? We do not know yet. My personal stanza is "yes this is allowed", mostly because disallowing it in a precise (Miri-checkable) way is really hard and might incur more complication than this is worth.
I have more questions about what is considered UB. The reference lists things that cause UB, but it doesn't contain the following:
dereferencing unaligned references
the reference says dereferencing an unaligned raw pointer is UB. Does this also apply to references? If not, why?
reading uninitialized memory
the reference says, it is UB producing an integer ( i* / u* ), floating point value ( f* ), or raw pointer obtained from uninitialized memory, or uninitialized memory in a str.
However, I doubt that these are the only types that must be initialized? The reference says that the list is not exhaustive. Is this something that should be added to the list? Or is it one of the things we don't know yet?
I'm still inexperienced with unsafe Rust, just trying to understand things.
IIRC an unaligned reference is simply an invalid value and therefore it was UB to produce it in the first place. Which is why you need raw pointers to do any unaligned pointing. And why there's this whole ongoing debate about adding something like a &raw reference syntax so you don't accidentally create a temporary reference and UB yourself as part of creating the raw pointer you actually wanted.
@Ixrec already answered this: it is impossible to cause UB by dereferencing an unaligned or dangling reference, because even constructing that reference is already UB.
This is an interesting one. First of all let me notice that some reads of uninitialized memory are fine. For example:
let uninit_mem = MaybeUninit::<u32>::uninit();
let ptr_to_uninit = &uninit_mem as *const _ as *const MaybeUninit<u8>;
let _val: MaybeUninit<u8> = ptr::read(ptr_to_uninit);
In other words, there is no general principle in Rust that says that reading uninit memory is disallowed. (This is a common confusion and indeed even the reference was wrong here for a long time.)
Whether or not reading uninit memory is allowed depends on the type---this is part of the "though shall not construct invalid values" system. As you noted, for integers and floats we state it explicitly. For bool, char and enum I consider this rule to be implied by saying things like "the tag needs to match an existing variant of the enum"; uniinitialized memory cannot match any tag. Thus these may not be uninitialized either. Structs, arrays and similar types inherit this property from their fields (but the padding between the fields, if any, may indeed be uninitialized). That leaves only unions, and those, too, may be uninitialized under the current rules (but also see this discussion). That's why the above read of an uninitialized MaybeUninit<u8> is allowed.
I find the "literally anything could happen" assertion of UB to be naturally (incorrectly) hand-waved with "but why would it do anything other than the obvious?" without some reasonable counter example.
An incr of 80 is still a reasonable value for an integer though, perhaps a better example would be that the compiler can use the 0/1 value of bool as part of a small jump table, and unconditionally jumping 3 * X instructions forward is probably as UB as it gets. This can even be observed in "dead" code which is speculatively executed.