I think it would make more sense to define which types can be read from arbitrary memory (even uninitialized) safely. That basically just boils down to all bit patterns being a valid value for the type.
This would also reduce the number of unsafe code footguns in low level systems programming in general.
I don't know this for sure, but I strongly suspect LLVM has deeply-baked-in assumptions to the effect that uninitialized memory is never read, except maybe via unsigned char, and even then it might just be a special case for memcpy.
Does LLVM even have the concept of uninitialized memory? I mean, at most it’s just a deliberate annotation. There is no way for code to determine whether memory returned from a function like malloc() is initialized or not, that meaning is attached on another level of abstraction entirely.
The principal example here is padding bytes. You can initialize memory by assigning a struct value, and still face the reality that reading it back byte by byte is UB by Rust rules. It’s not a sane definition.
I am not at all familiar with the innards of LLVM, but I would be very, very surprised if it didn't.
That's just not true. The term the C standard uses for uninitialized memory is indeterminate value, and it consistently assumes throughout the normative text that reading an indeterminate value via any type other than unsigned char provokes undefined behavior. There is a technicality in play, because the normative definition of "indeterminate value" is "an unspecified value or a trap representation", which can be understood to imply that an indeterminate value can be read via any type that has no trap representations -- but that is almost certainly a drafting error, and both LLVM (Clang) and GCC will trigger undefined behavior upon reading an indeterminate value via any type other than unsigned char.
Note that memcpy and friends are specified to behave as-if they access memory via unsigned char *.
So in C, padding bytes are special-cased in several places with the intention that you can always interact with a struct safely (byte-by-byte or otherwise) as long as all of its value fields have been initialized. Specifically, padding bytes have unspecified, but never indeterminate, values. That would be a sensible way to deal with them in Rust if it doesn't already work that way.
... I don't know how to put that any more clearly. What do you want to call it when the compiler starts pruning entire control flow paths because they can never be executed without executing a construct whose behavior is undefined?
Sure:
int foo(void);
int bar(void)
{
int x;
if (foo())
x = 3;
return x;
}
compiles, using GCC 6.3 for x86-64, to
bar:
subq $8, %rsp
call foo@PLT
movl $3, %eax
addq $8, %rsp
ret
See how the return value is unconditionally 3? This optimization would not be allowed if reading an uninitialized variable produced a merely "unspecified" value. If you tag the declaration of foo with __attribute__((pure)) (this means "I solemnly swear that foo has no side effects"), it won't even bother calling foo. And, due to a long-standing pass-ordering problem, you don't get a warning about the use of an uninitialized variable! (The lack of a warning is GCC bug 18501.)
Clang 3.8 does the same thing (the generated assembly language is not quite the same but that's only because it does the stack alignment a little differently), but at least it does warn about the uninitialized variable.
This is not a memory access. Uninitialized locals and uninitialized memory (indirectly accessed through a pointer) have very different properties. In particular, there is no way for compiler to discern what's on the other side of the pointer (apart from the declared type), so it can't make any optimization that take that into account.
I understand why you think that, but it's not true anymore. This small modification of the program I posted earlier...
#include <stdlib.h>
extern int foo(void);
int bar(void)
{
int *x = malloc(sizeof(int));
if (!x) abort();
if (foo())
*x = 3;
int r = *x;
free(x);
return r;
}
I suspect clang/LLVM managed to do this by "lowering" the malloc allocation to a local variable, thus converting the program into the previous program, but if there were a distinction between uninitialized locals and uninitialized memory, that transformation would be invalid.
To get this to a somewhat more high-level point, LLVM has undef and poison values that both have very special behavior, and in particular, neither of them corresponds to any particular bit pattern. The sad truth is that we cannot have both “everything is a bit pattern” and keep all the fancy optimizations compilers do. Rust will have to have something similar. miri has Undef, which ironically is more like LLVM’s poison than LLVM’s undef (but that’s good, LLVM’s undef is pretty crazy).
C decided to treat unsigned char* specially to make it possible to implement memcpy. Rust could do the same with u8, but that somewhat feels wrong – why u8 and not the other integer types? And this is why @nikomatsakis brought up this union type, which is supposed to express "either T or Undef":
I might be missing something, but if you can't tell statically whether or not something is initialized, how does a union like that help? If you read init from uninitialized memory, it's still UB, and you have no way of knowing whether it is.
Perhaps I should clarify what I’m talking about.
Let’s go with a very specific example.
[#inline(never)]
fn memcpy64(dest: &mut [u64], src: &[u64]) {
for i in 0..src.len() {
dest[i] = src[i];
}
}
As far as I can tell, when this function compiles, the resulting machine code is always correct, even if the memory that src points to hasn’t been touched since the motherboard first powered on. Here, the function must never be inline because there is no other way to express the constraint that compiler doesn’t make assumptions about src, but it shows that “uninitialized memory” is a dubious concept. Making those reads “UB by definition” doesn’t seem well motivated. There is no actual “uninitialized memory” in a running computer.
The idea is that you wouldn't read the init field. You have a memcpy somewhat like this:
fn memcpy64(dest: &mut [Uninit<u8>], src: &[Uninit<u8>]) {
for i in 0..src.len() {
dest[i] = src[i];
}
}
I have to say though that this does feel pretty hacky. Not sure I am happy with this as the work-around to access Undef.
The machine code being correct doesn't mean anything except that the compiler is not "smart" enough to notice all instances of UB.
That's true. However, unitialized memory is still a very useful concept in describing the behavior of programs, in particular if we want to consider out programs at a higher level than assembly languages do. I suggest you do some Googling about LLVM's poison and undef, there should be numerous examples out there explaining why the LLVM developers considered it necessary to add these concepts even though they do not "exist" in a running computer. A brief search brought up this note and this first of a series of blog posts, for example.
in a running computer, pointers and usize are also "the same" type. If we constrain ourselves by what things are on a running computer, there are many valuable optimizations that we cannot do. For example, we could never move a variable access across a function call -- after all, the unknown function we call could "guess" the address of our local variables in memory, and access them. In a running computer, nothing stops them. But we don't want to permit functions to do that, precisely because we do want to perform optimizations. To this end, when we specify the behavior of programs, we imagine our computer to have "extra bits" that we use to keep track of things like whether something is a pointer or an integer, or whether memory is initialized. I also recently wrote a blog post motivating this kind of "extra state": https://www.ralfj.de/blog/2017/06/06/MIR-semantics.html
"Really" high-level languages like Java have a much easier time with all of this because they don't allow programmers to ever observe things like uninitalized memory. "Really" low-level languages like Assembly do not permit interesting optimizations, so they can get away with a very bare-bone kind of representation. Languages like C and Rust, on the other hand, want to eat the cake and have it, too -- they want to do optimizations based on high-level intuition while at the same time permitting the programmer to observe low-level details (like the fact that padding bytes exist). Properly doing this is a hard and unsolved problem.
The nomicon gives an exhaustive list of UBs in Rust, and it’s pretty short. “Reading uninitialized memory” is the odd man out. Every single one of the remaining sources of UB is motivated by pure technical necessity. But “reading uninitialized memory” is not UB by necessity, that one’s an arbitrary choice. It’s perfectly possible, even simple, to let Rustc emit code with no undefined memory.
Think about it. It’s already impossible to read or reference undefined locals on Rust level, except for that one ugly intrinsic. If the intrinsic didn’t exist, Rust would be one of those “really” high-level languages already.
On the other hand, the decision whether or not the heap allocator returns “uninitialized” memory is completely arbitrary metadata, and while it’s certainly unsafe to access that memory (interpreting it as certain types is UB), there is nothing that would make it inherently broken.
Edit: To clarify, I’m not arguing against the concept of uninitialized (heap) memory. I’m arguing that the blanket statement “reads of uninitialized memory are UB” doesn’t make sense. It should be unsafe, not UB. The result of std::mem::uninitialized() is the obvious exception, and should be considered as a separate category.
I don't think that's the case. Rust permits you to do all sorts of nefarious things with pointers, things that involve observing the offset of fields and how structs are laid out in memory. Have a look at the code of Rc::from_raw. Another example is HashMap, which allocates one large chunk of memory and then splits it into three pieces, namely, an array of hashes, keys, and values. This is only possible if you take a low-level view of memory, where everything is ultimately just a bunch of bytes.
Oh, I see. I misunderstood you then, sorry.
What is your stanza on reading uninitialized memory at type bool? That type is an enum, which typically guarantees that the value is either 0 or 1. If you now pass an uninitialized bool somewhere, is it okay for that to be UB?
The point that mem::unitiailized and malloc yield "the same kind of" unitialized memory has already been made above in this thread. Existing compiler transformations will happily turn a malloc into a local variable if they can prove that it is not used after the function returns. I also feel there's little gained from distinguishing the two.
Definitely UB. There is no limit to the harm you can make by breaking type invariants. This is not strictly limited to fundamental types or uninitialized memory either. Transmute from an initialized byte will do the same thing, and plenty library types have internal invariants that will cause UB if broken.
Rust doesn't use malloc() (not necessarily, anyway), and LLVM doesn't either AFAICT. It's just an external function that clang in particular has special optimizations for. Doesn't mean it's a good idea for Rust to treat heap the same way.
I mean, what are the odds that your code just happens to not require the heap allocations you are making, and you never noticed?
Less UBs in tricky unsafe code is IMO a gain well worth the distinction.
What I meant is, it doesn't allow any way to create undefined values you can legally read (apart from said mem::uninitialized(), and putting aside the ambiguity of heap allocation). Of course you can always use unsafe code, to create pointers to places on stack you shouldn't see, but that's UB by virtue of breaking the aliasing rules I think.
The idea of treating heap memory and stack memory differently w.r.t. UB is an interesting one. It goes completely against my intuition, but it took me a while to understand why. I believe the reason why stack and heap memory should be the same in this regard is that all memory should be interchangeable. I don't mean that the optimization discussed earlier (demoting mallocs) should be legal, I mean that I want to choose how I allocate memory solely based on considerations like allocation lifetime and performance and availability (e.g. whether I even have a heap), not by language lawyering about UB.
Especially when I'm doing low-level, unsafe work where I treat something as a bag of bits and do dangerous things with it, I don't want to care where in memory those bits reside. I have enough trouble getting the actual bit bashing right. I don't want to track down a misoptimization caused by moving a temporary buffer from a Vec to a stack-allocated array (or the other way around).
The virtue of treating heap and stack alike also shows elsewhere: Your proposed memcpy64 is only well-defined (under the proposed "reading uninit heap memory is okay") if it copies from a heap allocation into another heap allocation. IIUC, the inline(never) is supposed to force the compiler to conservatively assume the heap-heap case, but that's not how this works. Language semantics must be defined independently of "what the compiler can see" and if you say "reading uninitialized bits from the stack is UB", then calling memcpy64 with stack buffers as arguments is UB, period. (And this will inevitably be exploited by a compiler author trying to squeeze out a few percent out of a benchmark suite.) So a rule that distinguishes stack and heap memory makes it actually harder to write a correct memcpy, because it's easy to write something that's only UB for stack memory (i.e., where it's even harder to observe the UB).
Of course, one could invent arbitrary semantics that are aimed at preventing certain reasoning steps by the compiler. For example, one solution would be to draw a line at function boundaries (perhaps only those that are inline(never)) and say that pointer arguments are to be treated "as if" they were referring to heap memory. Besides being very inelegant and throwing the whole object model into disarray, this will inevitably have fallout for the compiler's ability to analyze and optimize. So I am not a fan of this approach either (at least in principle, I won't rule out that there's some twist that is relatively simple and brings much advantage.)
PS:
That can easily happen in overly general code spread over multiple functions. But aside from the specifics, it doesn't sound very appealing to rule out optimizations just because they sound like good code shouldn't need them.