Role of UB / uninitialized memory

arielb1 · June 14, 2017, 10:36am

That's safety vs. definedness again. The language should not "guess" your types' semantics - if you don't use any "magic" attributes, the language should only care about whether your types are "syntactically" valid - e.g. you should be able to use an RcBox to store 2 random words (as long as you don't call any Rc functions in a way that would crash).

le-jzr · June 14, 2017, 1:23pm

That's not what I said, you are attacking a strawman. All memory is same, but pointers have different sources, and they don't necessarily lower to memory access. Specifically, if you reference a local, LLVM is allowed to optimize the memory access away completely in certain cases. In order to do so, it must know statically which value is in the memory, which basically means that the memory is allocated and accessed in the same function (that's why inlining matters in my example).

The problem is that if LLVM knows that the memory is not written to before read, it can do anything it wants with the access, but this is only doable with memory allocated with alloca, and the only way to achieve this in Rust is std::mem::uninitialized.

Heap allocation (not heap memory) is different because it's an external function with no special LLVM semantics. Defining heap allocations to be uninitiated is an explicit special case in Rust's definition (lawyering by the language, as you put it). In fact, all memory coming from heap allocator is just subdivided regions of initialized memory provided by the OS. There are no LLVM optimizations taking place, and Rust-specific optimizations don't benefit from uninitialized reads being UB, since any such optimizable code would almost certainly be invalid by definition, not to mention that most uses of heap go through safe abstractions anyway.

hanna-kruppe · June 14, 2017, 2:09pm

If we're only talking about accesses, then heap vs. locals (or, if you prefer to talk about that, the means of memory allocation) is not inherently relevant. The compiler is, and absolutely should be, allowed to remove (or introduce, or shuffle around) memory accesses no matter the source. More importantly, it's absolutely possible for LLVM to deduce that a load through a pointer yields uninitialzed bytes even without seeing where that memory is allocated. Consider this function for example:

unsafe fn make_and_read_undef(p: *mut u8) -> u8 {
    let u = mem::uninitialized();
    *p = u;
    *p
}

Even leaving aside any knowledge of uninitialized memory, it ought to be possible to optimize that code into this:

unsafe fn make_and_read_undef(p: *mut u8) -> u8 {
    let u = mem::uninitialized();
    *p = u;
    u // <- changed
}

... which has the effect of making the memory access *p yield uninitialized bytes, regardless of how and where it was allocated. So as long as "all memory is the same" and there is some source of uninitialized memory, any memory can be made uninitialized, unless we start randomly restricting useful optimizations such as store forwarding.

Again, whether something is UB or not is a semantic question, and so if we want to make certain operations defined or not, we need to lay out the criteria in terms of Rust semantics. What some compiler does or doesn't know, and what it does with that knowledge, is only relevant when testing whether said compiler is faithful to the semantics. These two concerns only intersect when it comes to judging how practical some proposed semantics are.

I tried to guess at what semantics you are proposing by reading it as "reading uninit stack memory is UB, reading uninit heap memory is fine" but apparently that was wrong. I apologize and would like to learn what semantics you are proposing. I believe that would be easier for me if you explained these semantics separately from how they can or cannot be implemented on LLVM.

This is not entirely true, LLVM knows about the symbol malloc (among others) and will treat it as the C function of the same name. So Rust code using the system allocator will also be affected by this, assuming some interprocedural analysis or inlining (which we'd like to have, at the very least to avoid additional overhead from wrapper functions compared to calling malloc directly—zero cost abstractions and all that jazz) .

RalfJung · June 14, 2017, 5:17pm

To add to what @hanna-kruppe said, the C standard explicitly says that memory returned by malloc is uninitialized: "The malloc function allocates space for an object whose size is specified by size and whose value is indeterminate." The motivation for this is to abstract away from details like allocating memory from the OS, and describe what all C programs can rely on when calling malloc.

RalfJung · June 14, 2017, 5:22pm

(Where "it" is "Rust".)

That's not entirely true: If you write something like let x = Struct { f1, f2, f3 };, the padding between the fields can well remain uninitialized. Still it should be legal to pass &x to memcpy to access mem::size_of::<Struct> many bytes.

le-jzr · June 14, 2017, 8:10pm

Right, I kind of internalized what @zackw said

I can't see why anyone would want to disagree with this, so I stashed it away to "resolved issues" in my head.

le-jzr · June 14, 2017, 8:23pm

I’ll try to break up my argument into small, independent pieces, so that it’s more understandable (and more easy to determine which parts are not agreeable). The problem I’m having explaining my rationale is that so many concepts and layers of abstractions are involved.

First off, let’s settle the general assumption: Undefined behavior is bad. If it’s not intuitive, it’s even worse. Things should only be classified as undefined behavior if doing them necessarily breaks assumptions programmer makes about the code. More to the point, giving the compiler freedom to do more optimizations is not by itself a sufficient argument to make something UB. There are numerous examples where C code doesn’t work as expected, because of things that are counterintuitively defined to be UB. Notorious examples in C are signed integer overflow and strict aliasing rules. They break correct-looking programs, silently. Any disagreements?

RalfJung · June 14, 2017, 8:45pm

Oh, I missed that comment. And I don't think it is true. I already quoted malloc above as yielding indeterminate values; the padding bytes of a struct will usually stay indeterminate. In fact, a quick search for "padding" in the standard brings up "The contents of ‘‘holes’’ used as padding for purposes of alignment within structure objects are indeterminate."

I think there is general agreement in the unsafe code guidelines team that there are some optimizations we want to do on safe code, and that we are willing to make things UB for that purpose. See https://github.com/nikomatsakis/rust-memory-model/tree/master/optimizations for some examples. I agree that UB is bad, but slow code is also bad. This is a trade-off. Making fewer things UB is one way to improve the situation here; another one that we are looking into is making UB testable. That would be a big thing, and put us in a totally different spot than where C sits. If we have a way to run your test suite in "UB checking mode", so that you can be sure it doesn't do anything the compiler doesn't want you to do -- that makes it much less of a problem for the UB rules to be subtle. Still, we want them to be as "un-subtle" as reasonable while still permitting sufficient optimization. After all, if people use C or assembly over Rust for reasons of performance, we haven't gained much in terms of overall safety. (Also notice that the UB rules will only ever affect people the use unsafe, so most Rust programmers should not have to care.)

Also, is some forum moderator reading this? I think most everything since https://internals.rust-lang.org/t/canvas-unsafe-code-in-the-wild/4990/23 is interesting discussion and very valuable feedback, but it's not on-topic in this thread any more, so maybe we could have it split into a new thread?

le-jzr · June 14, 2017, 9:08pm

Whether or not it's currently true in Rust (or C, which is even less relevant), the question here is whether it should be. Half the text I wrote in this thread tries to explain that these definitions of indeterminate/undefined/uninitialized memory (all conflated into one package in Rust) don't affect anything except for arbitrarily restricting what code is legal by definition...

...by which I also mean that this particular UB doesn't make any reasonable optimizations possible.

Determining whether a piece of memory is uninitialized, or whether uninitialized memory is read, is undecidable in general. Any realistic checking would necessarily flag lots of existing code as possible UB. But if there is a possibility to make it somehow work, I'm all for it.

The people who use unsafe are also exactly the population whose mistakes break security assumptions in all surrounding code. I wouldn't say it's a niche concern.

That would be very appreciated.

notriddle · June 14, 2017, 9:44pm

"Making UB testable" means (at least, as I understand it) being able to instrument a Rust program to detect if Undefined Behavior is invoked at runtime. Static analysis to detect UB already exists for Rust; the "testable UB" planning is centered around runtime checks for when static analysis is too conservative.

RalfJung · June 15, 2017, 1:12am

Well so one practical reason is that LLVM uses this semantics, so if we want to say something like "freshly allocated memory (heap or stack) is essentially a non-deterministically initialized bag of bits", we'd have to actually zero-initialize everything to make LLVM play by these rules. I will concede that this is not a great reason to change our semantics, but it has to be taken into account. What you seem to be asking for, however, is a concrete useful optimization that would be disallowed by these semantics. I would be seriously surprised if there isn't something LLVM does here that is actually useful, but that's pretty far outside of my expertise. Maybe @eddyb or @arielb1 have an idea.

I will assume here that we use the same rule for heap and stack, for multiple reasons. Some have been outlined above by @hanna-kruppe, I would also add that my personal view of the stack is that it's just an optimization: Really, we could allocate every local variable on the heap individually, and deallocate them explicitly; doing this with a stack discipline is just a lot more efficient.

Right, that's why I said "run your test suite". This is not a static analysis, it is a dynamic check. Much like address sanitizer, only we could actually make it check for everything. We could even make this the definition of what it means to be UB. I wrote some blog posts about this: How to specify program behavior and Exploring MIR semantics through miri.

toc · June 15, 2017, 5:57am

My understanding is that reading from an allocated (but as yet unwritten to) page will typically produce zeros, while reading after that page has been touched will produce some consistent garbage. So uninitialized memory would definitely be a meaningful concept at runtime.

le-jzr · June 15, 2017, 6:47pm

That's not true. Apart from the padding bytes, which would indeed need to be initialized, and mem::uninitialized, which is just plain evil, Rust doesn't allow anyone to touch or point to uninitialized stack memory, not even unsafe code. It's statically prevented by the compiler.

Heap allocations are a non-issue because even if LLVM itself has special handling for the malloc function, and even if LTO makes those malloc calls bubble up all the way from the allocator crate to the consumer code (both very possible), all you have to do is to alias the function under a different symbol name. Then it's just an extern function that returns a pointer, no special considerations involved.

Ah, right! That would be indeed very helpful.

Please don't. It's bad enough as it is, that Rust isn't fully specified separately from implementation. It makes it harder to understand the language, and writing non-trivial code with borrows is already basically "make random changes until it compiles".

I never, I repeat, never, suggested that heap and stack should be treated differently. I said previously that @hanna-kruppe is attacking strawmen and I stand by that sentiment.

It could make it more clear to separate the two conflated concepts. For the purposes of discussion, let's call them "undefined memory" and "not-initialized memory" (to disambiguate from "uninitialized").

"undefined memory" is the concept that leaks from the LLVM level, it's the memory that is considered uninitialized by LLVM, and reading it is UB as per LLVM's rules. On Rust level, this maps exactly to padding bytes and std::mem::uninitialized(), and nothing more, assuming LLVM is disavowed of its knowledge about what is heap allocation (as explained earlier).

"not-initialized memory", on the other hand, is the rust concept of uninitialized memory (Uninitialized Memory - The Rustonomicon).

"undefined memory" is "not-initialized memory" by definition, but the opposite doesn't hold. "not-initialized memory" which is not "undefined memory" is unsafe to read, but is not necessarily UB.

Now that the language is established, it should be clear that it's possible for heap allocations to be "not-initialized memory", while not being "undefined memory". For the programmer, this is no more of a concern than the current definition of uninitialized memory. The argument that you would somehow need to care about the distinction is completely unfounded... if you can write unsafe code today without freaking out about uninitialized memory, you can do it just the same without freaking out about "undefined memory". In fact, you can hold on to your current intuition. Anything that's legal today would be legal with my suggestion. It would just make more code legal, including some cases that unexperienced programmers intuitively expect to be correct, but are in fact UB by current definition.

le-jzr · June 15, 2017, 6:54pm

The OS actually has to zero out the physical page you receive. Not doing so would be a security problem (privileged application allocates page to store confidential data, page is deallocated or swapped to disk without being cleared, another application maps the freed physical frame and reads confidential data).

RalfJung · June 15, 2017, 7:42pm

However, the in-process memory allocator will happily re-use allocations. So whatever malloc returns may not be zeroed.

le-jzr · June 15, 2017, 7:57pm

Yes. That's the "not-initialized memory" concept I used above.

stebalien · June 16, 2017, 6:24pm

Also see:

https://github.com/rust-lang/rfcs/pull/1222

Specifically this comment.

notriddle · June 16, 2017, 9:49pm

unsafe {
    let my_var: u8 = 1;
    // Whether I need to subtract or add depends on whether the stack grows up or down.
    let uninitialized_a = ptr::read(ptr::offset(&my_var, 1));
    let uninitialized_b = ptr::read(ptr::offset(&my_var, -1));
}

le-jzr · June 16, 2017, 9:55pm

That’s always UB for unrelated reasons. It’s illegal to dereference pointer that was offset out of range of the allocation it was originally pointing to.

le-jzr · June 16, 2017, 10:42pm

Yeah, well it would still be unsafe code. Security concerns are inherent in any kind of unsafety. But yeah, it might seem safer (in the non-Rust sense) and simpler to say "reading this is UB" than saying "exposing this data may be insecure".

To be honest, I can't even remember the motivation I had for advocating my position. Damn my short memory. At some point it became solely about showing that the UB designation is not strictly necessary, in contrast to all the other UB cases. But looking back, I lack clear motivation for why it should matter.

Padding bytes are the critical facet. If padding bytes weren't uninitialized (I don't think it was clearly established whether or not that's the case currently), I can't think of a motivation for Rust code to want the ability to read uninitialized memory. (It might want write-only reference types, though, to help with those IO buffers. Just a side thought.)

I'm unsure about what's the situations on FFI boundary though.The definition of uninitialized memory in nomicon is not very explicit on the matter.

Topic		Replies	Views
Memcpy and uninitialized memory Unsafe Code Guidelines	20	3398	December 22, 2024
Safely reading uninitialized memory	25	3106	March 25, 2019
"What The Hardware Does" is not What Your Program Does: Uninitialized Memory	40	6408	October 17, 2019
Uninitialized memory	57	10311	March 25, 2019
Mem::uninitialized, `!` and trap representations language design	56	6823	March 25, 2019

Role of UB / uninitialized memory

Related topics