"What The Hardware Does" is not What Your Program Does: Uninitialized Memory

I have added the freeze idea but not in the way you asked.^^ I added it to mention a data structure that is incompatible with “unstable” uninitialized memory, and how freeze would fix that.

I would love to have a clear, concise answer to “why is it beneficial for optimizations to make the abstract machine state of uninitialized memory distinct from any initialized bit pattern”, but I do not. I know that LLVM propagates undef/poison through loads of things and that would indeed be much harder if it had to make sure that there exists a single “consistent” frozen value for any uninitialized byte, but I have no idea how much performance that really buys us.

I am not entirely surprised. DRAM does not even satisfy the digital contract that literally everything above assumes.

2 Likes

I thought you disliked that essay :slight_smile:

That’s all well and good, and I have more sympathy for this view than some on this forum, but given that for Rust no such abstract machine has been formally defined yet, platform details and potential optimisations are the only thing that people can realistically talk about now. And let’s face it, most Rust code is going to be compiled by naïve translation to imperative code later transformed by optimisation passes, so the whole ‘abstract machine’ business is in practice a fancy specification of invariants that all optimisations must uphold.

So I think it’d be nice for a change to have a discussion about what kind of optimisations do we want to support, so that we can finally come up with some sensible semantics. In particular, I’d like a more thorough explanation for the x < 150 || x > 120 example, because I find the concept of a value that defies universal quantifiers quite disturbing. I don’t see a reason why I shouldn’t consider this example an outright compiler bug. What is the specific sequence of optimisation passes that produces such code? How are their transformations justified?

(By the way, in the C++ example, if I replace a heap allocation with a stack allocation — which is uninitialized all the same — the function is optimised to always return true instead. This only serves to strengthen my ‘compiler bug’ intuition.)

And one more thing. If we want to discourage talking about target platform details in favour of abstract machine semantics, then the same thing should apply to LLVM intrinsics. Not only because it ties Rust to a particular compiler implementation, which isn’t all that rigorously specified anyway, but also for the much more down-to-earth reason that relatively few people are familiar enough with LLVM internals to understand it. When someone tells me ’mem::uninitialized is just undef', I have no idea what is being said. And given the headaches arising from conflicting optimisations mentioned in the post, LLVM people don’t necessarily have the best idea either.

The point is that x is not “a (particular) random bit pattern”; and in particular it doesn’t need to behave consistently or as a constant value across uses. Therefore, it can yield false in both sides of the logical or, first behaving as if it were a value greater than 150, then subsequently behaving as if it were a value smaller than 120. In other words, the compiler is allowed to, at its convenience, decide to optimize a comparison expression involving an uninitialized value into a constant false, separately, in both subexpressions.

It is still definitely disturbing.

In C and C++, the answer is “because the standard says so”. AFAICT there’s no Rust standard, so I’m not sure what the correct answer in Rust is, but maybe we should stick to the definitions and allowed assumptions enlisted by the Nomicon.

1 Like

Forget about standards for a moment. This assembly is generated by a specific, deterministic (I assume) compiler consisting, like I said, of a frontend that generates naïve (but annotated) imperative code that does the ‘obvious’ thing, and optimisation passes that transform that code. The optimisation passes have to justify their transformations, i.e. prove that they maintain semantics of all code that was not UB in the first place. I want something of the form ‘if this comparison returned true, that would imply condition X, which is UB and therefore cannot happen, so we can assume it returns false and fold it to a constant’; in other words, something like this. Then we can perhaps consider if this optimisation is worth the trade-offs and whether to adopt semantics that allow it, and to what extent. Mere ‘because the standard says so’ is not nearly enough. Why does the standard say so? Where? Does it actually? Maybe in fact the optimisation is wrong, and only happens in this case to produce code that is technically compliant.

1 Like
  1. Compiler assumes that UB does never happen in your program;
  2. If x is uninitialized, than (the standard states that) it is UB to evaluate x < 150.

Given the two statements above, if compiler can prove that x is uninitialized, it makes a conclusion that your program will not attempt to evaluate x < 150. “Will not attempt to evaluate” means that code that performs this evaluation is unreachable, and thus should be eliminated.

1 Like

@matklad makes a very good point about how MADV_FREE can introduce non-determinism even at runtime. From the man page:

madvise(addr, len, MADV_FREE)

The application no longer requires the pages in the range specified by addr and len. The kernel can thus free these pages, but the freeing could be delayed until memory pressure occurs. For each of the pages that has been marked to be freed but has not yet been freed, the free operation will be canceled if the caller writes into the page. After a successful MADV_FREE operation, any stale data (i.e., dirty, unwritten pages) will be lost when the kernel frees the pages. However, subsequent writes to pages in the range will succeed and then kernel cannot free those dirtied pages, so that the caller can always see just written data. If there is no subsequent write, the kernel can free the pages at any time. Once pages in the range have been freed, the caller will see zero-fill-on-demand pages upon subsequent page references.

Since this is used by jemalloc internally, this means that uninitialized memory returned by the allocator will non-deterministically contain either zeroes or data from previously freed memory, and the OS could decide to change this at any point. A freeze operation as suggested in @RalfJung’s post would have to read and write back 1 byte in each page of the frozen data to ensure that the OS doesn’t change the data behind our backs.

2 Likes

What happens in the Rust example is:

  1. The LLVM IR generated for mem::uninitialized() performs an alloca and then loads from it without storing a value. You can see this by building without optimizations.
  2. LLVM simplifies this to undef.
  3. always_returns_true is inlined into its caller, so the IR becomes the equivalent of undef < 150 || undef > 120.
  4. undef < 150 is replaced with false, an arbitrary boolean value (though interestingly, not undef itself). undef > 120 is also replaced with false. (To see what’s going on in the playground, choose the “LLVM IR” output option.)
  5. false || false becomes false.

This is all on LLVM’s end, so you will get the exact same result from Clang if you manage to get Clang to produce the same IR – but trivial differences in what IR the frontend generates, or a different LLVM version, may cause different results. After all, this specific outcome depends on things like the order of optimization passes (e.g. always_returns_true could have been optimized to true before inlining), and LLVM’s arbitrary choice of false as the result of comparisons with undef (true would have worked just as well).

As for how it’s justified –

For a C or C++ equivalent, the standard provisions that would apply are quite complicated and have changed over time, but it boils down to “executing the comparison is undefined behavior, therefore the program can do anything from that point on”.

For Rust, as has been mentioned, there is not yet an official definition of what operations produce undefined behavior, but when such a definition is created it will likely follow similar principles. It sort of has to, because Rust depends on LLVM’s optimizer which has those principles baked into it.

13 Likes

What I’d like to know is why undef < 150 isn’t replaced with a trap instruction?

(I understand why the spec doesn’t mandate this. As far as the spec is concerned, undef when the compiler knows that you’re reading uninitialized memory is the same as the undefined value that results when you read a memory cell at the same time as another thread writes to it, even though in the latter case the compiler doesn’t know you’ve created an undefined value. I’m curious why, if the compiler knows that the value is undef, it doesn’t throw in a trap instruction when you branch on the result of comparing with it.)

2 Likes

I was assuming the question was “why this shouldn’t be considered a compiler bug”. The — arguably — surprising behavior of the compiler is not a bug if and only if it is permitted by the specification of the language. No more, no less. Why the specification is what it is is a different, equally valid question.

Alright, but reading the value of an uninitialized value was UB, isn’t that the whole point? An optimizing compiler at this point goes like, “hm, this guy is reading the value of an uninitialized variable, so I have the license to evaluate the expression as anything I please”.

C99, §6.7.8.10:

If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate.

§3.17.2:

indeterminate value

either an unspecified value or a trap representation

§3.1.1:

access

〈execution-time action〉 to read or modify the value of an object

§6.2.6.1.5

Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined.41) Such a representation is called a trap representation.

And if this weren’t enough, it specifically says in Annex J.2:

J.2 Undefined behavior

[…]

— The value of an object with automatic storage duration is used while it is indeterminate (6.2.4, 6.7.8, 6.8).

[…]

Unfortunately, I don’t have a copy of any recent C++ standard at hand.

2 Likes

Slotted into the existing framework of undef (of course one could side-step the whole matter by eradicating the notion of undef, with more far-reaching consequences), such trapping would mean the optimizer would have to tread extraordinarily careful around optimizing comparisons [1]:

  • For most operations, it is infeasible to prove they don’t ever get undef inputs.
  • Even if a given optimization pass can’t prove the operand is undef, that doesn’t mean a later pass won’t be able to prove it.
  • In consequence, any comparison (without proof neither operand could ever be undef) has to be treated as potentially trapping and thus as having side effects.

This restricts a variety of useful optimization (even and especially those that don’t need UB to be justified). The biggest issue is that causing a comparison to be execute at all when it previously wouldn’t have becomes off-limits, or predicated on the incredibly difficult task of proving that the inputs can’t ever be undef. This affects a bunch of important (even classical textbook) optimizations, so scrapping the concept of undef and all the optimizations it enables entirely looks pretty appealing by comparison.

In fact, as you can read about in section 2 (particularly 2.2 and 2.3) of Taming Undefined Behavior in LLVM, enabling “speculative” execution of UB-ful operations is (one of the) main motivation(s) for having these weird values in the first place.

[1] talking only about comparisons because that was your example; the same applies to most other arithmetic and logical operations if they’d similarly trap. I’ll also only talk about undef although poison is essentially the same for the issues at hand.

7 Likes

Oh, right, duh. That makes sense.

2 Likes

There’s a small note in https://llvm.org/docs/LangRef.html#undefined-values:

In fact, %A and %C need to have the same semantics or the core LLVM “replace all uses with” concept would not hold.

So as I understand it, it’s less that this is an explicit choice and more that it’s a consequence of being able to do constant propagation.

2 Likes

Heh, my own words used against me. :wink: Well played!

I am not a fan of an undiscriminated use of the “anything is possible” shirt/slogan because it papers over why UB exists and why “anything is possible”. What do mysterious unicorns have to do with our compilers? I mean, the compiler won’t just generate unicorns into your program.^^

I am a fan of that article though.

It does not defy universal quantifiers. You just quantified over the wrong domain if you only considered 0..256.

Of course, that is surprising. But a “naive memory”, where every location just stores a value in 0..256, is IMO not achievable for a language that wants to be even remotely competitive in terms of performance. The post here argues this for uninitialized memory; Stacked Borrows needs to track pointer provenance which also is incompatible with a naive memory.

To recover naive memory, we would have to sacrifice all potential reference-based optimizations (say goodbye to Stacked Borrows) as well as all the already existing optimizations that are based on “pointer arithmetic cannot cross object boundaries”, as well as all the undef and poison-based optimizations LLVM does. I am content with assuming that that is prohibitively expensive even without having done the experiment.

I don’t disagree. However, the optimizations do not exist either. You said “potential optimizations”, why is that any better than a “potential abstract machine”?

For the longest time, my impression is that the optimizations came first and then someone tried to reverse-engineer an abstract machine for that. That does not work very well. I propose this time we let the abstract machine come first.

Of course, this should be guided by desired optimizations, but the abstract machine defines which one of these we can really get.

Absolutely, that is an express goal of the UCG WG.

Fully agreed. All intrinsics should have a specification in terms of the Rust abstract machine.

Btw, mem::uninitialized is not implemented with an intrinsic any more.


Oh, so the following can actually happen when using jemalloc:

  • Allocate memory, deallocate it. MADV_FREE gets called.
  • Newly allocated memory reuses that storage. The new memory gets read.
  • The kernel decides to free the page.
  • Memory gets read again, giving a different result.
  • Memory gets written, now allocating a page again and persisting the result.

Is that correct?


Actually, an optimizing compiler will go “this guy is surely not reading the value of an uninitialized variable, so I will only consider the cases where that does not happen when checking if an optimization applies”. The compiler does not actively search for UB; it assumes it not to happen. The difference is that the compiler generally does not know if UB occurred, so it cannot warn you about it.

Not having to pick a consistent value for all uses of undef certainly helps. Indeed, if creating memory would be non-deterministic, that means it would be effectful (non-determinism is a side-effect), and constant propagation can only propagate “pure” computations. This is why freeze and constant propagation do not interact very well.

6 Likes

Definitely. Although I think in this particular case it’s way easier to justify the false with the behavior I described. Otherwise the “sensible” optimization would be to assume that x < 150 and x > 120 can’t both be false at the same time and thus collapsing the logical OR expression into true.

I disagree. It is much simpler to do a pass over the entire IR and fold undef < 150 to false (this is a “peephole optimization”, very cheap to apply) than to do a range analysis that notices that this disjunction of two comparisons will always be true. Hence I am not surprised that the peephole optimization wins here.

But anyway, we are discussing why LLVM makes certain optimization choices, but that’s off-topic for my post. I think we agree that what LLVM does is legal according to the spec, and that making it illegal would put an undue burden onto the optimizer? And explaining why it is legal requires talking about the abstract machine and considering “weird” memory that tracks initialization.

8 Likes

Pretty much. This is essentially a data race, which you can somewhat get around by using volatile loads. However the “cute data structure” will definitely not work in such conditions.

1 Like

Wait, I didn’t say this was what actually happened :slight_smile: Only that it’d be easier to explain. At this point though, it doesn’t really matter much.

Yes, absolutely.

On Reddit someone linked to this Facebook presentation where they talk about actually running into this MADV_FREE issue in practice.

Wow. And here I was thinking I finally found a case where you need to consider the abstract machine as there is no other direct way to explain what is going on.^^

7 Likes

One thing I do not understand about that Facebook presentation is that they say, when the “removed” page is written to, “it is restored with its old content”. So they describe behavior that’s different from what I described above; they are first reading a 0 and let a non-0 value. How can that be, given that new pages are 0-initialized? Where is that “old content” even being preserved and why?

I think what happens is after MADV_FREE, the page table points that range to a shared read-only page of zeros. When written, the page fault handler allocates a distinct page of physical memory again, which happens to be the one it had before (or never truly freed). That physical memory was never cleared.