`unordered` as a solution to "Bit-wise reasoning for atomic accesses"

comex · October 10, 2019, 12:22am

Original thread: Bit-wise reasoning for atomic accesses

(Meta: It's annoying that the auto-lock for old threads blocks not only replying, but even "reply as linked topic".)

Somewhat recently, I discovered three things:

I was partly mistaken about what optimizations LLVM could theoretically apply to Relaxed atomic loads.

I was right that LLVM could legally coalesce multiple loads into one (assuming no potentially-aliasing stores or barriers in between), yet currently does not. IIRC, I looked at bytes and concluded that that optimization would probably be enough to fix the specific performance regression observed, although I don't remember the details.

However, for the atomic load to be "zero-overhead" compared to a normal load, LLVM should also be able to remove unused loads entirely. And I recently discovered that doing so would be illegal for Relaxed, thanks to an edge case involving fences.
LLVM has another ordering, unordered, which is even weaker than Relaxed (which it calls monotonic). It was designed for Java and is not currently exposed in Rust or C. Somehow I misread the spec and thought that reads didn't respect happens-before but could return literally any value ever written to the memory address. In fact, they do respect happens-before; makes sense, since even non-atomic loads respect happens-before. Derp. As such, it appears to be suitable for the bytes use case.
For unordered loads, LLVM already performs the kinds of optimizations we want! See appendix.

In fact, @RalfJung suggested unordered in the original thread, but I didn't test it then because of 1 and 2 above.

Anyway, this is really encouraging. Exposing unordered as a new variant of Rust's atomic::Ordering seems like an easy step forward that, if I'm right, would let us fix the bytes UB immediately. It wouldn't solve all our problems in the area of "intentionally racy reads" – in particular, it wouldn't help with non-atomic SIMD loads – but nor does it seem likely to conflict with any solution to that problem.

One potential downside is that adding a new memory ordering could complicate the job of non-LLVM backends. Also, since C doesn't support unordered, it may be a problem for backends (whether LLVM-based or not) that compile to C – or to C-derived bytecodes, like WebAssembly. However:

unordered semantics are strictly weaker than Relaxed, so you can treat the new ordering the same way as Relaxed if you don't mind the slight loss of optimization potential.
It's not like C has some alternative way to get the same effect. The concern is only that Rust would be encouraging programmers to do something that C doesn't support, and Rust already does plenty of that; consider its approach to aliasing. (On the other hand, consider tail calls.)
The difference between unordered and Relaxed is mainly in high-level optimization semantics. So if you run optimizations before compiling to C/bytecode/etc., the loss will be lower or nonexistent. For C, you might not want to perform optimizations if you want the output to be human-readable, but oh well.
If you're compiling to C with GNU extensions, you can actually approximate the optimization semantics of an unordered load with an appropriately-marked asm block, or potentially with other hacks (__attribute__((pure))).

Thoughts?

Appendix: Testing LLVM optimizations on atomics

Playground link

With USE_UNORDERED = false:

playground::test_remove_unused_load:
	movl	(%rdi), %eax // The load is still here.
	retq

playground::test_coalesce_redundant_loads:
	movl	(%rdi), %ecx // Load #1.
	movl	(%rdi), %eax // Load #2.
	addl	%ecx, %eax // Add the two values.
	retq

With USE_UNORDERED = true:

playground::test_remove_unused_load:
	retq  // Do nothing, just return.

playground::test_coalesce_redundant_loads:
	movl	(%rdi), %eax // Only one load.
	addl	%eax, %eax // Add the result to itself.
	retq

Edit: improved appendix. Rust already supports unordered loads as an unstable intrinsic, no need to hack up LLVM IR.

scottmcm · October 10, 2019, 12:53am

Sounds good to me. Allowing new things like this is why the enum is #[non_exhaustive], and #[unstable] works on variants, so I'd say just make a PR and it can probably go in without much fuss.

gnzlbg · October 11, 2019, 10:45am

I think it's reasonable to add a feature-gated variant to Ordering to be able to experiment with this some more.

We currently tell users that Rust memory model is C11-like and C11 does not have "unordered", so I'm not sure what the path of this feature towards stabilization would be. It would probably be unreasonable to block solving this problem on having a Rust memory model.

comex · October 12, 2019, 12:32am

Good point. That said, LLVM does document its atomic orderings using much of the same terminology as C/C++11 ("happens-before", "synchronizes-with", etc.), so we don't need to totally uproot the foundations to accommodate unordered.

For reference, here is the LLVM language reference's definition of unordered:

unordered

The set of values that can be read is governed by the happens-before partial order. A value cannot be read unless some operation wrote it. This is intended to provide a guarantee strong enough to model Java’s non-volatile shared variables. This ordering cannot be specified for read-modify-write operations; it is not strong enough to make them atomic in any interesting way.

But it's also governed by a general rule from elsewhere in the spec:

For each byte of a read R, R_byte may see any write to the same byte, except:

If write₁ happens before write₂, and write₂ happens before R_byte, then R_byte does not see write₁.

If R_byte happens before write₃, then R_byte does not see write₃.

(Note: LLVM's version is written in terms of bytes, rather than the value as a whole, in order to accommodate mixed-width accesses. But C/C++11 doesn't allow mixed-width accesses to atomics, so we can ignore that for now.)

Let's try to translate those bullets back into C/C++11 terms:

Bullet 1

Here it is again, with the byte part omitted:

If write₁ happens before write₂, and write₂ happens before R, then R_byte does not see write₁.

This is a subset of write-read coherence, defined by cppreference as:

Write-read coherence: if a side effect (a write) X on an atomic object M happens-before a value computation (a read) B of M, then the evaluation B shall take its value from X or from a side effect Y that follows X in the modification order of M

But they differ in the following case:

If write₁ does not happen-before write₂ but does precede it in the modification order (assuming both writes are non-unordered), and write₂ happens before R, then can R see write₁?

Write-read coherence prohibits it, but LLVM's definition does not. However, I think this is a mere oversight in the wording, because AFAICT the same issue would apply to all atomic reads, not just unordered ones. It would be good to get clarification on this.

Bullet 2

Again, LLVM's version with the byte part omitted:

If R happens before write₃, then R does not see write₃.

This is a subset of read-write coherence, defined by cppreference as:

Read-write coherence: if a value computation A of some atomic M (a read) happens-before an operation B on M (a write), then the value of A comes from a side-effect (a write) X that appears earlier than B in the modification order of M

They differ in the following case:

Can R see a write that comes later than write₃ in the modification order? (Again, assuming that both writes are non-unordered.)

Here too, the problem seems to apply even when R is stronger than unordered, although in this case you might be able argue that the term "modification order" implicitly prohibits such a thing. But if so, should that also apply to unordered reads as well? Again, we should seek clarification.

How I'd like to document `unordered`

I'd like to find out that both of the edge cases I noted are oversights. In that case, unordered loads could be defined in Rust like so:

Unordered loads are like Relaxed loads but:
They are excluded from read-read coherence. That is, if the program performs two loads of the same variable, L₁ and L₂, such that L₁ happens-before L₂, L₂ may observe a value that appears earlier in the modification order than the value observed by L₁, if either L₁ or L₂ (or both) has Unordered ordering.
They are not subject to the same forward progress guarantees as other orderings. For example, the following loop may never terminate even if some other thread sets ready to true:
fn wait_until_ready(ready: &AtomicBool) {
    while !ready.load(Unordered) {}
}

(The second bullet comes because LLVM's reference documents the progress guarantee under monotonic. This does reflect an actual difference in the optimizations LLVM performs: it's willing to hoist atomic loads out of loops only if they are unordered.)

Unordered stores are trickier to define, to the point that it might be worth leaving them out entirely. Also, I'm not sure how theoretically useful they are. For loads, I know of two important compiler optimizations that can't be performed for relaxed atomic loads: removing unused loads, and hoisting loads out of loops. But for stores I don't know of any, albeit I haven't looked very hard yet. LLVM has no shortage of checks for isUnordered() scattered throughout the optimizer (for both loads and stores), but most of the optimizations seem like they would actually be valid for relaxed accesses as well.

Of course, it's also worth asking how practically useful unordered stores are. For those optimizations that might be valid for relaxed accesses but which LLVM currently only performs on unordered ones, how likely are they to improve the performance of real programs? I'd definitely like to include unordered stores in the initial unstable implementation, if only to help determine that.

josh · October 12, 2019, 6:27am

I'd love to see this added.

RalfJung · October 12, 2019, 11:05am

FWIW, one more thing this unfortunately does not resolve is the entire "interaction with memory shared with untrusted parties" discussion. That requires some kind of "atomic volatile". I am bringing this up just to make sure we all agree that that is a separate issue that would still need solving even if we adapt this.

So, coming to the "unordered" proposal: My main problem with this is that while the C++ memory model is extensively studied, these extensions are not. Right now, as Rust uses the C++ memory model, we can rely on a whole lot of academic literature that provides us with a good foundation of things we know about that model. These are not just theoretical concerns; that academic work has uncovered several serious bugs in the definition of the memory model (the very first C++ model was entirely broken until Mark Batty fixed it, and even then some issues remained and the out-of-thin-air problem remains unfixed until today).

So, what is the situation as we go beyond C++?

"unordered"

There are some models coming out of work on Java that have "unordered", but the only one I know of that has both "unordered" (they call it "plain") as well as the other orderings and fences Rust already has is the Promising Semantics. I really like that paper, but at least the original model is incompatible with ARM or POWER (I forgot which) and there are also some optimizations one might want that it does not allow. It is also a big and subtle model so there are probably other issues. Oh, and it doesn't have non-atomic accesses at all; "plain" ("unordered") accesses are the weakest accesses they have.

The Promising Semantics are a great first step to give an operational specification to things like relaxed and unordered and solving the out-of-thin-air problem, but I do not think they are ready for adaption by a programming language.

LLVM

So, alternatively, we could say "we do whatever LLVM does".

The issue is, the LLVM memory model (to my knowledge) has seen no academic study at all. To be concrete, I am not aware of a single formally defined memory model with some reasonable theorems about it that follows the LLVM idea of making write-write races (two overlapping writes racing with each other, at least one of which is non-atomic) not UB but instead making them reset the affected memory to "undef"/"uninitialized".

I personally am not even convinced yet that it is possible to do the LLVM-style treatment of write-write races while also doing all the other optimizations. Remember that LLVM's model is axiomatic: it is specified as "give me a model that has all of the following properties" -- without any indication that such a model exists. This is the approach that leads to bugs like this when applied to things like pointer provenance; the bugs it causes in concurrency memory models will likely not be any less subtle.

I talked with some weak memory experts about LLVM's treatment of write-write races (I do not consider myself one) and we ran into some interesting questions about how to even define a data-race: in LLVM's world, where programs with data-races can go on and have well-defined semantics, the axiomatic C++ definition basically fails. It feels like you need an operational model to make progress here (and so far the Promising Semantics are the only operational model we have that handles relaxed accesses).

Conclusion

By adapting "unordered", we basically commit to do whatever LLVM does in terms of concurrency. That is a huge risk. Prior experience from C++ and Java shows that naively defined concurrency memory models are full of flaws.

Moreover, we (in the UCG) are already regularly experiencing the pain caused by other parts of Rust that rely on LLVM semantics (such as integer-pointer casts or getelementptr inbounds). Unfortunately, LLVM has a history of not specifying the assumptions optimizations and programs may make, respectively (IR semantics are specified for the "happy case" but what exactly is and is not UB is frequently even less clear than in C/C++) and they are also not very willing to improve their docs in that regard. If we also rely on LLVM for the concurrency semantics, we add all of concurrency to the list of areas for which we will regularly be unable to give precise answers to reasonable questions.

On top of that, I think there is a significant risk that LLVM's model is unsound in subtle ways, meaning it does not actually permit all the optimizations that LLVM developers think it permits. For C++, the question "is that optimization correct" is at least mathematically well-defined and there is a bunch of literature proving the reorderings compilers do. Without having a formal model, we might not even be able to assign blame ("out of these 5 optimizations that together caused the bug, this is the one that is wrong"). And that is assuming that we even notice such a bug and are able to locate and minimize it, which is notoriously hard for concurrency issues.

That's why, the way things stand right now, I am opposed to leaving the scope of the C++ memory model.

Alternatives

From what I understand, while relaxed accesses can't be quite as aggressively optimized as @comex thought, they can still be optimized better than what LLVM is currently doing. So it might be worth to try and better exploit the freedom we have, instead of changing the language.

Centril · October 12, 2019, 9:17pm

I'm afraid after having read @RalfJung's comment I think we in fact need a mechanized formal model in Coq/Agda as well as a thesis before I would be OK with leaving the scope of C++'s memory model.

gnzlbg · October 12, 2019, 9:21pm

FYI the latest revision of the paper suggests that there is consensus on what the fixed semantics of relaxed ought to look like, but since that's a backward incompatible change that software and hardware vendors might need time to adapt to, it proposes to add a new memory order called load_store instead.

scottmcm · October 12, 2019, 11:12pm

Note that my comment was about it going in unstable -- I think that's fine because it's a low-impact change to the compiler, and further conversations can happen around stabilization requirements.

Centril · October 12, 2019, 11:28pm

I'd be OK with that provided some guards against it being used within the standard library and the compiler internally + disclaimers around having a path towards stabilization.

comex · October 13, 2019, 12:26am

Your points are well taken.

To start with, I'm not proposing "we do whatever LLVM does". In my last post in the thread, I explored how "unordered" could be re-defined as an extension of the C++ memory model (while staying compatible with LLVM's unordered), and that's the basic direction I'd like to go in if this were to become a proper language feature. I think unordered should be a lot easier to formalize than, say, LLVM's approach to write-write races.

...But then, it depends on the degree of formalization. A literal "mechanized formal model in Coq/Agda" is a lot more than we've required for, well, any other feature in the language, so it feels a little unfair to me that it would be a prerequisite.

Also, I disagree that making LLVM's optimizations of existing memory orderings more aggressive would be somehow 'easier' (and thus less in need of formalization) than defining a new memory ordering. I'd like to do both! But the two tasks involve similar kinds of mental reasoning, and come with risks of similar mistakes. Just as there's a risk of improperly defining unordered in the spec, there's a risk of adding LLVM optimizations that turn out to be unsound; indeed, I've already been mistaken in the past about what optimizations would be legal. In theory this could be prevented by a formal model, but LLVM has none, yet we still use it. A spec bug has the risk of complicating things indefinitely... but an implementation bug has the risk of breaking actual code now. I'm not sure which is worse.

But enough negativity. I might actually be interested in coming up with a Coq/Agda model, if only as a learning experience. Do you have any suggestions for existing C++ memory model formalizations that might serve as a good base to add unordered to?

Even if I succeed, though (no guarantees), I am not going to write an academic paper, let alone a thesis. If that's going to be the requirement, I guess there's not much point...

Centril · October 13, 2019, 12:49am

(I'm using "thesis" as a stand-in for "elaborate document explaining and detailing the findings" (associated with the mechanization).)

RalfJung · October 14, 2019, 10:43am

Fair. However I'm afraid I can't really give you feedback on your proposed definition. I find myself extremely skeptical of whole-program graph-based axiomatic models like C++11, and in fact I don't think I ever properly understood any of them. For example, I only understood release-acquire semantics when reading this paper which gives an operational semantics to a stronger variant of release-acquire, avoiding many problems caused by only having a whole-program semantics. (I strongly recommend this paper to anyone interested in weak memory concurrency. Yes, "strong release-acquire" does not exactly match C++ release-acquire but something stronger -- however this can easily be changed by a small tweak to the semantics as done here, and seeing something very much like release-acquire defined in a totally different style has been eye-opening for me.) Unfortunately, even modelling relaxed accesses operationally is very hard, and "unordered" probably makes this worse.

With the C++ model, it is impossible to do something like "given a program state, what changes in that state when you run these 5 instructions". There's no such thing as a "state" before or after every atomic instruction that gets altered. Instead you consider all possible executions of the entire program at once and then rule out most of them. (You may wonder, don't most of these executions have UB? Yes, and I am not aware of any formal work that handles the interaction between per-thread UB, data-race UB and the graph-based axiomatic semantics. Some people I know that work on weak memory are skeptical that this is even soundly possible. That just adds to my skepticism about axiomatic semantics.) I do not have an intuition for that way of thinking even for existing models, and trying to analyze changes to such a model is way beyond me I'm afraid. I hope some other concurrency experts can chime in and give you some feedback on your proposed model.

Well, for "relaxed" we do have a few papers formally proving that a bunch of transformations are sound, in the context of at least a large fragment of C++ concurrency. So at least parts of that formalization has been done.

It is true that just because a transformation is sound for C++ "relaxed" does not mean it is sound for LLVM "relaxed", but given that the difference between the two is only about data races, I think if the transformation does not exploit data races being UB then the proof should carry over. It's at least way better than what we have for "unordered".

Wow, that's awesome.

I am not doing weak memory concurrency research myself, so I don't have a great overview. Viktor's papers are a good start; there's other people working on this but I know his work well because we are at the same institute.

gnzlbg · October 14, 2019, 10:45am

@Centril

Way too many users are already writing Rust code that does memory mapped I/O, inter-process communication, etc. and they are doing so with the tools that they have available, which aren't great (and often, just UB).

The moment we decided to stabilize atomic operations that lack a soundness proof we have already established that some level of risk is "ok", and we are just balancing the risk of maybe having to deprecate a particular Ordering in the future, with the problems that, to the best of our knowledge, that Ordering solves today.

Formal Rust models are already restricted in many ways, e.g., to programs that do not use concurrency, or that do not use certain orderings, like Relaxed (which has been proven to be unsound), so I don't think that adding a new Ordering carefully makes the situation catastrophically worse.

This doesn't mean that new ordering or intrinsics that interact with the memory model should be added recklessly, but rather that doing so should balance the risks they introduce with the value they add.

RalfJung · October 14, 2019, 10:50am

For atomics specifically, I think it does, and I argued as much above. We actually do have a lot of formal work for the large class of programs that uses all our concurrency features but stays within the boundaries of a single process.

This is not about a lack of a proof that the operation is implemented correctly by the compiler. This is about a lack of proper understanding of what the operation does. As prior experience shows, most of the time when specifying new concurrency semantics, the first attempt is not actually specifying what people think it does.

The bugs we are risking here are not compiler bugs. They are specification bugs. That's what worries me.

gnzlbg · October 14, 2019, 11:16am

I understood that.

If we add such operations, code will be written that uses them, and that code would be at risk of not doing what we think it does (and maybe proven to have undefined behavior in the future). If we don't add such operations, such code that still must be written will just have undefined behavior today instead.

That's what worries me. I think that both options are bad, but I'm not sure which one is worse.

Centril · October 14, 2019, 11:26am

I'm with @RalfJung on this one. In this case we are not simply living with the existing feature-set but there is a desire to add a new and not-so-well-undertood feature for which there are soundness and specifiability concerns (feels familiar from https://github.com/rust-lang/rfcs/pull/2360) and it's not something we at least feel we know we can resolve in a specifiable and sound manner (async & noalias is sorta in this category where there were some concerns but we felt we had paths forward). I don't feel comfortable with the risks here and would like to see an operational model. At some point "success, at all costs" has to end.

RalfJung · October 14, 2019, 1:49pm

I see. Yes, that is getting to the core of the trade-off here: right now bytes has UB because we don't have "unordered".

OTOH, that is the only example of this that I am aware of. And when providing stable unordered atomics, likely many places that currently can use relaxed because they are not quite as hot as bytes is would switch to unordered instead.

There's AFAIK no code that must be written with "unordered"; this is a correctness-performance trade-off for that code as well. And by adding "unordered" we are seemingly changing that trade-off for client code ("look, I can have performance and correctness"), while we are far from sure that this really helps in terms of correctness.

One worthwhile experiment here would be to add (unstable) "unordered" intrinsics and patch bytes to use those intrinsics instead of the current UB code, and then run the benchmarks that informed the decision to prefer UB over relaxed. The entire discussion here is moot if it doesn't actually help for the desired use-case.

gnzlbg · October 14, 2019, 2:26pm

Benchmarking bytes using those makes sense. The unordered intrinsics are already implemented on nightly (atomic_load_unordered in core::intrinsics - Rust), and @comex shows in the OP the differences in codegen for a synthetic example.

RalfJung · October 14, 2019, 8:45pm

It also turns out that bytes might be entirely removing the feature that required the UB in the first place:

Topic		Replies	Views
Using LLVM's "unordered" reads and writes	1	833	May 19, 2019
Need input regarding very weak memory models (GPUs, VMs...)	12	1677	April 5, 2020
Defining (some) unsynchronized writes Unsafe Code Guidelines	2	218	January 31, 2025
Loads and stores to/from outside the memory model language design	36	3438	September 27, 2019
Bit-wise reasoning for atomic accesses Unsafe Code Guidelines	39	3713	May 25, 2019