`unordered` as a solution to "Bit-wise reasoning for atomic accesses"

I think it's reasonable to add a feature-gated variant to Ordering to be able to experiment with this some more.

We currently tell users that Rust memory model is C11-like and C11 does not have "unordered", so I'm not sure what the path of this feature towards stabilization would be. It would probably be unreasonable to block solving this problem on having a Rust memory model.

3 Likes

Good point. That said, LLVM does document its atomic orderings using much of the same terminology as C/C++11 ("happens-before", "synchronizes-with", etc.), so we don't need to totally uproot the foundations to accommodate unordered.

For reference, here is the LLVM language reference's definition of unordered:

unordered

The set of values that can be read is governed by the happens-before partial order. A value cannot be read unless some operation wrote it. This is intended to provide a guarantee strong enough to model Java’s non-volatile shared variables. This ordering cannot be specified for read-modify-write operations; it is not strong enough to make them atomic in any interesting way.

But it's also governed by a general rule from elsewhere in the spec:

For each byte of a read R, Rbyte may see any write to the same byte, except:

  • If write1 happens before write2, and write2 happens before Rbyte, then Rbyte does not see write1.
  • If Rbyte happens before write3, then Rbyte does not see write3.

(Note: LLVM's version is written in terms of bytes, rather than the value as a whole, in order to accommodate mixed-width accesses. But C/C++11 doesn't allow mixed-width accesses to atomics, so we can ignore that for now.)

Let's try to translate those bullets back into C/C++11 terms:

Bullet 1

Here it is again, with the byte part omitted:

  • If write1 happens before write2, and write2 happens before R, then Rbyte does not see write1.

This is a subset of write-read coherence, defined by cppreference as:

Write-read coherence: if a side effect (a write) X on an atomic object M happens-before a value computation (a read) B of M, then the evaluation B shall take its value from X or from a side effect Y that follows X in the modification order of M

But they differ in the following case:

  • If write1 does not happen-before write2 but does precede it in the modification order (assuming both writes are non-unordered), and write2 happens before R, then can R see write1?

Write-read coherence prohibits it, but LLVM's definition does not. However, I think this is a mere oversight in the wording, because AFAICT the same issue would apply to all atomic reads, not just unordered ones. It would be good to get clarification on this.

Bullet 2

Again, LLVM's version with the byte part omitted:

  • If R happens before write3, then R does not see write3.

This is a subset of read-write coherence, defined by cppreference as:

Read-write coherence: if a value computation A of some atomic M (a read) happens-before an operation B on M (a write), then the value of A comes from a side-effect (a write) X that appears earlier than B in the modification order of M

They differ in the following case:

  • Can R see a write that comes later than write3 in the modification order? (Again, assuming that both writes are non-unordered.)

Here too, the problem seems to apply even when R is stronger than unordered, although in this case you might be able argue that the term "modification order" implicitly prohibits such a thing. But if so, should that also apply to unordered reads as well? Again, we should seek clarification.

How I'd like to document unordered

I'd like to find out that both of the edge cases I noted are oversights. In that case, unordered loads could be defined in Rust like so:

Unordered loads are like Relaxed loads but:

  • They are excluded from read-read coherence. That is, if the program performs two loads of the same variable, L1 and L2, such that L1 happens-before L2, L2 may observe a value that appears earlier in the modification order than the value observed by L1, if either L1 or L2 (or both) has Unordered ordering.

  • They are not subject to the same forward progress guarantees as other orderings. For example, the following loop may never terminate even if some other thread sets ready to true:

    fn wait_until_ready(ready: &AtomicBool) {
        while !ready.load(Unordered) {}
    }
    

(The second bullet comes because LLVM's reference documents the progress guarantee under monotonic. This does reflect an actual difference in the optimizations LLVM performs: it's willing to hoist atomic loads out of loops only if they are unordered.)

Unordered stores are trickier to define, to the point that it might be worth leaving them out entirely. Also, I'm not sure how theoretically useful they are. For loads, I know of two important compiler optimizations that can't be performed for relaxed atomic loads: removing unused loads, and hoisting loads out of loops. But for stores I don't know of any, albeit I haven't looked very hard yet. LLVM has no shortage of checks for isUnordered() scattered throughout the optimizer (for both loads and stores), but most of the optimizations seem like they would actually be valid for relaxed accesses as well.

Of course, it's also worth asking how practically useful unordered stores are. For those optimizations that might be valid for relaxed accesses but which LLVM currently only performs on unordered ones, how likely are they to improve the performance of real programs? I'd definitely like to include unordered stores in the initial unstable implementation, if only to help determine that.

6 Likes

I'd love to see this added.

FWIW, one more thing this unfortunately does not resolve is the entire "interaction with memory shared with untrusted parties" discussion. That requires some kind of "atomic volatile". I am bringing this up just to make sure we all agree that that is a separate issue that would still need solving even if we adapt this.


So, coming to the "unordered" proposal: My main problem with this is that while the C++ memory model is extensively studied, these extensions are not. Right now, as Rust uses the C++ memory model, we can rely on a whole lot of academic literature that provides us with a good foundation of things we know about that model. These are not just theoretical concerns; that academic work has uncovered several serious bugs in the definition of the memory model (the very first C++ model was entirely broken until Mark Batty fixed it, and even then some issues remained and the out-of-thin-air problem remains unfixed until today).

So, what is the situation as we go beyond C++?

"unordered"

There are some models coming out of work on Java that have "unordered", but the only one I know of that has both "unordered" (they call it "plain") as well as the other orderings and fences Rust already has is the Promising Semantics. I really like that paper, but at least the original model is incompatible with ARM or POWER (I forgot which) and there are also some optimizations one might want that it does not allow. It is also a big and subtle model so there are probably other issues. Oh, and it doesn't have non-atomic accesses at all; "plain" ("unordered") accesses are the weakest accesses they have.

The Promising Semantics are a great first step to give an operational specification to things like relaxed and unordered and solving the out-of-thin-air problem, but I do not think they are ready for adaption by a programming language.

LLVM

So, alternatively, we could say "we do whatever LLVM does".

The issue is, the LLVM memory model (to my knowledge) has seen no academic study at all. To be concrete, I am not aware of a single formally defined memory model with some reasonable theorems about it that follows the LLVM idea of making write-write races (two overlapping writes racing with each other, at least one of which is non-atomic) not UB but instead making them reset the affected memory to "undef"/"uninitialized".

I personally am not even convinced yet that it is possible to do the LLVM-style treatment of write-write races while also doing all the other optimizations. Remember that LLVM's model is axiomatic: it is specified as "give me a model that has all of the following properties" -- without any indication that such a model exists. This is the approach that leads to bugs like this when applied to things like pointer provenance; the bugs it causes in concurrency memory models will likely not be any less subtle.

I talked with some weak memory experts about LLVM's treatment of write-write races (I do not consider myself one) and we ran into some interesting questions about how to even define a data-race: in LLVM's world, where programs with data-races can go on and have well-defined semantics, the axiomatic C++ definition basically fails. It feels like you need an operational model to make progress here (and so far the Promising Semantics are the only operational model we have that handles relaxed accesses).

Conclusion

By adapting "unordered", we basically commit to do whatever LLVM does in terms of concurrency. That is a huge risk. Prior experience from C++ and Java shows that naively defined concurrency memory models are full of flaws.

Moreover, we (in the UCG) are already regularly experiencing the pain caused by other parts of Rust that rely on LLVM semantics (such as integer-pointer casts or getelementptr inbounds). Unfortunately, LLVM has a history of not specifying the assumptions optimizations and programs may make, respectively (IR semantics are specified for the "happy case" but what exactly is and is not UB is frequently even less clear than in C/C++) and they are also not very willing to improve their docs in that regard. If we also rely on LLVM for the concurrency semantics, we add all of concurrency to the list of areas for which we will regularly be unable to give precise answers to reasonable questions.

On top of that, I think there is a significant risk that LLVM's model is unsound in subtle ways, meaning it does not actually permit all the optimizations that LLVM developers think it permits. For C++, the question "is that optimization correct" is at least mathematically well-defined and there is a bunch of literature proving the reorderings compilers do. Without having a formal model, we might not even be able to assign blame ("out of these 5 optimizations that together caused the bug, this is the one that is wrong"). And that is assuming that we even notice such a bug and are able to locate and minimize it, which is notoriously hard for concurrency issues.

That's why, the way things stand right now, I am opposed to leaving the scope of the C++ memory model.

Alternatives

From what I understand, while relaxed accesses can't be quite as aggressively optimized as @comex thought, they can still be optimized better than what LLVM is currently doing. So it might be worth to try and better exploit the freedom we have, instead of changing the language.

17 Likes

I'm afraid after having read @RalfJung's comment I think we in fact need a mechanized formal model in Coq/Agda as well as a thesis before I would be OK with leaving the scope of C++'s memory model.

2 Likes

FYI the latest revision of the paper suggests that there is consensus on what the fixed semantics of relaxed ought to look like, but since that's a backward incompatible change that software and hardware vendors might need time to adapt to, it proposes to add a new memory order called load_store instead.

Note that my comment was about it going in unstable -- I think that's fine because it's a low-impact change to the compiler, and further conversations can happen around stabilization requirements.

I'd be OK with that provided some guards against it being used within the standard library and the compiler internally + disclaimers around having a path towards stabilization.

Your points are well taken.

To start with, I'm not proposing "we do whatever LLVM does". In my last post in the thread, I explored how "unordered" could be re-defined as an extension of the C++ memory model (while staying compatible with LLVM's unordered), and that's the basic direction I'd like to go in if this were to become a proper language feature. I think unordered should be a lot easier to formalize than, say, LLVM's approach to write-write races. :slight_smile:

...But then, it depends on the degree of formalization. A literal "mechanized formal model in Coq/Agda" is a lot more than we've required for, well, any other feature in the language, so it feels a little unfair to me that it would be a prerequisite.

Also, I disagree that making LLVM's optimizations of existing memory orderings more aggressive would be somehow 'easier' (and thus less in need of formalization) than defining a new memory ordering. I'd like to do both! But the two tasks involve similar kinds of mental reasoning, and come with risks of similar mistakes. Just as there's a risk of improperly defining unordered in the spec, there's a risk of adding LLVM optimizations that turn out to be unsound; indeed, I've already been mistaken in the past about what optimizations would be legal. In theory this could be prevented by a formal model, but LLVM has none, yet we still use it. A spec bug has the risk of complicating things indefinitely... but an implementation bug has the risk of breaking actual code now. I'm not sure which is worse.

But enough negativity. I might actually be interested in coming up with a Coq/Agda model, if only as a learning experience. Do you have any suggestions for existing C++ memory model formalizations that might serve as a good base to add unordered to?

Even if I succeed, though (no guarantees), I am not going to write an academic paper, let alone a thesis. If that's going to be the requirement, I guess there's not much point... :confused:

3 Likes

(I'm using "thesis" as a stand-in for "elaborate document explaining and detailing the findings" (associated with the mechanization).)

Fair. However I'm afraid I can't really give you feedback on your proposed definition. I find myself extremely skeptical of whole-program graph-based axiomatic models like C++11, and in fact I don't think I ever properly understood any of them. For example, I only understood release-acquire semantics when reading this paper which gives an operational semantics to a stronger variant of release-acquire, avoiding many problems caused by only having a whole-program semantics. (I strongly recommend this paper to anyone interested in weak memory concurrency. Yes, "strong release-acquire" does not exactly match C++ release-acquire but something stronger -- however this can easily be changed by a small tweak to the semantics as done here, and seeing something very much like release-acquire defined in a totally different style has been eye-opening for me.) Unfortunately, even modelling relaxed accesses operationally is very hard, and "unordered" probably makes this worse.

With the C++ model, it is impossible to do something like "given a program state, what changes in that state when you run these 5 instructions". There's no such thing as a "state" before or after every atomic instruction that gets altered. Instead you consider all possible executions of the entire program at once and then rule out most of them. (You may wonder, don't most of these executions have UB? Yes, and I am not aware of any formal work that handles the interaction between per-thread UB, data-race UB and the graph-based axiomatic semantics. Some people I know that work on weak memory are skeptical that this is even soundly possible. That just adds to my skepticism about axiomatic semantics.) I do not have an intuition for that way of thinking even for existing models, and trying to analyze changes to such a model is way beyond me I'm afraid. I hope some other concurrency experts can chime in and give you some feedback on your proposed model. :slight_smile:

Well, for "relaxed" we do have a few papers formally proving that a bunch of transformations are sound, in the context of at least a large fragment of C++ concurrency. So at least parts of that formalization has been done.

It is true that just because a transformation is sound for C++ "relaxed" does not mean it is sound for LLVM "relaxed", but given that the difference between the two is only about data races, I think if the transformation does not exploit data races being UB then the proof should carry over. It's at least way better than what we have for "unordered".

Wow, that's awesome. :smiley:

I am not doing weak memory concurrency research myself, so I don't have a great overview. Viktor's papers are a good start; there's other people working on this but I know his work well because we are at the same institute. :wink:

@Centril

Way too many users are already writing Rust code that does memory mapped I/O, inter-process communication, etc. and they are doing so with the tools that they have available, which aren't great (and often, just UB).

The moment we decided to stabilize atomic operations that lack a soundness proof we have already established that some level of risk is "ok", and we are just balancing the risk of maybe having to deprecate a particular Ordering in the future, with the problems that, to the best of our knowledge, that Ordering solves today.

Formal Rust models are already restricted in many ways, e.g., to programs that do not use concurrency, or that do not use certain orderings, like Relaxed (which has been proven to be unsound), so I don't think that adding a new Ordering carefully makes the situation catastrophically worse.

This doesn't mean that new ordering or intrinsics that interact with the memory model should be added recklessly, but rather that doing so should balance the risks they introduce with the value they add.

2 Likes

For atomics specifically, I think it does, and I argued as much above. We actually do have a lot of formal work for the large class of programs that uses all our concurrency features but stays within the boundaries of a single process.

This is not about a lack of a proof that the operation is implemented correctly by the compiler. This is about a lack of proper understanding of what the operation does. As prior experience shows, most of the time when specifying new concurrency semantics, the first attempt is not actually specifying what people think it does.

The bugs we are risking here are not compiler bugs. They are specification bugs. That's what worries me.

5 Likes

I understood that.

If we add such operations, code will be written that uses them, and that code would be at risk of not doing what we think it does (and maybe proven to have undefined behavior in the future). If we don't add such operations, such code that still must be written will just have undefined behavior today instead.

That's what worries me. I think that both options are bad, but I'm not sure which one is worse.

I'm with @RalfJung on this one. In this case we are not simply living with the existing feature-set but there is a desire to add a new and not-so-well-undertood feature for which there are soundness and specifiability concerns (feels familiar from https://github.com/rust-lang/rfcs/pull/2360) and it's not something we at least feel we know we can resolve in a specifiable and sound manner (async & noalias is sorta in this category where there were some concerns but we felt we had paths forward). I don't feel comfortable with the risks here and would like to see an operational model. At some point "success, at all costs" has to end.

2 Likes

I see. Yes, that is getting to the core of the trade-off here: right now bytes has UB because we don't have "unordered".

OTOH, that is the only example of this that I am aware of. And when providing stable unordered atomics, likely many places that currently can use relaxed because they are not quite as hot as bytes is would switch to unordered instead.

There's AFAIK no code that must be written with "unordered"; this is a correctness-performance trade-off for that code as well. And by adding "unordered" we are seemingly changing that trade-off for client code ("look, I can have performance and correctness"), while we are far from sure that this really helps in terms of correctness.

One worthwhile experiment here would be to add (unstable) "unordered" intrinsics and patch bytes to use those intrinsics instead of the current UB code, and then run the benchmarks that informed the decision to prefer UB over relaxed. The entire discussion here is moot if it doesn't actually help for the desired use-case.

6 Likes

Benchmarking bytes using those makes sense. The unordered intrinsics are already implemented on nightly (https://doc.rust-lang.org/core/intrinsics/fn.atomic_load_unordered.html), and @comex shows in the OP the differences in codegen for a synthetic example.

1 Like

It also turns out that bytes might be entirely removing the feature that required the UB in the first place:

1 Like

Heh. That does remove some of the urgency, so to speak. Does anyone know of any other examples of Rust code that intentionally uses non-atomic loads instead of Relaxed ones [edit: in situations where a race is possible]?

Or even examples of code that does use Relaxed but could potentially go faster with Unordered. This includes loads that, after inlining, are:

  • Completely unused
    • Might be rare; I'm not really sure
  • Performed repeatedly in a loop, but would be okay to hoist out of the loop (i.e. the value is not expected to change during the loop, or the code doesn't care if it does)
    • Though hoisting is only possible if the compiler can prove there are no aliasing writes within the loop, which is often hard, especially with the noalias woes
  • Performed multiple times in succession, when one load would suffice
    • But this could also be done on Relaxed

None of the above low-likelihood examples appear to be worth the risk of adding a new memory ordering mode that lacks provable theoretical foundations.