`unordered` as a solution to "Bit-wise reasoning for atomic accesses"

I'd love to see this added.

FWIW, one more thing this unfortunately does not resolve is the entire "interaction with memory shared with untrusted parties" discussion. That requires some kind of "atomic volatile". I am bringing this up just to make sure we all agree that that is a separate issue that would still need solving even if we adapt this.


So, coming to the "unordered" proposal: My main problem with this is that while the C++ memory model is extensively studied, these extensions are not. Right now, as Rust uses the C++ memory model, we can rely on a whole lot of academic literature that provides us with a good foundation of things we know about that model. These are not just theoretical concerns; that academic work has uncovered several serious bugs in the definition of the memory model (the very first C++ model was entirely broken until Mark Batty fixed it, and even then some issues remained and the out-of-thin-air problem remains unfixed until today).

So, what is the situation as we go beyond C++?

"unordered"

There are some models coming out of work on Java that have "unordered", but the only one I know of that has both "unordered" (they call it "plain") as well as the other orderings and fences Rust already has is the Promising Semantics. I really like that paper, but at least the original model is incompatible with ARM or POWER (I forgot which) and there are also some optimizations one might want that it does not allow. It is also a big and subtle model so there are probably other issues. Oh, and it doesn't have non-atomic accesses at all; "plain" ("unordered") accesses are the weakest accesses they have.

The Promising Semantics are a great first step to give an operational specification to things like relaxed and unordered and solving the out-of-thin-air problem, but I do not think they are ready for adaption by a programming language.

LLVM

So, alternatively, we could say "we do whatever LLVM does".

The issue is, the LLVM memory model (to my knowledge) has seen no academic study at all. To be concrete, I am not aware of a single formally defined memory model with some reasonable theorems about it that follows the LLVM idea of making write-write races (two overlapping writes racing with each other, at least one of which is non-atomic) not UB but instead making them reset the affected memory to "undef"/"uninitialized".

I personally am not even convinced yet that it is possible to do the LLVM-style treatment of write-write races while also doing all the other optimizations. Remember that LLVM's model is axiomatic: it is specified as "give me a model that has all of the following properties" -- without any indication that such a model exists. This is the approach that leads to bugs like this when applied to things like pointer provenance; the bugs it causes in concurrency memory models will likely not be any less subtle.

I talked with some weak memory experts about LLVM's treatment of write-write races (I do not consider myself one) and we ran into some interesting questions about how to even define a data-race: in LLVM's world, where programs with data-races can go on and have well-defined semantics, the axiomatic C++ definition basically fails. It feels like you need an operational model to make progress here (and so far the Promising Semantics are the only operational model we have that handles relaxed accesses).

Conclusion

By adapting "unordered", we basically commit to do whatever LLVM does in terms of concurrency. That is a huge risk. Prior experience from C++ and Java shows that naively defined concurrency memory models are full of flaws.

Moreover, we (in the UCG) are already regularly experiencing the pain caused by other parts of Rust that rely on LLVM semantics (such as integer-pointer casts or getelementptr inbounds). Unfortunately, LLVM has a history of not specifying the assumptions optimizations and programs may make, respectively (IR semantics are specified for the "happy case" but what exactly is and is not UB is frequently even less clear than in C/C++) and they are also not very willing to improve their docs in that regard. If we also rely on LLVM for the concurrency semantics, we add all of concurrency to the list of areas for which we will regularly be unable to give precise answers to reasonable questions.

On top of that, I think there is a significant risk that LLVM's model is unsound in subtle ways, meaning it does not actually permit all the optimizations that LLVM developers think it permits. For C++, the question "is that optimization correct" is at least mathematically well-defined and there is a bunch of literature proving the reorderings compilers do. Without having a formal model, we might not even be able to assign blame ("out of these 5 optimizations that together caused the bug, this is the one that is wrong"). And that is assuming that we even notice such a bug and are able to locate and minimize it, which is notoriously hard for concurrency issues.

That's why, the way things stand right now, I am opposed to leaving the scope of the C++ memory model.

Alternatives

From what I understand, while relaxed accesses can't be quite as aggressively optimized as @comex thought, they can still be optimized better than what LLVM is currently doing. So it might be worth to try and better exploit the freedom we have, instead of changing the language.

17 Likes

I'm afraid after having read @RalfJung's comment I think we in fact need a mechanized formal model in Coq/Agda as well as a thesis before I would be OK with leaving the scope of C++'s memory model.

2 Likes

FYI the latest revision of the paper suggests that there is consensus on what the fixed semantics of relaxed ought to look like, but since that's a backward incompatible change that software and hardware vendors might need time to adapt to, it proposes to add a new memory order called load_store instead.

Note that my comment was about it going in unstable -- I think that's fine because it's a low-impact change to the compiler, and further conversations can happen around stabilization requirements.

I'd be OK with that provided some guards against it being used within the standard library and the compiler internally + disclaimers around having a path towards stabilization.

Your points are well taken.

To start with, I'm not proposing "we do whatever LLVM does". In my last post in the thread, I explored how "unordered" could be re-defined as an extension of the C++ memory model (while staying compatible with LLVM's unordered), and that's the basic direction I'd like to go in if this were to become a proper language feature. I think unordered should be a lot easier to formalize than, say, LLVM's approach to write-write races. :slight_smile:

...But then, it depends on the degree of formalization. A literal "mechanized formal model in Coq/Agda" is a lot more than we've required for, well, any other feature in the language, so it feels a little unfair to me that it would be a prerequisite.

Also, I disagree that making LLVM's optimizations of existing memory orderings more aggressive would be somehow 'easier' (and thus less in need of formalization) than defining a new memory ordering. I'd like to do both! But the two tasks involve similar kinds of mental reasoning, and come with risks of similar mistakes. Just as there's a risk of improperly defining unordered in the spec, there's a risk of adding LLVM optimizations that turn out to be unsound; indeed, I've already been mistaken in the past about what optimizations would be legal. In theory this could be prevented by a formal model, but LLVM has none, yet we still use it. A spec bug has the risk of complicating things indefinitely... but an implementation bug has the risk of breaking actual code now. I'm not sure which is worse.

But enough negativity. I might actually be interested in coming up with a Coq/Agda model, if only as a learning experience. Do you have any suggestions for existing C++ memory model formalizations that might serve as a good base to add unordered to?

Even if I succeed, though (no guarantees), I am not going to write an academic paper, let alone a thesis. If that's going to be the requirement, I guess there's not much point... :confused:

3 Likes

(I'm using "thesis" as a stand-in for "elaborate document explaining and detailing the findings" (associated with the mechanization).)

Fair. However I'm afraid I can't really give you feedback on your proposed definition. I find myself extremely skeptical of whole-program graph-based axiomatic models like C++11, and in fact I don't think I ever properly understood any of them. For example, I only understood release-acquire semantics when reading this paper which gives an operational semantics to a stronger variant of release-acquire, avoiding many problems caused by only having a whole-program semantics. (I strongly recommend this paper to anyone interested in weak memory concurrency. Yes, "strong release-acquire" does not exactly match C++ release-acquire but something stronger -- however this can easily be changed by a small tweak to the semantics as done here, and seeing something very much like release-acquire defined in a totally different style has been eye-opening for me.) Unfortunately, even modelling relaxed accesses operationally is very hard, and "unordered" probably makes this worse.

With the C++ model, it is impossible to do something like "given a program state, what changes in that state when you run these 5 instructions". There's no such thing as a "state" before or after every atomic instruction that gets altered. Instead you consider all possible executions of the entire program at once and then rule out most of them. (You may wonder, don't most of these executions have UB? Yes, and I am not aware of any formal work that handles the interaction between per-thread UB, data-race UB and the graph-based axiomatic semantics. Some people I know that work on weak memory are skeptical that this is even soundly possible. That just adds to my skepticism about axiomatic semantics.) I do not have an intuition for that way of thinking even for existing models, and trying to analyze changes to such a model is way beyond me I'm afraid. I hope some other concurrency experts can chime in and give you some feedback on your proposed model. :slight_smile:

Well, for "relaxed" we do have a few papers formally proving that a bunch of transformations are sound, in the context of at least a large fragment of C++ concurrency. So at least parts of that formalization has been done.

It is true that just because a transformation is sound for C++ "relaxed" does not mean it is sound for LLVM "relaxed", but given that the difference between the two is only about data races, I think if the transformation does not exploit data races being UB then the proof should carry over. It's at least way better than what we have for "unordered".

Wow, that's awesome. :smiley:

I am not doing weak memory concurrency research myself, so I don't have a great overview. Viktor's papers are a good start; there's other people working on this but I know his work well because we are at the same institute. :wink:

@Centril

Way too many users are already writing Rust code that does memory mapped I/O, inter-process communication, etc. and they are doing so with the tools that they have available, which aren't great (and often, just UB).

The moment we decided to stabilize atomic operations that lack a soundness proof we have already established that some level of risk is "ok", and we are just balancing the risk of maybe having to deprecate a particular Ordering in the future, with the problems that, to the best of our knowledge, that Ordering solves today.

Formal Rust models are already restricted in many ways, e.g., to programs that do not use concurrency, or that do not use certain orderings, like Relaxed (which has been proven to be unsound), so I don't think that adding a new Ordering carefully makes the situation catastrophically worse.

This doesn't mean that new ordering or intrinsics that interact with the memory model should be added recklessly, but rather that doing so should balance the risks they introduce with the value they add.

2 Likes

For atomics specifically, I think it does, and I argued as much above. We actually do have a lot of formal work for the large class of programs that uses all our concurrency features but stays within the boundaries of a single process.

This is not about a lack of a proof that the operation is implemented correctly by the compiler. This is about a lack of proper understanding of what the operation does. As prior experience shows, most of the time when specifying new concurrency semantics, the first attempt is not actually specifying what people think it does.

The bugs we are risking here are not compiler bugs. They are specification bugs. That's what worries me.

5 Likes

I understood that.

If we add such operations, code will be written that uses them, and that code would be at risk of not doing what we think it does (and maybe proven to have undefined behavior in the future). If we don't add such operations, such code that still must be written will just have undefined behavior today instead.

That's what worries me. I think that both options are bad, but I'm not sure which one is worse.

I'm with @RalfJung on this one. In this case we are not simply living with the existing feature-set but there is a desire to add a new and not-so-well-undertood feature for which there are soundness and specifiability concerns (feels familiar from https://github.com/rust-lang/rfcs/pull/2360) and it's not something we at least feel we know we can resolve in a specifiable and sound manner (async & noalias is sorta in this category where there were some concerns but we felt we had paths forward). I don't feel comfortable with the risks here and would like to see an operational model. At some point "success, at all costs" has to end.

2 Likes

I see. Yes, that is getting to the core of the trade-off here: right now bytes has UB because we don't have "unordered".

OTOH, that is the only example of this that I am aware of. And when providing stable unordered atomics, likely many places that currently can use relaxed because they are not quite as hot as bytes is would switch to unordered instead.

There's AFAIK no code that must be written with "unordered"; this is a correctness-performance trade-off for that code as well. And by adding "unordered" we are seemingly changing that trade-off for client code ("look, I can have performance and correctness"), while we are far from sure that this really helps in terms of correctness.

One worthwhile experiment here would be to add (unstable) "unordered" intrinsics and patch bytes to use those intrinsics instead of the current UB code, and then run the benchmarks that informed the decision to prefer UB over relaxed. The entire discussion here is moot if it doesn't actually help for the desired use-case.

6 Likes

Benchmarking bytes using those makes sense. The unordered intrinsics are already implemented on nightly (https://doc.rust-lang.org/core/intrinsics/fn.atomic_load_unordered.html), and @comex shows in the OP the differences in codegen for a synthetic example.

1 Like

It also turns out that bytes might be entirely removing the feature that required the UB in the first place:

1 Like

Heh. That does remove some of the urgency, so to speak. Does anyone know of any other examples of Rust code that intentionally uses non-atomic loads instead of Relaxed ones [edit: in situations where a race is possible]?

Or even examples of code that does use Relaxed but could potentially go faster with Unordered. This includes loads that, after inlining, are:

  • Completely unused
    • Might be rare; I'm not really sure
  • Performed repeatedly in a loop, but would be okay to hoist out of the loop (i.e. the value is not expected to change during the loop, or the code doesn't care if it does)
    • Though hoisting is only possible if the compiler can prove there are no aliasing writes within the loop, which is often hard, especially with the noalias woes
  • Performed multiple times in succession, when one load would suffice
    • But this could also be done on Relaxed

None of the above low-likelihood examples appear to be worth the risk of adding a new memory ordering mode that lacks provable theoretical foundations.

The most recent tokio blog post includes snippets of code that do an unsync_load.

loop {
    let head = self.head.load(Acquire);

    // safety: this is the **only** thread that updates this cell.
    let tail = self.tail.unsync_load();

    if tail.wrapping_sub(head) < self.buffer.len() as u32 {
        // Map the position to a slot index.
        let idx = tail as usize & self.mask;

        // Don't drop the previous value in `buffer[idx]` because
        // it is uninitialized memory.
        self.buffer[idx].as_mut_ptr().write(task);

        // Make the task available
        self.tail.store(tail.wrapping_add(1), Release);

        return;
    }

    // The local buffer is full. Push a batch of work to the global
    // queue.
    match self.push_overflow(task, head, tail, global) {
        Ok(_) => return,
        // Lost the race, try again
        Err(v) => task = v,
    }
}

From the article I don't think Self::push_overflow writes self.tail, but it does "include stronger atomic operation [than Acquire]".

Thanks, but I just edited my last post to clarify – I meant non-atomic loads that can race. In that case, if the comment is correct, there are no potentially racing stores so no UB.