The RFC that you pointed to essentially went into the opposite direction: it proposed to remove some of the Acquire/Release/AcqRel
nuance.
Personally, I like this nuance (again, it is very useful as a beginner lint and a review aid) so I would be unhappy about that.
SeqCst does add a guarantee on top of Acquire
/Release
, it is just that this guarantee is more obscure than most people think and rarely needed in practice.
To explain what SeqCst
does, let's step back first and look at what guarantees Relaxed
actually provides us. It is very common for vulgarizations to summarize Relaxed
atomic ordering as "no ordering guarantee", but that is actually incorrect.
At the hardware level, cache-coherent CPUs guarantee that all threads agree on a single order for writes to a given memory location. This means that e.g. this execution is forbidden:
// Thread 1
a.store(0, Hardware); // This fictious ordering rigorously follows HW semantics
// Thread 2
a.store(1, Hardware);
// Thread 3
a.load(Hardware); // returns 0
a.load(Hardware); // returns 1, i.e. thread 1's write to a happened before
// thread 2's write from thread 3's perspective.
// Thread 4 (violates cache coherence w.r.t. thread 3)
a.load(Hardware); // returns 1
a.load(Hardware); // returns 0, i.e. thread 2's write to a happened before
// thread 1's write from thread 4's perspective.
What Relaxed
atomic operations do is to replicate the hardware guarantees of cache coherence at the language level. It does so by, among other things, forbidding compiler optimizers to reorder relaxed memory operations on a given variable with respect to each other. This is why LLVM calls this ordering Monotonic
, which I personally agree is a more descriptive name.
Now, this ordering guarantee is all we need as long as we are only manipulating a single shared memory location. But no such guarantee exists when manipulating multiple memory locations. For example, this execution is allowed:
// Thread 1
a_1.store(0, Relaxed);
a_2.store(0, Relaxed);
a_1.store(1, Relaxed);
a_2.store(1, Relaxed);
// Thread 2
a_2.load(Relaxed); // returns 1
a_1.load(Relaxed); // returns 0, this is allowed by Relaxed.
In practice, it can happen as a result of hardware or compilers reordering the store to a_1
after the store to a_2
or reordering the load from a_1
before the load from a_2
.
This is where the other atomic memory orderings come into play. In this particular case, as in all cases where there is only a single writer thread at a time, the reordering can be forbidden with an Acquire/Release
pair if it is undesirable...
// Thread 1
a_1.store(0, Relaxed);
a_2.store(0, Relaxed);
a_1.store(1, Relaxed);
a_2.store(1, Release);
// Thread 2
a_2.load(Acquire); // returns 1
a_1.load(Relaxed); // returning 0 is not allowed here. If the write of 1 to
// a_2 happened, then all previous writes from thread 1
// also happened from this thread's perspective.
...however, if there are 2+ writer threads and 2+ atomic variables, Acquire/Release
is not enough to guarantee that all threads agree on a single write order anymore:
// Thread 1
a_1.store(0, Relaxed);
a_1.store(1, Release);
a_2.load(Acquire); // May return 0, i.e. a_2.store(1) has not happened
// yet from thread 1's perspective.
// Thread 2
a_2.store(0, Relaxed);
a_2.store(1, Release);
a_1.load(Acquire); // May return 0 as well, i.e. a_1.store(1) has not
// happened yet from thread 2's perspective
The reason why this can happen is that hardware and compilers are allowed to reorder Acquire
loads before Release
stores. From the "acquire is like acquiring a mutex and release is like releasing a mutex" perspective, this is equivalent to moving extra reads and writes into a mutex-protected region, which is fine.
Now, for most synchronization protocols, this limitation of Acquire
/Release
does not matter, because synchronization transactions are made visible to other threads in a single atomic store or RMW operation that systematically targets the same atomic variable. As a result, cache coherence of that variable will give you all the total store ordering that you need "for free".
But a few complicated synchronization protocols whose correctness is hard to prove require manipulating multiple atomic variables per transaction. In this case, SeqCst
may be needed. What SeqCst
guarantees is that, given a set of atomic variables a_i
...
- If all stores to the
a_i
variables areSeqCst
... - ...then all
SeqCst
loads from thea_i
variables agree on a single store order.
From a hardware/compiler semantics point of view, this is almost the same as using Release
stores and Acquire
loads, except that a SeqCst
load to a variable a_1
can't be reordered before a SeqCst
store to a different variable a_2
inside a given thread. On most hardware, enforcing this property requires use of an expensive memory barrier instruction.
It has been claimed before that this guarantee is not implementable at all on some hardware like Power. Personally, I am not familiar enough with the Power memory model to judge on this. All I know is the language-level semantics which SeqCst
is intended to provide.
I hope this clarifies what extra guarantees SeqCst
actually provides with respect to Acquire/Release
, and why these guarantees are less frequently needed than most people think.
As for this proposal not matching the C++ memory model, all the nuances of SeqCst
that I am proposing are strictly stronger than what C++'s memory_order_seq_cst
provides by virtue of eliminating an ambiguity about memory ordering without relaxing any other constraint.
Therefore, you can't introduce a synchronization bug by porting an algorithm that uses C++'s memory_order_seq_cst
to these more specific orderings. At worst your program will panic because you asked for an impossible memory ordering, which would be a hidden bug in C++.
Conversely, any program using the proposed SeqCst(Acq|Rel|AcqRel)
orderings and which does not panic due to using an incorrect memory ordering has the same meaning if reimplemented using memory_order_seq_cst
in C++.