Generic `Atomic<T>`

Indeed this is explicitly out of scope for the initial plan (link to ACP). API stability is interesting, but your intuition should be accurate that C-style enums with #[repr(Int)] should be fine for atomics. See the Swift atomic protocols (traits) linked above for what this might look like. I'd love to have this in the future.

The initial perma-unstable trait is AtomicPrimitive, but in the future AsAtomic may become the better name. I do already have some sketches starting into that direction; this is just the first step, hopefully.

With a small caveat — on targets with word-sized atomic bitand/bitor but not byte-sized atomics (RISC-V), we change the library implementation of atomic bool to use the fact it's i1 instead of i8, so LLVM can lower ops with atomics instead of LL/LC loop.

This is basically the solution to ABA issues as well, atomically swapping (*mut T, usize) and changing the counter for each swap[1].

If we support this, I think it'd be supported at least for any (T, U) where we guarantee the layout of (T, U). Currently that's no tuples, but there's discussion of maybe guaranteeing layout (but not ABI) equivalence between (T, T) and [T; 2], and I personally think that extending that to (T, U) where the types are layout equivalent[2] makes sense as well.


  1. It's a probabilistic fix to a probabilistic issue, technically, since other threads could theoretically wrap the entire usize space before putting in the same-but-different pointer. Rust has already decided that aborting in such degenerate cases is the usual accepted response to maintain soundness. (E.g. std does so for reference count overflow or exhausting thread IDs. Hammering a single atomic isize::MAX times is a bit more achievable but still an erroneous edge case imo.) Swift's protocols call the raw atomic repr WordPair. ↩︎

  2. I don't know what the exact policy is on trait impls differing based on cfg outside std::os's extension traits, though. cfg(target_has_atomic) gets a minor pass in that it's true for any std-having target, up to "ptr" at least. But​ this runs into the fun question of isize being both intptr_tC and size_tC again, unfortunately​. ↩︎

1 Like

FYI the tracking issue and step one of implementation have been created.

3 Likes

Regarding uninitialized bytes, see this unsafe-code-guidelines discussion:

Operational semantics for AtomicMaybeUninit::compare_exchange

My biased summary:

C++ has committed to being able to do "compare-exchange while ignoring padding bits". Which means LLVM theoretically should be able to do it soundly (and I think theory meets reality in this case, unlike certain others). Which means we should be able to offer it too.

In practice this looks like a compare-exchange loop where if the comparison fails due to padding bit differences, it retries with the actual padding bits read from memory. Hardware-wise, this should succeed at least eventually. However, this algorithm isn't kosher from an abstract memory model perspective. But that’s not a showstopper. For opsem, we can treat the entire loop as an opaque "compare-exchange ignoring specified bits" operation. Likewise, the stable primitive offered by libcore should be that opaque operation, matching C++’s approach. The loop should be an implementation detail and there shouldn’t be a stable guarantee that naively implementing the loop yourself works correctly.

3 Likes

By the way, I think we should go beyond C++ and offer a primitive that allows the ignored-bits mask to be dynamic. We should also say that, even if you don’t ignore a bit which turns out to be uninit in memory, invoking the compare-exchange primitive isn’t UB. The uninit bits just potentially (nondeterministically) cause the compare-exchange to fail. This guarantee can be limited to types that actually allow uninit bits in the positions that turn out to be uninit.

Why is the above useful? Because it would make it possible to soundly compare-exchange enum values, even if they have variant-dependent padding bits. A compare-exchange on an enum value would lower to a masked compare-exchange where the ignored bits are the padding bits of the expected variant. If the actual value in memory is also that variant, then the padding bits are ignored and everything works as expected. But if the actual value is a different variant, then it may have uninit bits in non-ignored positions. So for this approach to be sound, the compare-exchange can’t be automatic UB just because of the uninit bits. But it doesn’t matter whether those bits are considered matching or not; the compare-exchange should fail regardless due to the non-matching discriminant (whose bits wouldn’t be ignored).

8 Likes

AFAIK this isn't yet implemented, see here and here.

Also, I am not aware of any specification of this operation that satisfies liveness / forward progress.

I think we should figure this out for padding-free types before we try to handle types with padding. I'd also exclude unions initially just to avoid that source of potential hassle. Basically, only fieldless enums and structs without padding. Let's walk before we run, shall we?

4 Likes

You need the loop to work eventually to be able to use compare_exchange_weak.

1 Like

Not if compare_exchange_weak contains a loop itself.

Yes, that would result in slightly suboptimal code generation. But libstdc++, libc++, and Microsoft's STL all follow this approach for types with padding. (Though in Microsoft's case that's just because they implement compare_exchange_weak as compare_exchange_strong.)

Well, a few months ago libc++ did implement atomic_ref in a way that uses __builtin_clear_padding if it's available. But Clang still doesn't implement __builtin_clear_padding, so right now this only works properly with GCC. That will need to change before they can claim conformance. However, my guess is that there won't be any backend work needed for conformance; it won't be necessary to adjust the optimizer because right now the optimizer doesn't even perform basic optimizations on atomics. But I admit I'm speculating.

Maybe this is getting off-topic, but isn't this solved by defining the AM operation as "compare-exchange while ignoring a specified set of bits" as I suggested? If you assume such an operation exists, then forward progress analysis for programs using it should be comparable to the no-padding case.

The ability of the implementation to guarantee forward progress would need to rely on multiple implementation details, such as LLVM not optimizing cmpxchg on undef to always return false, and the OS or hardware not constantly changing the contents of uninitialized memory. (The latter is not violated by Linux's MADV_FREE because that modifies each page at most once.) Those implementation requirements may change in the future, but LLVM should always have some way to implement the operation, or else C++'s atomic_ref will break.

Fine by me.

1 Like

I would describe it as something like a CAS version of a SIMD masked load on a u8 vector (I think we only need byte-level precision) -- we never even load the other bytes so if they are uninit it doesn't matter. That avoids having to talk about freeze.

However, even then the implementation as a loop does not have the same liveness properties as the AM semantics: if one thread tries to do such a CAS while another thread constantly does atomic writes with random values to the same memory, then in the AM the first thread would be guaranteed to terminate. However, a loop-based implementation (even if we only introduce the loop in machine IR, entirely avoiding all optimization-related questions) could make the first thread loop forever.

IOW, this pseudocode would always terminate under AM semantics (assuming a fair scheduler), but the produced binary could fail to terminate:

static X;
static FLAG;

thread::scope(|s| {
  s.spawn(|| {
    do_a_masked_cas(&X);
    set_flag(&FLAG);
  });
  s.spawn(|| {
    while !get_flag(&FLAG) {
      write_random(&X);
    }
  });
});

So no, this does not solve the liveness concerns.

Also, a static mask indicating which bytes to compare and which bytes to ignore would not suffice to permit CAS on enums with fields, as for them the padding mask depends on the enum discriminant.

Good point.

It's not an entirely novel issue though. Architectures that implement CAS based on a load-reserved/store-conditional loop have a similar problem. Store-conditional will generally fail if there have been any intervening writes to the memory location since the load-reserved, even writes of the same value that was already present. This means that if one thread stores the same value over and over, and another thread does a single compare_exchange with that value as the expected value, the second thread can theoretically be blocked indefinitely.

In both situations, you do have some thread making progress, but you can't guarantee that all threads will make progress.

Architectures that use LR/SC include RISC-V, POWER, ARM before v8.1, and LoongArch64, so it's a good chunk of Rust's target list.

Yeah, that's what I brought up in this post. For CAS on enums to work, the mask must be allowed to be dynamic, and comparisons against non-masked-out uninit bits must not be UB (though they can behave nondeterministically). Actually, the C++ spec seems to imply that such comparisons are not UB, though it's hard to tell because the example seems to have a mistake (the specific values used in the example have no uninit bits).

Ah, fair. I guess there is the possibility that hardware guarantees that ll/sc (I know this as "load-link/store-conditional, but both names seem to be in use) will eventually succeed, but I don't know if actual hardware makes such guarantees.

This is even worse than enums, it's about unions. It's also kind of useless from a liveness perspective as it explicitly says the CAS might never succeed (even if there are no concurrent writes)...

I don't think C++ is a good model to copy here. But at this point we're just re-hashing the exact discussion we already had in

I'm citing that passage as an analogue for the case where there are uninit bits that are not masked out. For enums, I'm assuming the compiler-generated CAS implementation would dynamically choose a mask based on the current variant of the 'expected' value. So you would only encounter non-masked-out uninit bits if the enum variant does not match, in which case the CAS should fail anyway, if only due to the discriminant being unequal. All that is needed is for the CAS not to be UB.

That's a separate issue from liveness when bits are masked out.

FWIW, in the version of the Arm Architecture Reference Manual for A-profile architecture - DDI4087K I have to hand, there's a forward progress guarantee for LL/SC in AArch64 mode on pages B2-314 and B2-315, and the same for AArch32 mode on pages E2-9643 and E2-9644.

The revision I have is Issue K.a, dated 20 March 2024, and applies to all 64-bit ARM processors; I believe that if you dug out an older manual (e.g. for ARM-v7A), you'd find similar language, and I can't find anything in the R-profile supplement that contradicts this for R-profile architecture.

You can find similar language in the v8 M-profile architecture reference manual, too, but not the v7 M-profile reference.

LoadExcl/StoreExcl loops are guaranteed to make forward progress only if, for any LoadExcl/StoreExcl loop within a single thread of execution, the software meets all of the following conditions: [...]

Oh wow, I hadn't actually expected this! Nice :smiley:

However...

However, it is permissible for the LoadExcl/StoreExcl loop not to make forward progress if a different thread is repeatedly doing any of the following in a tight loop:

I don't know all the acronyms involved here, but "Performing stores to a PA covered by the Exclusives monitor" sounds like that would apply to examples like my code above?

1 Like

Yeah, it would apply.

RISC-V has something similar, with similar limitations (emphasis added, and note that "hart" means "hardware thread"):

As a consequence of the eventuality guarantee, if some harts in an execution environment are executing constrained LR/SC loops, and no other harts or devices in the execution environment execute an unconditional store or AMO to that reservation set, then at least one hart will eventually exit its constrained LR/SC loop. By contrast, if other harts or devices continue to write to that reservation set, it is not guaranteed that any hart will exit its LR/SC loop.

1 Like

My understanding of the guarantee is that it's intended to allow an implementation to guarantee forward progress by only checking the monitor state at load exclusive time, not at any other time.

Thus, if you have stores to a physical address without a matching load-exclusive, you're not guaranteed forward progress - you need the load-exclusive/store-exclusive pair to guarantee progress. If your code implements the write_random with an ordinary store, then progress is not guaranteed; if it notes the "atomicness" of X and uses an LDXR/STXR pair, then the guarantee is made.

1 Like

write_random has to do an atomic store to avoid UB, but I would expect that a relaxed atomic store becomes just a regular store in assembly.

So that sounds like there's already no liveness guarantee for compare-and-swap on RISC-V. The operation itself may loop infinitely under certain conditions.

On AArch64, you'd need a "dead" LDXR before doing the atomic store in write_random; the progress guarantee is phrased carefully so that the only time an implementation must check that it's not impeding progress is on LDXR (although it's always allowed to do better).

I assume the standard compilation scheme for a relaxed write does not involve such a LDXR, so the example above (of one thread hammering relaxed writes while the other thread does a single CAS) would then be an example of Rust-level liveness being violated on aarch64.

I have just encountered a much less esoteric concrete reason to want Atomic<(X, Y)> where X and Y are both pointer-sized but not the same type: fat pointers. Atomic<&T> where T:Sized can be implemented on top of the existing AtomicPtr. Atomic<&[T]>, Atomic<&str>, and Atomic<&dyn Trait> can't, but, as long as the extra bulk is no more than one additional pointer's worth of bits, they should be implementable on top of AtomicDoubleword.

3 Likes

In addition to that - Waker. There are implementations that allow some kind of atomic swap, but they are pain to use and reason about if you are writing unsafe low level concurrency.