Generic `Atomic<T>`

CAD97 · September 18, 2024, 6:28pm

Indeed this is explicitly out of scope for the initial plan (link to ACP). API stability is interesting, but your intuition should be accurate that C-style enums with #[repr(Int)] should be fine for atomics. See the Swift atomic protocols (traits) linked above for what this might look like. I'd love to have this in the future.

The initial perma-unstable trait is AtomicPrimitive, but in the future AsAtomic may become the better name. I do already have some sketches starting into that direction; this is just the first step, hopefully.

With a small caveat — on targets with word-sized atomic bitand/bitor but not byte-sized atomics (RISC-V), we change the library implementation of atomic bool to use the fact it's i1 instead of i8, so LLVM can lower ops with atomics instead of LL/LC loop.

This is basically the solution to ABA issues as well, atomically swapping (*mut T, usize) and changing the counter for each swap^[1].

If we support this, I think it'd be supported at least for any (T, U) where we guarantee the layout of (T, U). Currently that's no tuples, but there's discussion of maybe guaranteeing layout (but not ABI) equivalence between (T, T) and [T; 2], and I personally think that extending that to (T, U) where the types are layout equivalent^[2] makes sense as well.

It's a probabilistic fix to a probabilistic issue, technically, since other threads could theoretically wrap the entire usize space before putting in the same-but-different pointer. Rust has already decided that aborting in such degenerate cases is the usual accepted response to maintain soundness. (E.g. std does so for reference count overflow or exhausting thread IDs. Hammering a single atomic isize::MAX times is a bit more achievable but still an erroneous edge case imo.) Swift's protocols call the raw atomic repr WordPair. ↩︎
I don't know what the exact policy is on trait impls differing based on cfg outside std::os's extension traits, though. cfg(target_has_atomic) gets a minor pass in that it's true for any std-having target, up to "ptr" at least. But this runs into the fun question of isize being both intptr_t_C and size_t_C again, unfortunately. ↩︎

CAD97 · September 19, 2024, 5:10am

FYI the tracking issue and step one of implementation have been created.

github.com/rust-lang/rust

Tracking Issue for generic `Atomic`

opened 02:24AM - 19 Sep 24 UTC

CAD97

T-libs-api C-tracking-issue

Feature gate: `#![feature(generic_atomic)]` This is a tracking issue for replac…ing the distinct `Atomic*` types with a generic `Atomic<T>` type. This allows using `Atomic` with FFI type aliases and helps clean up some API surface. ### Public API ```rust // core::sync::atomic pub struct Atomic<T: AtomicPrimitive>(/* private fields*/); pub type AtomicI32 = Atomic<i32>; // etc ``` ### Steps / History  - [x] ACP: https://github.com/rust-lang/libs-team/issues/443 - [ ] Implement `Atomic<T>` as an alias to `AtomicT`: https://github.com/rust-lang/rust/pull/130543 - [ ] Flip alias so `AtomicT` is an alias to `Atomic<T>` - [ ] Move generic functionality from `AtomicT` to `Atomic<_>` - [ ] Pseudo-prerequisite: Stabilize 128-bit atomics: https://github.com/rust-lang/rust/issues/99069 - [ ] Final comment period (FCP)[^1] - [ ] Stabilization PR  ### Unresolved Questions  - `Atomic<T>` is soft-blocked on 128-bit atomics, because since trait implementations cannot be unstable, gating `Atomic<i128>` separately from `Atomic<i32>` isn't possible. - If necessary, `AtomicI128` could instead name `Atomic<Wrapper<i128>>` for some unstable name `Wrapper` until 128-bit atomics are stable, to prevent `Atomic<i128>` from being usable from stable earlier than intended. [^1]: https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html

github.com/rust-lang/rust

Create `Atomic<T>` type alias

rust-lang:master ← CAD97:generic-atomic

opened 04:19AM - 19 Sep 24 UTC

CAD97

+347 -258

and use it in core/alloc/std where possible, ignoring test files for now. Thi…s is step one, creating the alias from `Atomic<T>` to `AtomicT`. The next step of flipping this and aliasing `AtomicT` to `Atomic<T>` will have a bigger impact, since `AtomicT` imports can be dropped once `Atomic::new` is a usable name. First commit is the true change. Second commit is mostly mechanical replacement of `AtomicT` type names with `Atomic<T>`. See [how this was done for `NonZero`](https://github.com/rust-lang/rust/issues/120257) for the rough blueprint I'm following. - ACP: https://github.com/rust-lang/libs-team/issues/443#event-14293381061 - Tracking issue: https://github.com/rust-lang/rust/issues/130539

comex · September 21, 2024, 7:58pm

Regarding uninitialized bytes, see this unsafe-code-guidelines discussion:

Operational semantics for AtomicMaybeUninit::compare_exchange

My biased summary:

C++ has committed to being able to do "compare-exchange while ignoring padding bits". Which means LLVM theoretically should be able to do it soundly (and I think theory meets reality in this case, unlike certain others). Which means we should be able to offer it too.

In practice this looks like a compare-exchange loop where if the comparison fails due to padding bit differences, it retries with the actual padding bits read from memory. Hardware-wise, this should succeed at least eventually. However, this algorithm isn't kosher from an abstract memory model perspective. But that’s not a showstopper. For opsem, we can treat the entire loop as an opaque "compare-exchange ignoring specified bits" operation. Likewise, the stable primitive offered by libcore should be that opaque operation, matching C++’s approach. The loop should be an implementation detail and there shouldn’t be a stable guarantee that naively implementing the loop yourself works correctly.

comex · September 21, 2024, 8:28pm

By the way, I think we should go beyond C++ and offer a primitive that allows the ignored-bits mask to be dynamic. We should also say that, even if you don’t ignore a bit which turns out to be uninit in memory, invoking the compare-exchange primitive isn’t UB. The uninit bits just potentially (nondeterministically) cause the compare-exchange to fail. This guarantee can be limited to types that actually allow uninit bits in the positions that turn out to be uninit.

Why is the above useful? Because it would make it possible to soundly compare-exchange enum values, even if they have variant-dependent padding bits. A compare-exchange on an enum value would lower to a masked compare-exchange where the ignored bits are the padding bits of the expected variant. If the actual value in memory is also that variant, then the padding bits are ignored and everything works as expected. But if the actual value is a different variant, then it may have uninit bits in non-ignored positions. So for this approach to be sound, the compare-exchange can’t be automatic UB just because of the uninit bits. But it doesn’t matter whether those bits are considered matching or not; the compare-exchange should fail regardless due to the non-matching discriminant (whose bits wouldn’t be ignored).

RalfJung · September 22, 2024, 2:16pm

AFAIK this isn't yet implemented, see here and here.

Also, I am not aware of any specification of this operation that satisfies liveness / forward progress.

I think we should figure this out for padding-free types before we try to handle types with padding. I'd also exclude unions initially just to avoid that source of potential hassle. Basically, only fieldless enums and structs without padding. Let's walk before we run, shall we?

the8472 · September 22, 2024, 9:08pm

You need the loop to work eventually to be able to use compare_exchange_weak.

comex · September 22, 2024, 10:25pm

Not if compare_exchange_weak contains a loop itself.

Yes, that would result in slightly suboptimal code generation. But libstdc++, libc++, and Microsoft's STL all follow this approach for types with padding. (Though in Microsoft's case that's just because they implement compare_exchange_weak as compare_exchange_strong.)

Well, a few months ago libc++ did implement atomic_ref in a way that uses __builtin_clear_padding if it's available. But Clang still doesn't implement __builtin_clear_padding, so right now this only works properly with GCC. That will need to change before they can claim conformance. However, my guess is that there won't be any backend work needed for conformance; it won't be necessary to adjust the optimizer because right now the optimizer doesn't even perform basic optimizations on atomics. But I admit I'm speculating.

Maybe this is getting off-topic, but isn't this solved by defining the AM operation as "compare-exchange while ignoring a specified set of bits" as I suggested? If you assume such an operation exists, then forward progress analysis for programs using it should be comparable to the no-padding case.

The ability of the implementation to guarantee forward progress would need to rely on multiple implementation details, such as LLVM not optimizing cmpxchg on undef to always return false, and the OS or hardware not constantly changing the contents of uninitialized memory. (The latter is not violated by Linux's MADV_FREE because that modifies each page at most once.) Those implementation requirements may change in the future, but LLVM should always have some way to implement the operation, or else C++'s atomic_ref will break.

Fine by me.

RalfJung · September 23, 2024, 9:28am

I would describe it as something like a CAS version of a SIMD masked load on a u8 vector (I think we only need byte-level precision) -- we never even load the other bytes so if they are uninit it doesn't matter. That avoids having to talk about freeze.

However, even then the implementation as a loop does not have the same liveness properties as the AM semantics: if one thread tries to do such a CAS while another thread constantly does atomic writes with random values to the same memory, then in the AM the first thread would be guaranteed to terminate. However, a loop-based implementation (even if we only introduce the loop in machine IR, entirely avoiding all optimization-related questions) could make the first thread loop forever.

IOW, this pseudocode would always terminate under AM semantics (assuming a fair scheduler), but the produced binary could fail to terminate:

static X;
static FLAG;

thread::scope(|s| {
  s.spawn(|| {
    do_a_masked_cas(&X);
    set_flag(&FLAG);
  });
  s.spawn(|| {
    while !get_flag(&FLAG) {
      write_random(&X);
    }
  });
});

So no, this does not solve the liveness concerns.

Also, a static mask indicating which bytes to compare and which bytes to ignore would not suffice to permit CAS on enums with fields, as for them the padding mask depends on the enum discriminant.

comex · September 23, 2024, 5:56pm

Good point.

It's not an entirely novel issue though. Architectures that implement CAS based on a load-reserved/store-conditional loop have a similar problem. Store-conditional will generally fail if there have been any intervening writes to the memory location since the load-reserved, even writes of the same value that was already present. This means that if one thread stores the same value over and over, and another thread does a single compare_exchange with that value as the expected value, the second thread can theoretically be blocked indefinitely.

In both situations, you do have some thread making progress, but you can't guarantee that all threads will make progress.

Architectures that use LR/SC include RISC-V, POWER, ARM before v8.1, and LoongArch64, so it's a good chunk of Rust's target list.

Yeah, that's what I brought up in this post. For CAS on enums to work, the mask must be allowed to be dynamic, and comparisons against non-masked-out uninit bits must not be UB (though they can behave nondeterministically). Actually, the C++ spec seems to imply that such comparisons are not UB, though it's hard to tell because the example seems to have a mistake (the specific values used in the example have no uninit bits).

RalfJung · September 23, 2024, 6:28pm

Ah, fair. I guess there is the possibility that hardware guarantees that ll/sc (I know this as "load-link/store-conditional, but both names seem to be in use) will eventually succeed, but I don't know if actual hardware makes such guarantees.

This is even worse than enums, it's about unions. It's also kind of useless from a liveness perspective as it explicitly says the CAS might never succeed (even if there are no concurrent writes)...

I don't think C++ is a good model to copy here. But at this point we're just re-hashing the exact discussion we already had in

github.com/rust-lang/unsafe-code-guidelines

Operational semantics for `AtomicMaybeUninit::compare_exchange`

opened 07:10PM - 13 Aug 23 UTC

RalfJung

Rust currently has no way to do atomic operations on types that may contain unin…itialized bytes. crossbeam's `AtomicCell` resorts to using the standard library `Atomic*` types instead, but that doesn't entirely work: `load` will be UB if there is any uninitialized byte, and `store` requires transmuting the to-be-stored `T` into an integer which is likewise UB if there are any uninitialized bytes. (Plus if there is pointer provenance, that is getting lost.) (Note that `(u16, u8)` anyway cannot be used with atomic operations since it is insufficiently aligned. Generally atomic operations have alignment requirements equal to their size, so the only way for other types to have the right size and alignment is for them to be newtypes, or to use `repr(align)`.) So there is a clear demand for something like `AtomicMaybeUninit`. There even is an [inline assembly implementation of it](https://github.com/taiki-e/atomic-maybe-uninit/). It would be great if Rust could provide this natively. Mostly this is a libs-api question of how to best expose these operations, but for read-modify-write operations things are more interesting: what is `AtomicMaybeUninit::compare_exchange` supposed to behave like? Usually comparing uninit values leads to UB, even on the LLVM level. ### Avoid the uninit comparison One option would be to try and somehow entirely avoid comparisons of uninit memory. This would also avoid having to get any new guarantees from LLVM. An `AtomicMaybeUninit` type could have the invariant that the data inside it is always frozen; `compare_exchange` could freeze the provided value-to-compare, and then the comparison works on entirely well-defined bits. However, this invariant is broken by `get_mut`, since that function could be used to de-initialize padding bytes. So maybe we don't have such an invariant but freeze the in-memory data before comparison? But then that freezing must somehow be atomic, and what if the comparison fails -- usually this would be just a read-only access, but now the freeze made this a write somehow? ### Allow the uninit comparison We could say that comparing data with uninit contents just leads non-deterministic results (and try to convince LLVM to add that to their concurrency spec). However this introduces liveness issues: always returning "inequal" would be a legal implementation if any of the in-memory bytes is uninitialized. And it seems like [the C++ wording for `compare_exchange_strong` on unions indeed allows comparisons to always fail](https://eel.is/c++draft/atomics.types.operations#note-8). For padding outside union types however C++ guarantees that comparison will ignore padding bytes; I have no idea how compilers are implementing this -- they might have to use locking?

comex · September 23, 2024, 7:05pm

I'm citing that passage as an analogue for the case where there are uninit bits that are not masked out. For enums, I'm assuming the compiler-generated CAS implementation would dynamically choose a mask based on the current variant of the 'expected' value. So you would only encounter non-masked-out uninit bits if the enum variant does not match, in which case the CAS should fail anyway, if only due to the discriminant being unequal. All that is needed is for the CAS not to be UB.

That's a separate issue from liveness when bits are masked out.

farnz · September 23, 2024, 8:06pm

FWIW, in the version of the Arm Architecture Reference Manual for A-profile architecture - DDI4087K I have to hand, there's a forward progress guarantee for LL/SC in AArch64 mode on pages B2-314 and B2-315, and the same for AArch32 mode on pages E2-9643 and E2-9644.

The revision I have is Issue K.a, dated 20 March 2024, and applies to all 64-bit ARM processors; I believe that if you dug out an older manual (e.g. for ARM-v7A), you'd find similar language, and I can't find anything in the R-profile supplement that contradicts this for R-profile architecture.

You can find similar language in the v8 M-profile architecture reference manual, too, but not the v7 M-profile reference.

RalfJung · September 23, 2024, 8:23pm

LoadExcl/StoreExcl loops are guaranteed to make forward progress only if, for any LoadExcl/StoreExcl loop within a single thread of execution, the software meets all of the following conditions: [...]

Oh wow, I hadn't actually expected this! Nice

However...

However, it is permissible for the LoadExcl/StoreExcl loop not to make forward progress if a different thread is repeatedly doing any of the following in a tight loop:

I don't know all the acronyms involved here, but "Performing stores to a PA covered by the Exclusives monitor" sounds like that would apply to examples like my code above?

comex · September 24, 2024, 12:35am

Yeah, it would apply.

RISC-V has something similar, with similar limitations (emphasis added, and note that "hart" means "hardware thread"):

As a consequence of the eventuality guarantee, if some harts in an execution environment are executing constrained LR/SC loops, and no other harts or devices in the execution environment execute an unconditional store or AMO to that reservation set, then at least one hart will eventually exit its constrained LR/SC loop. By contrast, if other harts or devices continue to write to that reservation set, it is not guaranteed that any hart will exit its LR/SC loop.

farnz · September 24, 2024, 8:08am

My understanding of the guarantee is that it's intended to allow an implementation to guarantee forward progress by only checking the monitor state at load exclusive time, not at any other time.

Thus, if you have stores to a physical address without a matching load-exclusive, you're not guaranteed forward progress - you need the load-exclusive/store-exclusive pair to guarantee progress. If your code implements the write_random with an ordinary store, then progress is not guaranteed; if it notes the "atomicness" of X and uses an LDXR/STXR pair, then the guarantee is made.

RalfJung · October 11, 2024, 12:26pm

write_random has to do an atomic store to avoid UB, but I would expect that a relaxed atomic store becomes just a regular store in assembly.

So that sounds like there's already no liveness guarantee for compare-and-swap on RISC-V. The operation itself may loop infinitely under certain conditions.

farnz · October 11, 2024, 12:46pm

On AArch64, you'd need a "dead" LDXR before doing the atomic store in write_random; the progress guarantee is phrased carefully so that the only time an implementation must check that it's not impeding progress is on LDXR (although it's always allowed to do better).

RalfJung · October 11, 2024, 2:53pm

I assume the standard compilation scheme for a relaxed write does not involve such a LDXR, so the example above (of one thread hammering relaxed writes while the other thread does a single CAS) would then be an example of Rust-level liveness being violated on aarch64.

zackw · October 17, 2024, 5:07pm

I have just encountered a much less esoteric concrete reason to want Atomic<(X, Y)> where X and Y are both pointer-sized but not the same type: fat pointers. Atomic<&T> where T:Sized can be implemented on top of the existing AtomicPtr. Atomic<&[T]>, Atomic<&str>, and Atomic<&dyn Trait> can't, but, as long as the extra bulk is no more than one additional pointer's worth of bits, they should be implementable on top of AtomicDoubleword.

Ddystopia · October 17, 2024, 9:35pm

In addition to that - Waker. There are implementations that allow some kind of atomic swap, but they are pain to use and reason about if you are writing unsafe low level concurrency.

Topic		Replies	Views
Type-safe atomics usage libs	17	1879	August 23, 2021
Bit-wise reasoning for atomic accesses Unsafe Code Guidelines	39	3784	May 25, 2019
Adding more atomic intrinsics	11	3073	March 25, 2019
`unordered` as a solution to "Bit-wise reasoning for atomic accesses" Unsafe Code Guidelines	24	3681	December 22, 2024
Pre-RFC: Extended atomic types internals	8	4760	March 25, 2019

Generic `Atomic<T>`

Related topics