Bit-wise reasoning for atomic accesses


It would be nice to know what optimization exactly LLVM is failing to perform. Does LLVM just not coalesce redundant relaxed loads at all, or is there something more subtle going on?


@carllerche do you have a benchmark that shows the 30% perf regression that the comment talks about ?


The comment mentions 10%. It was a long time ago, but I would guess the benches in the repo would show it.


Yep, it doesn’t:

use std::sync::atomic::{Ordering::*, AtomicUsize};
pub fn load_twice(x: &AtomicUsize) -> (usize, usize) {
    let a = x.load(Relaxed);
    let b = x.load(Relaxed);
    (a, b)


	movq	(%rdi), %rax
	movq	(%rdi), %rdx

That sucks. It may be possible to partially work around this on rustc’s end by marking the load function with LLVM’s readonly attribute, which is equivalent to __attribute__((pure)) in C. The semantics are that the function can read through its arguments and global state, but must not modify global state and must be deterministic. This means the compiler can fold multiple calls with the same arguments into one if there were no intervening writes. And, based on this C equivalent, LLVM does do so, even if the function is marked always_inline (in which case I feared it would inline the calls before it had a chance to merge them).

In theory we’d like a stricter semantics, where the function can read through its arguments but not read global state. That way, LLVM could theoretically merge reads even if there were intervening writes to other locations, if it could prove the addresses didn’t alias. (This would only be valid for relaxed loads.) Unfortunately, the only stricter option is readnone, aka __attribute__((const)) in C, which requires that the function not read anything from memory, not even from its arguments.

Anyway, as for actually fixing the issue in LLVM, this bug report looks on point:

I went ahead and left a comment there.


Can’t you get this pattern of writes with a normal atomic?

// mut s_a: AtomicU32; /* initial value should not matter */
// mut s_b: AtomicPtr<AtomicU32> = AtomicPtr::null();

// mut x: u32

// Thread A

// these stores are non-atomic because we have &mut access
s_a = AtomicU32::new(1);
s_a = AtomicU32::new(0);
// but these store are atomic s_a, Ordering::Release);, Ordering::Relaxed);

// Thread B
let mut x = None;
let tmp = s_a.load(Ordering::Acquire);
if tmp != ptr::null() {
    x = Some((*tmp).load(Ordering::Relaxed));

I’m quite sure that here, x should either be None or Some(0). If removing the second non-atomic store (as a Thread A modification, ignoring anything thread B does) would be legal, then thread B could see Some(1) (or worse, Some(uninitialized)).


Good point! So the optimization can likely not break bytes. Interesting.


What we actually want for bytes is some sort of load_freeze, that returns a non-undef value. @RalfJung thinks that having per-bit undef for loads and combining it with LLVM’s freeze would work.


I’ve needed that in jemalloc-sys:

So maybe having a #[rustc_readonly] attribute would be useful.


How hard would it be to make experiments with LLVM the unordered ordering for this? I am not sure about how this is formally modeled, but it might actually be the “no-undef-on-read-write-races” that we need.


If it can be used from Rust, probably not terribly hard.


It would help if the Rust compiler had an equivalent to READ_ONCE and WRITE_ONCE as defined in the Linux kernel. Those force the compiler to perform a single load or store respectively, while not using a machine atomic. With those, you can safely make assumptions like “if READ_ONCE races with WRITE_ONCE or an atomic write, you’ll get either the old value or new value, never a mix of both”.


@josh The Linux kernel uses volatile accesses for that, and Rust has those.

The problem is that neither LLVM nor C actually specify volatile accesses to have the effect you described when it comes to data races. There is no interaction between the concurrency model and volatile accesses specified.


C doesn’t, LLVM doesn’t, but in practice every real architecture does, and thus the Linux kernel does. That’s the reason those primitives exist.

There has been discussion in the C standards committee of standardizing something similar to the Linux kernel’s memory model.


That just doesn’t matter. We are not talking about architecture semantics here, we are talking about programming language semantics. See this example by @alexcrichton for why arguing with architectures just does not make any sense in this discussion.

We need a commitment from LLVM, or else this will be a hack that “happens to work” but remains UB and could stop working any time LLVM adds a new optimization.


I’m talking about programming language semantics as well. My point is precisely that we need the facilities to implement READ_ONCE and WRITE_ONCE primitives we can rely on.

C doesn’t specify that interaction with respect to volatile accesses, but in practice that is the interaction on every existing compiler and target, so the Linux kernel counts on that.


I agree it’d be reasonable for LLVM to guarantee that a volatile read will never have undef/poison in it, it will always be frozen. These reads cannot be duplicated, after all.

Has anybody ever talked with the LLVM devs about this? Is that something they would be willing to guarantee?


I just found this, seems very related: there is no such thing as a “benign” data race. In particular this has an explicit section on “disjoint bit manipulation” (Cc @carllerche)


Note that the example in that section is valid for normal reads/writes, in which case a racing read that could observe the speculative store would be undefined behavior, but not if the read and write are both atomic (even if relaxed), in which case the read is guaranteed to observe some value that was previously written.