Should we change fetch_<logic op> for AtomicBool to exploit knowledge about it contents?

Currently, fetch_or generates CAS loop:

        movzx   eax, byte ptr [rdi]
.LBB0_1:
        mov     ecx, eax
        or      cl, sil
        lock            cmpxchg byte ptr [rdi], cl
        jne     .LBB0_1
        test    al, al
        setne   al
        ret

Should we replace it by argument specific operations like this?

pub fn fetch_or(a: &AtomicBool, val: bool, order: Ordering) -> bool {
    unsafe {
        let a =&*(a as *const _ as *const AtomicU8);
        if val {
            a.swap(1, order) != 0
        }
        else {
            a.fetch_add(0, order) != 0
        }
    }
}

This implementation generates assembly without loop:

        test    esi, esi
        je      .LBB1_1
        mov     al, 1
        xchg    byte ptr [rdi], al
        jmp     .LBB1_3
.LBB1_1:
        mfence
        movzx   eax, byte ptr [rdi]
.LBB1_3:
        test    al, al
        setne   al
        ret

and if val is known (which it is for most code, IMHO) it would also be branchless.

I just don't know if this is worthy optimization. AFAIK, mfence is slightly more powerful compared to lock cmpxchg so what do you think?

If you think that it is better to use my variant, I would open a PR.

Also, this seem to affect only x86 and x86_64, other architectures generate more similar code.

4 Likes

You say fetch_<logic op>, but this might be only about fetch_or and fetch_and, right?

IIUC, atomic_or is an intrinsic directly passed to LLVM, so it's the codegen backend making the choice, not us. But I could easily be wrong; I didn't check the intrinsic's implementation.

If we decide to use this, I would test if it is possible to improve other operations.

Yes, it uses LLVM intrinsics internally. However, LLVM cannot transform fetch_or to swap because it cannot be sure that memory of type i8 can be only 0 or 1. Rust standard library, on the other hand, knows that AtomicBool must contain correct bool value.

5 Likes

We do use !range metadata to tell LLVM that on loads. Maybe there's a way to tell LLVM this for atomics as well?

I'm generally opposed to putting different code in the library for these things, when the intrinsic should just work.

Well, unfortunately, LLVM don't accept !range on atomicrmv :frowning:

Ranges are only for loads, calls and invokes!

I know nothing about how rustc lowers atomics to IR, but doesn’t LLVM have i1 for representing bools?

That results in an "atomic memory access' size must be byte-sized" error.

5 Likes

At least for x86, it could also be implemented as a single CAS:

pub fn boolean_fetch_or(dst: &AtomicU8, new: u8) -> u8 {
    // if 0 (false), replace with new, otherwise already true
    match dst.compare_exchange(0, new, SeqCst, SeqCst) {
        Ok(x) | Err(x) => x
    }
}
example::boolean_fetch_or:
        xor     eax, eax
        lock            cmpxchg byte ptr [rdi], sil
        ret

As far as I can see, this generated assembly should be correct for AtomicBool::fetch_or with SeqCst-ordering. The Rust code doesn't directly generalize for other orderings, since the failure-ordering cannot be Release or AcqRel. ( Rust's compare_exchange only writes on success, whereas x86's cmpxchg always does. )

1 Like

My read of the x86 cmpxchg is that it writes only on success.

1 Like

From the linked page (and its source, Intel's documentation):

The pseudocoded operation also unambiguously writes one of the values back in each code path:

Rephrased, x86 lock cmpxchg always counts as both a sequentially consistent read and write for the purpose of memory consistency / synchronization. Whereas in Rust, while a successful compare_exchange is both a read and a write (using the success ordering), a failed compare_exchange is only a read (using the failure ordering).

(And this does matter, even with SeqCst, as SeqCst still only provides AcqRel synchronization, i.e. can't make a read synchronize with a later write, even if they're sequentially ordered. Using that Rust implementation would be incorrect as optimizations on the IRs before x86 could exploit the weaker ordering.)

Especially since Rust can't express the desired semantics, the resolution to this codegen issue is to teach LLVM and/or rustc_codegen_llvm to do the better thing with the intrinsics, not to change the implementation on the Rust side. That would be true anyway, but this makes it more so.

6 Likes

I stand corrected, I read the first paragraph:

"Compares the value in the AL, AX, EAX, or RAX register with the first operand (destination operand). If the two values are equal, the second operand (source operand) is loaded into the destination operand. Otherwise, the destination operand is loaded into the AL, AX, EAX or RAX register. RAX register is available only in 64-bit mode."

And my read is the write did not always occur, but the pseudo code definitely shows there is always a write. I note that the second paragraph begins with "This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically." I wonder how the lock changes the instruction execution of cmpxchg? A partial answer I found1 says there are differences in performance using lock or not and on a non-SMD device the lock prefix is not necessary for memory consistency. Learning new things every day :slight_smile: