Should we change fetch_<logic op> for AtomicBool to exploit knowledge about it contents?

AngelicosPhosphoros · December 28, 2023, 9:10pm

Currently, fetch_or generates CAS loop:

        movzx   eax, byte ptr [rdi]
.LBB0_1:
        mov     ecx, eax
        or      cl, sil
        lock            cmpxchg byte ptr [rdi], cl
        jne     .LBB0_1
        test    al, al
        setne   al
        ret

Should we replace it by argument specific operations like this?

pub fn fetch_or(a: &AtomicBool, val: bool, order: Ordering) -> bool {
    unsafe {
        let a =&*(a as *const _ as *const AtomicU8);
        if val {
            a.swap(1, order) != 0
        }
        else {
            a.fetch_add(0, order) != 0
        }
    }
}

This implementation generates assembly without loop:

        test    esi, esi
        je      .LBB1_1
        mov     al, 1
        xchg    byte ptr [rdi], al
        jmp     .LBB1_3
.LBB1_1:
        mfence
        movzx   eax, byte ptr [rdi]
.LBB1_3:
        test    al, al
        setne   al
        ret

and if val is known (which it is for most code, IMHO) it would also be branchless.

I just don't know if this is worthy optimization. AFAIK, mfence is slightly more powerful compared to lock cmpxchg so what do you think?

If you think that it is better to use my variant, I would open a PR.

Also, this seem to affect only x86 and x86_64, other architectures generate more similar code.

steffahn · December 28, 2023, 9:22pm

You say fetch_<logic op>, but this might be only about fetch_or and fetch_and, right?

CAD97 · December 28, 2023, 9:22pm

IIUC, atomic_or is an intrinsic directly passed to LLVM, so it's the codegen backend making the choice, not us. But I could easily be wrong; I didn't check the intrinsic's implementation.

AngelicosPhosphoros · December 28, 2023, 9:29pm

If we decide to use this, I would test if it is possible to improve other operations.

Yes, it uses LLVM intrinsics internally. However, LLVM cannot transform fetch_or to swap because it cannot be sure that memory of type i8 can be only 0 or 1. Rust standard library, on the other hand, knows that AtomicBool must contain correct bool value.

scottmcm · December 28, 2023, 9:42pm

We do use !range metadata to tell LLVM that on loads. Maybe there's a way to tell LLVM this for atomics as well?

I'm generally opposed to putting different code in the library for these things, when the intrinsic should just work.

AngelicosPhosphoros · December 28, 2023, 10:38pm

Well, unfortunately, LLVM don't accept !range on atomicrmv

Ranges are only for loads, calls and invokes!

jdahlstrom · December 29, 2023, 9:31am

I know nothing about how rustc lowers atomics to IR, but doesn’t LLVM have i1 for representing bools?

bjorn3 · December 29, 2023, 12:55pm

That results in an "atomic memory access' size must be byte-sized" error.

quaternic · December 30, 2023, 5:57pm

At least for x86, it could also be implemented as a single CAS:

pub fn boolean_fetch_or(dst: &AtomicU8, new: u8) -> u8 {
    // if 0 (false), replace with new, otherwise already true
    match dst.compare_exchange(0, new, SeqCst, SeqCst) {
        Ok(x) | Err(x) => x
    }
}

example::boolean_fetch_or:
        xor     eax, eax
        lock            cmpxchg byte ptr [rdi], sil
        ret

As far as I can see, this generated assembly should be correct for AtomicBool::fetch_or with SeqCst-ordering. The Rust code doesn't directly generalize for other orderings, since the failure-ordering cannot be Release or AcqRel. ( Rust's compare_exchange only writes on success, whereas x86's cmpxchg always does. )

winksaville · December 30, 2023, 6:14pm

My read of the x86 cmpxchg is that it writes only on success.

quaternic · December 30, 2023, 8:24pm

From the linked page (and its source, Intel's documentation):

The pseudocoded operation also unambiguously writes one of the values back in each code path:

CAD97 · December 30, 2023, 10:04pm

Rephrased, x86 lock cmpxchg always counts as both a sequentially consistent read and write for the purpose of memory consistency / synchronization. Whereas in Rust, while a successful compare_exchange is both a read and a write (using the success ordering), a failed compare_exchange is only a read (using the failure ordering).

(And this does matter, even with SeqCst, as SeqCst still only provides AcqRel synchronization, i.e. can't make a read synchronize with a later write, even if they're sequentially ordered. Using that Rust implementation would be incorrect as optimizations on the IRs before x86 could exploit the weaker ordering.)

Especially since Rust can't express the desired semantics, the resolution to this codegen issue is to teach LLVM and/or rustc_codegen_llvm to do the better thing with the intrinsics, not to change the implementation on the Rust side. That would be true anyway, but this makes it more so.

winksaville · December 31, 2023, 4:19pm

I stand corrected, I read the first paragraph:

"Compares the value in the AL, AX, EAX, or RAX register with the first operand (destination operand). If the two values are equal, the second operand (source operand) is loaded into the destination operand. Otherwise, the destination operand is loaded into the AL, AX, EAX or RAX register. RAX register is available only in 64-bit mode."

And my read is the write did not always occur, but the pseudo code definitely shows there is always a write. I note that the second paragraph begins with "This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically." I wonder how the lock changes the instruction execution of cmpxchg? A partial answer I found1 says there are differences in performance using lock or not and on a non-SMD device the lock prefix is not necessary for memory consistency. Learning new things every day

system · March 30, 2024, 4:19pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Could we have `fetch_not` for `AtomicBool`s? libs	4	990	September 22, 2022
Pre-pre-RFC: Exploring API design space for volatile atomics language design	44	1218	March 11, 2024
alloc::sync::Arc for core libs	6	1082	June 27, 2021
`unordered` as a solution to "Bit-wise reasoning for atomic accesses" Unsafe Code Guidelines	23	3435	October 15, 2019
New target support compiler	9	1101	March 25, 2019

Should we change fetch_<logic op> for AtomicBool to exploit knowledge about it contents?

Related topics