Currently, fetch_or
generates CAS loop:
movzx eax, byte ptr [rdi]
.LBB0_1:
mov ecx, eax
or cl, sil
lock cmpxchg byte ptr [rdi], cl
jne .LBB0_1
test al, al
setne al
ret
Should we replace it by argument specific operations like this?
pub fn fetch_or(a: &AtomicBool, val: bool, order: Ordering) -> bool {
unsafe {
let a =&*(a as *const _ as *const AtomicU8);
if val {
a.swap(1, order) != 0
}
else {
a.fetch_add(0, order) != 0
}
}
}
This implementation generates assembly without loop:
test esi, esi
je .LBB1_1
mov al, 1
xchg byte ptr [rdi], al
jmp .LBB1_3
.LBB1_1:
mfence
movzx eax, byte ptr [rdi]
.LBB1_3:
test al, al
setne al
ret
and if val
is known (which it is for most code, IMHO) it would also be branchless.
I just don't know if this is worthy optimization.
AFAIK, mfence
is slightly more powerful compared to lock cmpxchg
so what do you think?
If you think that it is better to use my variant, I would open a PR.
Also, this seem to affect only x86 and x86_64, other architectures generate more similar code.