AtomicU64 store disassembly shows a ton of lines of asm, why?

gopa · December 29, 2019, 6:00pm

I am doing AtomicU64 store relaxed - I guess that translates to the intrinsic atomic_store_relaxed() ? I dint try to dig into where that is in the rust source code (will do that next). But when I do the disassembly / objdump and trace the API corresponding to the AtomicStore, I see like a page or more of assembly.

I am testing on an x86_64 platform, so a relaxed 64 bit atomic store there should be just a regular store instruction - so wondering what makes it so humoungous when it finally gets compiled ?

CAD97 · December 29, 2019, 6:06pm

First check: are you compiling with optimizations (--release)?

Second: an example on godbolt would help people determine what the issue is.

sfackler · December 29, 2019, 9:42pm

x86_64 is a highly coherent architecture, so a relaxed atomic store is just a normal store, as you can see with godbolt: https://rust.godbolt.org/z/ZjHtCf.

CAD97 · December 29, 2019, 10:46pm

Without optimizations and turning off godbolt's assembly trimming, I get about 580 lines of assembly. I honestly believe that you forgot to turn on optimizations.

gopa · December 30, 2019, 1:54pm

Yes, my apologies for this stupidity, I was not aware of the --release option for turning on optim, I turned that on and did a objdump and the relaxed store looks to be just a normal store as expected. Sorry about this again, thanks to all who responded

jyn514 · December 31, 2019, 3:23am

Wow, I'd heard before that the codegen for LLVM IR was bloated but I had no idea it was this bad. I do think this shows (one of the reasons) why codegen is so slow - LLVM has to go through all 500 lines that rustc sends to it.

Certainly the 500 lines is only in debug mode, but 2 orders of magnitude still seems like an enormous difference even without optimizations.

xfix · December 31, 2019, 7:58am

Using store method of AtomicU64 gives me like 150 lines (I wonder what I'm doing differently), but the reason why the IR is so bloated is because you can use AtomicU64::store in multiple ways.

Let me clarify. The AtomicU64::store method redirects to atomic_store function, nothing special here.

pub fn store(&self, val: $int_type, order: Ordering) {
    unsafe { atomic_store(self.v.get(), val, order); }
}

This translates to rather straightforward assembly.

core::sync::atomic::AtomicU64::store:
        sub     rsp, 40
        mov     qword ptr [rsp + 32], rdi
        mov     rax, qword ptr [rsp + 32]
        mov     rdi, rax
        mov     qword ptr [rsp + 24], rsi
        mov     byte ptr [rsp + 23], dl
        call    qword ptr [rip + core::cell::UnsafeCell<T>::get@GOTPCREL]
        mov     qword ptr [rsp + 8], rax
        mov     rdi, qword ptr [rsp + 8]
        mov     rsi, qword ptr [rsp + 24]
        mov     al, byte ptr [rsp + 23]
        movzx   edx, al
        call    qword ptr [rip + core::sync::atomic::atomic_store@GOTPCREL]
        add     rsp, 40
        ret

core::cell::UnsafeCell<T>::get:
        mov     rax, rdi
        ret

The real deal is atomic_store however.

unsafe fn atomic_store<T>(dst: *mut T, val: T, order: Ordering) {
    match order {
        Release => intrinsics::atomic_store_rel(dst, val),
        Relaxed => intrinsics::atomic_store_relaxed(dst, val),
        SeqCst => intrinsics::atomic_store(dst, val),
        Acquire => panic!("there is no such thing as an acquire store"),
        AcqRel => panic!("there is no such thing as an acquire/release store"),
    }
}

There is no inlining, so Rust cannot inline the order and has to compile the code for all of this. In addition, there is panic, which means unwinding, which means even more code bloat to handle possible exceptions.

The generated assembly code has a jump table for all possible variants (Rust compiler told LLVM to generate one, so that's what it did). However, because optimizations are disabled the compiler did not bother optimizing branches within a jump table or merging similar code, so they are somewhat verbose.

core::sync::atomic::atomic_store:
        sub     rsp, 56
        mov     byte ptr [rsp + 31], dl
        mov     byte ptr [rsp + 39], 0
        mov     byte ptr [rsp + 39], 1
        movzx   eax, byte ptr [rsp + 31]
        mov     ecx, eax
        mov     qword ptr [rsp + 16], rsi
        mov     qword ptr [rsp + 8], rdi
        mov     qword ptr [rsp], rcx
        lea     rax, [rip + .LJTI1_0]
        mov     rcx, qword ptr [rsp]
        movsxd  rdx, dword ptr [rax + 4*rcx]
        add     rdx, rax
        jmp     rdx
.LBB1_1:
        mov     rdi, qword ptr [rsp + 40]
        call    _Unwind_Resume@PLT
        ud2
.LBB1_2:
        lea     rdi, [rip + .L__unnamed_1]
        lea     rdx, [rip + .L__unnamed_2]
        mov     rax, qword ptr [rip + core::panicking::panic@GOTPCREL]
        mov     esi, 50
        call    rax
        jmp     .LBB1_14
        ud2
.LBB1_4:
        mov     byte ptr [rsp + 39], 0
        mov     rax, qword ptr [rsp + 8]
        mov     rcx, qword ptr [rsp + 16]
        mov     qword ptr [rax], rcx
        jmp     .LBB1_11
.LBB1_6:
        mov     byte ptr [rsp + 39], 0
        mov     rax, qword ptr [rsp + 8]
        mov     rcx, qword ptr [rsp + 16]
        mov     qword ptr [rax], rcx
        jmp     .LBB1_11
.LBB1_8:
        mov     byte ptr [rsp + 39], 0
        mov     rax, qword ptr [rsp + 16]
        mov     rcx, qword ptr [rsp + 8]
        xchg    qword ptr [rcx], rax
        jmp     .LBB1_11
.LBB1_10:
        lea     rdi, [rip + .L__unnamed_3]
        lea     rdx, [rip + .L__unnamed_4]
        mov     rax, qword ptr [rip + core::panicking::panic@GOTPCREL]
        mov     esi, 42
        call    rax
        jmp     .LBB1_14
.LBB1_11:
        add     rsp, 56
        ret
.LBB1_12:
        mov     byte ptr [rsp + 39], 0
        jmp     .LBB1_1
.LBB1_13:
        test    byte ptr [rsp + 39], 1
        jne     .LBB1_12
        jmp     .LBB1_1
.LBB1_14:
        ud2
        mov     qword ptr [rsp + 40], rax
        mov     dword ptr [rsp + 48], edx
        jmp     .LBB1_13
.LJTI1_0:
        .long   .LBB1_6-.LJTI1_0
        .long   .LBB1_4-.LJTI1_0
        .long   .LBB1_10-.LJTI1_0
        .long   .LBB1_2-.LJTI1_0
        .long   .LBB1_8-.LJTI1_0

.L__unnamed_1:
        .ascii  "there is no such thing as an acquire/release store"

.L__unnamed_5:
        .ascii  "src/libcore/sync/atomic.rs"

.L__unnamed_2:
        .quad   .L__unnamed_5
        .asciz  "\032\000\000\000\000\000\000\000]\b\000\000\023\000\000"

.L__unnamed_3:
        .ascii  "there is no such thing as an acquire store"

.L__unnamed_4:
        .quad   .L__unnamed_5
        .asciz  "\032\000\000\000\000\000\000\000\\\b\000\000\024\000\000"

When optimized, assuming unknown Ordering in release mode, the code looks much more readable.

        push    rax
        movzx   eax, dl
        lea     rcx, [rip + .LJTI0_0]
        movsxd  rax, dword ptr [rcx + 4*rax]
        add     rax, rcx
        jmp     rax
.LBB0_4:
        mov     qword ptr [rdi], rsi
        pop     rax
        ret
.LBB0_2:
        xchg    qword ptr [rdi], rsi
        pop     rax
        ret
.LBB0_3:
        lea     rdi, [rip + .L__unnamed_1]
        lea     rdx, [rip + .L__unnamed_2]
        mov     esi, 42
        call    qword ptr [rip + core::panicking::panic@GOTPCREL]
        ud2
.LBB0_1:
        lea     rdi, [rip + .L__unnamed_3]
        lea     rdx, [rip + .L__unnamed_4]
        mov     esi, 50
        call    qword ptr [rip + core::panicking::panic@GOTPCREL]
        ud2
.LJTI0_0:
        .long   .LBB0_4-.LJTI0_0
        .long   .LBB0_4-.LJTI0_0
        .long   .LBB0_3-.LJTI0_0
        .long   .LBB0_1-.LJTI0_0
        .long   .LBB0_2-.LJTI0_0

.L__unnamed_3:
        .ascii  "there is no such thing as an acquire/release store"

.L__unnamed_5:
        .ascii  "src/libcore/sync/atomic.rs"

.L__unnamed_4:
        .quad   .L__unnamed_5
        .asciz  "\032\000\000\000\000\000\000\000]\b\000\000\023\000\000"

.L__unnamed_1:
        .ascii  "there is no such thing as an acquire store"

.L__unnamed_2:
        .quad   .L__unnamed_5
        .asciz  "\032\000\000\000\000\000\000\000\\\b\000\000\024\000\000"

How does that affect compilation other parts of Rust in development mode? It doesn't, really. The pattern used by atomics with a parameter explaining which function actually should be used isn't seen often.

That said, there are many issues with the quality of development code (it's bad), but atomics are an edge case of sorts.

jyn514 · December 31, 2019, 1:27pm

The pattern used by atomics with a parameter explaining which function actually should be used isn't seen often. ... atomics are an edge case of sorts

I wonder if making atomic_store a const fn would help, then the parameter would always be inlined (I think). Unfortunately this currently isn't possible because std::intrinsic functions aren't const, although I don't see why they couldn't be: Rust Playground

bjorn3 · December 31, 2019, 2:42pm

It is not possible to make it a const fn, as it has a side effect.
Const fn's are not guaranteed to be inlined. At mir level that can only happen (ignoring the mir inliner) when all arguments are constants, in which case they can be replaced by their result value.

lambda · December 31, 2019, 7:27pm

Using an enum for the ordering is a somewhat unfortunate design decision, because it means that you have to do all of this extra inlining, and it introduces the possibility of panic rather than statically enforcing the constraints.

You could do a better job by using a trait instead of an enum; that lets you implement the trait only for the orderings which make sense. Here's an example of what it could look like (compiler explorer):

pub trait Ordering {
    unsafe fn atomic_store<T>(dst: *mut T, val: T);
}

struct Relaxed;

impl Ordering for Relaxed {
    #[inline]
    unsafe fn atomic_store<T>(dst: *mut T, val: T) {
        intrinsics::atomic_store_relaxed(dst, val)
    }
}

impl AtomicUsize {
    #[inline]
    pub fn store<O: Ordering>(&self, val: usize) {
        unsafe {
            O::atomic_store(self.v.get(), val);
        }
    }
}

pub fn store_relaxed(v: &AtomicUsize, n: usize) {
    v.store::<Relaxed>(n);
}

I guess you'd need to make this two traits, LoadOrdering and StoreOrdering, so that you could implement them separately for the orderings which make sense.

However, that also seems like it may be overly complex. Why does it need to be parameterized at all? Why aren't the public methods on AtomicUsize etc. just store_relaxed and so on? That would be the simplest solution, and allow proper static checking.

Looking through core::sync::atomic, I do see that that would lead to a very large number of methods, since there are a lot of operations which are parameterized on ordering, some of them on two orderings, although they do generally all dispatch to one intrinsic per combination of orderings.

Having the ordering be parameters does help to cut down on the number of methods, but doing it as enums adds an extra level of indirection and dispatch that needs to be inlined out when compiling in release mode, and makes it possible to accidentally use an ordering that will cause a panic at runtime. Does anyone know why an enum was chosen over a trait, or over just exposing the methods directly, both of which would cut down on the amount of indirection and allow for better static checking?

CAD97 · December 31, 2019, 7:43pm

The TL;DR of it is that Rust's atomics, since we lack a memory model of our own, have the design of "it does what the C++ memory model and atomics do".

To that end, we expose the roughly same API as std C++ atomics. And dynamic ordering guarantees should be supported anyway, so when the community better understands formally what atomic orderings mean for Rust, we can potentially add a better API.

As one example of a potential change over C++ that we've discarded primarily because of equivalence with C++, @RalfJung has suggested making AcqRel work always, meaning Acq for reads (if present) and Rel for writes (if present), rather than manually having to choose Acq/Rel appropriately for the operation.

josh · December 31, 2019, 10:50pm

Given that the ordering parameter will almost certainly be a compile-time constant, and these functions should compile down to a single instruction, it might make sense to mark all the functions involved as inline(always). There's really no advantage to ever having them become out-of-line calls.

Lokathor · January 2, 2020, 8:14pm

Would that be a good PR, to simply tag the function as #[inline]?

josh · January 2, 2020, 8:36pm

I'm suggesting #[inline(always)], specifically. Might be worth testing, to find out if that makes an atomic operation compile down to a single instruction even in debug mode.

system · April 1, 2020, 8:38pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
`unordered` as a solution to "Bit-wise reasoning for atomic accesses" Unsafe Code Guidelines	23	3201	October 15, 2019
Need input regarding very weak memory models (GPUs, VMs...)	12	1560	April 5, 2020
Windows x86-64 Assembly Output question [SOLVED] compiler	5	505	September 20, 2023
Why does Rust generate 10x as much unoptimized assembly as GCC? compiler	23	4242	November 2, 2021
[x-post from users forum] Long compile times for a vector with 50K u64 values internals	2	934	March 11, 2020

AtomicU64 store disassembly shows a ton of lines of asm, why?

Related Topics