Copy elision & RVO optimization

I commonly try to appease the borrow checker by changing |a: &mut T| { ... } into |a: T| -> T { ... } and I always worried about optimization but it never really bit me so I assumed that the compier was desugaring both into the same code. I decided to look into it directly and found out that it actually doesn't so I wanted to ask some opinions and thoughts on an optimization pass that would change the following f into f_opt.

#[link(name = "my_c_library")]
extern "C" {
    fn cfunc(x: u64) -> bool;
}

type A = [u64; 1024];

// Not only does the compiler not find tail call opts here,
// it can't optimize it into f_opt
fn f(nest: u32, mut a: A) -> A {
    if nest > 0 {
        // Note: I guess tail call opt should have kicked in here... 
        f(nest + 1, a)
    } else {
        a[500] = 30;
        a
    }
}


// Since A is Copy we can avoid memcpy without chainging the semantics
// if we in-place modify the argument.
#[inline(always)]
fn f_opt(nest: u32, mut a: A) -> A {
    fn f(nest: u32, a: &mut A) {
        if nest > 0 {
            f(nest + 1, a);
        } else {
            a[500] = 30;
        }
    }

    f(nest, &mut a);
    a
}
 

pub fn main() {
    let a = [0; 1024];
    let a = f(1000, a);
    // let a = f_opt(1000, a);
    unsafe { cfunc(a[500]); }
}

godbolt

To avoid dependence on godbolt (x86):

f_opt begets simply

example::main:
        mov     edi, 30
        jmp     qword ptr [rip + cfunc@GOTPCREL]

while f compiles into the much longer winded

example::f:
        push    rbp
        push    r14
        push    rbx
        mov     eax, 8192
        call    __rust_probestack
        sub     rsp, rax
        mov     rax, rdx
        mov     rbx, rdi
        test    esi, esi
        je      .LBB0_1
        mov     ebp, esi
        inc     ebp
        mov     r14, rsp
        mov     edx, 8192
        mov     rdi, r14
        mov     rsi, rax
        call    qword ptr [rip + memcpy@GOTPCREL]
        mov     rdi, rbx
        mov     esi, ebp
        mov     rdx, r14
        call    example::f
        jmp     .LBB0_2
.LBB0_1:
        mov     qword ptr [rax + 4000], 30
        mov     edx, 8192
        mov     rdi, rbx
        mov     rsi, rax
        call    qword ptr [rip + memcpy@GOTPCREL]
.LBB0_2:
        mov     rax, rbx
        add     rsp, 8192
        pop     rbx
        pop     r14
        pop     rbp
        ret

example::main:
        push    rbx
        mov     eax, 16384
        call    __rust_probestack
        sub     rsp, rax
        lea     rbx, [rsp + 8192]
        mov     edx, 8192
        mov     rdi, rbx
        xor     esi, esi
        call    qword ptr [rip + memset@GOTPCREL]
        mov     rdi, rsp
        mov     esi, 1000
        mov     rdx, rbx
        call    example::f
        mov     rdi, qword ptr [rsp + 4000]
        add     rsp, 16384
        pop     rbx
        jmp     qword ptr [rip + cfunc@GOTPCREL]

PS. My point is not the missed tail call optimization although it feels vaguely related to my main point. PS2. I would also appreciate some links to related/similar optimization passes that the compiler does.

Thanks

1 Like

Interesting note: changing A to type A = (u64, u64); seems to produce the same output for f and f_opt while type A = (u64, u64, u64); seems to reintroduce the problem.

The more I think about this the more I struggle to understand why would the compiler ever want to memcpy the moved arguments into the stack frame instead of transforming the function arguments into by-reference. I wonder if there are other compilers/literature doing a similar transform (although rust is probably uniquely designed to be accommodating to this kind of thing, maybe not many other compilers have the luxury of doing such tricks "easily")

It's been a long time since I looked into similar problems, but one thing I recall is that rustc explicitly generates codes that performs the memcpy, and produces function parameters not qualified with unnamed_addr. Back then, this was one thing that prevented the optimizer from eliminating the copies and performing the transformation you're looking for.

This is unlike C code for which clang passes the argument qualified as byval, which makes LLVM perform the copy on the stack itself and has a few properties that help the optimizer.

Unfortunately, switching to byval isn't (wasn't?) an option at all (I don't recall the details) and simply making the address insignificant would IIRC be a breaking change.

Is unnamed_addr also taken into account for MIR transformations or are you referring to purely the llvm tag? If the latter I am assuming that you mean that this transform should happen at the llvm level, is there something stopping a value moved from being marked as unnamed_addr?

Yeah, I've been talking strictly about the LLVM level there. I'm sorry I don't recall many details, but I think there was code that actually relied on address significance and broke. I would have to spent some time looking into things again to be more helpful :confused:

It's ok thanks very much, I will dig though the code myself a bit and will come back when I inevitably get stuck.

Here's a playground so things are a bit clearer. Here's what I learned:

  • It looks like LLVM is explicitly told to do memcpy
  • memcpy happens in order to copy into the the return value, not the argument.
  • The llvm main function allocates the argument and return lvalues separately, therefore f has no choice but to duplicate the data. I was mistaken to think that memcpy happens in order to move values to the callee stack frame.
  • LLVM needs to do a lot of reverse engineering to infer that the argument and return memory can be unified. In particular LLVM IR needs to remove operations in f based on call site information (particularly @llvm.lifetime.end) which sounds fairly convoluted. It seems to me that MIR can much more easily be transformed into unifying the argument and return data.
  • The particular transform I would go for would be the one indicated by the difference between f and f_opt so we can reuse all the tricks done by the Inline instead of changing the callsites ad-hoc. This transform would happen before the Inline pass. Hopefully this would actually avoid to avoid copying even for cases like let (x, y) = f(x,y).
  • Some particulars about the design I came up with:
    • A new field of LocalDecl contains information on whether a variable is mut-ref-able w.r.t the current body. A mut-ref-able local is a) an argument of the body, b) possibly modified and c) eventually moved into a return value that is constructed only of mut-ref-able locales and constants.
    • A function that has at least one mut-ref-able argument is rewritten in the same way that f is rewritten to f_opt.
    • Nested functions are transformed in such a way calling the inline pass.

I will try to make this more concrete as an RFC but any preliminary comments are more than welcome.

Semantics of MIR assignments, around aliasing, ordering, and primitives. · Issue #68364 · rust-lang/rust · GitHub and https://github.com/rust-lang/rust/issues/71117 are very relevant to this. The answer to those questions determines if this optimization is possible, or if it can cause miscompilations for sound unsafe code. Also see Fix Dest Prop by JakobDegen · Pull Request #96451 · rust-lang/rust · GitHub, which should fix the destination propagation mir optimization, which I believe is relevant here, once it lands.

5 Likes

Just answering this one directly. Assuming this is a function where we can use whatever ABI;

There's three ways to pass values at the ABI level. You can

  • pass values directly in registers (usually preferred)
  • pass values in memory at a known offset from the stack pointer
    • before the stack pointer = in the caller frame
    • after the stack pointer = in the callee frame
      • relies on the "red zone" of safe memory after the stack pointer
  • pass values in memory by passing an address/pointer by one of the other mechanisms

It's perfectly fine to replace fn(T) with fn(&'_ T), though not ideal when T is small enough to pass in registers and doesn't have its address taken (which will force it onto the stack anyway).

What's not okay in general is transforming fn(T) -> T into fn(&'_ mut T) without further consideration. If T: Copy, then the value in the caller's frame must not be modified, so the caller cannot pass the address of their existing value; they'll have to make a copy anyway. If T: !Copy, it's IIRC still an open question whether moving from a place deinitializes the place; if yes, allowing the callee to clober the value is fine; if no, it's the same situation as with T: Copy. Another way of justifying the transform is if the dynamic borrowing rules mean moving from a place invalidates all outstanding provenances so accessing the place is UB and it doesn't matter if the place is deinitialized or not because it's gone.

There's semantic benefit to having “move place” do the same thing whether T is Copy or not. However, I do personally think that we're more likely to say (if we haven't already; I haven't kept up with decisions here) that moving !Copy values makes the moved from place UB to access, because this doesn't just justify allowing the callee to clobber that space, but also allows overlapping another stack local into the same physical address since the old AM object is no longer validly accessable. (Strangely enough, this might make implementing Copy a pessimisation, since then the object remains accessible... one of the few places a move place expression would actually be useful, to force the invalidation of a Copy place. The same can be accomplished with *&mut place to invalidate all outstanding borrows.)


Then we would come to “placement by return”; transforming fn(...) -> T into “fn(..., &lateinit mut T)”. MIR already sort of acts like this, as it asigns to the return slot (_0 IIRC) to return values rather than returning a value the normal way.

This also causes us some problems, because ideally we'd like to support both x = f(x) and y = f(x) to construct the return value directly in place. The former requires a signature of fn(&mut T), whereas the latter requires a signature of “fn(&move T, &return T)[1] because the two objects have different addresses.

There's lots of interesting decisions and tradeoffs to be made here, as well as layering where one layer may have semantics allowing the transformation but in lowering to the next level forgets the permission to make the transformation.


    • &move ≈ the callee must move from the place before exiting (either normally or by unwind)
    • &return ≈ the place starts uninitialized; the callee must initialize the place before returning normally but leave it uninitialized in an unwind
    ↩︎
2 Likes

Doesn’t ptr::read already rely on that, though? And mem::replace in the other direction. I know language-level operations are different from optimizations, that we can’t automatically say “change a mem::replace-like operation into a ptr::read/write-like operation”, but it wouldn’t be adding any new concepts to the language implementation.

Thank you very much @CAD97 for taking the time to analyze the issue I learened a lot from your reply.

This is a good summary of my original motivation and concern, well put. The f/f_opt definitions in my original post indicate that the optimizer is able to reason about where it is safe to apply the transformation you mention and when it is not, as long as f_opt is inlined. All the interesting considerations you put forward in your post focus on expectations on the caller side but I feel like we can get the optimization I propose by only transforming the callee (f to f_opt) and letting the optimizer take care of the rest. The caveat to all this, which you probably had in mind, is my plan relating to

Ideally I would like to be able to do that with an fn(impl FnOnce(T) -> T, &mut T):

// The caller can't tell the difference between g and g_opt but the optimizer 
// can make much better code with g_opt.

fn g(t: T) -> T {
   g_nested(t)
}

#[inline]
fn g_opt(mut t: T) -> T {
   fn g(t: &mut T) {
     // It's unclear to me if there exists an implementation of `replace_with` that the optimizer likes.
     replace_with(g_nested_opt, t);
   }
   g(&mut t);
   t
}

which opens an adjacent can of worms that relate to panics, not to mention that the optimizer might be reaching its limit.

This "trick" is probably related to what you are saying @jrose , correct?

1 Like