Can we pass `Copy` values by immutable reference?

Post inspired by

Consider the following code:

#[derive(Clone, Copy)]
pub struct Big([u32; 64]);

pub fn f(big: Big) {
    unsafe {
        g(big);
    }
}

extern "Rust" {
    fn g(big: Big);
}

The f function has the following asm:

big::f:
 push    rbx
 sub     rsp, 256
 mov     rsi, rdi
 mov     rbx, rsp
 mov     edx, 256
 mov     rdi, rbx
 call    qword, ptr, [rip, +, memcpy@GOTPCREL]  
 mov     rdi, rbx
 call    qword, ptr, [rip, +, g@GOTPCREL]
 add     rsp, 256
 pop     rbx
 ret

It copies the big parameter to its own stack before calling g. The memcpy is necessary, because g might modify its copy of big.


It seems it would be more efficient to say that functions, when they get passed a big Copy by-move parameter which doesn't fit into registers, aren't allowed to mutate its backing memory.

If we do that, then f can pass through the pointer to big it got from its caller, avoiding thee memcpy.

Am I missing something here?

8 Likes

@pcwalton has been exploring this recently: https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/Stack-to-stack.20copies.20in.20Rust

3 Likes

Yeah, that makes a lot of sense for an ABI.

It depends on whether we more commonly mutate the big parameters or not.

It would be better if LLVM could see that the function doesn't modify its argument and therefore we can eliminate the memcpy.

Here another possible observation is that the caller doesn't use the parameter after and therefore we can pass its ownership to the function. Whether this is generally true for any non-Copy type depends, if I remember correctly, on some UCG discussion.

That's likely the question on whether a move deinitializes the moved-from place (UCG#188, among some others).

Just in general, as well, LLVM is surprisingly loathe to elide memcpys, likely due to the fact that the address (non)overlap is considered an observable property optimizations mustn't impact.

1 Like

We do, arguably -- note the ptr parameter: https://rust.godbolt.org/z/GW4MTozW6

define void @f(ptr %big) unnamed_addr #0 !dbg !6 {
start:
  %_3 = alloca %Big, align 4
  call void @llvm.memcpy.p0.p0.i64(ptr align 4 %_3, ptr align 4 %big, i64 256, i1 false), !dbg !11
  call void @g(ptr %_3), !dbg !13
  ret void, !dbg !14
}

So the problem is eliding the copy, which people are working on for &impl Copy + Freeze stuff like this.

The tricky part is that if it really is modified down the line somewhere, there does need to be a copy taken somewhere. Right now we're just doing the "easy and obviously correct but suboptimal" version of that that's "well everyone copies it".

"Immutable" is the load-bearing word here though: if both f and g promise (via calling convention) to not modify the memory pointed to by big, no copy is necessary.

1 Like

Right, but

isn't immutable.

I could define that as

extern "Rust" fn g(mut big: Big) {
    big.0[0] = 10;
    do_something(big);
}

and if the Big wasn't ever copied, it'd affect the callers -- even the caller of f, if the pointer were just passed along all the way.

It can go either way, depending on calling convention:

  • g can modify big in place (current calling convention)
  • g might need to copy big to its own stack before modification (what's being proposed)
4 Likes

I have thought about this rough category of problems a lot, so going to use this as an opportunity to get all of my thoughts in one place.


The semantics of Rust are pretty clear that arguments are passed by-value. At least that's my opinion and afaik also the opinion of the rest of UCG. Specifically, this means the code below must print two different values:

fn callee(b: Big) {
    println!("In callee: {:p}", &b);
}

let b = Big::new();
println!("In caller: {:p}", &big);
callee(b);

If we were to "transparently" pass parameters by reference, we might print the same value both times, which would be a miscompilation.


What this means is that our ABI must have a copy in either the callee or the call site. Applied to the above example, either the callsite callee(b) must be responsible for making a copy, or we must make a copy at the entry of the callee function.

Where we make this copy has an effect on what we can optimize. If copy is made in the callee, optimizations in the caller cannot be used to remove the copy, which might lead to a missed optimization (and vice versa). We can come up with examples where either choice leads to a missed optimization.

Copying in the Callee

fn callee(b: Big) {
    println!("In callee: {:p}", &b);
}

callee(B::new());

Here we must make a copy in the callee because the call site might look like the first example above.

Copying in the Caller

fn callee(_: Big) {}

let b = Big::new();
println!("In caller: {:p}", &big);
callee(b);

Here we required to make a copy at the call site because the callee might look like the first example above.


The second article above is basically claiming that copying in the callee is better than copying at the call site. The article does acknowledge in the update that this can also be a pessimization in some cases, but still claims that copying in the callee is better. That claim appears to me to be entirely unsubstantiated - it might be true, but I have no idea.


The problem is even more interesting than this though. We can also come up with an example like this:

fn inner(_: Big) {}

fn outer(b: Big) {
    println!("In outer: {:p}", &b);
    inner(b);
}

outer(Big::new());

The ideal number of copies to make in this example is zero. There is no real need to ever copy. However, if we consistently adhere to either of the above ABIs, this is unachievable. We must always make a copy in outer, either for the inner(b) call site or for the argument.


Fortunately, ABIs are not that important in Rust anyway. The vast majority of function calls in Rust can be statically resolved and we can pass additional metadata about the function. The only cases where we can't do this are function pointers, dynamic dispatch, and linking strategies that do not allow us to pass metadata (dylibs, stable rlibs, etc.). None of these are terribly common.

What this means is that we should instead look to infer the right ABI on a per-function basis. Essentially, consider a functions like inner or outer above. The optimizer would begin by assuming that the callee (ie the function being optimized) is responsible for making the copy, and try to elide the copy. If it fails to do this, the responsibility to make the copy is punted to the call site, which will also get a chance to elide the copy.

By doing this, we can achieve zero copies in the most recent example above. The ABI for inner would be such that the callee has to make the copy (which is immediately elided because it's unnecessary) and the ABI for outer would be such that the call site has to make the copy (which is also elided due to being unnecessary).


The article mentions the possibility of multiple entry points for a function depending on whether the copy needs to happen or not. Unfortunately I don't know of any codegen backend that has support for functions with multiple entry points. I had the same idea a while back though, and fortunately there's a trick we can use to implement this: Tail calls. Specifically, imagine that we decide that by ABI, copies must always be made in the callee, and consider a function like this:

fn foo(b: Big) {
    println!("{:p}", &b);
}

Making the copy in the callee explicit, we'd codegen that function like this:

fn foo(b: &Big) {
    let real_b = *b;
    println!("{:p}", &real_b);
}

Alternatively, we could do this:

fn foo(b: &Big) {
    let real_b = *b;
    foo_inner(&real_b);
}

fn foo_inner(real_b: &Big) {
    println!("{:p}", real_b);
}

The foo_inner call can be codegened as a tail call, ie just a jmp. This means we do not have the cost of an extra function call. In effect, the combination of foo and foo_inner is a function with multiple entry points - one at foo and one at foo_inner.

The caller is now free to decide whether to call foo or foo_inner directly, if it decides that it does not care about the function call.


Interestingly, this "extra function with a tail call" strategy is actually nearly isomorphic to the "ABI inference" thing I suggested above. Specifically, imagine two minor changes: 1. We only generate the "wrapper function" if it is actually necessary, ie the copy can't be optimized out in the callee. 2. We slap #[inline(always)] on the wrapper function.

By always inlining the wrapper, we effectively move the obligation to perform the copy to the call site (which now has the copy in the inlined body). The call site is now free to optimize out the copy (the copy). This is exactly equivalent to what would have happened if we had inferred a "callee is responsible" ABI.

9 Likes

That doesn't address the more fundamental question imo what behaviour is desired here. Beyond artificial examples are there any real world use cases that require the current behaviour?

We do have stability guarantees in Rust: this isn't something we can break lightly, however, that shouldn't also lock us into C++ style "no breaking changes ever" mode either if there is a compelling enough reason for a breaking change.

I honestly don't see any reason this currently observable behaviour would be useful in real life since the parameter binding is immutable.

1 Like

I don't know for sure about this. For example, we're currently perfectly fine with different concrete functions having the same address, so maybe we're fine with saying "addresses of locals are irrelevant" (their order is already not guaranteed, so their equality not being guaranteed isn't that far away) too.

16 Likes

I agree with the two comments above. The claim that a move must change the physical address goes against my intuition of move semantics. Firstly because stack frames get reused all the time, but more importantly because the Rust abstract machine should have no requirement to move the value in physical memory on a target.

At the language level, it ought to be enough that the move makes it impossible to reuse the let b binding. By itself, this should be enough to allow the compiler to make optimizations such as reusing the physical memory backing let b for let c in:

let b = Big::new();
println!("In caller, b: {:p}", &b);
callee(b);

let c = Big::new();
println!("In caller, c: {:p}", &c);
callee(c);

And likewise, the same stack frame could be reused for callee in this very example. I don't believe that the capability to observe the physical address is evidence that a physical address change must take place.

6 Likes

I was assuming above that Big: Copy. If not, we get back slightly more wiggle room to allow some things, but not necessarily all the things we care about.

That doesn't address the more fundamental question imo what behaviour is desired here. Beyond artificial examples are there any real world use cases that require the current behaviour?

so maybe we're fine with saying "addresses of locals are irrelevant"

What we want is unfortunately not the most relevant question (and I really do wish it were). There are a couple things that we absolutely must support. For example, the Eq implementations on Arc and slices have fast-paths in which they check pointer equality. That means we can't just say that addresses can be whatever, because then we might get the wrong result here:

let a = [5];
let b = [10];
dbg!(&a[..] == &b[..]);

We have to guarantee that a and b have different addresses. We also need some amount of address uniqueness for int2ptr casts to make any sense.

The only way I know of to give these guarantees is to say that allocations (both for locals and on the heap) with overlapping liveness get disjoint address ranges. If anyone has a suggestion for an alternative, please let me know (possibly open a thread on Zulip in t-lang/unsafe-code-guidelines). I don't know one.

For example, we're currently perfectly fine with different concrete functions having the same address

For functions this is easy because functions aren't "real" allocations, they're ZSTs and in any case we can treat them as their own special thing (ie not real memory). For static deduplication we may want this too, but that is harder to achieve, and UCG/T-opsem will need to do some hoop jumping to make this work.

If your conclusion from all of this is that this sounds like a bit of a pain in the ass, then... yes, yes it is. Address are honestly a ******* nightmare, and I really wish we could just not worry about any of this. See also here for more hoop jumping that we'll probably have to do for very trivial seeming reasons (and I have more examples of other similar silliness)

Rust doesn't guarantee that values have unique addresses, e.g. ZSTs explicitly don't, so you can have two different fields in a struct that share the same address.

Unlike a similar C syntax, this is not a pointer comparison. & is not an address-of operator, but creation of a loan. Due to operator overloading this is a call to eq(), which compares contents of the slices, not the pointers.

It's pretty hard to actually compare addresses in Rust. You have to cast to raw pointers first, or use an explicit ptr_eq function. But even these aren't always reliable, e.g. Rust doesn't give guarantees about vtable addresses so, dyn pointer comparison is already unreliable: eq in std::ptr - Rust

3 Likes

The PartialEq implementation on Arc is specialized here to have a fast path in the case of pointer equality. I thought we had this optimization for slices as well, but apparently not. In any case, I don't think we can claim that such an optimization would be wrong. Your suggested semantics would also make this function in bytes not do the thing the authors expect it to do, and I expect it to break more ecosystem code (that does things like store a HashMap<*const T, U> or whatever).

e.g. ZSTs explicitly don't

The more precise way to state the claim I'm making is that allocations for different locals are non-overlapping. This condition is still satisfied for ZSTs because zero-sized allocations are always trivially non-overlapping with all others.

(Disclaimer: I am a pinged member of wg-UCG, though IIUC the most junior, and in a position to potentially be a member of T-opsem if/when it becomes a thing. The content of this post is entirely my own and does not implicate the thinking of any other member of wg-UCG.)

Wouldn't whatever logic we use for removing allocation (whether it be allocation planes or whatever) also function to allow overlapping stack allocation in this case from the same justification? (I really should try to sync talk to you at some point — I believe we can utilize the concept of allocation planes to specify the allocation API within the VM, with alloc creating a fresh allocated object deliberately aliasing the distinct larger allocated object the allocator is slicing up. There's, obviously, complexities, but I believe this idea of allocation planes to be an extremely powerful tool.)

In this specific example (slightly reworded),

#[derive(Copy, …)]
struct Data { … }

fn callee(x: Data) {
    println!("In callee: {:p}", &x);
}

let x = Data::new();
println!("In caller: {:p}", &x);
callee(x);

the copy into callee is known to be the last usage/observation of the value in the caller's scope. Thus, if any stack allocated objects are ever allowed to reuse addresses, the callee's copy should be allowed to reuse the caller's address. There's absolutely no semantic difference from inlining the function call as

let x = Data::new();
println!("In caller: {:p}", &x);
let x = x;
println!("In callee: {:p}", &x);

and in this spelling, it's IMHO quite difficult to argue that eliding the move/copy shouldn't be allowed, but should if we remove the impl Copy. Taking a sound Rust program and adding an impl Copy should never[1] be a required pessimization; this is clearly against the intent[2].

A notable but nonobvious part of the justification is that Rust provides absolutely zero guarantees about the ordering of stack allocated objects; there is no reason that a value in callee's stack frame cannot be an address previously used as part of the caller's stack frame.

If we add another observation of the caller's value's address after the call to callee, things get more interesting. Now it's possible that we have observed two separate allocated objects existing at the same address. The question is thus when do we guarantee two distinct allocated objects to have disjoint addresses?[3]

rambling

I feel it relatively possible that Rust won't have any guarantee that structurally identical values have disjoint unexposed addresses[4]. The various implications mean that I'm not going to predict this outcome in any way likely, but I see no fundamental blocker to relaxing address uniqueness in this way in some limited cases. In other words, given the example

#[derive(Copy, …)]
struct Data { … }

fn callee(inner_data: Data) {
    dbg!(ptr::from_ref(&inner_data).addr());
}

let data = Data::new();
dbg!(ptr::from_ref(&data).addr());
callee(data);
dbg!(ptr::from_ref(&data).addr());

I do not think it is a certainty that two distinct addresses are printed; I believe there's reasonable potential that printing a singular address should be a valid execution. To use the allocation planes justification, I believe non-exposing observation of an allocated object's address should not force the object to exist on the root/exposed plane.

The experimental Strict Provenance model (we're far into experimental theory if we're making arguments involving any sort of provenance or re/moving allocations; all existing optimization models just handle this by vibes) defines a pointer as three parts: an address-space, an address, and some provenance permissions. A non-exposing[5] .addr() allows observing the address, but if we say that two objects exist in different address-spaces (allocation planes), then seeing overlapping addresses does not mean the allocated objects are the same.

The key justification is that I'm restricting the overlap to structurally identical* values. The implementation of Rc::eq uses ptr::eq as an optimization to determine if the pointees are identical. It's already the case that ptr::eq does not mean that provenance is equal / that the pointers are interchangeable. This optimization is still valid if we allow pointers to compare equal if their contents are structurally identical[6]. Or, the rules could differ between stack allocated objects and heap allocated objects.

If the contents of the allocated objects are structurally distinct and live at any point in time, then their addresses must of course compare unequal, because they cannot exist in the same physical space. If the bytes of an allocated object are known to not be modified and be structurally identical to an unmodified byte region in another allocated object, and this is true for the entire duration that both objects are live, I see no fundamental reason why those allocated objects cannot be allowed to exist at the same address.


As soon as we introduce any justification for removable allocation (in the face of observable OOM), stating that allocated objects are non-overlapping becomes more difficult, by design.

The property required by the language to keep objects independent is that the {address-space, address} pair are disjoint. The address-space is a fundamentally unobservable property of pointers. So long as the address-space differs, it is possible for two allocated objects to have addresses that imply they would overlap were they in the same address-space.

So it becomes a game of controlling address-space (and provenance more generally). Does non-exposing .addr() force an allocated object to be in the root address-space? Does pointer comparison force the pointers' allocated objects to have disjoint addresses even if they're still allowed to exist in separate address-spaces?

There's a lot of implication I've not fully explored. But I believe we absolutely do want to allow stack objects to overlap, so we can eliminate stack copies that exist just to make addresses unique. I don't think formally justifying doing so if it's something we agree we want is any more theoretically difficult than justifying allocation removal, though it certainly requires a bit more human justification, because while the latter is something compilers already do, the former is a lot scarier implication of loss of control.


  1. Move semantics doing deinit invalidates this absolute, unfortunately. If moves deinit or even just invalidate outstanding provenance (count as mut-access), that provides better variable place pass-by-reference semantics (invisible &move) than possible for a Copy type. A copy is maximally ref-access, so if the address/value is potentially used after the function call, then there must exist a copy for callee to mutate the value without mutating the caller's copy. With mut-access semantics for a variable move, the caller is unable to read the value after making the call to callee (any pointers were invalidated), so callee is free to have clobbered the place. It's not currently possible to perform a move from behind a pointer, so it's not possible to observe (without already causing provenance UB) if move does deinit if a move does a mut-access retag, as borrowck prevents further access to the place except to reinitialize it. ↩︎

  2. If Rust had an opsem spec, then making changes to that spec would be much harder to justify, as unsafe code may be relying on that spec. Without a spec, though, all of the opsem rules around e.g. overlapping addresses is just vibe-based inferences from the stdlib documentation and expectations from other systems programming languages. Like adding noalias or noundef in more places resulting in more LLVM UB but justified from previous messaging of intent, it's currently acceptable for us to declare things that would be fine things to rely on in C or C++ as not guarantees provided by Rust, and start miscompiling programs relying on those properties, so long as it matches previous messaging of intent. (And ideally, provide an out for developers previously relying on it.) Safe references have always been intended to provide strong aliasing guarantees and optimizations based on them; allowing allocation addresses to overlap I believe falls under that umbrella. ↩︎

  3. A related but technically distinct guarantee question is that of address stability. We absolutely must maintain address stability while an exposed address provenance is valid. I think it completely unreasonable to not provide address stability while a reference is live (as it would require references to be relocatable). Pin documents that !Unpin values have stable addresses. It's technically possible that we don't need to guarantee Unpin addresses stable (in known-disjoint address-taking regions), and potentially beneficial not to, but I think the surprise of lacking address stability for only some types far outweighs the potential benefit. Weaker address uniqueness guarantees, though, I believe offer a significantly greater benefit, justifying the potential surprise. Developers are also already familiar with certain forms of address overlap, namely optimizing variables out entirely. ↩︎

  4. In order to justify this, I think it's at least sufficient that the structural equivalence is true for the entirety of one of the two allocated objects' live region. It might be that extending beyond this precondition causes too much fallout to be practical, but I do believe that this precondition is sufficient to control the fallout and desirable to be sufficient to remove address uniqueness, as doing so justifies the removal of stack copies that exist solely to provide a unique address. ↩︎

  5. If an .expose_addr() is done, then we must ensure that the allocation is done in the address-space accessible by external code with just the address. In practice, this means any exposed addresses exist on the "root" exposed allocation plane used by ptr::from_exposed_addr. This makes overlapping exposed addresses much harder to justify. ↩︎

  6. Allowing this certainly has some interesting implications for guaranteed_neq — being separate allocated objects of nonzero size is no longer sufficient to return Some(true); you would also need to justify that the objects could not be allowed to overlap (i.e. that they have structurally distinct contents at some point in time). ↩︎

2 Likes

the copy into callee is known to be the last usage/observation of the value in the caller's scope.

Note that this is actually only true if the compiler can see everything that happens to the &x in the first println!. If it loses track of that then there's a valid reference floating around until (as far as the compiler knows) the end of the scope.

it's IMHO quite difficult to argue that eliding the move/copy shouldn't be allowed

If Data is not Copy, the justification that we have for eliding the copy right now is to essentially say that locals are always deallocated when they are moved out of. In order for this to be reasonable, we need to ensure that there are never any valid pointers across moves.

let x = Data::new();
println!("In caller: {:p}", &x);
let x = x;
println!("In callee: {:p}", &x);

If the third line in this example is Copy, then we can't really do anything like that. Indeed, imagine that we had let mut x = x; (and then went and mutated the second x). Reusing the storage there would be dangerously wrong if we did not also prove that the &x we created in the first println! can't be used anymore. I don't expect us to want to include an analysis in the spec that can draw that conclusion, and so I do think that these addresses will need to be permanently disjoint.

I don't think formally justifying doing so if it's something we agree we want is any more theoretically difficult than justifying allocation removal

Well, UCG#328 does do that. But the problem is that removing unused allocations is something that we only ever care about if we can see the entire allocation. In cases where a pointer to the allocation escapes our view, we don't care about being semantically able to remove that allocation, because we can't actually do it anyway. However, when we're talking about removing copies, we only ever care about this second category of cases, since removing the copy is trivially correct anyway if we can see all of what happens to the new/old value. This makes it feel somewhat unlikely that these problems are going to be solved in the same way.

A major point of the argument I'm attempting to put forward is that we know that's not being done, though. We would like to overlap allocated objects elide stack copies when it's known the copy isn't mutated, and that the source isn't mutated while the copy's manufactured pass-by-reference is live.

Determining the lack of mutation involved in println! is trivial, even if we can't see through it enough to prove the reference doesn't escape — we only give it a shared reference.

Well, I'm also implicitly assuming our Data type is Freeze and doesn't contain any sort of UnsafeCell. UnsafeCell is currently never Copy, so the lack of shared mutability in a Copy value is currently sound, but this technically isn't a property guaranteed by just Copy.

Is a known lack of mutation sufficient to overlap allocated objects which are known to be structurally equal for the entire time both are live? Does doing so require changes to e.g. ptr::guaranteed_[n]eq? Is this relaxation sufficient to allow passing copies[1] by-ABI-reference-to-original instead of by-ABI-reference-to-temporary?

Is it sufficient to make nonregister Copy values pass-by-reference, with the callee required to make a copy as part of its prelude if it might modify it? This of course still observably overlaps objects, but significantly improves stack efficiency by eliminating the purely defensive stack copies.

It really would be unfortunate if we can't make Rust more stack efficient by default with Copy values because we need to do a bunch of defensive copies; if we're bound to end up with major style guides recommending passing Copy values larger than a scalar pair by reference instead of by value/copy for performance reasons, because the ABI semantics don't allow rustc to remove redundant copies.

This comes from a legitimate curiosity. Obviously, if there could be a live mut-capable provenance used at some point, that allocated object can't be overlapping with other allocated objects. But the entire point of Rust is that we can easily derive strong proofs about where that isn't the case.

This is of course veering off into likely impossibility, but the "function signature" of println! gives &x a lifetime which ends at the end of the expression. If violating that lifetime annotation were UB, we could use the end of that lifetime region to justify that the reference did not escape.

This is absolutely not going to happen, though, because we allow things like fn ptr::from_mut(&mut T) -> *mut T to extend the "dynamic" lifetime of a reference to whenever the pointer is last used, which may be after the parent reference's lexical lifetime has expired.

So, yes, in a somewhat annoyed tone of voice.

println!("{:p}", &x) here is just a useful thing here for a "sufficiently complicated" function. It may be that a future expansion of format_args! is transparent enough[2], but there'll always be a different "complex enough" function. Which could even just be something extern.

I am legitimately interested in the open middle, though — where the address is observed, but it is known that accesses are not done afterwards. (Say, perhaps if we allow something like MIR's explicit move x in surface Rust, to perform a move from a place of Copy type. Or more simply, the type just isn't Copy.) If the address is known not observed on either side of the move, eliding it is simple. But if we can't prove that the address hasn't been observed, just that further access to the original place is not done (e.g. by declaring it UB), is that sufficient to overlap the allocated objects?


  1. Important enabler: making a copy counts as a ref-access IIUC, so invalidates any existing mut-provenance. ↩︎

  2. You could theoretically use some amount of specialization to essentially turn format_args!("{:p}", &x) into format_args!("0x{:x}", ptr::from_ref(&x).addr()). There was a discussion around whether {:p} should be semantically considered to expose the address somewhat recently; the main conclusion was that the current formatting machinery is complicated/dynamic enough that it doesn't matter, but it's possible that future improvements to the formatting machinery, or alternative formatting providers like defmt, make formatting transparent enough that it could matter. ↩︎

5 Likes