Optimized Clone of Copy Types?

Hello everyone,

I have a question about the semantics of Copy and Clone.

Consider this trivial, rather silly program:

#[derive(Copy, Clone)]
struct Point {
    x: f32,
    y: f32,
}

fn euclidean_distance(p1: &Point, p2: &Point) -> f32 {
    let p1 = p1.clone(); // unnecessary clone
    let p2 = p2.clone(); // unnecessary clone
    ((p2.x - p1.x).powf(2.0) + (p2.y - p1.y).powf(2.0)).sqrt()
}

fn main() {
    let p1 = Point { x: 10.0, y: 1.0 };
    let p2 = Point { x: 1.0, y: 1.0 };

    println!("The distance is {}", euclidean_distance(&p1, &p2));
}

Silly though it is, I have similar code repeated over 150 times in a code base, because it is the output of a procedural macro. Here's why.

With the help of a macro, euclidean_distance is generated on any type defined by macro inputs. That type may or may not be Copy, but it will always be Clone.

The generated function always takes references for consistency. (The real function is much more complex, of course, and may not need to make copies.) As a result, when I need owned copies, I must clone() them.

Now that I'm starting to use clippy more, I discovered that it has a lint for that. I was surprised: wouldn't rustc optimize out Clone on Copy types?

Well, looking at the MIR of the example, the answer is no. The explicit call to clone is still in there, even in release mode. It even uses extra stack slots just to make the call!

Is this a missed optimization, or preserving a language semantic? In other words: does deriving Copy always imply that Clone is a trivial copy?

1 Like

This question would probably fit the users.rust-lang.org forum way better than this one.

I thought about that, but none of their categories seemed to fit a rather specific question like this. Do you have a suggestion where I should re-post it?

EDIT: I'll put it in "uncategorized", I guess.

EDIT 2: cross posted. Sorry for the noise!

I've been contemplating making a mir optimization to fix that, https://rust-lang.zulipchat.com/#narrow/stream/189540-t-compiler.2Fwg-mir-opt/topic/Rewrite.20.60Call(clone.2C.20_N).60.20to.20.60Copy(*_N).60.3F/near/209852212

4 Likes

The above macro-generated use case is compelling enough, over the long term and an increasing number of experienced Rust programmers, to justify that optimization.

1 Like

I’m not sure if I’m personally in favor of such an optimization. I like Clone not being magical. I mean, I guess if there are significant performance wins in somewhat realistic scenarios I’d perhaps rethink my opinion here but I’m not fully convinced those exist.

(By “significant” I don’t necessarily mean “large” but just “measurable and in a situation where performance isn’t totally irrelevant”.)

With specialization one could instead offer an alternative to clone() which specializes to copying whenever possible, and in a case where you’re unsure about performance implications of clone you could call that function/method instead. Also the generated (derived) Clone implementation could call that method. (It currently doesn’t use specialization yet.) Even though that would be a breaking change, too. Maybe a candidate for 2021 edition (or a later one)? (The idea being: observable changes to what code derive generates would be less “breaking” if they only come with a new edition. Although if changing clones to copies in generic code would be considered breaking then libraries doing this would need to go to the next sem-ver version, too.... I’m not sure what to think of this.)


I’m also a bit curious if someone could think up (ignoring possibly breaking changes for now) a way to structure the traits Clone and Copy, possibly with blanket implementations (and perhaps with some new extensions to the trait system), that actually enforces clone() to just copy in case the type implements Copy. One would need to be able to manually implement Clone for a generic type for just the cases where the parameter is Clone but not Copy.

Would be a bit overkill to think up something like this, if it’s just for Clone. But I can imagine different traits in new libraries benefiting from features that are powerful enough to enable these kinds of guarantees. For clone and copy, changing derived instances would do a lot already.

Is there even any types out there (in std of popular crates) that implement Copy but have a non-derived Clone implementation (for a good reason)?

1 Like

If it was possible to refer to trait methods at the const-generics level then you could have Copy require that Clone::<T>::clone == ptr::read::<T>:

trait Copy: Clone<const clone = ptr::read::<Self>> { }

This would require that the derived Clone impl actually makes clone the exact same function as ptr::read. Perhaps we could make it so that the type system can "see through" a function in this way if it's marked #[inline(always)] and consists of nothing but a call to another function which passes along the arguments in-order. In that case the derived Clone impl could look like:

impl Clone for MyCoolType {
    #[inline(always)]
    fn clone(&self) -> MyCoolType {
        unsafe {
            ptr::read(self)
        }
    }
}

There's also the issue that the types don't exactly match (unsafe vs non-unsafe, *const T vs &T). But they're close enough that we could allow the type system to still allow you to check for equality.

Edit:

Though, actually, a better way to support this would be to just allow people to declare functions/methods by assigning them a value:

impl Clone for MyCoolType {
    fn clone = unsafe { ptr::read::<MyCoolType> };
}

Wait a bit there. Before we all jump on the "missed optimization" bandwagon, let's consider that CPUs don't execute MIR. Therefore you shouldn't be looking at the MIR – you should be looking at the generated assembly. It can be quite a bit more optimized than the MIR – after all, the point of the very existence of the MIR is that it's higher-level than assembly or even LLVM IR.

And indeed, it looks like that at -C opt-level=3, the code you posted generates identical assembly, whether an explicit clone(), an explicit Copy (by dereferencing), or none of them (indirect access to the fields) is used:

This. Sure, implementing Clone differently for Copy types is bad practice, but it would be even worse if the compiler went ahead and just straight up ignored the code of a Clone impl completely, for reasons of "optimization".

7 Likes

I agree with the general principle here, but Copy and Clone are so special that there's actually a full RFC from back in May 2016 that this is explicitly allowed: https://rust-lang.github.io/rfcs/1521-copy-clone-semantics.html

And based on that, Vec::clone has been using specialization to memcpy once instead of calling <T as Clone>::clone N times since Dec 2016. (And, since then, other things like extend have been similarly optimized.)

Aside: I've also never seen a place where anyone has actually argued that their side-effectful-Clone-but-still-Copy was actually a good thing. Do any situations exist where the correct answer to getting impacted by this optimization would be anything but "you should remove Copy from the type"?

I certainly expect that it'd be rare, if ever, that this would make a material difference in runtime performance.

However, you know what does execute MIR? Const evaluation in rustc. Where being able to just be an Operand::Copy instead of a function call could easily help.

And, more generally, there are plausible compiler performance gains here. The compiler team is already looking at doing MIR-level inlining to improve compile times, and this would be like a really good, really cheap inliner -- saving LLVM from needing to do a bunch of work to inline the implementation and figure out whether it happens to be equivalent to a bitwise copy. And removing the Call also means removing the unwind edge, which should mean materially less code to codegen and have LLVM deal with.

Not to mention it'd make other MIR optimizations work better: For example dest_prop can't see through a Call, so this could make it work better. And turning those Calls into Gotos mean that SimplifyCfg can collapse all the basic blocks together, so anything working on a single BB will work better.

11 Likes

I feel like the burden of proof should not be on me in this case.

1 Like

(Zulip Archive link for the above discussion).

Is there a reason to separate this from the MIR inlining pass? IIRC derive(Copy, Clone) is already a special case, so this could just add something like #[mir_inline(always)] onto fn clone to avoid having to run inlining heuristics.

FYI, Pretty much all this has already been said on URLO.

Related, there's a PR to specialize Clone for Option<T> where T: Copy:

While I generally consider my question answered between these two threads, I feel I should add one last comment: a bit of context about my question.

In general, you are absolutely right. Before I asked, I already confirmed that, for all of the types currently generated by this macro, LLVM optimizes this to a single move (or sometimes, re-use of an input register with zero moves).

That is why the lint struck me -- a lint in the clippy::perf group, no less. My question was about the language -- hence my initial choice of the "language design" topic, and IRLO instead of URLO -- with longer-term notions in mind.

All the types I currently use this code on are simple, but it is easy for me to imagine more advanced uses (not yet written) that are more complicated. It makes me ask: will LLVM be as optimal for a #[repr(C)] struct with 16 fields, weird alignment, bit packing, and/or a fields bigger than a register?

I also hear, once in a while, about people hoping to use a non-LLVM backend "someday." This also led me to believe that the MIR is the "semantic truth" of the Rust code, and LLVM optimizations are "happy accidents" -- even if they are reliable in practice.

My question was, in effect: what is this lint actually telling me?

I think the answer is: "the language currently does not guarantee any optimizations on your clone call, even though LLVM can probably do them 90% of the time in release mode. If you want it guaranteed, remove it." The MIR seems to reflect that.

The RFC is interesting, and further optimization based on its ramifications are an exciting idea. In effect, it would eliminate the need for that lint. But I defer to the greater expertise in the thread as to whether it would be beneficial.

EDIT: I also tried to reply to @H2CO3, but somehow got @steffahn. Sorry about that.

3 Likes

Actually, going back to your given example, the lint is definitely correct (for generic types):

fn euclidean_distance(p1: &Point, p2: &Point) -> f32 {
    let p1 = p1.clone(); // unnecessary clone
    let p2 = p2.clone(); // unnecessary clone
    ((p2.x - p1.x).powf(2.0) + (p2.y - p1.y).powf(2.0)).sqrt()
}

The redundant_clone lint is telling you that you don't need to perform the clone at all, and could just use the borrowed value directly. If your type is not Copy and holds some allocation handle, you're needlessly cloning said allocation handle.

The lint it sounds like you're talking about is actually clone_on_copy, which does explicitly say why it's bad to call .clone() on a Copy type:

The only reason Copy types implement Clone is for generics, not for using the clone method on a concrete type.

(Or in other words, make a copy, not a clone, if you know that it is Copy. The reason is for clarity, not for performance.)


For the given example, clippy does in fact emit only clippy::clone_on_copy and not clippy::redundant_clone.

5 Likes

Oh, thanks for spotting this! This answers OPʼs question and my confusion about why Clippy considers clone() on copy types a performance problem: it doesnʼt.

2 Likes

Oh!

I have now removed the clone() causing the lint, and you're absolutely right. The behavior is unchanged.

Thank you!