Is there a place to report missed optimisation?

robinm · November 4, 2021, 4:55pm

Is there a way to report a miss-optimization? I tried to search for some kind of label on the rustc github but couldn’t find it (nor the corresponding category when creating an issue). I would expect such category to have very low priority (if at all). But if a miss-optimization is optimized at one point it would make it easy to record it, and prevent later (unwanted) regressions.

Both code should be optimized to the same assembly, and while I’m not expecting the first one to be written in practice, I think it can come up after constant propagation and inlining in real code.

pub fn square42_fp(v: Vec<u32>) -> Vec<u32> {
    v.into_iter().enumerate().map(|(idx, elem)| { if idx == 42 { elem * 2 } else { elem } }).collect()
}

pub fn square42_assign(mut v: Vec<u32>) -> Vec<u32> {
    if v.len() > 42 {
        v[42] *= 2;
    }
    v
}

godbolt

If the tool I had in mind existed, then square42_fp would have an annotation that says that its assembly should be (but isn’t yet) the same than square42_assign. In CI, both function would be compiled, and their assembly compared.

At one point square42_fp and square42_assign’s assemblies will match and thus the test will fail, reporting that the assembly of both function matches now. This would allow for a manual review (in case square42_assign was pessimized instead of optimized). At that point the author of the commit would be able to change the annotation to say that both function assembly should match.

If at one point the assembly diverge again the test will fail, making it possible for a reviewer to analyze it, and possibly catch a regression.

I think that comparing the assembly of two function is better than comparing the assembly of one function to a reference, because I expect less false positive to come up.

Since Rust makes the strong promise that high-level construction should be as efficient as a lower-level equivalent one, I hope that such tool exists.

scottmcm · November 4, 2021, 5:56pm

The compiler has optimization tests as part of its CI. Here's an example:

github.com

rust-lang/rust/blob/master/src/test/codegen/slice-as_chunks.rs#L9-L20

    
      
          // CHECK-LABEL: @chunks4
          #[no_mangle]
          pub fn chunks4(x: &[u8]) -> &[[u8; 4]] {
              // CHECK-NEXT: start:
              // CHECK-NEXT: lshr i64 %x.1, 2
              // CHECK-NOT: shl
              // CHECK-NOT: mul
              // CHECK-NOT: udiv
              // CHECK-NOT: urem
              // CHECK: ret
              x.as_chunks().0
          }

I don't know that there's an external tool for this kind of thing, though.

(And the level of difference between those two examples is such that I would be very surprised if the former would optimize down the same as the latter.)

jonasbb · November 4, 2021, 6:13pm

The two square functions do not look identical to me, so I am not sure if that counts as an allowable optimization.

square42_fp creates a new Vec as return value. This creates an additional allocation and deallocation. I am unsure if that alone counts as an observable side effect, but the capacity of the Vec can also be read and this might be different between the two functions.

CAD97 · November 4, 2021, 6:23pm

It's worth noting that this is a very tricky high level optimization to apply automatically.

The reason that vec.into_iter().collect::<Vec<_>>() doesn't allocate is a specialization of (I believe) for<T> Vec::<T> as FromIterator::<T>::from_iter::<vec::IntoIterator<T>>. Without this specialization, it's very difficult to actually justify the allocation slot reuse. (E.g. capacity could likely be an observable difference the optimizer has to not change.)

A specialization to cover this would need to specialize from_iter for something along the vague lines of all iterator combinators which (type recursively) do not change the length of the iteration and which terminates in a vec::IntoIter. The specialization could then theoretically run the iterator and collect into the moved-from head, and also fix up drop (on panic) to drop the whole Vec exactly once (except the element which panicked and is moved from).

I believe even the simple vec.into_iter().map(identity).collect() will not reuse the Vec allocation.

robinm · November 4, 2021, 6:33pm

I would argue that the capacity is not something that can be observed from outside of the function, since there is no guaranty that the output Vec is related in any way to the input one.

For the allocation, I’m not sure that it counts as observable effect. I think, but I can be wrong, that in C++, it’s not considered as observable behavior, and I would have expected the same in Rust. In zig, since the allocator is passed explicitly, I think that it would have counted as observable.

But I totally agree that it’s not at all an easy optimization. I’m not really surprised that it was not optimized (yet). I mostly wanted to know if there was some place to create high-level optimization regression tests.

robinm · November 4, 2021, 6:40pm

It’s correctly optimized out godbolt. I don’t understand what is different in the assembly however (I’m not familiar enough with asm), I think it’s strictly instruction ordering.

cuviper · November 4, 2021, 6:44pm

There's no guarantee at the Rust library level, so it's something that could possibly be specialized, but at LLVM's level the capacity is definitely an observable effect.

robinm · November 4, 2021, 7:06pm

I played a bit with the assembly:

If the signature of the function is pub fn foo(v: Vec<u32>) -> Vec<u32>, then all 3 are equivalent (godbolt) except for some instruction re-ordering:

v
v.into_iter().map(std::convert::identity).collect()
v.into_iter().enumerate().map(|(_idx, elem)| elem).collect()

So it’s seems that the only thing that prevent the optimization in the original example is the map when it’s not an identity.

For fun I also tested v.clone(), and it’s not optimized away, even if v is not re-used (godbolt).

InfernoDeity · November 4, 2021, 7:28pm

This seems like it could run into a problem of "what if the codegen changes to something ruled out, because it's faster on the particular CPU (with -C target-cpu=native or -Z tune-cpu=native)"

mathstuf · November 4, 2021, 7:30pm

Right. Optimizations are usually couched in "as-if" language. The collect induces a reallocation which does not have a forced link to the old capacity, so that is "observable behavior" as far as the optimizer is concerned.

AFAIU, Clone can only be optimized away if it is also Copy (though I'm not sure if this is actually happening). See these threads:

There was a more recent one mentioning it too, but I'm not seeing it right now.

InfernoDeity · November 4, 2021, 7:32pm

Clone can be optimized away iff the compiler can prove it's trivial, other otherwise that it is pure. Copy is irrelevant to it, it's only a logical requirement that T::clone() be trivial if T: Copy.

newpavlov · November 4, 2021, 7:45pm

I think most such issues are marked as I-slow + A-LLVM. Unfortunately, such issues indeed have a very low priority and we often simply wait for LLVM to improve. For example, I had several such issues reported and I closed them myself long time after they apparently got resolved on the LLVM side.

In my opinion, one problem is that such issues often do not get mirrored into the LLVM bug tracker (lack of open registration makes reporting them only harder), so LLVM developers may not know about those issues in the first place.

scottmcm · November 4, 2021, 7:46pm

Well, that test doesn't pass those flags, so they're not really a concern.

But also, LLVM will always canonicalize things at the IR level -- so it'll always simplify multiplication to shifts as part of doing the optimizations. Then if it's actually faster on a particular target to use multiplication for shifts, then the codegen part will do that, but the test isn't looking at the assembly output to see that.

At the same time, yes, the codegen tests are absolutely more tempermental than most of the other kinds of tests in the compiler. You'll notice all the comments at the top of that one, restricting it to only 64-bit, disabling debug assertions, ...

steffahn · November 4, 2021, 8:17pm

For the record, this specialization infrastructure exists. It's for standard library iterator adapters that do not increase the length of the iterator, so e. g. filter works as well. It's become far less discoverable since the documentation of the involved unstable traits has been hidden in 1.54, e.g. InPlaceIterable in std::iter - Rust. Apparently it supports vec and binary_heap.

steffahn · November 4, 2021, 8:53pm

To show something that does work: At least for a small statically-known-size Vec, the two functions produce very similar assembly:

Compiler Explorer

mbartlett · November 5, 2021, 1:34pm

And even with the original code, that specialization does kick in, meaning that there is no allocation or deallocation. It seems to be just that the loop can't be optimised away.

Example:

pub fn iter_thru(v: usize, slice: &mut [u32]) {
    assert!(slice.len() > 42);
    assert!(slice.len() < v);
    for item in 0..v {
        if item == 42 {
            slice[42] = 17;
            // With a `return;` added here, the function suddenly is
            // just as optimised as the one below.
        }
    }
}

pub fn iter_thru_2(v: usize, slice: &mut [u32]) {
    assert!(slice.len() > 42);
    assert!(slice.len() < v);
    slice[42] = 17;
}

(Godbolt)

SadiinsoSnowfall · November 5, 2021, 5:13pm

Continuing the discussion from Is there a place to report missed optimisation?:

Another reduced test case :

pub fn iter_to_2(v: usize, slice: &mut [u32; 8]) {
    if v <= slice.len() { return; } // remove panic noise

    for item in 0..v {
        if item > 0 {
            slice[5] = 17;
        }
    }
}

example::iter_to_2:
        cmp     rdi, 9
        jb      .LBB3_6
        mov     dword ptr [rsi + 20], 17
        lea     rcx, [rdi - 1]
        add     rdi, -2
        mov     eax, ecx
        and     eax, 7
        cmp     rdi, 7
        jb      .LBB3_4
        and     rcx, -8
        neg     rcx
.LBB3_3:
        add     rcx, 8
        jne     .LBB3_3
.LBB3_4:
        test    rax, rax
        je      .LBB3_6
.LBB3_5:
        add     rax, -1
        jne     .LBB3_5
.LBB3_6:
        ret

godbolt link (contains one more example)

The compiler is smart enough to know that the slice will be affected and thus perform the operation right away (after checking that v > slice.len(), of course, or else v could be lower or equal to 8, which would trigger an early return, or equal to 0 in which case the slice wouldn't be affected even without the first if). But then, there is still some assembly that doesn't make much sense to me.

qm3ster · November 9, 2021, 1:07pm

Where can I read about why they were hidden from documentation?

steffahn · November 9, 2021, 1:11pm

I don't know. Possibly because they're considered implementation details? I could try to locate the PR that made them hidden.

The only good way I’m aware of for reading what they currently look like is to clone the rust repository and build the hidden docs locally e.g. with RUSTDOCFLAGS='--document-private-items --document-hidden-items' ./x.py doc library/std --open. I only know about these due to the fact that they used to be visible in the online standard library docs.

steffahn · November 9, 2021, 1:16pm

Here:

github.com/rust-lang/rust

Public unstable feature="inplace_iteration" APIs without a tracking issue

opened 04:27PM - 16 May 21 UTC

closed 09:56PM - 17 May 21 UTC

SimonSapin

T-libs-api B-unstable Libs-Small

The `SourceIter` and `InPlaceIterable` unstable traits are public, so they and t…heir `impl`s show up in docs. For example https://doc.rust-lang.org/1.52.1/std/iter/trait.InPlaceIterable.html. However unlike other unstable public APIs, there is no link to a tracking issue. These traits were added by @the8472 in https://github.com/rust-lang/rust/pull/70793. It’s not clear from the PR’s description if the traits are intended as private implementation details of the standard library that are only technically `pub` so that they can be used across multiple standard library crates, or if they are intended as a new public API with a possible path to eventual stabilization. Source definitions of the traits and `impl`s should either have `#[doc(hidden)]` attributes added, or a tracking issue should be filed and the `#[unstable]` attributes modified to point there.

Topic		Replies	Views
Idea: `#[opt_level(level)]` for performance-sensitive functions	9	647	April 1, 2024
Resolving Rust's forward progress guarantees compiler	52	6770	August 6, 2020
Using a custom optimisation pass pipeline compiler	7	3822	March 25, 2019
Benchmark optimization woes	10	3454	March 25, 2019
Why does Rust generate 10x as much unoptimized assembly as GCC? compiler	23	4606	November 2, 2021

Is there a place to report missed optimisation?

Related Topics