Missing optimizations with slices?


#1

Let’s say I want to write a function that takes a slice of ints and returns the sum of the first two values. I assume that the slice must contain two elements, otherwise I want to panic.

It looks like the simplest thing is the better for the compiler:

pub fn test(ns: &[i32]) -> i32 {
    assert!(ns.len() >= 2);
    return ns[0] + ns[1];
}

This produces the following:

example::test:
        push    rax
        cmp     rsi, 1
        jbe     .LBB7_1
        mov     eax, dword ptr [rdi + 4]
        add     eax, dword ptr [rdi]
        pop     rcx
        ret
.LBB7_1:
        lea     rdi, [rip + .Lbyte_str.9]
        lea     rdx, [rip + .Lbyte_str.8]
        mov     esi, 31
        call    std::panicking::begin_panic
        ud2

Exactly what I want: one check and one branch.

Now, something slightly different:

pub fn test(ns: &[i32]) -> i32 {
    return ns[0] + ns[1];
}

I would expect that I have two checks for the size, the slice is const and the two checks can be collapsed (because sz > 0 && sz > 1 == sz > 1). Let’s see…

example::test:
        push    rax
        test    rsi, rsi
        je      .LBB6_3
        cmp     rsi, 1
        je      .LBB6_4
        mov     eax, dword ptr [rdi + 4]
        add     eax, dword ptr [rdi]
        pop     rcx
        ret
.LBB6_3:
        lea     rdi, [rip + .Lpanic_bounds_check_loc.5]
        xor     esi, esi
        xor     edx, edx
        call    core::panicking::panic_bounds_check@PLT
        ud2
.LBB6_4:
        lea     rdi, [rip + .Lpanic_bounds_check_loc.6]
        mov     esi, 1
        mov     edx, 1
        call    core::panicking::panic_bounds_check@PLT
        ud2

Uuuhhh… Ok, I understand what is going on. The two panics are different, because they refer to distinct problems. But is what the writer of the function is expecting? I would just think “ok, I want to panic if there are less than 2 elements, indexing already does that for me. Why bothering with assert!?”. Honestly, I don’t know if there is something that could be done here, but it was worth to mention the situation.

Let’s go on. We don’t like to use panicing indexing. But for now we still want to panic instead of returning an error, for reasons. We have beautiful zero-overhead abstractions, right?

pub fn test(ns: &[i32]) -> i32 {
    match ns.get(..2usize) {
        Some(&[a, b]) => a + b,
        _ => unreachable!(),
    }
}

We use slice pattern matching to get the two first values in case they exist, and for now we can just use unreachable!() to panic. Let’s look at the assembly:

example::test:
        push    rax
        xor     ecx, ecx
        cmp     rsi, 2
        cmovae  rcx, rdi
        cmp     rsi, 1
        jbe     .LBB8_1
        mov     eax, dword ptr [rcx + 4]
        add     eax, dword ptr [rcx]
        pop     rcx
        ret
.LBB8_1:
        lea     rdi, [rip + .Lbyte_str.d]
        lea     rdx, [rip + .Lbyte_str.c]
        mov     esi, 40
        call    std::panicking::begin_panic
        ud2

Maybe some of you which are asm experts do expect this, but honestly I don’t get it. Why there are 2 comparisons against rsi? rsi contains the size of the slice, and in case it is above or equal (greater or equal for unsigned values), rdi is moved to rcx. Now the size is checked again; if it is less or equal to 1 we jump to the panic trampoline (how it is called that part? :thinking:), otherwise we get the two values, sum them together in eax and we return the result. My point is that rcx is used only in case we have at least 2 values, and rdi could be used directly. To, me, it looks totally useless to check two times the value of rsi, conditionally moving rdi to `rcx’ and then using this register.

Am I missing something?


#2

As the author of the program, you could write ns[1] + ns[0]. That way, the compiler can prove that the second access will never be out of bounds. Coalescing multiple bounds checks is pretty hard for the optimizer unless it can prove that some of them are simply impossible to occur.


#3

Yes, sorry, I forgot to mention it. The fact is In this case that the programmer is probably aware of the issue, and at that point he will probably use an assertion.

I think that in this case we strictly want different behaviour, in the sense that the two panics must be different. I am wondering if we need this kind of strictness in these sort of situations.


#4

Usually taking a fixed-length slice, and using that slice instead of the original, is clear enough for the compiler to elide bound checks.

let tmp = &slice[0..2];
tmp[1]

#5

That’s true, but IMHO this exactly like what I said before: people who already know the issue would just use the assertion. The others would simply use the slice[0] + slice[1] notations without knowing that there is a cost that could be avoided. Maybe the best thing for this case could be a clippy lint…

However, I am much more interested in the second issue, because I think that the compiler could simplify and optimize the resulting code.


#6

It’s a good idea to look at the LLVM IR as well, since that has more information:

; playground::test
; Function Attrs: nonlazybind uwtable
define i32 @_ZN10playground4test17hcb15e025c7806543E([0 x i32]* noalias nonnull readonly %ns.0, i64 %ns.1) unnamed_addr #3 {
start:
  %0 = icmp ult i64 %ns.1, 2
  %1 = getelementptr inbounds [0 x i32], [0 x i32]* %ns.0, i64 0, i64 0
  %spec.select.i.i.i = select i1 %0, i32* null, i32* %1
  br i1 %0, label %bb2, label %bb4

bb2:                                              ; preds = %start
; call std::panicking::begin_panic
  tail call fastcc void @_ZN3std9panicking11begin_panic17h7da1939c2e94f92eE()
  unreachable

bb4:                                              ; preds = %start
  %2 = load i32, i32* %spec.select.i.i.i, align 4
  %3 = getelementptr inbounds i32, i32* %spec.select.i.i.i, i64 1
  %4 = load i32, i32* %3, align 4
  %5 = add i32 %4, %2
  ret i32 %5
}

It’s already very weird here, in particular %spec.select.i.i.i: it’s loaded from, but somehow there’s a select that can put null into it.


#7

You haven’t said which variant of the code this IR came from but if I had to guess, it looks like the variant matching on ns.get(..2) (the null being the None result from get). Normally that would be a phi(null, %1), but SimplifyCfg likes to turn small phis into selects, in this case probably before get is inlined. The phi-to-select transform is known to cause issues for later optimizations (see https://lists.llvm.org/pipermail/llvm-dev/2018-August/125313.html for example), this might be another instance of that problem.