Bounds checking optimizations

Hey!

I have been working on optimizing bound checks for C++ in LLVM (especially -fsanitize=array-bounds) and have been checking whether some of my work translates to Rust as well.

Some time ago I did an optimization in LLVM that focused on loops that only have thread local side effects, and allows to hoist the bounds check out of them. For instance:

pub fn do_stuff(x : &[i32], y : &mut [i32], s : usize) {
    for i in 0..s {
        y[i] = x[i] ^ 0x32;
    }
}

Can be transformed to something along the lines of

pub fn do_stuff(x : &[i32], y : &mut [i32], s : usize) {
    if (s > x.len() || s > y.len())
        panic!();
    for i in 0..s {
        y[i] = x[i] ^ 0x32;
    }
}

This almost works for Rust as well, except that the calls to core::panicking::panic_bounds_check are a) not memory(argmem: read) (which is a precondition for the optimization) and b) take the failing index, which makes it dependent on the specific loop iteration. While the compiler is able to figure out that that index is always the slice's size, it seems to figure that out after indvars has already run. a) should be easy to fix, but b) is more involved.

If I change rustc to codegen llvm.trap() instead, unsurprisingly the optimization works. That will have code size improvements as well, because we don't need to set up the registers (my toy loop above reduces by 3 % after strip on the object file). We have found that generally code size improvements come with improved performance. I ran bench_runtime_local with generally positive results (Range: [ -4.53%, +1.26%] Mean: -1.18%, when pinned to one core with performance governor on my machine).

Original codegen
do_stuff:
        .cfi_startproc
        pushq   %rax
        .cfi_def_cfa_offset 16
        testq   %r8, %r8
        je      .LBB0_8
        cmpq    %rsi, %rcx
        movq    %rsi, %rax
        cmovbq  %rcx, %rax
        leaq    -1(%r8), %r9
        cmpq    %r9, %rax
        cmovaeq %r9, %rax
        cmpq    $7, %rax
        ja      .LBB0_3
        xorl    %eax, %eax
        jmp     .LBB0_5
.LBB0_3:
        incq    %rax
        movl    %eax, %r9d
        andl    $7, %r9d
        movl    $8, %r10d
        cmovneq %r9, %r10
        subq    %r10, %rax
        xorl    %r9d, %r9d
        movaps  .LCPI0_0(%rip), %xmm0
        .p2align        4
.LBB0_4:
        movups  (%rdi,%r9,4), %xmm1
        movups  16(%rdi,%r9,4), %xmm2
        xorps   %xmm0, %xmm1
        xorps   %xmm0, %xmm2
        movups  %xmm1, (%rdx,%r9,4)
        movups  %xmm2, 16(%rdx,%r9,4)
        addq    $8, %r9
        cmpq    %r9, %rax
        jne     .LBB0_4
        .p2align        4
.LBB0_5:
        cmpq    %rax, %rsi
        je      .LBB0_10
        cmpq    %rax, %rcx
        je      .LBB0_9
        movl    (%rdi,%rax,4), %r9d
        xorl    $50, %r9d
        movl    %r9d, (%rdx,%rax,4)
        incq    %rax
        cmpq    %rax, %r8
        jne     .LBB0_5
.LBB0_8:
        popq    %rax
        .cfi_def_cfa_offset 8
        retq
.LBB0_10:
        .cfi_def_cfa_offset 16
        leaq    .Lanon.d5329084f0fbe5323db051eb288edcc7.1(%rip), %rdx
        movq    %rsi, %rdi
        callq   *_RNvNtCsiTQtmXicy1o_4core9panicking18panic_bounds_check@GOTPCREL(%rip)
.LBB0_9:
        leaq    .Lanon.d5329084f0fbe5323db051eb288edcc7.2(%rip), %rdx
        movq    %rcx, %rdi
        movq    %rcx, %rsi
        callq   *_RNvNtCsiTQtmXicy1o_4core9panicking18panic_bounds_check@GOTPCREL(%rip)
.Lfunc_end0:
        .size   do_stuff, .Lfunc_end0-do_stuff
        .cfi_endproc
Codegen with trap
do_stuff:
        .cfi_startproc
        testq   %r8, %r8
        je      .LBB0_9
        cmpq    %rsi, %rcx
        movq    %rsi, %rax
        cmovbq  %rcx, %rax
        leaq    -1(%r8), %r9
        cmpq    %r9, %rax
        cmovaeq %r9, %rax
        cmpq    %rax, %rsi
        je      .LBB0_10
        cmpq    %rax, %rcx
        je      .LBB0_10
        cmpq    $8, %r8
        jae     .LBB0_6
        xorl    %eax, %eax
        jmp     .LBB0_5
.LBB0_6:
        movq    %r8, %rax
        andq    $-8, %rax
        xorl    %ecx, %ecx
        movaps  .LCPI0_0(%rip), %xmm0
        .p2align        4
.LBB0_7:
        movups  (%rdi,%rcx,4), %xmm1
        movups  16(%rdi,%rcx,4), %xmm2
        xorps   %xmm0, %xmm1
        xorps   %xmm0, %xmm2
        movups  %xmm1, (%rdx,%rcx,4)
        movups  %xmm2, 16(%rdx,%rcx,4)
        addq    $8, %rcx
        cmpq    %rcx, %rax
        jne     .LBB0_7
        jmp     .LBB0_8
.LBB0_10:
        ud2
.LBB0_5:
        movl    (%rdi,%rax,4), %ecx
        xorl    $50, %ecx
        movl    %ecx, (%rdx,%rax,4)
        incq    %rax
.LBB0_8:
        cmpq    %rax, %r8
        jne     .LBB0_5
.LBB0_9:
        retq
.Lfunc_end0:
        .size   do_stuff, .Lfunc_end0-do_stuff
        .cfi_endproc

Performance aside, having worked with crashes one of the nice things about ud2 is it leaves the program's state completely alone, so if you have a register dump you can try to reconstruct what happened from the disassembly. With some work it should be possible to encode in debug information which register contained the faulting index (if we don't deduplicate the ud2), so theoretically we can have the same debugging experience. For my work on fsanitize=array-bounds, I have a script to reconstruct the register from a crash using capstone, and it works very well.

In our work on sanitizers we go to great lengths to optimize traps in LLVM, a lot of which should be applicable to Rust as well.

Would a flag that hard traps on out of bounds be likely to be accepted? If so, are people interested in improving the debug experience for traps in the long term?

12 Likes

First, thanks for working on bounds check eliminations! I've sometimes had trouble with them not being well handled in LLVM since they do things that would be weird in hand-written code.

Note that we have panic=abort, which is closer to trap than unwind, but still not quite the same. Maybe what you're thinking about is basically Tracking Issue for immediately-aborting panics (currently -Cpanic=immediate-abort) · Issue #147286 · rust-lang/rust · GitHub ?

TBH I've often pondered having -C overflow-checks=trap as well, for similar reasons like you describe here. If I turn it on in release I probably don't actually want them all to be unwind locations, usually, and would be happy to just get a watson crash dump from them, with the greatly reduced binary size (by not having all that panic code, just the jo+ud2) being a really nice bonus.

FWIW we generally tell people to do that themselves in their rust code. I call it reslicing (Understanding Rusts Auto-Vectorization and Methods for speed increase - #5 by scottmcm - help - The Rust Programming Language Forum).

(Since making the length of each slice and the upper bound of the iteration be the same SSA value unsurprisingly makes it really easy for LLVM to optimize out the checks -- though it's inequality propagation has gotten much smarter lately so it tends to do well even without that these days.)

As for having the index in the panic, it seems like that's what vector::at does in clang too, so it's not obvious that we'd want to change that. It can certainly be useful at times, even though yes it does mean things like x[0] + x[1] emits somewhat more code than x[1] + x[0] since the former has two possible panics and the latter only one.

One of my oldest LLVM issues is related to that, actually Also outline conditions that pick which cold BB to go to · Issue #55759 · llvm/llvm-project · GitHub

Is that for performance reasons specifically? Because I feel other than that it would be preferable if people didn't have to do that.

EDIT: Actually, playing with this this doesn't seem to codegen into ud2, but still into the same function call.

2 Likes

Also because I think it's a better reflection of the semantics that people actually want, especially if they ever catch panics.

I don't think anyone ever wants "write some then panic", even though that's what they technically wrote. They want "write all of them" or "panic and write none of them".

(This is related to how I think Iterator::zip is often meh for people, because the common case is "I expect them to be the same length or for one to be infinite", not "I'm intentionally taking the common prefix length".)

So starting with assert_eq!(x.len(), y.len()) or assert!(x.len() <= n); assert!(y.len() <= n); is good even when it doesn't make things faster. Clearer to the caller, too.


Though I'd also really like a type for EqualLengthSlices<'a, T, U> or something. That particular case comes up really often, and it'd be nice to encode it in the type system instead of needing the asserts. (See things like https://doc.rust-lang.org/std/primitive.slice.html#method.swap_with_slice.)

3 Likes

Makes sense. But from an optimization perspective: do you have any intuition about how often people actually do this vs. don't? If they don't often enough, it is still worth trying to optimize.

5 Likes

Couldn't the rust compiler (which presumably has a higher level understanding of the bounds checks it is adding itself than LLVM does) hoist already (presumably in MIR or such)? That way we could include the index still.

It seems that would be duplicating some of the loop optimizations that LLVM already does.

In this case, LLVM could do this if we teach it to. It seems to be a pass-ordering problem, because in the final IR LLVM realizes that the index always has to be the length of the slice.

The need to specify the failing index is problematic for optimisations, but LLVM is still able to handle it in some cases. For example, with this code at optimisation level 3:

pub fn dot_product(x: &[i32], y: &[i32], count: usize) -> i32 {
    let mut r = 0;
    for i in 0..count { r += x[i] * y[i]; }
    r
}

LLVM vectorises by doing the bounds checks in advance, doing a vectorised loop if they pass, and doing the iterations one at a time if they fail. (It needs to generate the one-at-a-time code anyway to handle the situation where count isn't a multiple of the vector size.)

I think this general technique is useful in most cases (in all of C, C++ and Rust) where bounds checks are done mid-loop but it's possible to tell in advance whether one of them will be hit or not – the challenge may be in getting LLVM to apply it in every situation where it helps.

TBH, most of the bounds checks aren't anything special at all. If you write foo[i..j], for example, that's just normal rust code that calls a function that panics:

There's no "go through and add all the bounds checks" pass to have a holistic understanding.

Really, rustc has basically nothing that understands loops in codegen. We leave that entirely up to LLVM.

4 Likes

I have the impression that we want to move optimizations from LLVM to rustc where possible, for a variety of reasons.

I don't really know. The thing that makes it particularly hard to answer is that the other general advice is to just not use indexing like this in the first place.

In simpler cases the linter will even tell you to stop using indexing:

warning: the loop variable `i` is only used to index `y`
 --> src/lib.rs:2:14
  |
2 |     for i in 0..s {
  |              ^^^^
  |
  = help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.93.0/index.html#needless_range_loop
  = note: `#[warn(clippy::needless_range_loop)]` on by default
help: consider using an iterator
  |
2 -     for i in 0..s {
2 +     for <item> in y.iter_mut().take(s) {
  |

So if people are writing the example as

#[unsafe(no_mangle)]
pub fn do_stuff(x : &[i32], y : &mut [i32]) {
    std::iter::zip(x, y).for_each(|(x, y)| *y = *x ^ 0x32);
}

then it has no bounds checks and autovectorizes https://rust.godbolt.org/z/1oqv5Mrzz.

But of course even if they don't then with latest LLVM then still get it vectorized up to umin(x.len(), y.len(), s) anyway (if I'm reading https://rust.godbolt.org/z/nKbjYrdd5 right) so I don't know how much they'd necessarily even notice that technically there's still a bounds check path in there :person_shrugging:

I think "possible" is probably too strong there. I'd be perfectly happy to never move the autovectorization stuff over into rustc, for example, because of just how much machine and target information that needs that I don't think it's really worth replicating.

But yes, if we ever get LIR (so writing optimizations isn't really painful) then bringing in the basics so we could clean things up in a backend agnostic way would be nice. I just don't know that we should ever bring things like machine specific latency or execution slot count aware things into rustc because that's just such a giant rabbit hole.

1 Like