Hey!
I have been working on optimizing bound checks for C++ in LLVM (especially -fsanitize=array-bounds) and have been checking whether some of my work translates to Rust as well.
Some time ago I did an optimization in LLVM that focused on loops that only have thread local side effects, and allows to hoist the bounds check out of them. For instance:
pub fn do_stuff(x : &[i32], y : &mut [i32], s : usize) {
for i in 0..s {
y[i] = x[i] ^ 0x32;
}
}
Can be transformed to something along the lines of
pub fn do_stuff(x : &[i32], y : &mut [i32], s : usize) {
if (s > x.len() || s > y.len())
panic!();
for i in 0..s {
y[i] = x[i] ^ 0x32;
}
}
This almost works for Rust as well, except that the calls to core::panicking::panic_bounds_check are a) not memory(argmem: read) (which is a precondition for the optimization) and b) take the failing index, which makes it dependent on the specific loop iteration. While the compiler is able to figure out that that index is always the slice's size, it seems to figure that out after indvars has already run. a) should be easy to fix, but b) is more involved.
If I change rustc to codegen llvm.trap() instead, unsurprisingly the optimization works. That will have code size improvements as well, because we don't need to set up the registers (my toy loop above reduces by 3 % after strip on the object file). We have found that generally code size improvements come with improved performance. I ran bench_runtime_local with generally positive results (Range: [ -4.53%, +1.26%] Mean: -1.18%, when pinned to one core with performance governor on my machine).
Original codegen
do_stuff:
.cfi_startproc
pushq %rax
.cfi_def_cfa_offset 16
testq %r8, %r8
je .LBB0_8
cmpq %rsi, %rcx
movq %rsi, %rax
cmovbq %rcx, %rax
leaq -1(%r8), %r9
cmpq %r9, %rax
cmovaeq %r9, %rax
cmpq $7, %rax
ja .LBB0_3
xorl %eax, %eax
jmp .LBB0_5
.LBB0_3:
incq %rax
movl %eax, %r9d
andl $7, %r9d
movl $8, %r10d
cmovneq %r9, %r10
subq %r10, %rax
xorl %r9d, %r9d
movaps .LCPI0_0(%rip), %xmm0
.p2align 4
.LBB0_4:
movups (%rdi,%r9,4), %xmm1
movups 16(%rdi,%r9,4), %xmm2
xorps %xmm0, %xmm1
xorps %xmm0, %xmm2
movups %xmm1, (%rdx,%r9,4)
movups %xmm2, 16(%rdx,%r9,4)
addq $8, %r9
cmpq %r9, %rax
jne .LBB0_4
.p2align 4
.LBB0_5:
cmpq %rax, %rsi
je .LBB0_10
cmpq %rax, %rcx
je .LBB0_9
movl (%rdi,%rax,4), %r9d
xorl $50, %r9d
movl %r9d, (%rdx,%rax,4)
incq %rax
cmpq %rax, %r8
jne .LBB0_5
.LBB0_8:
popq %rax
.cfi_def_cfa_offset 8
retq
.LBB0_10:
.cfi_def_cfa_offset 16
leaq .Lanon.d5329084f0fbe5323db051eb288edcc7.1(%rip), %rdx
movq %rsi, %rdi
callq *_RNvNtCsiTQtmXicy1o_4core9panicking18panic_bounds_check@GOTPCREL(%rip)
.LBB0_9:
leaq .Lanon.d5329084f0fbe5323db051eb288edcc7.2(%rip), %rdx
movq %rcx, %rdi
movq %rcx, %rsi
callq *_RNvNtCsiTQtmXicy1o_4core9panicking18panic_bounds_check@GOTPCREL(%rip)
.Lfunc_end0:
.size do_stuff, .Lfunc_end0-do_stuff
.cfi_endproc
Codegen with trap
do_stuff:
.cfi_startproc
testq %r8, %r8
je .LBB0_9
cmpq %rsi, %rcx
movq %rsi, %rax
cmovbq %rcx, %rax
leaq -1(%r8), %r9
cmpq %r9, %rax
cmovaeq %r9, %rax
cmpq %rax, %rsi
je .LBB0_10
cmpq %rax, %rcx
je .LBB0_10
cmpq $8, %r8
jae .LBB0_6
xorl %eax, %eax
jmp .LBB0_5
.LBB0_6:
movq %r8, %rax
andq $-8, %rax
xorl %ecx, %ecx
movaps .LCPI0_0(%rip), %xmm0
.p2align 4
.LBB0_7:
movups (%rdi,%rcx,4), %xmm1
movups 16(%rdi,%rcx,4), %xmm2
xorps %xmm0, %xmm1
xorps %xmm0, %xmm2
movups %xmm1, (%rdx,%rcx,4)
movups %xmm2, 16(%rdx,%rcx,4)
addq $8, %rcx
cmpq %rcx, %rax
jne .LBB0_7
jmp .LBB0_8
.LBB0_10:
ud2
.LBB0_5:
movl (%rdi,%rax,4), %ecx
xorl $50, %ecx
movl %ecx, (%rdx,%rax,4)
incq %rax
.LBB0_8:
cmpq %rax, %r8
jne .LBB0_5
.LBB0_9:
retq
.Lfunc_end0:
.size do_stuff, .Lfunc_end0-do_stuff
.cfi_endproc
Performance aside, having worked with crashes one of the nice things about ud2 is it leaves the program's state completely alone, so if you have a register dump you can try to reconstruct what happened from the disassembly. With some work it should be possible to encode in debug information which register contained the faulting index (if we don't deduplicate the ud2), so theoretically we can have the same debugging experience. For my work on fsanitize=array-bounds, I have a script to reconstruct the register from a crash using capstone, and it works very well.
In our work on sanitizers we go to great lengths to optimize traps in LLVM, a lot of which should be applicable to Rust as well.
Would a flag that hard traps on out of bounds be likely to be accepted? If so, are people interested in improving the debug experience for traps in the long term?