Annotation to lint for when bounds checks occur?

I am currently writing mesh processing code using Rust, and I find I often hit performance bottlenecks more often than I'd expect to, sometimes because there is a bounds check deep inside some nested inner loop. While I could just mark accesses as unsafe, that would be a huge pain as I appreciate the default behavior of bounds checking, especially when developing an algorithm.

I have taken some time to drop my code into godbolt and manually inspect the asm to see if there's a bounds check, but that's not always feasible because sometimes there's a bunch of files and it's cumbersome to analyze all their assembly. It's also hard if I have a single long function where the assembly is quite long.

What I would really like is to be able to determine if my code has bounds checks, but not have to manually inspect it to see if there are. Currently I find that I am often dealing with "Schrodinger's Bounds Checking", where until I manually inspect a piece of code I have to assume there are bounds checks, even if I have littered asserts about the sizes of arrays/vectors all over. That boils up into me being unsure why the performance of my code is slow: Is my implementation/algorithm poor or are there tons of bounds checks in hot loops which are killing my performance?

This especially matters for geometry processing, where I am storing many indices into other structures, and later indexing into these structures, and iterators can't easily replace them since there are many different structures with indices which I implicitly know to be the same size. In C++, they could be equivalently be replaced with pointers, but storing references in Rust is a huge PITA.

I'm not sure if it would be possible to create an annotation which emits a lint if a bound check occurs, but I think for the speed-fiends and optimization hunters it would be a boon, since then I am positive what I'm calling is doing what I expect.

In my head, while C is much less safe, the certainty that each operation is not doing anything I don't expect (which is false but at least I know it's mostly optimizing for speed) is beneficial for mesh processing since I'm not too concerned about security.

1 Like

It's like your own local godbolt for any Cargo project.

Besides this, it's a hard problem, because checks depend on optimizations. Linting on the source code level would force you to use either unsafe everywhere, even where not needed, or .get()? everywhere, which still may have a branch.

1 Like

What I'm saying is that I don't want to have to decide to manually inspect asm to determine if an optimization was applied, I'd rather be able to opt-in to being warned if some optimization is not applied. If I have already decided to inspect asm, that means I have realized there was some regression and am trying to understand why, but by that point I may have missed it for a while and it could be stacked under many changes, which is quite tedious.

Currently, my main problem is non-determinism: it is a lot of effort to go back and forth to examine what lines are causing optimizations to occur/not occur. I certainly don't expect linting to happen on the source code level, since that's not where the non-determinism comes from. The non-determinism comes from conditional optimization, which happens both at the MIR and the LLVM level. It's tricky to reason about these optimizations, for example, deleting an assert can slow perf because the compiler suddenly has less information, but I might expect removing the assert to also reduce branching. These are both just guesses, and I can't know until I go and look at the asm, as maybe some other line provides sufficient information to reduce branching. If optimization is happening in LLVM, then I imagine that's out of the hands of rustc (for now), but if optimization is happening in MIR, then information about that can certainly be provided to the user in some way.

Of course, this doesn't replace benchmarking the actual use of the code in the end, but can provide more useful signals when implementing something.

1 Like

The optimizer essentially can't optimize that out. It doesn't infer things like "hey, all the values you put into that Vec are less that the length of that other Vec".

That said, are you sure bounds checks for that are really causing that much of a perf hit? On modern chips (superscalar with branch prediction) the cost of the bounds checks are often so small as to be immeasurable, for that kind of "wild" indexing.

(It's kinda like inlining: function calls themselves are cheap, but inlining is still important not because it makes things faster itself, but because it's a gateway optimization. Compiler-eliminated bounds checks are really only critical in simple loops where the optimizer can take advantage of the simpler control flow from not having the checks to do other optimizations like autovectorization, but that's not something that typically happens for things like "the indexing is an edge lookup in a graph algorithm".)

All the interesting ones are in LLVM. The ones ones that rustc can remove are things like "you looked at a constant index in an array -- not even a slice".

Here's a really interesting post on avoiding bounds checking by using std::hint::assert_unchecked: No-Panic Rust: A Nice Technique for Systems Programming

Having the compiler automate the technique he describes there would be tricky because you only know if the bounds check was eliminated after code gen and realistically only in release builds. One approach could be to use a different symbol for the panic handler when emitting bounds check code when this "lint" is enabled and check at link time if its present. But maybe if there is some LLVM equivalent to static_assert that would cause it to omit a compiler error or warning if it doesn't optimize out the assert? That would be a cleaner approach but no idea if that's possible.

(I do think implementing something like this would be, at the very least, good marketing for Rust, since it provides an out for the most obvious way Rust can be "slower" than C.)

2 Likes

I agree it would be useful. It's just the architecture of rustc and LLVM that makes such language feature very cumbersome to add.

Currently, the bounds checks always exist for as long as rustc can see them [1], so any rustc feature for asserting if the bounds checks are there would just say "yes, they're always there".

rustc builds always-bounds-checked code, then hands it off to LLVM, and doesn't see it again. If you enable LTO, then rustc isn't even optimizing the code itself, the linker is! So the bounds checks disappear long after rustc is done compiling Rust, and then it's pretty late in the pipeline to display any warnings about them.

This is basically the same problem as asserting there are no panics left in the optimized code, and it requires ugly hacks for the same reasons:

The Rust compiler itself uses LLVM FileCheck to assert whether the code is optimized properly. If you really need to have automated regression checks, that's probably the way to do it.


  1. well, there are some MIR optimizations that could remove them early, but in practice all the interesting optimizations that remove non-trivial bounds checks from loops are in LLVM. â†Šī¸Ž

3 Likes

Hm, so it seems for the most part that at the rustc level this is not really possible, and needs to be something that happens in LLVM (I'm also not using LTO currently, maybe I should be). I guess it would also be possible for some architectural change in rustc for it to emit llvm which doesn't have bounds checks, but then you get into a whole host of safety problems which are now the responsibility of rustc, which I imagine is a can of worms devs don't want to open.

I think the no panic approaches make the most sense, but as the blog post linked from aszs says then you're stuck doing a bit of an optimization dance. I guess another difference with the blog post is that I'm not doing embedded dev, and there are large chunks of code where I do not care if there are bounds checks or panics, but just for a few key parts it's useful to know.

Is it possible to tell rustc/LLVM to increment an atomic whenever a bound check happens? I'd be surprised if there isn't.

It would be nice to have additional performance metrics from the stdlib/language (not just those form the hardware) to find inefficient code.

For example: RUST_FLAGS='-Zperf-counters' cargo +nightly build ... (or as a separate flag), which would add a few static atomic usize counters and provide functions to read them. That way you could write something like this:

#[test]
fn test() {
    #[cfg(...)] // Only if perf-counters are enabled
    let c: usize = std::perf::get_boundary_check();
    do_something();
    #[cfg(...)] // Only if perf-counters are enabled
    assert!(std::perf::get_boundary_check().wrapping_sub(c) < 10);
}

And in the std-lib:

unsafe impl<T> SliceIndex<[T]> for usize {
    type Output = T;

    #[inline]
    fn get(self, slice: &[T]) -> Option<&T> {
        // Probably need a macro to tell rustc to tell llvm
        // Macro does nothing perf-counter flag is false
        let cond = std::perf::inc_boundary_check!(self < slice.len());
        // SAFETY: `self` is checked to be in bounds.
        if cond { unsafe { Some(&*get_noubcheck(slice, self)) } } else { None }
    }

Yes, that code change can result in different codegen and optimization decisions, but at the very least it gives a useful indicator and with enough counters an indicator where you loose performance without having to look at the assembly. Similar to how performance counters work in hardware. That would have the benefit of you not necessary having to know where to look.

Other counters that could be useful:

  • memory allocation count/size
  • memory dealloc count/size
  • reallocs
  • integer overflow/underflow checks (only relevant when they don't happen of course)
  • other expensive operations (especially when called somewhere deep in the stdlib)

What if - hypothetically, I know it's not trivial - you could hit "show ASM" in your editor and it opens a view showing the asm for a particular function? I.e. is this a matter of friction, having to search for the output, or a problem with looking at asm even if it were easy to obtain?

Having written some of those tests myself... the checking can be quite unergonomic. It lacks negative lookaround, so it's difficult to write things like: between this start and end point (the inner loop body) there should be no branches or function calls.

It's tolerable for small handcrafted snippets that are set in stone, but I think for production code it would be unbearable since even small changes can lead to perturbations that invalidate matches.

Not really, because removing bounds checks depends on the optimizer but the optimizer isn't allowed to stop incrementing an atomic.

Remember that "a bounds check" isn't really anything special. It's just some code doing a i >= n check where it turns out that LLVM can sometimes figure out that it'll always be false so it can remove the block of code. There's no "this is a bounds check" annotation or anything in LLVM for it to see.

Maybe there should be a dedicated LLVM intrinsic or instruction for a bounds check or a panic branch?

Currently LLVM has no way of optimizing a[0] + a[1] + a[2] to only check a.len() > 2, and emits 3 checks instead.

1 Like

Having run-time checks to see performance is tricky, since the runtime checks themselves will lead to different codegen as you suspect. At that point, it's likely possible that you can create your own stdlib replacement which does something like that (akin to replacing malloc in C to have some kind of counter). That being said, allocation is something I can easily reason about without having to dive into asm. If I touch a vec or map of some kind, I generally hoist it above hot loops since I know it will be slow. External tooling, such as valgrind/memcheck can also check some parts of allocations. What you're proposing I think can generally live externally.

Well, I'm primarily using vanilla vim... so it's a bit annoying. My main problem is that I want to prove to myself that code has optimized a certain way, without having to second-guess if I need to check the asm.

Does it emit this because of compatibility with the C abstract machine and some ordering requirement (I'm imagining some case where indexing an array has side-effects which can't be ignored)? If you instead hoisted the line a[2] above the other 2 indexing ops, would it emit 1 check? For cases where I am doing something like this in a nested loop indexing one vector into another, I will sometimes add an assert that all values in a vector are less than the length of the other vector, but I can't know for certain if the compiler uses that information to optimize out the bounds check.

After more time spent thinking, I feel like run-time restricted value/pattern types (i.e all values in forall (a: usize) in (A: Vec<usize>), a < B.len(), B must not be mut) could solve my specific problem if they are guaranteed to remove bounds checks. Of course, implementation of something like that is non-trivial, and I have no idea how it would be structured or if it even is more generally useful. I also don't know if just slapping a ton of asserts in the code would/could have the same effect.

That's because we tell it it can't, because we include the failing index in the panic message.

Change it to try { a.get(0)? + a.get(1)? + a.get(2)? }.unwrap() and then it can.

1 Like

Yeah, I was hoping for a way to tell LLVM that it should handle a block/instruction as if it doesn't have side-effects + a way to force it to exist as long as the condition exists. Like a block of code it isn't allowed to optimize internally (see inline assembly?) or split apart, through which the condition has to be passed:

// With a promise to not have side effects
do_check:
    // Somehow hidden from the optimizer
    atomic_increment
    // Force the condition to go through this code
    _out = _in

Unfortunately I couldn't find such a thing (though my search wasn't long), so I still have no idea if this is possible.

Maybe. Would have the benefit of not impacting codegen too much, but to actually increment the counter you'd still need one of the following:

  • A debugger/runtime/interpreter that does increments a counter external to the application
  • A way to tell the hardware to do it (probably ends up being the debugger)
  • A post-processing step that takes the binary as input and replaces all relevant places with a jump that does the original thing and additionally increments the counter (I think that might be how debuggers do it (in-memory)
  • An additional step as part of linking that inserts the counter increments after compilation

All of those still have a performance impact and require inspecting the debug information to know which branches belong to which check (which can likely get really difficult for release builds, even with debug information enabled).

But I do agree that it might be better for this to live outside of compilation + codegen (where possible). Doesn't Valgrind do something similar: Inserting additional instructions into a binary?