Replace array bounds checking with modulo?

So I’m a newbie with rust, but quite an experienced programmer otherwise (most experienced in c++).

I really like what I’ve seen about Rust, and have been doing some exploration.

I was looking through rust_doom for a starting point for getting some actual concrete stuff up and running with Rust, and came across a comment in this source file which made me stop and think.

    // Sorry for the messy/unsafe code, but the array bounds checks in this
    // tight loop make it 6x slower.

The thing is that a big trend for me in my work has been to move away from iterators in data representations and use simple indices instead.

The context for this is PathEngine, where we do a lot of stuff with mesh data structures. And I’ve written a bit about this subject on my blog, in this series of posts.

Many years ago our meshes started out as being ‘list like’, so with pointers for access to individual mesh elements, and these pointers then wrapped up in iterators.

Since then we’ve moved from list-like to vector like, so with contiguous buffers replacing lots of small allocations (and memory fragmentation), but also with simple indices replacing iterators.

Some advantages of simple integer indices for this kind of stuff is that:

  • data structures are relocatable, and can be saved and loaded directly to/from memory (or moved around memory) very easily
  • integer indices are more compact than iterators
  • integer indices can easily refer to more than one data structure in parallel (so we can have multiple, independant, data channels for stuff like mesh face attributes, and use the same integer indices across these multiple data channels)

As well as looking up mesh related data, indices are also used for traversing around the mesh, e.g. looing up edges in a face and neighbouring faces across those edges, and this has to be very fast.

Adding bounds checking to this all would be a huge performance hit, then. And so, as soon as I see this issue this comes up straight away for me as a big red flag!

I saw some previous discussion about this here: https://www.marshut.net/iswyph/compiling-with-no-bounds-checking-for-vectors.html And actually, I agree with the ‘compiler flag for unsafe goes against the spirit of rust’ argument. It is really cool, I think, to have some kind of full safeness guarantees, and this is part of why I am interested in rust.

So I’m thinking about the whole issue of array bounds checking, and safety, and it occurred to me, what if instead of array accesses being bounds checked, and out of bounds cases being an error, the indices where instead forced to the valid range.

So, for example, indices could be clamped to the valid range, or a modulo operator applied.

This may seem kind of odd, in that it would silently cover up program errors that you would prefer to know about, and result in some strange and (initially) unintuitive behaviour. But what I’m looking for very specifically is:

  • safeness guarantees of the kind ‘your program cannot write or read outside of allocated memory’ and ‘your program will not crash or result in undefined behaviour’
  • as close to the metal as possible performance

So, personally, I’m willing to accept a bit of unintuitiveness for this.

The other side of it is then: what kind of difference would this actually make to performance? Clamping would seem to require conditional execution, but modulo can be done without any conditional execution, for what that’s worth (although still is quite an expensive operation, I think).

Taking this idea a bit further, if we knew that every array range is inside some power of 2 range owned by the program, perhaps this could be reduced to a very fast bitmask? (But that would be a lot more complicated and intrusive a change, if even possible, than just inserting a modulo operation!)

What are people’s thoughts on this? If there’s a reasonable chance of this actually getting in to the language, I’m willing to have a go at patching this in and setting up some kind of benchmark to get an idea about what kind of difference this would actually make, if any…

I don’t find the code that messy. And it is fine to opt-in for unsafe code if performance is really needed as long as it spans reasonable amount of code.

The thing is that the current behaviour should fit to majority of use-cases. And it is the safest behaviour possible. Your proposed wrap-around vector (using modulo) is memory safe but it will lead to silent error quite often. I believe that it would be regression for majority of use-cases for vectors because it is almost everytime more valuable to have correct code than extremely good performing code.

Moreover, you can create your own vector type that does modulo and uses unsafe_gets but provides memory safe interface. Only you have to avoid using slices of such vector.

Anyone else have any thoughts on this? I was thinking that there are most likely faster ways than modulo to force to a non-power of 2 range, e.g. with things like compare and conditional move (but that all starts to get processor specific pretty quickly). Also, while I quite like the idea of this kind of behaviour actually being the defined behaviour of a language (!), it could also just be something that could be turned on for release builds if desired, by compile flag…

Rust aims to be safe, concurrent, and fast. Your proposal fails to adhere to the first goal.

It’s that simple really. The modulo is not safe. Allowing corrupt state in the program must certainly be a niche situation, not applicable to general case. If you have a situation where this is acceptable, go right ahead and create your own vector type that does this. For general usage, and for inclusion into the language, this is totally out of the question.

1 Like

Look. If you know that your index will stay within bounds you can use the unsafe accessors that don’t do any bounds-checking at all. This is even faster than the modulo-trick or what have you.

Although I strongly disagree with it, Rust’s current philosophy in many cases seems to be that memory safety is paramount, but program correctness is dispensable. Using modulo instead of bounds checking would preserve memory safety, but not correctness. In that sense it would be highly consistent with (and analogous to) Rust’s current choice that integer overflow wraps around, which is nearly always wrong, but is fast and doesn’t compromise memory safety. (That is if modulo is faster than bounds checking - a benchmark would be nice.)

(All I’m saying is that there is at least a reasonable argument in favor of this proposal. But my strong preference would be to go in the opposite direction: keep the existing bounds checking and add optional overflow checks as well.)

2 Likes

@glaebhoerl

Rust wraps on integer overflow because integer arithmetic is pure (up to UB) in all common languages (can you even find 1 counterexample?), trying not to break that almost-universal trend (and because immutability is important in Rust).

On the other hand, raising some exception on OOB access is quite common.

Divide by zero is frequently quite impure. Raises exceptions or sets an error flag.

That may be one reason for wrapping on integer overflow, but the issue has been discussed thoroughly on the mailing list in the past - it was lack of performance of overflow traps (especially where they inhibit optimization) that seemed to ruin what was otherwise a strongly desired feature.

(Also, pure up to UB isn’t really pure. It still means that evaluating the wrong expression can, and sometimes will, cause the rest of the program to work incorrectly.)

How much faster is an idiv compared to a branch?

According to www.agner.org/optimize/instruction_tables.pdf , IDIV on Haswell has a latency of 39-103 cycles. That sounds a lot more painful than a well-predicted branch.

I would only use such an optimization with arrays with power-of-two size, so that I can mask instead of divide. But in that case, I wouldn’t want such an optimization built into the language. I’d write a wrapper myself.

Modulo operator is just one example, and yes this could be quite expensive, so modulo operator is probably not the best suggestion.

Another example that satisfies the same goals (certain type of safety guarantee + good performance) could be something like forcing out of range indices to zero. I'm not an expert on assembly, and this is platform dependant, but it seems like this could come down to something like a compare and then conditional move.

There's a comment about bounds checking in this thread on reddit, btw:

And I'm not sure that the cost of checking just comes down to a single conditional. I think I saw comments somewhere else to the effect that there is also an issue with stack frame requirements (or something like that) due to the fact that bounds checking failure may need to be reported somehow..

What do these proposals do when you index into an empty slice?

1 Like

The cost in the not-out-of-bounds case is the same for both failing and clamping to 0, since the failure code will never be run.

It costs in terms of executable size and compile time, though, since the compiler needs to generate more code for failure than it does for clamp-to-zero. But honestly this is more the fault of failure not calling abort than the current bounds-checking strategy being sub-par.

If you don’t want the safety and correctness that comes with real bounds-checking, don’t use it. It seems silly to invent different kinds of bounds-checking that still add overhead and that kill correctness. Just index the pointer directly if that’s what you want.

2 Likes

The possibility for empty slices complicate the situation, and throw a bit of a spanner in the works. :frowning:

The worst case is indexed writes to potentially empty slices, I guess, but I’m thinking that indexed reads and writes could be treated differently. If there is a way to speed up indexed reads only, but not writes, then that would still seem pretty worthwhile.

But indexed reads from potentially empty slices is still a tricky case.

Maybe it would be possible to require all arrays to hold an extra sentinel value, and then just read off the end of the slice. (Messier, but can still provide meaningful memory safety guarantees.)

In some situations it would be ok to read from some global zero filled memory space, I guess, since, in some situations, zero filled memory can work as valid default object initialisation. (I’m thinking about C# structs, for example.)

I don’t know enough about rust, to know whether any of this is actually workable, though…

Or treat empty slices differently.

Most extreme would be to disallow empty slices completely. But that would be very annoying, of course, because it would add the need for special case checks in a bunch of places where no special case checks were previously required.

Or, extend the type system to include some extra information which can be used to compile differently (without bounds checking). e.g. Allow me to specify for some slice type that the slice is not permitted to be empty.

I'm assuming that it will be possible to do clamping without a branch on at least some architectures, e.g. with conditional move on x86 and x64, and that this should then be faster. I don't have any proof of concept, however, or benchmark, to show this in practice..

Some relevant benchmarking here: http://danluu.com/integer-overflow/ (AMD64)

He compares estimated and measured performance costs for branch on integer overflow, so this seems quite similar to a lot of branch on range bound exceeded cases, but with the difference that branch on range bound exceeded will also need:

  • a cmp instruction, and
  • the bound to be brought in from memory, or use up a register

It seems like the branch on overflow basically gets set up to be correctly predicted for the not overflowing case, and I guess the same is likely to happen for branch on range bound exceeded.

My take away (for the branch on overflow case, at least), is that the theoretical cost of this is really not so bad. The actual cost can be very significant, but it seems like this is something that should be fixed at the level of the compiler…

Another interesting blog post on this subject:

http://bluishcoder.co.nz/2013/05/07/ranged-integer-types-and-bounds-checking.html

This kind of system with ranges actually coded into types statically is what I would need to be able to code something like PathEngine both memory safely and efficiently, I think…

If there’s a performance problem with subscript checking, that should be looked into. If you write that inner loop in a straightforward way, is the compiler missing an optimization? Could the subscript check be hoisted out of the loop.

That code looks like something transliterated from C. Would a redesign help?