Subscripts and sizes should be signed

There is an article about C++ from Bjarne Stroustrup: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1428r0.pdf

There are some videos with him where he talk about this problem too. Unsigned indexes was a mistake in C++ STL design. And I dont understand, why Rust repeats this mistake.

The main idea: it we work with a bit representation of some value, it is better to use unsigned type, otherwise we need signed type. And Indexes and sizes of arrays must be signed too.

Some of Stroustrup's arguments are applicable in the Rust. Yes, there is some difference between size_t in C++ and usize in Rust: in C++ unsigned type is a module type, but in the Rust it is not.

Really, where are advantages of unsigned indexes?

May be, checking unsigned index is more effective (1 branch instead of 2)? No, there is a simple hack to check signed index using one branch: (index as usize) < (length as usize). Sometimes it seems to me that people, who made the decision about the unsigned index, really did not know about this hack.

May be, checking unsigned index prevent us from some errors? No. For example, I need to enumerate all elements of vector excluding last. With signed indexes I need just write for i in 0 .. v.len() - 1. But with unsigned indexes this code crashes in C++ and in the Rust. It is an exemple of error generated by unsigned indexes! With unsigned I need to write something like this: for i in 0 .. min(v.len() as isize - 1, 0) as usize. Iterator-fans can say that I am very old and indexes are too vintage, and it is better to write for (i, e) in v.iter().enumerate.filter(|i, _| i + 1 < v.len()). If they think, than this loop header is more clear, I cant talk them something else.

May be, unsigned index is good, because we have a compile-time invariant, that type corresponds valid values? No, it is not, because if some vector has size s, the data type corresponding all valid indexes must be [0 .. s-1]. But unsigned type corresponds number from 0 to 2**ptr_size-1. So we have type which corresponds valid values only from lower bound. And anyway we need to check upperbound in runtime.

Also, the sentence, that lower-bound is compile-time-checked is a lie. Let we write arr[x-y]. How compiler can predict, that x-y>=0? It cant. In fact this "compile-time invariant" supported by runtime checks. "Compile-time invariant" in this context is just a beautiful words for hipsters, without any useful purpose.

Also about the idea where we bound each data by type corresponding only valid values. Why it is good? It is not obvoius for me. Very often we need to make calculations where intermediate result is out of bounds. And we need to write dirty code full of casts, it os not good!

Also I have an example where unsigned index is slower that signed.

if let Some(element) = arr.get(x-y) { ... }

With signed indexes this code has only one branch. But indexes are unsigned, and we need to check overflow and underflow using different ways, and compiler cant optimize it: if let Some(element) = if x<y { None} else {arr.get(x-y)} { ... }

If you has an example where unsigned index is more effective, please, write it. Because I dont know such examples.

You can see, that with unsigned indexes code if more dirty, more verbose and slower. So I think, we need to make these things using many steps:

  1. Add possibility use signed index and size in language code and std
  2. Add requirment to use signed indexes in slice traits.
  3. Mark unsigned indexes as deprecated.
  4. Disable unsigned indexes. Rust codebase is not too big as in C++, so it is not to too late to fix this design mistake.
2 Likes

Optimally you would be able to index slices and provide sizes with any integral type (signed or unsigned).

In some cases you want platform-dependent types (usize / isize), but in many other cases it would be better to use platform-independent types (u16 etc).

  • If your code is supposed to handle slices of any size supported by the machine, you want usize (or isize)
  • If, on the other hand, the size of your slice is platform-independent (e.g. you know it's 1000000), it's better to use the appropriate platform-independent type (e.g. i32).

Using a platform-independent type makes your code more portable.

For instance, with usize:

if big_data {
    let a = vec![0; 1000000usize];
    // ...
}

will compile on 32-bit platforms but not on 16-bit platforms, which is a portability problem. usize was supposed to make code more portable, but in this case it makes it less portable.

Of course the problem with implementing Index<u32> etc for slices is that it would probably break a lot of existing code.

2 Likes

I think, you talk about another problem. It is the old problem of C language, where integer types has no fixed size, int can be 16- or 32- or 64- bits, long can be shorter than int etc. And in old times it really made some code more portable. But now it makes code less portable and each project use uint32_t, size_t, int32_t etc.

I think using unsigned types is not so bad. I've often wanted to use u32 or u8 for indexing, but haven't really felt the need to use isize.

Almost all of the problems that Stroustrup talks about aren't about using unsigned types per se, they are about implicit conversions between unsigned and signed types. Rust doesn't have those implicit conversions. So most of what he says doesn't apply to Rust.

If you use as usize you might suffer from some of those problems. Using as for such conversions is bad and should eventually be deprecated in favor of .into(), try_into(), .wrapping_into(), and such.

Yes there is a danger of overflow when using index - 1 or length - 1. That's the only real argument I see in favor using signed types.

Most examples of dangerous code from Stroustrup's paper will just not compile in Rust because it mixes signed and unsigned numbers in arithmetic or comparisons.

14 Likes

Most people are in agreement that in a perfect world you'd be able to index with any primitive integral type. The limitation is that adding the trait implementations is unacceptably inference breaking without some future undesigned features to refine type inference of indexing.

In the meantime, you'll see some people using a macro along the lines of

macro_rules! ix {($i:expr) => {
    $i.try_into().unwrap_or_else(|| panic!("index out of range")
}}

The main pitfalls of unsigned indexing in C++ result from the usual integer coercions. Rust doesn't implicitly coerce integer types, so most of them are a lot less potent in Rust.

I'd typically write this with as either

if !v.is_empty() {
    for e in &v[..v.len() - 1] {
// or perhaps if I'm feeling clever
for e in v.get(..v.len().wrapping_sub(1)).unwrap_or(&[]) {

This isn't actually sufficient. You can have length > isize::MAX when the array element type is zero sized. It's only sufficient when the array element type has positive size, and then only due to the restriction that no one object can be bigger than isize::MAX bytes.

3 Likes

Why cant we write code without conversions? Conversion make code harder to write and harder to read, code with a lot of conversions looks ugly. Just make isize a default integer type.

Why do you do it? Why cant you use the same type for all indexes? Less casts - less problems.

This problem is not from real world. It is fantastic situation, it can be appeared only in 32 bits, and if you need more than 2**31 elements, you dont need usize, in such case you need 64 bits.

Ok this unsigned code looks very clear, easy to read, easy to understand! Really, why u try to defend feature, which makes code ugly in all real cases?!

I think it's more of a mistake for C++ because of integer promotion, unsigned wrapping vs. signed overflow UB, and the lack of bounds checking in general.

One unsigned drawback that Rust does share is that we can't actually use the full usize range for object sizes. That's relatively recent though, that we codified Layout size being limited to isize::MAX.

1 Like

You can do this:

if let [elems @ .., _] = v {
    for e in elems {
        // ...
    }
}

It makes sense to have an if statement because it forces you to think what to do if there is no last element that you want to skip -- it's not clear, it's a special case. If you're counting them, is it 0 elements or -1 elements? So the type system forces you to consider the special case.

For indices you can do this if you're sure that you want to saturate at 0 elements:

for i in 0..v.len().saturating_sub(1) {
}

I agree signed indices would be cleaner for this. But you can get the same speed with usize:

if let Some(element) = arr.get(x.wrapping_sub(y)) { ... }

This doesn't work if the indices are bigger than isize::MAX -- but neither does using isize.

Because with u8 or u32 I know how many bits I have, while with usize I don't know. It's sometimes easier to deal with overflow if I know how many bits I have. And it makes the code more portable and testable. And it saves memory sometimes: if my arrays are 100-long, u8 suffices, usize wastes 7 bytes.

2 Likes

I think there's a reasonable argument that unsigned indices are better for catching some mistakes, because going below 0 will underflow to some big number and almost inevitably cause a crash. When I translated one codebase from C to Rust, I found a ton on of "index by -1" errors that were silently ignored because it used i32 for indexing. Combine that with a common practice of using -1 as a sentinel value and an allocator using that space for metadata and you have some nasty memory corruption to debug, potentially.

7 Likes

Wouldn't they be caught even with signed indices, thanks to the bounds checking in Rust?

There's also an argument that the opposite is the case, the proper way to catch these bugs is 'bounds-check-then-convert-to-offset'. The current practice requires all callers to do 'convert-then-bounds-check' which loses some intent of the original data during the lossy conversion to a different integer range (whether this by use of as or by other means).

If the index size is larger than the platform's pointer size (possibly most common, i64 from a file offset on a 32-bit system) this can even be subtly devious if a negative index wraps around to a valid but incorrect positive index during conversion. Or, when a signed integer of smaller size is used as an optimization to calculate some local index and by mistake the programmer used zero-extension instead of sign-extension for the the conversion, which can also return a valid but incorrect indices.

By keeping programmer intent, and given the above use case of the check, suppose that negative indices were be handled just like positive out-of-bounds indices as there is no difference in intent between the two. That is, return None from a use of get and panic on subscript. I would expect:

let arr = [(); 1];
assert!(arr.get(1).is_none());
assert!(arr.get(0..2).is_none());
assert!(arr.get(-1i64).is_none());
assert!(arr.get(-1i64..).is_none());
assert!(arr.get(i64::MIN).is_none());
assert!(arr.get(i8::MIN..).is_none());
arr[-1]; // Panics
4 Likes

It is only argument about index checking, not about unsigned indices.
And if I want to check index using method .get, I want to recognise underflow in this method, I dont want to use specal quirks on subtracting stage.

Yes, both these code examples work. Ok, in each case you have the way to write it with same performance with unsigned indices. Yes, you can dodge underflow errors in each case. But with signed indices you dont need to dodge it, you can write the simple code using simple arithmetic operators! If you need to have special method for each case, and if you need to think too much about each special case - it is a problem. It is a good indicator about mistake in indexing design.

These types are useful only in communicating with ffi, low-level optimisations, working with bit representation, embedded code. But I think, such code is not the bigger part of Rust's codebase.

Yes, thats the point I want to tell. I provided some examples when this design make code clearer or faster. And I dont know examples, where usize indices make code clearer of faster.

Rust doesn't make this mistake because it didn't repeat the core mistake from C++ in the first place.

If you look at the paper, the primary reason is

Signed values are the result of most integer calculations

That's just not true in Rust.

C++ has the "usual arithmetic conversions" which mean that if you add unsigned short to unsigned short you get not another unsigned short, but an int! (Well, probably. Depending on your platform's integer sizes things could be even weirder.)

In Rust if you add a u16 to a u16 you get a u16, not an i32.

So unsigned types aren't second-class in Rust the way they are in C++, and thus it's good and correct for indexing and .len() and such to be unsigned in Rust.

21 Likes

And it is a problem. If u subtract u8 from u8, the result is from diapasone [-255..255]. But in Rust rules the result is u8 too.

When we talk about numver overflow, it is very rare case, so it is not a problem. But unsigned underflow - is the result of any subtraction bigger number from less number. And if we need to make some calculations with indices and subtractions, we need to make code dirtier with a lot of ugly casts. Unsigned indices is a big problem in these cases. And I dont know cases, then signed indices are problem. No, really, u need array bigger than 2**31 elements in 32-bit code?

That's true. 42-1337 is a signed number anyway. Sometimes result cannot be represented in result's type, and we get a panic in debug and wrapping in release. It is not an expectable result in most cases.

Sorry, I did not understood this logic implication. Why I can not tell this: "Signed types aren't second-class in Rust and thus it's good and correct for indexing and .len() and such to be signed in Rust"?

I like unsigned numbers because they have a simpler, more natural range. n bits: 0..2n. Simple.

Imagine I write a chess program. I know coordinates on the chessboard go from 0 to 7. So I use the u8 type. Why should I use a type that depends on the size of RAM in a platform-dependent way for this? Makes no sense.

Or I calculate chess endgame tablebases. I know there are 125,246,598 positions with 4 pieces. So I can encode a 4-piece endgame position in a u32. Using usize for this is not only wasteful on 64 bit platforms, it's also wrong on 16-bit platforms.

Even the standard library uses types other than usize sometimes, e.g. usize::from_str_radix takes the radix as a u32 (seems arbitrary that u32 was chosen and not the more clearly motivated u8, but oh well).

Clearly, many people in C++ think that they should be unsigned -- and things like vector's size_type in C++ is signed -- or else he wouldn't need to write the paper in the first place.

So people would prefer them to be unsigned in C++ too, it's just that there's some unfortunate interactions with old C mistakes that make that not work as well as it should, thus the paper suggesting signed instead.

But languages that don't repeat those mistakes from C can thus freely use unsigned, which better represent the actual domain, the way many people would like to in C++ too.

7 Likes

That's because radix is not intended to be pointer-sized. I agree the decision to use u32 here is surprising.


This discussion vaguely reminds me of the new-ish Strict Provenance in std::ptr: Tracking Issue for strict_provenance · Issue #95228 · rust-lang/rust (github.com) And the exposition for it: The Tower of Weakenings: Memory Models For Everyone - Faultlore

Also note that many of the pointer methods like wrapping_offset() take isize which more or less implies that impl<T> SliceIndex<[T]> for isize (and friends) should also exist. But I'm also missing a huge chunk of context between the implementations of wrapping_offset and SliceIndex<T>, so don't take my word for it.

If I'm reading this right, these two things (ptr::wrapping_offset() and SliceIndex<T>) exist at different levels of abstraction, so there is no explicit connection between them. It also appears that if there is an implicit connection at all, the isize/usize mismatch was just an oversight?