Subscripts and sizes should be signed

fn main(){
    for i in (0..10).rev().skip(1).rev() {
        println!("{}",i)
    }
}
2 Likes

offset takes isize, which is why wrapping_offset takes isize. And it took signed because the underlying LLVM intrinsic takes something signed.

However, offset taking something signed is actually a pain. And now that add -- which takes usize instead of isize exists -- it's much nicer to use instead, allowing replacing lots of .offset(i as isize) with just .add(i).

Similarly, as I was working on https://github.com/rust-lang/rust/pull/95837, I found out that nothing actually wanted .offset_from(p) which returned signed. Every call turned into .sub_ptr(p), since they were all doing as usize on the return value anyway.

4 Likes

Very good example. You have a chess-horse. And you want to describe all possible moves of this horse. They have offsets (-1,-2), (-2,-1), (-1, 2), (-2, 1), (1, -2), (2, -1), (1, 2), (2,1). As you can see, offsets are signed. Also you can not move out from the board. If you use isize for position and offset, you can just write
if board.is_valid(horse.x + offset.x, horse.y + offset.y)
But you selected u8, so how would you write this code? The same but with 100500 casting operators? Checking underflows of each subtraction result? How it make code more clear? How it make code more safe? Looks like you just created more problems which are not actual with isize positions. Using "bounding types" is good when you need only dicsrete set of values like an enum. But if you need arithmetic calculations, bounded types are useless, and they create a lot of problems.

Do you really work in embedded? Sorry, but it looks, like you just use premature optimizations.

Is it real case? I dont know about 16-bit platforms in the Rust.

Using i32 as size is good, but there are rare cases, when we really need more that 2**31 elements (on 64-bit platforms, of course), in Java it is a problem sometimes. Using i64 is slow in 32-bit mode. So we need to use isize.

What's the advantages of "using actual domain"? I read this words very often, as a "compile-time invariant", and people who tell these words, assumes, that it is something good. But I dont understand why. I provided two examples, where signed type is safer and faster and make code more clear, but I did not see examples of opposite.

Very good example! It shows that using usize as index is a terrible design mistake! Let we look to this diff:

let dst = dst_buf.offset(i as isize);
---->
                let dst = dst_buf.add(i);

If index were signed isize, this problem would not be appeared. It is the proof, that you just "decided" the problem born by unsigned usize indices.

I don't think we need (or should) discuss pointer::* methods at the same time as array subscript operators. The former are very low-level methods that really care about provenance and their safety requirement includes the value range being in-bounds. My own argument in the paragraph above is not applicable to pointer because it argues the behavior for out-of-bounds index values—which would already be undefined behavior in the pointer operation and thus the conclusion a non-sequitur. Wrapping doesn't happen in the unsafe methods by assumption hence there's no reason to worry about it happening.

And then it seems easier to document the pointer primitive for the case of choosing a single, complete index type: isize. Clarity of documentation is of high importance for unsafe operations, any genericity would only hurt that clarity. And usize is not a complete index type for pointers since negative indices are actually necessary for some raw pointer operations. This explains the historical order of the earlier primitive offset(isize) then the helper add(usize).

Negative pointer offsets are valid and have a use in production code. This is commonly used with data structures where, when passing it around by a pointer, we use a pointer to the most commonly used field of the structure instead of its actual beginning.

One example of such data structure is BSTR from Microsoft's COM standard: a BSTR is a UTF-16 string, represented as a pointer to the first code point of the string, and also storing the byte length of the string immediately before the data. Thus, to read a code point at a specific index, you simply dereference the BSTR as a conventional array; but, to obtain the length, you need to perform a read at a negative offset:

BSTR bstr = ...;
uint16_t word_at_index_n = bstr[n];
int32_t bstr_size = ((int32_t*)bstr)[-1];

Another example is Glasgow Haskell's info tables. As Haskell is a lazy language, almost any value can either hold the actual value contents (the numerical value of an Int, the characters of a String, etc), or be an unevaluated thunk, which needs to be normalized before its contents can be inspected. The types which allow such laziness are called "lifted" and, to support this functionality, the lifted types have a specific memory structure. At the start of a lifted type, there is always a pointer to a so-called "info table", analogous to a virtual table in C++-like languages, that contains some necessary information about the object - whether it's an already evaluated value or a thunk, how the GC should traverse the object, some debug information and, in case of a thunk, the machine code to evaluate this object.

Since evaluating suspended thunks is the most common operation, GHC decided to optimize for this specific scenario, and laid out the info table like this - the pointer from the object points directly to the machine code of the thunk (so that "evaluating" a thunk becomes a simple indirect function call), while the rest of the information is stored immediately before the entry point of the machine code - thus, at negative indices.

LLVM needs to support all possible scenarios of pointer usage, and some of those scenarios include legal use of negative offsets - therefore, the offset intrinsic of LLVM is required to support them.

The way I understand it, this sentence refers not to the particularities of C++, but to the general pattern in programming as a whole - when choosing a representation for some numbers, programmers tend to use signed types by default, and only switch to unsigneds when there is sufficient motivation (such as the number being a bit field or an element of a finite group with modulo arithmetic).

So, with this interpretation, the argument looks like this:

  • When the index comes from the array itself (e.g. we simply iterate over the array's elements), there is no difference between signed or unsigned - we just move an object from one place to another without doing anything. For all we know, it could be a completely opaque object with no support for any arithmetic whatsoever.
  • When the index comes from a non-trivial arithmetic expression - each type conversion incurs additional cost, be it in the form of additional runtime checks, reduced code clarity or a risk of semantic mistakes. So, for this scenario, there is an objective preference: we want to use the same type for indexes as is used for the majority of arithmetic expressions that generate them.
  • In programming practice, the majority of arithmetic expressions operate with signed integers.

Unless the last point is not true in Rust (why? is there a reason for this? is there a statistical study?), this point does indeed apply there too.

What are the use cases for this feature? Does its positive impact outweigh the increased complexity that it incurs, or would incur?

1 Like

There is a mistake in Rust's design that it defined usize to be intptr_t (same size as a pointer), but uses it as a substitute for size_t (max size of a collection).

This happens to work on most platforms where intptr_t == size_t, but is technically incorrect. Pointers can be larger than collection sizes, e.g. 32-bit "far" pointers on machines with 16-bit segmented memory, or pointers with extra bits for capabilities/permissions.

So Rust either has a de-facto size_t and a specification bug, or has intptr_t and lacks a proper type for indexing arrays and collections.

And on top of that byte offsets in Rust are not allowed to exceed isize::MAX, but there's no unsigned type for this, so it has to be implemented as a manual check on usize.


As for signed indexing: there's zero chance of this changing in Rust. It can't be done due to language backwards compatibility constraints, policy of not making high-churn changes, consistency with other systems programming languages, etc. There's no point even discussing it further.

7 Likes

That's an argument that i8 may be better than u8 (if you implement knight moves that way, rather than by using bitboards), but the context of the chess example was not signed/unsigned, I was answering the idea that one should only use pointer-sized integral types.

Regardless of whether there are or aren't such platforms (there is at least one), what is the benefit of using usize and depending on the platform not being 16-bit instead of using the appropriate type that is guaranteed to be exactly what I need, i.e. u32?

Now I look at 5-piece endgames, of which there are 25,912,594,054. Should I use usize to represent them, or should I use u64?

1 Like

It is argument, why we need to use the type, which can hold all possible values and all possible intermediate results of calculations. Initial idea was "lets use u8 because coordinates of chess figures can not be negative", but when we need to use negative offsets and check negative result, this idea obviously creates a lot of problems and does not solve any of them.
The same I can tell about index size - it is better to use type which used in calculations. And calculations with negative numbers and negative offsets use often. And there are no calculations on indices which really require only unsigned integers.

There is no best fixed-size type for array sizes. i16 is too small on 32- and 64- bit platforms, but i64 is too slow in 16- and 32- bit platforms. So we have only 2 ways:

  • Use platform-specific type, like isize.
  • Allow users to specify type for size of any vector, array, slice.

Only 64-bit types can hold this value, so u need i64 if you want use numeric value, or u64 if you want use bitwise operations with this number.

No, it is an humar error caused by not checking your code. Signed indexes would not prevent the error, they just happen to do what you expect. In order to verify that your code is doing the right thing you should still check the edge case where v.len() is 0, and by using signed indexes you're hiding it from the reader.

And if you do some calculations that happen to make x = 0 and y=-1 then x-y results bigger than 0 and no error is thrown. Except maybe you didn't intend that to happen and now you got a subtle bug which you could have caught in debug mode if x and y were unsigned.

Because it allows you to immediately restrict the domain you're working with. If Vec::len returned an isize I would be confused as to when it could be negative. It being a usize makes it immediately clear that it can't be negative.


Overall I feel like the problem is that checked_*, wrapping_* and saturating_* are too long and hard to write/read. As other showed many of these examples can be written more clearly with unsigned indexes using those methods, it's just that the code feels a bit too long and messy with them.

5 Likes

It is a very strange point. I think, if code is valid - code is valid. There are no hidden errors, because these error are born by unsigned size only.
If code valid with signed types and invalid with unsigned types - unsigned types are error-prone in this code.

May be I really want to have negative numbers or intermediate result of calculations? If I acquire negative end result, method .get will return None and I will see the problem. And in current semantic the only way to do calculations with negative intermediate result - write a lot of ugly casts and ugly operators.

Why it is good? I have a lot of examples, when it creates a problem and we have dirty code full of casts, wrapping operators etc. May be, we need not use "beautiful" words, if they are not supported by practical advantage?

No, it is just because signed type is more useful in calculations. You can read documentation of the method and read that length can not be negative. And you need to do it once. But you need to write ugly casts many times while using usize.
If I see "usize" return type of ".len()" method, can I expect, that it can return values bigger that 2**63 on a 64-bit platform? According to you logic, I can.

Yes, it is. And if you forget to write these methods, compiler would not protect you. So you will have some runtime errors and rewrite code in ugly mode, so unsigned indices make code unsafe, slow and dirty. And it is really important.

To me valid code is not enough, you should be able to easily see why it's correct.

That's one case, not wanting negative numbers in intermediate results in another. You should strive to find the best solution for both of them, not only your preferred one.

Maybe I didn't explain myself well enough, but it allows the reader of the code to immediately see what they're working with. The expectation (at least mine) is that if an signed type is used then negative values are possible, otherwise a unsigned one would have been used instead. Sure, it doesn't guarantee that preconditions are always met, but it sets a clear boundary where they should be true.

If everything could be solved by reading documentation we wouldn't need types though.

And in fact it can for ZSTs, although you could argue it's not essential.

Signed indexes are not that better in my opinion. You still need to think about the exact same situations, you can just solve them by less code.

1 Like

Here is why this code is error-prone even with signed indices. This has a bug (playground):

fn print_all_but_one(elems: &[i32]) {
    let n: isize = elems.len() - 1;
    println!("The first {n} elements are:");
    for i in 0..n {
        println!("{}", elems[i])
    }
}

What you actually want to do (playground):

fn print_all_but_one(elems: &[i32]) {
    // If there are no elements, we don't skip anything.
    let n: usize = elems.len().saturating_sub(1);
    println!("The first {n} elements are:");
    for i in 0..n {
        println!("{}", elems[i])
    }
}
3 Likes

The is absolutely language-dependent. There are lots of people who've only ever used languages that only have signed types (like Java).

In Rust it's generally my experience that the vast majority of things are best as either unsigned or floating-point. Negative integers just aren't that important.

This came up a bunch in the add/sub method discussions. It's obviously useful to be able to go backwards. But it's extremely rare that it's reasonable for a single index in Rust code to plausibly be both negative and positive. So yes, you need .cast::<i32>().sub(1) to get to the length in the BSTR. But you still don't need -1 in the Rust code.

8 Likes

Yes. C code mostly use signed int because that's the type of literals like 5 and because integral promotion rules make your life difficult if you try to use other types. Some C++ code uses unsigned size_t because that's what vector uses (although that's actually implementation-defined AFAIK, you're supposed to use vector<T>::size_type).

A lot of Rust code uses usize for integers to avoid casts when indexing slices. I think if indexing allowed all types, people would start using other types a lot more. Not sure if they would prefer signed or unsigned types. I personally like unsigned types.

There is also the fact that in Rust there is a default type for literals, i32. I wish an unsigned type was the default, or at least the second choice for literals, because sometimes you have a trait that is implemented only for various unsigned types and then you can't use a literal without a suffix.

3 Likes

Related to this is the concept that indexes shouldn't be "raw" integers at all - if Rust had value restricted types, then the allowed type for indexing would not be usize, but would instead be a value-restricted type whose permitted range was exactly the same as the indexing range. This begins to take you in the direction of dependent types, though.

The extreme is not having bitwidth specifiers for integers at all; you'd specify an integer as having a range (e.g. Int<0..=255>), and the compiler would determine signedness, bit width needed etc based on this range (so that would be a u8 in today's Rust, while Int<0..=254> would be represented as a u8, but 255 would not be a possible value. Int<-8..=16> would be signed, and at least 5 bits large).

I think it is worth pointing that there actual crates where indexing is allowed with negative numbers. See for example the ringbuffer crate. I could see sense on std::slice::Iter implementing indexing for negative indices meaning accessing a previous element. For the slices (or Vec) themselves I do not have much preference. Perhaps I am a little inclined on isize being slightly better, but not important enough to be worth the churn of changing all existent code. And if someone really wants to index a Vec with isize they can always implement Index<isize> in a wrapper (although std::slice::SliceIndex would not help).

It works on each input data on each platform, and does not supported only by compiler`s hacks, so what's the problem?

Could you provide small example, please?

Reading documentation one time is not a problem, because you read it one time. Creating quirks for dodging unsigned problems - it is what you need during all Rust programming practice.

Is it real case?

I provided two examples, where unsigned indexes create objective problem. These problems are bad not only in my opinion, because you can easily compare count of branches or count of boilerplate code.

Yes, thats true, when you need count enumerated numbers, there is no difference between signed and unsigned. In this case you need write the same code for signed and unsigned.

Numbers bigger that 2**63-1 are mush less important than negative numbers.

We need to cast each time when we need negative numbers. But if indices in Rust initially were isize, when you would need to cast to u32 or usize? May be code were more clear in most cases?

I can wrap Vec, but how can I wrap default language slices/arrays?

It is actual for ringbuffer, but in default vector Python-like semantic is not good, I think, by many reasons, I think you know them. In Vec negative index should be invalid, of course.

It is ok for operator [ ], but for method .get you need wider type. And it is better to make it wider to both sides. Because why "overflow" side it better than "underflow" side?

If we had a single generic integer type Int<std::ops::Range>, then […] indexing would take an integer with the exact range required; get(…) would take an Int<T> and do the conversion to the required Range, returning None if that conversion fails.

Beyond that, though, I have no strong opinions - I rarely, if ever, use indexing with Vec or slices (although I do with BTreeMap and HashMap, where the index type is set by a parameter on the collection), because I find iterators are almost always clearer about my actual intent than indexing.