Subscripts and sizes should be signed

Obviously subscripts are special in that the representing datatype is generally larger then the set of valid subsets which will allways be the range 0..=N where N is some nonnegative integer smaller them or equal to isize::MAX.

As for choosing the underlying datatype, I think usize has some clear benefits over a sized type:

a) The set of invalid indices consists of only one kind - too large indices. Checking whether an index is valid thus requiers only a single comparision operation to the index set lengh.

b) Given that the case N=isize::MAX is extreamly rare, there will allways be an range of too large but representable indices, no matter wheter you use usize or isize. The fact that usize can represent more numbers towards positive infinity them isize, does not really matter. It will only increase an allready existing set of too large indices.

c) The production of an negative, and thus invalid, index will be spotted egerly, when it is produced, not when it is used in an attempted array access. This is good, because, this is where you would have to fix your code.

d) For some index i the range expressions 0..=i and 0..i are allways valid, this wouldn't be the case for signed indices.

e) If a special, invalid indices are required, you can also use usize::MAX just as easy as -1. If you explicitly want these special indices to be obtainable during calculation, it is good to explicitly mark this in your code and using wrapping sub instead of normal sub is a good way to do so. Similarily casts between usize and isize can be justified, as they just remap invalid indices.

2 Likes

If you read my starting post - checking index requires single comparison with signed index too. Moreover, if you need to check result of subtraction - you need one comparison with signed indexes and two comparisons with unsigned (because you need extra checking to prevent subtracting bigger number from smaller).

Why it is applied only for index underflow (index<0), not for index overflow (index>=len)? Situation looks non-symmetric. If we has no way to check overflow while creating index, may be we should check overflow and underflow at one place? And it will be faster.

When we get subslice, we need runtime checks anyway, so this constraint does not give us the way to optimize code.

Wrapping sub is a default sub in C++, you can read Stroustrup's article about it. It is ok if I really need modulo arithmetic by semantic, but it is not ok if I mean signed semantic.

I think there is a conceptual simplicity of using a single comparison, regardless of performance. Natural numbers are simpler to think about than integers:

// Verify that the index is within the collection.
assert!(index < collection_size);

is simpler than either:

assert!(index >= 0 && index < collection_size);

or:

assert!((0..collection_size).contains(index));

Yes, by reinterpreting the index as unsigned using the wapping convention and then checking the unsigned index. If you need that reinterpretation, do it explicitly and concisefull map invalid negative integers to invalid positive integers.

Why it is applied only for index underflow (index<0), not for index overflow (index>=len)? Situation looks non-symmetric. If we has no way to check overflow while creating index, may be we should check overflow and underflow at one place? And it will be faster.

We are talking about different kind of checks. The one I am talking about is the debug mode check for integer overflow. In release mode, the value will wrap around to invalid high indices for unsigned. Notice that using signed integer, Rust also performs an debug check for integer underflow for signed integers. The difference is that this one is not helpfull in fixing the invalid index generation however.

When we get subslice, we need runtime checks anyway, so this constraint does not give us the way to optimize code.

For unsigned, we can ommit this runtime check when the lower bound is zero.

Wrapping sub is a default sub in C++, you can read Stroustrup's article about it. It is ok if I really need modulo arithmetic by semantic, but it is not ok if I mean signed semantic.

You do not need signed arithmatic. Indexes outside the valid range are invalid and only make sense using additional means of interpretation.

1 Like

Stroustrup makes a strange claim here:

Modular arithmetic makes “overflow” well-defined behavior and thus inhibits error detection and handling.

He's saying undefined behavior makes debugging easier. This is contrary to usual wisdom.

3 Likes

This is a thing specific to C/C++.

In Rust wrapping operations are a special well defined operations, different to normal integer operations, that can be accessed using specialized methods. These then have a very solid mathmatical interpretation. In Rust overflow is effectivly also a sort of "undefined" operation, although rustc uses instant panic in debug and wrapping arithmatic in release mode, making the behavior not unessesarily unpredictable.

In C/C++ you do not have this luxury. You would have to define all arithmatic to be wrapping, something which is indeed not beneficial.

It is easier to think about .is_valid_index as one blackbox. If we are not writing library code, we dont need to think about inner realisations, we can think only aboud set of valid indides.

It is dirty hack. It is ok for library code, but not ok for user code. And I dont need modulo semantic. I need signed semantic, I need to acquire negative intermediate result of calculation and send it to .get method.

There are many ways to generate invalid index, all of them will be checked in [ ] operator or .get method.
Unsigned numbers are not helpful when you have overflow, and they generate problems, when underflow can really be a valid intermediate result by a context.

Yes, but for checking argument using .get we need wider range, wider to both sides.

If we try to compile slice operator, and we inlined it, and try to use information than left bound is zero, we anyway can use only one comparison for right bound.

I think, he tells that modulo arithmetic sometimes prevent detecting errors.

If you think of indices as black boxes to be dealt with by somebody else, then it doesn't matter to you what type they are. You are discussing arithmetic on indices, so you can't now say let's treat them as black boxes.

He implied that if unsigned overflow was also undefined behavior, it would be easier to debug errors. That's just really really false. I would agree if he compared it to panicking, but he compared it to undefined behavior.

2 Likes

@T4r4sB It's really unclear what is your goal with this thread and what you are trying to argue. If you propose to change the API of Vec, that will never happen. It would be a violation of Rust's stability guarantees, and even if it could be implemented as an edition change (which is almost certainly impossible), it would cause an extreme amount of churn in the ecosystem for very dubious gains.

Moreover, people have given you multiple reasons why the current behaviour is better, more convenient and more correct. You have brushed them off without much consideration. It may be the case that you have a codebase where it's really impossible to properly use unsigned integers, but in that case I'd like to take a look at it, because in my experience I have never wanted Vec indices to be unsigned, and extremely rarely wanted them to be something other than usize (which can be easily solved by writing a newtype around Vec with different indexing).

So, what are you trying to achieve by continuing to argue? You don't seem inclined to change your mind, the other people certainly won't, and even if they do, the Vec cannot be changed.

8 Likes

Technically implementing Index<other_types> for slices wouldn't violate stability guarantees since additional trait impls are explicitly allowed.

It may break code that relies on type inference. Or even change the meaning of some such code due to the i32 default if Index<i32> is implemented.

Of course removing Index<usize> as OP proposes is a total no-go.

I do not find this approach easier. I find the "non-negative number" but only up to a certain limit interpretations the easiest to work with, as you only need to remember one single bound.

It is dirty hack. It is ok for library code, but not ok for user code. And I dont need modulo semantic. I need signed semantic, I need to acquire negative intermediate result of calculation and send it to .get method.

There are many ways to generate invalid index, all of them will be checked in [ ] operator or .get method.
Unsigned numbers are not helpful when you have overflow, and they generate problems, when underflow can really be a valid intermediate result by a context.

Underflow and overflow are per se undefined results. (Unless you use wrapping arithmatic and carfully fought about it). You can also break things using signed integers if you allow underflow and overflow to propergate. Notice that in both cases (signed and unsigned) you cannot rule out that excessive wrapping brings you back into the valid range and hence is not detected by the element access operation. It is a good thing that all elements where the bit representation of the index starts with a 1 are allways out of range.

Yes, but for checking argument using .get we need wider range, wider to both sides.

Why? We only need to distinglish between valid and invalid but representable values.

If we try to compile slice operator, and we inlined it, and try to use information than left bound is zero, we anyway can use only one comparison for right bound.

We cannot. If the right hand sight bound is an integer, we must allways check at runtime, whetjer it is negative.

It is ok when you use index just like an unique id of something. But each time you need to have arithmetic with subtraction, it is very important to have signed indices. And compiler will not protect you from potential errors, you need to have problems in runtime. Writing unsigned code is much harder, and has no benefits.

I provided real code examples. I dont believe in common worlds, like "compile-time invariant is cool, because using these clever words sounds beautiful". Self-documentation? How Qt and Java users live without it? Checking errors exactly at place where it was generated? Why do you think that you know, which kind of errors I need? And why it works only with underflow?

Looks like you used indices only as unique id, and never used it as coordinates where you need some arithmetic.

  1. Rust's community need to agree that unsigned indices and sizes is a design mistake (on one side are real examples, how unsigned indices make code slow, dirty, unsafe, on other side - just beautiful words).
  2. Provide a way to move to signed semantics.
    You can look to same process in C++ STL's design. First step - community postulated that using unsiged numbers is fail (excluding context where you need bit representation), second step - STL provided methods like ssize etc.

Remembering "index>=0" is not hard.

If you use wider type, they have defined result. Lets use wider type.

Yes, it is. But with unsigned semantic we need to have a first check before subtracting, and second check inside .get method. And the only way to dodge double checking is using hack with wrapping operators.

Easy

fn get_slice_from_zero(&self, upper_bound: isize) -> &[T] {
  assert!(upper_bound as usize < len as usize);
  self.get_slice_unchecked(0, upper_bound)
}

We have only one check, which is necessary with unsigned sizes too.

Technically I know, that it is impossible, but why cant I dream about it?

The community consensus is overwhelmingly "unsigned indices are the bee's knees".

You don't have any benchmarks to prove that it makes code "slow", and there is copious evidence otherwise. It doesn't make code "unsafe", the word "unsafe" in Rust has very specific meaning "violates memory safety", and unsigned indices can't violate it. You won't convince anyone of anything if you try to redefine terminology how you see fit. "Dirty" - well, it's your opinion. So I'd say one side has facts, data and experience, while you have only provocative words.

If you need index arithmetics, you just do it. Signed is not in any way easier to do arithmetics with than unsigned. If you do multiplication, you are just as likely to overflow. If you do addition or subtraction, you need to check bounds just as carefully. At least unsigned integers have methods like checked_sub or saturating_sub, which make it easy to define behaviour on zero overflow.

Because that's the stable API, and it won't change. Any code written today should compile forever. If you need different error handling and different indexing, write your own vector implementation. You can wrap a Vec to let it do the heavy lifting.

Then you don't understand anything about Rust. There is no point to discuss it further, because you are just talking past each other. You're welcome to write your code however you write, but don't expect people to side with you when you don't understand or appreciate basic design principles of the language.

3 Likes

We dont need benchmarks to compare code with 2 cmp instructions and 1 cmp instruction.

Ok, "unsigned indices sometimes provoke writing a code generating panics in runtime, when the same signed-indices code works correctly". It is better?

Ok, you right. It is only my subjective opinion, that
for e in v.get(..v.len().wrapping_sub(1)).unwrap_or(&[]) {
harder to write and harder to read than
for i in 0 .. v.len()-1.
May be for somebody else first example looks more intuitive.

In most cases isize overflow means you have incorrect input data. usize underflow in most cases means that you just forgot to write easy and intuitive words try_cast::<isize>.unwrap(). It is too easy to remember to write try_cast::<isize>.unwrap() each time when you need to subtract two numbers.

Yes, I know it, but community can find some way to fix the unsigned underflow problem.

How can I wrap slices and arrays?

What are Rust's design principes? I thought, it is writing code with less count of possible errors and less possible performance penalty, isn't it? If some conception has beautiful name but make code only more verbose and slower, and provoke runtime errors without compiler protections - this conception doesnt follow Rust's principes, and it is usable only for hipster's smalltalks in barbershops.

Indeed - and if you look at the optimized output from a compiler, unsigned index bounds checks have a single cmp instruction (if the bounds check hasn't been optimized out completely), while a signed bounds check starts with at least two cmp instructions before optimization. I have yet to see a case in which a signed index guarantees fewer runtime instructions than an unsigned index (which is one comparison to do a bounds check, at most) - your "trick" at the beginning of this thread is to treat the signed index as an unsigned index, to get the same bounds check as unsigned indexes use.

5 Likes

The "undefined behavior" means that the implementations are free (though not obliged) to define whatever meaning for the operation as they see fit. In the case of signed overflow, this allows the compiler to define it as "pause the program, open the debugger, issue a diagnostic message and ask the programmer what to do next". Or "cough in the stderr and halt the program". Or "issue a CPU trap and let OS deal with it".

Additionally, when a static analyzer determines that, at a particular point in the program, a signed overflow is highly likely or even inevitable - it can be 100% sure that it's an error and tell the programmer all about it.

Unsigned "overflow", on the other hand, is already defined by the Standard to wrap the result around, so any of the behaviors above would be actually a violation of the C++ Specification. At runtime, when unsigned overflow occurs - the machine has to wrap around and continue. Of course, it can still trigger a breakpoint - when already under a debugger, but in no way it is allowed to halt or raise an exception. Similarly, when an unsigned overflow is discovered during static analysis - the toolchain should always consider the possibility that this is intentional, and as a result, they end up much more conservative with warning messages. As an example, look at how GCC and Clang gave a warning only on signed overflow and remained silent on unsigneds: Compiler Explorer

3 Likes

As a possible example where signed indexes would make sense, consider an image DFT.

The result is also a 2D matrix, where each element is a complex number measuring the amplitude and phase of a specific 2D frequency in the original image, and the 2D index corresponds linearly to the frequency. That is, each matrix element at index (i,j) contains the complex amplitude of (S i + i0, S j + j0) and represents the wave Exp[I * ((S i + i0) x + (S j + j0) y)]. What are the offsets (i0, j0)? This depends on where in the frequency matrix we decide to put the DC axes - this is, technically, a matter of convention (there is a working DFT/IDFT pair for any such choice); but the most common convention places the DC in the middle. If you search "image dft" on google, practically all the pictures will look like this - the low-frequency components, containing most of the energy in the image, are placed in the middle; while the higher frequencies are places further to the sides. It makes perfect sense to define the DC component as (0, 0), and index the rest of the matrix relative to it - so that the index-frequency dependence becomes a simple proportionality (S i, S j). However, under this convention, since the left and the top halves of the DFT represent negative horizontal and vertical frequencies respectively - they get assigned negative indexes as well.

Also, in this scenario, the negative indexes also have physical meaning (as negative frequencies) and can be used as numerical quantities to calculate some function values, for example - to obtain the attenuation factor for a specific frequency, in order to perform a convolution.

So, the image DFT, in my opinion, is a pretty good example of an object that is simultaneously:

  • an array, as the components are placed contiguously in memory
  • and uses negative indexes, as they have direct physical meaning of a spatial frequency, measured in units of the inverse image size.

Sounds like this image DFT type should be indexed by (isize, isize) pairs and work for negative coordinates. The language already supports this: impl Index<(isize, isize)> for ImageDft.

2 Likes

It's a niche case, wouldn't you agree? Yout argument is reasonable, but the proper solution would be to write a separate type which would support sized indices. The topic is about general-purpose vectors and slices, and in that case there is no such symmetry argument.

1 Like