[Pre-RFC] Implicit number type widening

Obviously there are safety issues, like u8 / u16 overflow on 32bit or 64bit CPU. I mean compilation types issues. Obviously it hit assert on runtime, but until you run code you have no idea that for mistake you use u8 or u16 type instead of usize.

1 Like

I don't see how this works. If there's a conversion between two integer types of different width, one of the directions necessarily involves losing information. Furthermore, I was referring to the scenario where not even a conversion is necessary for a semantic error (so whether you are actually losing information with the conversion doesn't even matter, and consequently the upcast being lossless doesn't help): if you use u32 as an index or counter or length or whatever, and you hand the indexing/counting operation something that is bigger than 4G elements, you won't be able to access the end of the elements (in the case of indexing) or you'll overflow (in the case of counting).

But my point is exactly that it's not noise. You need the conversion because you are doing something fallible. Hiding it might be more convenient to write but it's certainly more error prone and harder to debug.

The burden is not in the types or the conversion itself. The burden is inherent as it is caused by the kind of impedance mismatch whereby you are trying to do something platform-specific in a more generalized way, or vice versa. Any "solution" that hides the need for extra attention and thought in these cases introduces sloppiness.

Yes, but this ignores situations when you are supposed to use a wider integer for either correctness or generality (with the latter, I again refer to the case when the use of u32 instead of usize prevents 64-bit platforms for handling data structures larger than 232). In my definition, that is a bug (at best a design error, at worst an actual loss of data), which is currently prevented by a type error and would slip under the radar with the introduction of implicit widening.

First, we’re talking past each other. Your case is about using u64 for files, while I was speaking about using u32 or smaller for things that are known at design time to never ever be big.

So let’s look at them separately:

size > 32-bit

If I’m working with files, I’m supposed to use u64 instead of usize, and in general there’s little need for implicit widening.

However, if a file format uses 32-bit lengths or offsets (and plenty still do), I’d still need to convert them to u64 for seek, and usize for read. Suddenly there are legitimately 3 different types involved. All fine and perfectly correct with implicit widening, and noisy with casts.

size <= 32-bit

That’s the case where I’d like implicit widening.

There are many values in programs that are not related to files, and are never even close to being as large as 4 billion. It may be a number of currently open program windows, number of wheels on a car, number of occupied nodes in a fixed-width b-tree.

As soon as you involve such quantity in indexing or length comparison, Rust insists on it being usize. There is nothing technically wrong in it being usize, apart from wasting bytes and larger arithmetic instructions on values that are never as large.

So if I decide to store the value in memory in the smallest type it needs, Rust makes me litter the code with as usize. This doesn’t make anything more correct, doesn’t add useful information.

7 Likes

But it is important only for cases when file big enough for architecture, like > 4GB for 32bit. So it is not common case. Why you can not write code like this:

struct BigFile(Vec<u8>);

impl std::ops::Index<u64> for BigFile {
    type Output = u8;
    fn index(&self, index: u64) -> &u8 {
        unimplemented!();
    }
}

and do proper thing only in one place?

But all of them are specific, for one case uN right index, for another it is not, and it's usage is mistake. If it is boring to convert from uN to usize forth and back, why not implement proper std::ops::Index ? Why make this decision on language level?

I don't see any way wasting bytes, you can hold index inside any type and convert forth and back into usize only during indexing.

larger arithmetic instructions on values that are never as large

this is valid, but for some cases it would be extra instruction for work with small types, if the most instruction of CPU works with machine words only.

I consider this a non-issue, because arr[usize::MAX] or let idx = -1; arr[idx as usize] will also assert at runtime. So one needs to get the value correct when indexing regardless of the type, so I don't see any harm in allowing additional types.

This is the same as how the following compiles, though it obviously panics at runtime:

let shift = std::u128::MAX;
let _ = 1_u8 << shift;

(I agree it would be bad if extending the domain forced it to go from total to non-total, but that absolutely doesn't happen for indexing. If anything, the opposite can happen: if I have a [T; 256] lookup table, I know that a u8 is in-bounds --- that's why the table is the size it is --- but if I have to index by usize I can get the value out of bounds.)

5 Likes

It absolutely different things. usize machine word, when you want on 64 bit machine address something above 64bit address space you do something very very specific. It doesn't happens in the most programs.

When because of some mistake you use u8 for indexing overflow happens with very high probability.

I’ve always viewed as casts as gross but a “cost of doing business”, similar to how items do not participate in type inference. A lot of this flavor of sugar design boils down to compression of pain: the usual case becomes less painful but you wind up with finicky corner-cases.

While I don’t want to make this thread about implicit conversions in the large, I’ll use an example from C++ that I find annoying. std::optional<T> is constructed by a non-explicit ctor, and implements operator bool. So, unit tests that test things returning optionals look like this:

EXPECT_EQ(MyFoo(), kExpected);

unless MyFoo() types at std::optional<bool>, in which case you must write

EXPECT_EQ(MyFoo(), std::make_optional(kExpected));

to avoid the implicit converstion of MyFoo() to bool via operator bool. The general case is painless but now you’ve got a bit of a trap in the pathological case.

My worry is that implicit conversions (even innocent ones like zero/sign-extending integers) do not play ball with inference and can lead to unpleasant surprises.

For example: if K: Index<u64> and K: Index<usize>, and further k: K, i: u32, what expectation should I have for k[i]?

  • Pick one by some kind of integer hierarchy (C++ does this in some cases via its complicated promotion rules).
  • Pick neither, assert compiler error (C++ does this in some cases, such as char into either short or long).

The former is something I really don’t want because Rust is already complicated and I think C++'s promotion rules are generally agreed to be too complicated. Also, a library addition could cause promotion to chose a different overload… though this is a problem with any overload mechanism, and not necessarily a con.

The latter is far easier to reason about, but is a form of negative reasoning that makes it possible for pure additions to break existing code.

Mind, this doesn’t mean we need to keep as. I think that providing Index<uN> implementations for the standard containers is not insane, nor is providing widening Into conversions and fancy saturating and wrapping conversions (though Into<usize> is definitely not a good idea).

8 Likes

Agreed. "Index into that array with this u16/u32/..." is a perfectly well-defined operation. It is partial, just like normal indexing. It could be implemented as

fn get<I>(&self, i: I) -> Option<&T> where usize: TryFrom<I> {
  self.get_with_usize(usize::try_from(i)?)
}

If either the cast or the indexing fails, we get None, otherwise we get Some. There is nothing lossy about this.

Is there any situation in which this method would not do what is intended, would lead to subtle bugs or would generally not be desirable? Off the top of my head, I don't see any.

9 Likes

The only time this would be problematic is when doing math on small types that would've otherwise not type checked. A somewhat reasonable example I thought of: a matrix type that uses (u8, u8) for indexing into a manually flattened array. The correct way to write this would be (untested)

struct Matrix<T> {
    raw: Box<[T]>,
    width: u8,
    height: u8,
}

impl<T> Index<(u8, u8)> for Matrix<T> {
    fn index(ix: (u8, u8)) -> &T {
        let row: usize = ix.0.into();
        let col: usize = ix.1.into();
        &self.raw[col * self.width + row]
    }
}

and the version with unintended overflow is

impl<T> Index<(u8, u8)> for Matrix<T> {
    fn index((row, col): (u8, u8)) -> &T {
        &self.raw[col * self.width + row]
    }
}

Interestingly, the version written with unsafe would be less likely to fall into this trap as it would use unchecked indexing or pointer offsetting after checking bounds itself, both of which would continue to be _size-only. (A real version would probably store a *mut T raw box with width * height as the slice length and use pointer offsetting to index after checking its semantic bounds. (And the "cursed" space saving optimization is to stuff the matrix dimensions in the unused bits of the pointer.))

1 Like

If you’re on a relatively normal platform where u32 fits into usize, then you shouldn’t have to deal with an Option if you write arr[some_u32].

For get, though, that seems fine.

I had imagined the same conversion behaviour as Ralf, and would expect any [idx] to work like get(idx).unwrap() and I'd also expect automatic conversion to avoid bugs. Right now an access with these types needs to do the conversion itself and if it fails to consider the differing pointer and usize sizes this can lead to subtle unexpect non-failure. On all 32-bit targets, this does not panic out of 'luck':

vec![0; 1][(1_u64 << 32) as usize]; // idx `0` on 32-bits

If one were to code on 64-bit where everything 'works fine'—and considering that coercion is an easier conversion, TryFrom is also quite new and more verbose—it might be tempting to simply insert an as usize instead of properly converting and it seems to work. But then on a 32-bit deploy suddenly everything goes wrong.

If that were to already be engrained in indexing, the behaviour would be consistent.

// Some(_) everywhere
vec![0; 1].get(0_u64);
// None everywhere
vec![0; 1].get(1_u64 << 32)
// Would panic everywhere
vec![0; 1][1_u64 << 32];
4 Likes

vec![0; 1] doesn't have an element with index 1, only with index 0.

Oops, that should have been .get(0_u64). Thanks for pointing it out.

Until one tries to port that code to a (partially-defined but not yet available) RISC-V RV128I system where usize = 128. It is naïve to presume that Rust will be replaced by another language before a computer exists where usize > 64.

I’m not sure where you are getting? It reads like you don’t agree with what I said but I also agree with your conclusion, so I just want to make sure I’ve been understood. My point is that treating u32 and u64 as indices for slices would make code more portable if done right. The whole example of as usize going wrong is written from the perspective of a naiive programmer, to show that current ergonomic choices already may make some code less portable. Of course this becomes even more problematic with future compatibility when thinking about usize = u128 as well.

2 Likes

I’m all for making sure that we don’t assume the size of usize. I don’t think adding some extra impls of Index and IndexMut will lead to that, though. We can handle Index with u8, u16, u32, u64, and u128.

Would it also make sense to allow signed indices for the same reasons?

Allowing signed indices seems much less reasonable to me; some earlier part of the program should have either declared something unsigned or done an appropriate check for negative numbers.

4 Likes

A common argument against indexing using other unsigned types was also that some earlier part of the program should have used usize instead – I’m just trying to make sure that we’re drawing the line at the right place here. I agree with you that indexing using unsigned integers is more reasonable than indexing using signed integers.

Indices smaller than usize that are stored in efficient data structures (e.g., in an ECS) naturally give rise to index expressions of uN types smaller than usize. Virtually every programmer encounters such situations frequently. For me the primary thrust of this thread was widening such expressions to usize at the point where they are applied as index expressions, but not earlier. That's the use of the unchecked as usize conversion that troubles so many of us.

3 Likes