Revisiting the unsigned


#1

Yesterday, I’ve come to a reddit comment that complains the usage of unsigned integer types in std. I found this interesting, especially:

just find Going Native 2013 conference video called “ask us anything” where all the C++ gurus discuss size_t and integer types and that using size_t in STL was a really big mistake.

in my opinion as I said, it’s a non-technical question at the end. “I’m writing an API, what type should I use?”, and the answer here should be as simple as possible.

I know this rant. It probably came from the well-known fact that mixing signed and unsigned arithmetic is dangerous in C/C++. At least, unconscious programmers may make mistakes due to this. As many of you may have known already, Google has a similar coding style guideline.

As he mentioned some other languages that use a signed integer as a container length type, I decided to find out what languages use a signed integer as a container length type. And I got a surprising result: Among the languages I’ve investigated, only D used unsigned as a length type. All the other languages, including Swift, Go, Haskell, Java, C#, and Ocaml all use a signed integer. Some languages inherently don’t have the unsigned type (like Java), but some languages, such as Swift, specifically states not to use unsigned.

I think there’s a reason behind this. As Rust has overflow/underflow checks, the risk of the misuses of the unsigned types is relatively low. But we can still make mistakes, and the checks are currently disabled in the release builds. That may be the reason why Swift, which is an overflow-checking language, still discourages the uses of the unsigned types.

And there’s one more advantage. Unlike usize, isize can be trivially casted from i32, although this is not the case in 16-bit systems. Currently we don’t have any integer type coercions, but if we decide to add some as a limited form in the future, sticking to isize may help.

One counter-argument to this might be the fact that the unsigned types prevent some illegal values, thus help catching bugs in advance. This is a good point, although some may raise a counter-counter-argument that even in the case, the types still allow some positive (large) illegal values.

I admit this is just a matter of a style. This is more likely a cultural thing. But it can impact the way we design the container APIs. Like the reddit user said, many crates will be modeled after std in the future.

Unfortunately, the most effective counter-argument to this would be that we already released a beta. We should be really careful before making another breaking change. But I feel some kinds of anxiety that if this is really a problem, keeping this in 1.0 will be problematic in the future. But I really hope I was wrong. Any thoughts on this? I tried to find some relevant information in the forum and the RFC repository, but couldn’t find one.

Thanks!


#2

Changing the sizes to be signed would break all code everywhere, since there are no implicit coercions. It’s not a bug fix, and even if you argued that it is, at that level of breakage pragmatism takes precedence. It is not a minor correction or addition to a library API. Furthermore, even prior to 1.0 one would have had to make a super-humanly good case for such disruptive changes. “Other languages do it and here’s a few rare bugs it prevents, at the cost of not preventing this and that bug” is not very convincing IMHO.

So, with the disclaimer that I am not a core team member or even a contributor, I’d say: Not a snowball’s chance in hell.

That said, let’s talk hypothetically. If we had no existing code and wanted to design a new language, what should be preferred? Well, to make things short, I don’t think there are very good reasons one way or the other. In C and C++, there is significant danger from mixing singed and unsigned (UB can lead to silent misoptimization), but in Rust and in other sensible language designs, at worst overflow/underflow wraps around. Then it’s just this class of minor bugs versus that class of minor bugs.


#3

If we would have understood the implications of llvm’s pointer arithmetic, that Vectors and indexing by pointer offset is restricted to the isize range and not usize range, then we could have used isize for collection sizes and indices all along.

Unsigned values require only an upper boundary check, no check for larger than zero to see if an index is valid.


#4

Containers could bitcast the isize to a usize and check whether it’s in-bounds, that removes the need for checking that it’s not negative.


#5

FWIW, it seems that we already have a convention that restricts the actual range of a collection to be that of isize. If this is the case, wouldn’t it be better to reflect this fact in the type signature?