Pre-RFC: Generic integers (uint<N> and int<N>)

Basically: I’ve actually written the RFC, but considering how this is such a big change, I figured that I’d open it up to internals before actually posting it in the RFCs repo.

In summary, this adds the builtin types uint<N> and int<N>, allowing integers with an arbitrary size in bits. For now, 0 < N ≤ 128, but this restriction can be removed in the future. The RFC is here: https://github.com/clarcharr/rust-rfcs/blob/generic_int/text/0000-generic-int.md

This doesn’t actually coincide with the Rust 2018 goals, and I don’t intend to get it accepted any time soon, especially considering how we haven’t even implemented const generics yet! But I would definitely love feedback and to see what people think; feel free to send a PR my way too if you notice a typo or think something should be changed.

16 Likes

On the one hand, I’d love to have arbitrarily sized integers handled well in the type system, especially to improve the handling of bitfields. On the other hand, if we do it at all, we’d need to do so efficiently and in a well-specified way with respect to the underlying machine.

For instance, if you store such a type in an array, how much space does it use? If you store such types in a struct, how much space do they use (with or without packing)? Depending on that storage, can you use these types in a multithreading-safe way? Does writing such a type involve a read-modify-write?

You may want to specify, for instance, that these types always get represented by rounding up the number of bits to the next power of two, always get read or written as such, and just mask the result.

10 Likes

Here is a link with links to some related prior discussions

3 Likes

I’d like a possibility at least that Rust can use the upper bits as niche space – so sizeof::<Option<uint<31>> can be the same as sizeof::<u32>. (It’s not as great as just using u32::MAX as None but it’s better than lacking it.)

That said, specifying that sizeof::<uint<33..64>> == sizeof::<u64> seems like the safe choice.


Just as a side note, I’ve found it interesting that having true bounded integers also solves the problem of supporting different bit sized integers with a smart compiler. Given an Integer[0..100), the compiler knows how many bit patterns it needs and can reserve the rest for other uses.

I agree with this idea.

Some general remarks for consideration and for you to discuss / expand upon:

  • You should highlight all the code with Rust syntax highlighting by using: ```rust instead of ```.

  • The types usize and isize could have been (u)int<ptr_size> where ptr_size is a cfged const item. Unfortunately, coherence prevents us from making this change as u64 and usize are nominally different even if they have the same representation (provided that ptr_size == 64. You can include this note if you like (or not if you believe it adds no useful info…).

  • You say that: “This is a big change to the language, not to be taken lightly.” However, it doesn’t strike me as very large change either at the compiler level proviso that const generics already exists, and neither is it a big addition complexity-budget wise. People will continue to use usize and similar types…

  • You say that: “No other language provides this kind of generalisation”. Have you checked languages with dependent typing and/or const generics?

  • The rationale for “bool remain entirely separate types” is not discussed. It might be the right choice… but there are no coherence problems here, so we could make u1 and bool the same types nominally. What problems do you see with that approach?

  • It seems right to me that it should be uint and int, but entertain the notion of int and iint – why did you make the first choice?

  • The note that “For the moment, a monomorphisation error will occur if N > 128, to minimise implementation burden.” does not square with how Rust avoids monomorphisation errors. Not erroring post-monomorphisation would also require that:

    impl<const N: usize, const M: usize> Shl<uint<M>> for uint<N> { .. }
    

    becomes:

    impl<const N: usize, const M: usize> Shl<uint<M>> for uint<N>
    where N <= 128, M <= 128
    { .. }
    
  • The restriction 0 <= N <= 128 should be explained. Why is it technically necessary? It seems to complicate things instead…?

4 Likes

But how should bounds checking work? Can we find a single behavior that fits all use cases or is it better to have a separate library for each use case?

  • The rust standard approach is to panic on integer overflow
  • Wrap around would be useful for group theory (and bit manipulation??)
  • Ranged integers could also be useful for indexing arrrays. Here it could be pretty to do something like usize<MIN1, MAX1> + usize<MIN2, MAX2> =usize<MIN1+MIN2, MAX1+MAX2>

By the way the const generics RFC allows types that are generic over an integer. This could be useful for creating a library from ranged integers, but I don’t know if it has all the neccesary features.

Parsing, type checking etc. isn't that difficult but for layout and backend purposes, strange-sized integers are almost but not quite entirely unlike {u,i}{8,16,32,128} (more on this below).

Probably not, but it is conceptually and philosophically a departure from the "we offer a selected set of numeric primitives that match things many CPUs can operate on" perspective which probably doesn't describe everyone's opinion on primitive types but is, today, a valid perspective.

There are quite a few things you can do with integers but not with bools (arithmetic) and vice versa (branching, boolean operators). So making this work would require not just special-casing 1-bit integers to extend its capabilities, but also arbitrarily restricting it in other ways compared to all other integer type. Such arbitrary differences based on parameter values is pretty bad for generic code (the poster child is C++'s std::vector<bool>).

Also, it invites the bikeshed of whether booleans should be signed or unsigned :bike:

Funnily, we do have monomorphization-time errors due to type sizes.

As mentioned above, code generation for arbitrarily large integers is a challenge. Loading, storing, bitwise operations and other things that can be broken down into independent byte-sized chunks is conceptually simple (but someone still has to write the code to do that!). Arithmetic, however, can't be decomposed that simply.

For smaller types, you could promote to the next largest normal integer size and truncate the result (but again, this too has to be implemented) but a 1024 x 1024 multiply or remainder or float<->int cast is highly non-trivial. With {i,u}128 we could mostly lean on LLVM and compiler-rt but we still had to deal with backends not handling it properly and ultimately rewrote the compiler-rt routines in Rust. For types beyond 128 bit, we'll be entirely on our own. We probably can't even take code from bignum libraries since most/all? of those either allocate heap memory or assume the results fit in a pre-allocated buffer.

Well... there actually is a "fixed size bignum" implementation in libcore, but it's been written for float<->decimal and is correspondibly specialized. Not to mention that lowering "primitive" operations to calls to language items is a whole new can of worms for various reasons. though one we'll probably have to eventually face eventually.

3 Likes

I dislike the asymmetry of uint and int. How about u<N> and i<N>? I think this would be clear because of the similarities to u32 and i32. Would this disallow variables u and i? Alternatively we could use uint and iint or u_int and i_int. We could imidiatly use uN and iN but it may be better to leave those for the power of two integer types.

You talk about allowing conversion from u1 and i1 to bool but I’m not sure the conversion from i1 would be obvious. This would map -1 to true where true is normally only represented as 1.

I’m also interested in the posibility to have usize be implemented as u<PTR_SIZE> as @Centril suggested.

@hanna-kruppe You make fine points in general; I want to emphasize that my points above were more discussion items than statements. It is a good idea to encode these discussions in the RFC itself :slight_smile:

Sure; Is that an important property?

That's a good point.

I contemplated for a second whether uint<1> and int<1> should be nominally the same type, but it creates overlap (in impls) and seems hazardous for other unspecified reason [fill in].

That is funny :wink: but also quite extreme and massively unlikely. uint<256> however, not so much ^,-

Interesting. So correct me if I'm wrong... this seems to boil down to complicated busy work in terms of implementation but not the theory behind it? Conceptually, I'd think of uint<256> as [uint<128>; 2] (or using tuples...) and then building arithmetic in the "dumb" and inefficient way you might represent multiplication in terms of smaller sizes. It doesn't have to be performant initially.

That's not a bad idea... I feel like there should be some reason against it, but I can't think of any really.

That's not surprising tho; false in terms of integers is usually defined as == 0 and anything not that is true.

I only brought it up to say I don't think afaik that it is possible. :wink:

I reallize that true is mostly seen as x != 0 and false as x == 0. However integers dont have this conversion today, presumably because the x != 0 is more clear than x as bool. I can see the reasoning for u1 but I don’t think i1 should be an exception to this.

3 Likes

Yup, I was supplying points for the RFC to include.

I'm not sure. It is if you want primitive types to be fast, but that is already sort of not true for u128 (though I do believe that one is reasonably fast on many platforms, roughly comparable to 64 bit integers on 32 bit targets).

I suppose it's not any challenge from a language design perspective. If we're considering "theory" more generally, quite a lot of ink has been spilled about the algorithmic complexity of fundamental arithmetic operations and the best practical way to do arithmetic on large integers. Though now that said ink has been spilled, it is mostly a matter of engineering work to put that into a form we can use to actually generate code.

For the very first unstable implementation, no, but if it's stabilized and people are supposed to use it, then it would be pretty bad if a type from crates.io is routinely much faster. As an analogy, we didn't want to stabilize {i,u}128 before they were portable to all sorts of platforms to prevent people from needing to use libraries emulating those types.

1 Like

Perhaps performant for u256 and u512 as well -- given AVX, for a limited set of cases?

Answering in order:

You may want to specify, for instance, that these types always get represented by rounding up the number of bits to the next power of two, always get read or written as such, and just mask the result.

This actually is clarified in the RFC; was it not clear?

I’d like a possibility at least that Rust can use the upper bits as niche space – so sizeof::<Option<uint<31>> can be the same as sizeof::. (It’s not as great as just using u32::MAX as None but it’s better than lacking it.)

I should probably specify this explicitly; I alluded to it in the sense that Option<uint<7>> is one byte large. But yes, this would happen.

Just as a side note, I’ve found it interesting that having true bounded integers also solves the problem of supporting different bit sized integers with a smart compiler. Given an Integer[0..100), the compiler knows how many bit patterns it needs and can reserve the rest for other uses.

Would you then consider this an alternative to this, or another feature to be added on top?

You say that: “No other language provides this kind of generalisation”. Have you checked languages with dependent typing and/or const generics?

Yep, and I found none. If you do find one, though, let me know!

The rationale for “bool remain entirely separate types” is not discussed. It might be the right choice… but there are no coherence problems here, so we could make u1 and bool the same types nominally. What problems do you see with that approach?

Quite simply: bool doesn't do a lot of things integers do. You'd be able to add uint<1>, but bool addition can only be emulated with xor. We can't simply make one a type alias of the other.

  • The rust standard approach is to panic on integer overflow
  • Wrap around would be useful for group theory (and bit manipulation??)

I'm a bit confused here; wrapping_add and the like would still exist.

I dislike the asymmetry of uint and int. How about u and i? I think this would be clear because of the similarities to u32 and i32. Would this disallow variables u and i?

Considering how common i is as a variable, I'd say this is a 100% no. Shadowing one-char symbols seems like a bad idea.

1 Like

I think saying that AVX gives you an u256/u512 is a bit misleading, in the sense that this instruction sets does not give you “true” 256-bit operations. The only things you can perform are vectorized operations on batches of numbers which are up to 64-bit in size.

You can use these primitive ops to emulate 256-bit ops in software, but that’s quite a bit of work due to the need to manage carries and such.

2 Likes

As uint<4> is roughly equivalent to Integer[0..16), I'd say that the latter can be formulated as an extension of the former. It might even be better for performance-by-default to only have the former, as truncating is cheap, whereas the only sane behavior for bounded integers I've come up with is either statically preventing overflow or failing on overflow; there's no simple way to implement wrapping that would make it more performant.

Bounded integers also opens up the question of weird ranges. Integer[5..261) can be stored in a byte (unless I messed up counting), but do you want to? Bounded integers are great for correctness but I personally don't think they're ready for prime time in a performant lower abstraction level language.

TL;DR I support bounded integers as an experiment and bit counted integers for practical use, if only to tell the compiler that it can use those unused bits as niche space for optimizations.

3 Likes

One small detail in the RFC that seems odd to me:

It is not guaranteed that the upper bits of these types will be masked after operations, and it's up to the compiler to determine if masking should be done during casting or after every operation. This means that (x: uint) as uint is not necessarily equal to transmute::<uint, uM>> if N isn't a power of two; this does not have to be linted, but might be a good idea for a clippy lint in the future.

I guess this is intended to give the compiler more freedom to generate better code? But allowing basically arbitrary garbage in some of the bits to be observable has serious downsides, such as:

  • It's a big footgun for unsafe code dealing with these types (bigger than the usual "you can't store certain values in it or it's UB", since even reading is dangerous)
  • Probably complicates the unsafe code guidelines: since the upper bits can be arbitrary garbage, what's the principled reason for disallowing unsafe code from writing any bit pattern it likes there?
    • Relatedly, such bit-level tagging complicates miri (if it wants to catch any UB related to these bits)
  • As these types are supposed to provide niches for layout optimizations, I don't see how the compiler can actually exploit this freedom (other than in simple function-local cases, where it already has free reign since it sees all the relevant code and thus can do optimizations without affecting observable behavior). For example, when operating on a &mut uint<7>, you generally can't write anything above 0x7F to that memory since it might be the payload of an Option<uint<7>>.
  • Unspecified bits are unprecedented in Rust semantics AFAIK (there's padding but that's separate bytes)

I'd like to hear your rationale. Right now I think we should define the in-memory representation of uint<N> and int<N> as: zero-extended to however many bytes it occupies in memory. Any other bit pattern is an invalid value the same way a byte with value 3 is an invalid bool (which is precisely what enables the niche optimization).

7 Likes

I just realized this implies we generally can't store int<N> sign extended, which means absent optimizations we need to sign extend on every load – and from a strange bit position too, so there's likely no hardware support for it. I've not worked through all the details but this seems to be a serious performance hazard for signed types compared to unsigned types of the same width. Yet, at the same time, I don't really see how to avoid zero-extending if we want to use niches.

Edit: this is false, see discussion below

4 Likes

Would changing,

to, "in-memory representation of uint<N> and int<N> as: zero-extended to however many bytes it occupies in memory for unsigned types and sign-extending to however many bytes ... for signed types" be a useful solution? So, for unsigned types, the unused bits are guaranteed to be all zero. For signed types, the high order bits are guaranteed to be either all ones or all zero, depending upon the sign.

Would that help?

Here’re two ideas:

Arithmetic operations on a type of an unusual size which is additionally not properly aligned would just be slow. Wouldn’t it be better to require explicit conversions instead and to not make arithmetic ops available on this type?

Since the primary use case for this feature are bitfields, why not call the type bits<n> (e.g. bits<12>)?

2 Likes

This is basically defining the criteria for a niche in a bit more of a complex way, and I think it’d work. For example, before we might check if the upper bit of Result<int<7>, ()> is set to determine which variant we’re in. In this case, we’d check the top two bits: if they’re the same, we’re in one variant, and if they’re not, we’re in the other.

This would incur a penalty of one extra instruction to check the niche but no penalty otherwise, assuming that the variant where the bits are the same is the one that contains the int<7>.