Pre-RFC: Arbitrary bit-width integers

Summary

Add arbitrary uXX and iXX primitive integer types.

Motivation

Some algorithms need to work with very large numbers, or numbers that do not have 2n bits.

For example, the twelve_bit crate has u12, which is "useful for implementing Chip-8 assemblers and interpreters safely" from its description. The Ethereum Virtual Machine's scalar type is 256 bits, and ethnum was created to make it easier to work with those types.

ethnum has a feature to enable uses of LLVM intrinsics to help with performance, but it needs to use clang to compile and requires enabling LTO for optimal performance, since FFI calls are not inlined without LTO.

Having arbitrary bitwidth numbers support makes use cases above easier and removes the need to write software arithmetic implementations for integers with any number of bits in the future, since LLVM already has them. If an author of a library needs to change the implementation to be more specialized in a specific domain (for example, embedded targets, scientific computing), they could use a wrapper type and focus on the functionalities they need to change, and use LLVM's implementations for insignificant functionalities.

Guide-level explanation

Arbitrary bitwidth primitives can be sized any number of bits, specified by its number, for example, i24 represents a signed integer with 24 bits (3 bytes) of storage.

Arbitrary bitwidth integer primitives are very similar to existing primitives.

Below is an example of declaring u24 literals:

// Works.
let u24_max = 0b11111111_11111111_11111111u24;

// Errors.
let u24_invalid = 0b1_11111111_11111111_11111111u24;

Arbitrary bitwidth integers always have an alignment of 1.

These primitives used outside of #[repr(packed)] structures will round bits up to a multiple of 8 for its size.

Reference-level explanation

Valid range of bits

Arbitrary integers use 1 bit at minimum, and can be as large as 8388608 (223) bits, as supported by LLVM.

Any type with a specified number of bits outside this range are not recognized as a valid type.

// invalid
let u0: u0 = 0;

// 2^23 + 1 = 8388609
// invalid
let u8388609: u8388609 = 0;

Signed Integers

Signed integers are negative when the most significant bit is set. This means an i1 has two possible values: -1 and 0, i2 has three: 0, 1, -2 and -1, and so on.

Literals

The compiler now allocates an arbitrary precision integer on the heap to allow writing arbitrary large numbers with arbitrary bit-width integers.

Layout

Arbitrary bitwith integer types have a fixed alignment of 1. To specify an alignment, downstream users should define a newtype struct with #[repr(packed(ALIGN))].

Irregular types

Types are irregular when their size in bits is not a multiple of 8.

repr(Rust) structures are never irregular.

reprs

Irregular types used as fields in repr(Rust) structures will round up to the next multiple of 8 bits for its size. This means struct S { i: i1 } will take 1 byte instead of just 1 bit.

#[repr(bitpacked)] will merge fields that have irregular bit-widths. If the sum of bits of the fields are not a multiple of 8, the structure remains being irregular.

Casting

The types conform to the semantics for numeric casting:

  • Casting between integers of the same size (e.g. i4 -> u4) is a no-op.

  • Casting from a larger integer to a smaller integer (e.g. u9 -> u8) will truncate

  • Casting from a smaller integer to a larger integer (e.g. u7 -> u14) will

    • zero-extend if the source is unsigned

    • sign-extend if the source is signed

  • Casting from an integer to float will produce the closest possible float

    • if necessary, rounding is according to roundTiesToEven mode

    • on overflow, infinity (of the same sign as the input) is produced

Drawbacks

Hard to write inherent impls

We cannot add inherent functions to these types as there are a lot of them and we do not have a trivial solution to generalize adding functions to arbitrary bitwidth integers.

Trait impls must be handled by the compiler

There are a lot of arbitrary bitwidth integers, and we cannot just name all of them by hand and start writing trait impls. This could lead to a lot of work required for the compiler to support trait implementations and lowering them to codegen IR.

Rationale and alternatives

Const generic definitions

The initial design for this RFC involved having const generic parameters (e.g. Int<4> for i4) to specify its bit width.

We could write inherent and trait impls for the types in the standard library if we used this approach, but it had its own drawbacks:

  • Int<32> is not necessary the same as i32, as the alignment for arbitrary bit-width integers is always 1.

  • The const generic parameter has a valid range. You cannot declare Int<0> or Int<1000000000000000> because we cannot support them. It requires a lot of refactoring for the current trait system in place to detect them.

  • Having const generic parameters on a primitive can be confusing, especially when we use angle brackets.

Prior art

The Zig programming language supports arbitrary bitwidth integer types.

Their approach is very similar to this RFC.

Unresolved questions

impls

How do we define inherent methods and trait impls? Should this be special cased by the compiler?

Hybrid approach?

Can we use a hybrid approach to resolve the impls problem above? For example, i4 is what an end user uses, and ab(u)int<BITS> is what libcore uses for defining impls. This way we only need to resolve the type in a special way instead of requiring codegen to handle trait implementations.

Future possibilities

2 Likes

This seems like a feature, not a drawback, to me. These integers as a category have different properties from the standard integers, such as the alignment of 1 and (I assume) inability to take references to them, and it is better to have that category be specially marked and complete than to say that u63 is unlike u64.

This also addresses the entire drawbacks section of the proposed design, since impls can be generic over the width.

16 Likes

I think the Int<N> approach is better, and that alignment should be the largest power-of-2 divisor of N and not 1 (possibly capped at the cache line size or smallest MMU page size on the target platform), so that the primitive types can become type aliases to Int<N>.

I don't understand the "valid range" problem, any N: usize (including of course N = 0) can be supported. Since N is the number of bits, this only allows to define integers that take up to 1/8 of the virtual address space, but that's not likely an issue in practice. If LLVM only supports 1 to 2^24, it can be patched to support 0 to 2^usize, and meanwhile the compiler can just throw an error on monomorphization, or generate code using an array and function calls instead of built-in types (or for u0/i0, generate nothing).

3 Likes

LLVM doesn't support them: LLVM Language Reference Manual — LLVM 18.0.0git documentation

N = 0 is trivial to take care of in the frontend (it's just another ()). For the 2²⁴ limit: "just" make the const generic u24, since we're adding the built-in type anyway.

If arbitrary bit with integers behave notably different from the current ones (e.g. align 1 rather than align n/8) then they should be different from the native int types, and a confirming one should be provided for native sizes. (Another wrinkle is if/when we decide to e.g. add u256 properly, does it get the native alignment or is it stuck with the arbitrary alignment.)

11 Likes

As far as I know, we already don't specify a specific alignment for even things like u64 (it might be 4 on 32-bit and 8 on 64-bit), so I don't think that we'd need to specify the alignment for these primitives either.

Since you didn't mention it in the OP, I'll bring up https://github.com/rust-lang/rfcs/pull/2581

This is a huge feature. I suggest you avoid it for now. For example, how would a &u4 work? Is there a size_of that can tell you the size in bits instead of bytes? Why can't I make an irregular struct with a u2 and a u5 in it? ...

3 Likes

From my reading, you can, it just has to be repr(not_Rust)if you want the bitsize to be 7. If it were repr(Rust), it would "fill out" to a multiple of 8 bits in size.

Why would it work any differently? repr(packed) already doesn't support access to members by direct reference if that's what you're getting at. I suppose this would add repr(C) with these types as well (since they may be packed together), but that might be able to be limited to the affected fields?

Wouldn’t i2 have -2 as a possible value as well? (With bit pattern 10)

8 Likes

as a minor point, although u0 would make perfect sense (it's always zero), i0 has a conceptual problem, in that it's ambiguous whether its singular value would logically be 0 or -1.

7 Likes

I do not think it is much ambiguous. uX is a sum of X elements in form bit * 2.pow(bit_nr). Zero bits means zero elements in sum means single value is zero. The only difference with iX is that for bit position equal to X - 1 element is bit * 2.pow(X - 1) * (-1). Still zero elements in sum means zero.

Or, if you think that last element should be moved out of the sum it would be ambiguous whether single value of i0 will be 0 or -1/2, but it is 0 vs -1/2, not 0 vs -1. And -1/2 is rather odd value for an integer.

A fixed-sized, unsigned integer can be viewed as having an infinite amount of zeros in the higher bits. A fixed-sized, signed integer can be viewed as having either an infinite amount of zeros or an infinite amount of ones in the higher bits – and the choice of which assumption to make is defined by sign-extension (you assume that the higher bits match the sign bit). An i0 has no room for a sign bit, breaking the definition by removing the thing that specifies whether you have an infinite amount of zeros (0) or an infinite amount of ones (-1).

4 Likes

If this were implemented it would completely solve the bitfield problem. You wouldn't need any special syntax for bit-fields -- you'd just need to figure out the types you'd need, and fro bit-fields that'd be pretty straightforward.

This should not be invisible. If people want to declare an arbitrary integer on the stack, they should be able to, just as they can put [u8; 1024] on the stack.

2 Likes

I meant the compiler needs to store literals on the heap because users might want to write large literals.

2 Likes

Yes, that was an error

Thank you for linking me to some previous work! It looks like the major concern in that RFC was that it was hard to constrain the range of the BITS const generic parameter.

Unfortunately I don't think that solution would work very well. You would have a const generic type specified as the type of a const generic parameter. I'm not confident to say if that will be supported. Unless we decide to add uXX and U<BITS>, then we can just write U<const BITS: u24>.

We could also define a bits<const BITS: usize> primitive and make Int and Uint newtype wrappers with WF predicates constraining the const generic parameters to be within range. But that makes arithmetic harder to handle since Int and Uint are now lang items instead of primitive types.

The easiest approach is probably to make Int and Uint generic primitive types, and use WF predicates to ensure the specified BITS is within range.

1 Like

Oh, I see! Yeah, the compiler would need to store oversized literals out-of-line, perhaps with an enum that has variants for "small number | Box(large number)".

1 Like

That doesn't completely solve bitfields. You'd also have to ensure that the compiler lays out bitfields in a well-specified way, and in particular that repr(C) bitfields are laid out exactly as C would have.

And C's layout depends not just on the number of bits but also on the type specified; you can get a different layout if you write unsigned char x : 3; than if you write unsigned int x : 3;. (That can affect the alignment applied before starting the bitfield, as well as what happens when starting the next non-bitfield.)

1 Like

What qualifies as a "small" and "large" number, though? A [u8; 1024] can fit on the stack, and so can a [u32; 65536] on contemporary mainstream OSes (though I think that in the past I've crashed the compiler doing something like that). I'd say make it obvious when a stack/heap allocation is occurring so that developers can prepare the underlying environment for it. That way, they don't have to wonder when the compiler will decide to allocate 4096 bytes when their heap doesn't have that much room available because its a small embedded system and they need nearly all the heap space for other tasks too.

1 Like

C bitfields have some problems:

  • C doesn't specify how bitfields are packed within a "storage unit".
  • When bitfields straddle across storage units (e.g. when storage unit is u8, struct Foo(u4, u5, u7) the second field doesn't fit in the unit) the behavior is implementation-defined
  • Definition of a "storage unit" is intentionally vague.

An implementation may allocate any addressable storage unit large enough to hold a bit-field. If enough space remains, a bit-field that immediately follows another bit-field in a structure shall be packed into adjacent bits of the same unit. If insufficient space remains, whether a bit-field that does not fit is put into the next unit or overlaps adjacent units is implementation-defined. The order of allocation of bit-fields within a unit (high-order to low-order or low-order to high-order) is implementation-defined. The alignment of the addressable storage unit is unspecified.

Here are some solutions I can come up with.

  • Disallow references to individual fields inside a bit-packed struct. Not even addr_of! can be called. This is similar to the repr(packed) story but stricter.
  • Hard error when bitfields straddle across storage units.
  • Force users to write out the individual storage unit types for a struct.

This approach is different to the post in some ways:

  • #[repr(bitpacked)] does not allow you to create irregular types from other irregular types. We need another repr for that.
  • Full examples for using bit-packed repr:
#[repr(C, bitpacked(u16, u8))] // use two storage units
struct Foo {
    a: u7,
    b: u4,
    c: u5, // above three fields are packed into `u16`.
    // note that it cannot fit inside `u8` and would need to overlap.
    // overlapping is not allowed.

    d: u2,
    e: u6,
    // two fields packed into `u8`.

    // f: u8,
    // ^ above would error due to no storage left.
}
  • storage unit := u8 | u16 | u32 | u64 | u128 | _, For _ it only applies when having non-integer types as a fields, then the fields are their own storage units.

Below are some design choices that I did not write in the main post but is what I had in mind:

  • #[repr(transparent)] over an irregular type e.g. u3 acts as if it were u3.
  • References to irregular types are valid and you can assume that those do not belong to a bitpacked structure (therefore &u3 points to a byte instead of just three bits).

We will keep the idea that usages of irregular types will always round up to the nearest byte, if not contained in repr(bitpacked).

I just clarified it above: I was talking about the compiler storing literals. This has nothing to do with generated code and it will not allocate integers on the heap for generated code.

We would probably try parsing it to an u128, before using to a big integer like num_bigint::BigUint.

1 Like

I disagree. At least principally, and by that I mean that the compiler would lay out bit-fields precisely as you specified (assuming your bit-field filled the bit width of the next power-of-8 native integer size, which pretty much every bit-field I've seen does), at least on #[repr(packed)] types. For all intents and purposes, then, #[repr(packed)] structs containing these integer types would be bit-fields, regardless of whether they conformed to C's idea of a bit-field (which is unspecified, if I remember the standard properly) or not. Usually, my #[repr(C)] structs when I'm working with hardware are also packed (#[repr(C, packed)]), so that would constitute a bit-field, just not in the "C-way" of doing things, which is, IMO, a perfectly acceptable solution.

It won't be perfect: you won't be passing them over FFI. But no FFI ABI standard I've seen even covers bit-fields, so I'd generally consider that an incredibly bad idea to begin with, since the language your working with on the other end may or may not allow you to control the bit ordering (big/little endian, for example) and may or may not specify how bit-fields are constructed in memory, so that'd be a gamble.