Pre-RFC: Generic integers v2

Note that this is incompatible with i47 having a niche, since that requires the unused bits to have a fixed value (0 according to the Pre-RFC), but LLVM seems to consider them to be more akin to padding bytes, which can take any value. As a sidenote, is reading those unused bits UB for LLVM?

i<4,4> would not be a valid type since 1/2 is not a valid alignment. Bit-level alignment doesn't make sense in general, it needs to be a positive amount of bytes and is generally assumed to be a power of 2.

2 Likes

Note that we'd never store an i47 in the LLVM we emit, because that doesn't have well-defined semantics. We'd always zext or sext it (depending on u<47> or i<47>) to whatever byte size we decide on and store that, which is the point at which things like that niche note would matter.

The actual values in registers is LLVM's problem; not something that'd ever be observable by a rust programmer.

It's an i47. If you lshr it by one 47 times, you're guaranteed to get zero. Asking about the bits above the 47 in the type is like asking what's in the 100th bit of an i64.

1 Like

What about sending a i42 to/from inline assembly?

Due to compiler limitations, yes. What I envision in a future RFC is to have a single type for range-bound integers (without specifying the number of bits). I know this is doable given my previous experiments.

The only thing I'm asking is for the naming to be considered. I thought I was extremely clear on that from the start, where I asked

do you have any ideas for forward compatibility with ranged integers name-wise?

That remains my question. uint is an obvious choice for generic integers that I have no objection to. int, on the other hand, has the potential to cause significant confusion with range-bound integers (for which my personal preference is naming them int). There is iN and uN for generic integers that would work nearly as well, while ranged-bound integers have no other easy-to-think-of alternatives.

Bikeshedding on names: how about uint and sint for "unsigned integer" and "signed integer", leaving int unused? This doesn't quite match up to existing Rust u32 and i32 types, but is clear about what's intended. Or use rint for range integers?

I do want to consider it; I just want to make sure that the thing being considered is something that both can be done, would be reasonable to do, and makes sense under the int name, and I'm not sure both apply here.

I haven't looked into this nearly as much as folks like you have, which is why I can't rule this out right now. It's just that, from everything you and others have said, there are still a lot of unanswered questions.

I talk about the type of the bounds, but honestly, the bigger issue is how exactly these integers should behave, and what "fixed precision, bounded integers" would look like in the language. Do they just never overflow, always growing bigger each time you perform operations on them? How would assign-operations work for these, or would they just not have them? Do they have overflow, wrapping in their range, in that case?

And again, we're deciding to break the symmetry for the name of a fundamental type in the language, going with uint and sint instead of uint and int to match uN and iN everywhere, all because this hypothetical ultimate integer type is going to eventually be maybe be designed in a couple years and then implemented a couple years later?

Why can't the ranged integer type just be called Int? or integer? It feels like it'll be way less used than the existing integer types for sure, since it adds a lot of complexity to bounds when most programs just choose integer types as a way of optimising the size of things in memory, not strictly bounding things in their APIs. Certainly, it would be weird that the generalisation of i32 is int<0..=0xFFFFFFFF> or sint<32> and not int<32>.

Like, choosing the name int for the signed, bit-generic integer type makes a lot of sense, and unless ranged integers feel both canonical and implementable in the near future, I don't see why we should use the int name for them. They feel canonical mathematically, but they don't feel canonical in the way the language is designed, and that's the bigger issue IMHO.

3 Likes

I don't want to totally derail this thread into a full-on discussion about the precise behavior of range-bound integers. As such I won't be responding to the questions on that regard. Feel free to create another topic if you'd like to discuss it, however.

Generic integers are hypothetical at this point as well. Let's not act as if one is a foregone conclusion and the other is a moonshot. I personally plan on getting around to ranged integers sooner than multiple years out.

All primitive types begin with a lowercase letter, and integer is quite verbose.

Likewise for this proposal, for what it's worth.

If I'm understanding you correctly, you're essentially arguing for a "first come first served" naming basis here?


In the pre-RFC, you bring up using u<N> and i<N> as an alternative, dismissing them due to conflicts with let i (and presumably u as well, albeit far less common). I think it's worth noting that values and types live in different namespaces, so the conflict here would be minimal if any. It's perfectly legal to have a variable named u8, after all.

3 Likes

I'd argue there is a stronger argument for int<N> over int<A..=B>: int<N> directly describes the memory layout, maps directly to existing iN and uN types, has clear/intuitive behavior in regards to overflow and those are easy/performant to implement and will be familiar to those coming from other languages.

To me this feels more like a primitive type (that being closer to the hardware) than int<A..=B>, which contains all information to know the memory layout, but does not have a clear mapping to the underlying memory representation.

For example: int<100..=200> fits into a u8, but which bits/range does it use? 0x64-0xc8 (100-200) or 0x00-0x64? There is an argument for both: The first being the actual value, making it trivial to convert to a u8, which a lot more functions would accept, the second being sufficient to store the data. You'd probably suggest the first representation.

On the other hand you can also define a int<4294967140..=4294967240> type (0xffffff64-0xffffffc8). Does this now mean it's a u32 because it is trivially convertable? Or does it mean it's a u8 (0-100) because that's sufficient or does it mean it's a u8 that maps via the use of bitwise and/or only to make conversion easier?

The last option could a type with a type larger than it'd be needed: int<0x01f0..=0x02e0> has a range smaller than 256 but how would it be represented in memory:

  • u8: 0x00-0xf0 - Converting to the "real" number (u16 or larger) requires addition and requires knowing the minimum from the type when debugging/reverse engineering.
  • u9/u16: 0x0f-0x1e - Conversion is trivial using bitwise and with 0x0100, but that means there you need at least 9 bits because the range is not well "aligned"
  • u16: 0x01f0-0x02e0 - Makes the type larger than really needed, but stores the "real" number

TLDR

I'm not trying to argue against ranged/bounded integers, but I think there are still open questions in regards to the memory layout that are not intuitive based on the type alone. The name+generics of the type itself doesn't show/explain the memory representation, which is in my opinion really important for such a basic type like an integer. That direct to-hardware mapping and intuitive memory representation (for anyone familiar with other languages) makes int<N> a lower-level primitive than a int<A..=B> and thus it should in my opinion have the shorter/more intuitive name.

As I have said, I do not intend to debate the full workings on range-bound integers here, as that is off-topic. However, I will push back against

Other primitive types — namely pointers, references, and tuples — have largely unspecified layouts with only minimal guarantees. bool, oddly enough, doesn't seem to have layout guarantees either: its documentation only mentions that it's 0 or 1 when cast to an integer.

Given that there are three primitive types that already don't have specified layouts, this clearly can't be a blocker.

1 Like

it does have layout guarantees: Boolean type - The Rust Reference.

1 Like

The reference is explicitly not normative.

2 Likes

bool is size 1 https://doc.rust-lang.org/std/mem/fn.size_of.html, and thus must be align 1.

1 Like

They're already implemented in the layout system, actually. that's how NonZero and NonNull work.

rustc_layout_scalar_valid_range_start and rustc_layout_scalar_valid_range_end are the attributes (although they can't be tied to generic paramates or even negative numbers)

1 Like

I think that considering the semantics of them would be worthwhile in this case if we choose to believe that ranged integers are more canonical of a representation than bit-generic integers. That said, I won't push for more details if you don't want to provide them here.

The reason why I mention Int as an option is that it's not necessarily guaranteed that said ranged integer type would be a primitive. For example, NonNull is entirely implemented on top of existing primitive types, with rustc_layout_scalar_valid_range_start and rustc_layout_scalar_valid_range_end as mentioned. I don't see why ranged integers couldn't have similar treatment, especially if we're assuming that they could easily coexist with bit-generic integers. I also think that ultimately, it would be good longer-term to allow coercing literals into types like NonNull, and the "primitiveness" of a type shouldn't affect whether we're allowed to do that.

The reason I mentioned integer is actually because the compiler refers to integer literals as {integer}, and thus it feels appropriate for ranged integers in this case. We would effectively be replacing {integer} with a proper integer type, since these literals would also be used as the bounds for the type.

I mean, that is how we've decided Rust works as a language. Since names can't be changed, the existing iN and uN types are forever enshrined as the primitive integer types, and it makes sense to mimic the name of a bit-generic integer with the ones that already exist in the language. Except, as you mention…

I… legitimately didn't even consider this, and you're probably right. It still feels weird to have single-letter types effectively in the prelude, but honestly, this makes u<N> and i<N> feel more canonical than uint<N> and int<N>.

And that would solve the issue of int<...> being for ranged integers. I'll have to think about it a bit more, but honestly, I feel silly and will probably update the RFC to just use the single-letter names.

Have you considered uint and sint? I haven't seen any responses to those suggestions.

1 Like

To me it is an issue to fix rather that a good guideline to follow for new types.

I'm not sure it is technically possible but in the way I would like to introduce ranged types is by deprecating current primitives integers and replacing them with their uppercase counterpart that support ranges as an optional parameter. They would act like the current integer primitives if the range is not specified.

According to this vision, generic integers could be I<SIZE> or U<SIZE> with an optional second parameter for the range, and automatically sized integers would be Int<RANGE>

this is a very bad idea, single letter uppercase types are conventionally used for generic type paramaters, and I and U are quite common for this purpose.

2 Likes

this is only the case if we make the current primitive integers into aliases for generic integers.

there's a few arguments against that, for example, crates that have integer-related traits may want to have a faster impl for regular ints, then a generic impl for generic ints, and they probably don't want to wait for specialization in order for that to happen.

1 Like

The current RFC lists generalizing code for integers and decluttering documentation as two of the main goals and they can only be achieved if the current primitive integers become aliases for generic integers. Otherwise you'll actually make those situations worse, since you're adding yet another integer type which will be harder to generalize code for (all the existing macros don't expect the integer types to have a generic parameter).

5 Likes

My general response is that adding an s for signed is at least not what we've been doing throughout the rest of the language. We use iN, not sN, so, it feels a bit out of place.

Like mentioned, u<N> and i<N> do seem canonical despite my (wrong) assumptions earlier, so, it feels right to just go with them instead of changing up the formula.

2 Likes