Pre-RFC: `usize` semantics

quinedot · September 4, 2023, 1:04am

The difference in how you break things^[1] as I see it is that usize being pointer sized is one of the most normatively specified things in Rust.^[2] Not only is that a bigger deal than breaking an assumption which was never promised, it penalizes the people who went to the effort to figure out, to the extent possible, what Rust's guarantees are. We should support those people, not penalize them.

It's not "just" the official documentation, although that's very important. RFC 0544 is the one that defined usize. They are explicitly pointer-sized integers. Technically that was "just" a rename,^[3] and you can see this was true before the rename (when they were int/uint) as well. They're also explicitly not ssize_t/size_t:

The names fail to indicate the precise semantics of the types - pointer-sized integers.

[ ...]

The names remind people of C's ssize_t/size_t, but isize/usize don't share the exact same semantics with the C types.

RFCs and discussions around that time (which was the leadup to 1.0) include...

And if you read through them, the constant is that the type is pointer sized, and pointer sized is always "good enough" for indexing even if a smaller size would be more efficient.

It's also true of any issue on the topic I've seen post-stabilization.

RFC 1861 also defines pointers to externs to be usize.

An undertone or presumption in the pre-1.0 discussion is that Rust should only have one signed/unsigned integer type where the width depends on the platform. In that world, the defining feature^[7] has to be pointer width and not size_t, because it's "good enough" for the other uses, while the opposite is not true.

Part of this issue is basically "opting for only one platform integer^[8] was wrong", which I can empathize with. The problem with the proposal here is that size_t is not the defining factor which has been promised, and people have -- reasonably and responsibly! -- relied on the definition.

I'm not convinced there isn't a way to change that definition in a way that blatantly breaks Rust's backwards compatibility promises, but anything I've thought of so far involves some sort of ecosystem split (like explicit opt-in, or crate-non-iteroperability across an edition).

usize not pointer sized or usize not size_t ↩︎
That's the definition any official place you look, should you choose to look. ↩︎
there was a ton of pre-stable contention around integers ↩︎
Excerpt:

int and uint are always defined to have the same number of bits as a pointer on the target platform

[...]

it frequently happens that you have integers that are tied to the size of memory: for example, indices into an array or the size of an allocation. In these cases, uint and int are an excellent choice, though it may make sense to use smaller, fixed-size types if you know that the length of the array (or size of the allocation, etc) is limited.

↩︎
Excerpt:

it is certainly true that none of the listed proposals included variable-size integer types except for pointer-sized (and they all assume a flat address space as well). This is no accident.

↩︎
Core team decision excerpt:

On the other hand, seeing usize in the context of slice indexing or as the return of the len function is unlikely to lead to too much surprise for newcomers. A type like umem, on the other hand, is likely to raise eyebrows. Since "size" is general enough to refer to both the size of the address space and the size of a container and its indexes (which are, of course, closely related), and reasonably intuitive, we feel it is the best choice.

↩︎
without committing to a stronger size_t == uintptr_t requirement ↩︎
modulo signage ↩︎

dlight · September 4, 2023, 2:42am

The problem is that, in practice, Rust code uses usize mainly as the integer that we use to index a Vec, and this the role of size_t. There's unfortunately a mismatch between what's been defined (usize as uintptr_t) and what's used in practice. This has been discussed before.

I think the best to do is to admit the mistake and fix the definition of usize. The reason this is possible is that this doesn't break any currently running code, since in all platforms Rust is currently supported, size_t = uintptr_t

quinedot · September 4, 2023, 5:27am

Yes, I'm aware of that. I'm also aware of that thread, participated in that thread, and have read it's entirety. Same for all the threads I cited, modulo participation.

Those who made the decision were also aware, hence the notes about pointer size being "good enough" for other uses and the citations provided (read the inlined footnotes at least, if you haven't already). As they said,^[1] it's sufficient (even when inefficient) to use something uintptr_t sized for size_t uses.

But the use of usize for indexing or other size_t-esque uses is not the only use in practice. People also use it for uintptr_t-esque uses... and that's the normative definition. See @tcsc's concrete examples in this topic, for one example.

It's a violation of backwards compatibility. It would make code which is valid today, under the normative semantics of Rust, invalid tomorrow. What's "running" today is irrelevant to that fact; If I wrote an edition 2015 library whose soundness^[2] relies on usize being pointer-sized on May 15, 2015, it should still be sound when compiled tomorrow or any other time during Rust 1.x. I shouldn't dread the minor release that changes the normative definition I relied on for soundness under my feet.

If I'm a programmer depending on such a crate, my code code should also be sound when compiled tomorrow. (And what edition I'm on should be irrelevant.)

Look, although I'm generally against breaking changes, I get that the decision to buy into "only one platform dependent integer"^[3] is painful for some platforms. And that introducing a second one that is size_t while usize remains uintptr_t doesn't address most complaints. And that the decision was probably motivated by older platforms and something like CHERI was not anticipated.

But here's what really bugs me about these discussions writing off the existing guarantees.

The people who really care about usize = uintptr_t are those working with unsafe. And they have so few resources for "what can I soundly do". Any advances in that area are taking forever to flesh out (Niko's memory model, UCG, team opsem...). They have so few resources and it's taking so long that it's a running joke. But here we have an extremely unambiguous normative guarantee that has existed since before day Rust 1.0: usize is pointer sized.

What do we want to give people working with unsafe, and what do we want from them? We want to give them the tools to create sound libraries we can use without unsafe, and we want them to do the due diligence to make sure they are making their libraries sound.

	Looked up what is normative	Did not look up what is normative
Relied on `usize = uintptr_t` for soundness	Group A	Group B
Relied on `usize = size_t` for soundness	Group C	Group D

Group A did the reasonable and responsible thing. They did what we wanted, and they are the unsafe developer we desire. If the backwards-incompatible change goes through, it will make their crate unsound when before it was forwards-compatibly sound. We will punish people who did the right thing. Their takeaway: Rust doesn't actually have a reliable backwards-compatibility guarantee, and I can't really trust such documentation or other normative resources. Incentive to do the right thing is diminished. Meaningfulness of any upcoming spec or opsem output is also reduced. Maybe they'll change their minds later.

Group C willingly wrote a forwards-looking unsound crate. They should have at a minimum thrown some asserts in. They would benefit from the backwards-incompatible changes, but are normally the group that gets ostracized for neglecting soundness in Rust culture.

Group B and Group D lacked due-diligence; which group lucked out depends on whether or not the backwards-incompatible change goes through.

So in addition to being against breaking changes, I don't think we should be punishing Group A or diminishing the incentive to believe in Rust's supposed backwards compatibility guarantees or other incentives to do the right thing. And I think it will be harmful to Rust going forward: the message is that you can't actually trust anything to be normative.

Making usize != uintptr_t opt-in is less damaging in that respect, but potentially ecosystem-splitting. I would still consider that more palatable than going back on the existing guarantees.

If there's an option to not punish Group A and not be ecosystem-splitting, again, I haven't thought of it but am all ears.

and one of the alternatives explored in the OP ↩︎
or correctness otherwise ↩︎
modulo sign ↩︎

CAD97 · September 4, 2023, 6:09am

Here's the thing though: Group A's code is unsound on CHERI anyway. You can't freely do arithmetic to a pointer as uintptr_t on CHERI; you need to use the equivalent of the strict provenance APIs to manipulate the memory address and recombine it with the signed pointer.

Making the definition usize_Rust == size_t_C</ == uintaddr_t_C doesn't break anything. Group A is implicitly assuming that uintaddr_t_C == uintptr_t_C, as is all of the pre-1.0 discussion about usize being pointer-sized; at that time, nobody knew what "pointer provenance" was, let alone had an idea that a hardware processor would have pointers which are a power of two larger than the addressable memory space.

fuzzy_provenance_casts would make pointer as usize and usize as pointer into hard errors on any platform which doesn't have usize_Rust == size_t_C</ == uintaddr_t_C == uintptr_t_C. Code of all four groups would continue to work, and would fail to compile on the targets where it's a potential issue.

The one case where it wouldn't be caught is if a crate exclusively uses transmute_copy and pointer reinterpret casts to cast between pointers and usize. And that code is already unsound, because doing so breaks the provenance chain on today's targets.

If fuzzy_provenance_casts is an error on platforms that don't have usize_Rust == uintptr_t_C — and that's the plan laid out — then there is no code which is sound on x86_64 which also compiles on CHERI but is unsound.

Caring about "exotic targets" is opt-in, and for CHERI the opt-in is spelled #![deny(fuzzy_provenance_casts)]. Just, if your dependencies' code happens to not fire that lint, and thus has no issue on CHERI, it'll work on CHERI without needing to opt in to breaking CHERI being an error.

RalfJung · September 4, 2023, 6:28am

That's not entirely correct, is it? See the example of code using size_of::<usize>() to stand for "pointer-sized" to compute offsets.

quinedot · September 4, 2023, 8:44am

I'm pretty sure this part is false for the pre-1.0 discussion, there are acknowledgements in the discussion to them being different.^[1] My interpretation is that they didn't want to be too much more restrictive than C,^[2] but didn't want the soup of C int types, so the performance hit was worth the cost on "legacy" or exotic systems like a SNES.^[3] (Some of those making that decision are still around, so there's no reason to rely on my interpretation if you want to ask them.)

I also don't see how it follows for Group A. For example the w65 platform implementation thread wanted the independent size_t != usize without changing usize = uintptr_t. And also Ralf's comment just above.

In that case, support for CHERI should definitely be opt in... if ever supported on Rust 1.x.

Is this agreed upon generally? I've not gotten this impression re: fuzzy_provenance_casts before, but perhaps I just missed it.^[4]

Let's say I just missed it. Then making fuzzy_provenance_casts a hard error^[5] could basically be an ecosystem split approach, wherein Rust's guarantees without opt-in are extended to size_t == uintptr_t. The RFC for such should not be phrased as usize = size_t, it should be phrased as uintptr_t = size_t. usize = size_t follows from the pre-existing usize = uintptr_t guarantee. And that would be backwards compatible; the question becomes is the (future-when-uintptr_t != size_t-platforms-are-sorta-supported) ecosytem split acceptable.

If it's acceptable, I believe this could be phased in over time in a non-breaking manner.

But I take it that what you actually had in mind is basically what you were pitching before, wherein backwards compatibility is not guaranteed but only a "best effort", as per the exceptions listed, and usize is redefined. I still don't see how that can be introduced in a non-breaking manner. Or how the benefit of a smaller-than-usize-index-integer can be introduced without the redefinition.

Again the portions about pointer size being adequate if not ideal. Maybe I should direct quote my footnotes? ↩︎
quote use Rust libraries as a "drop-in replacement" for C, unquote ↩︎
CHERI hadn't been pitched at that time. ↩︎
Citation welcome! ↩︎
with no other changes like redefining usize ↩︎

graymalkin · September 4, 2023, 11:30am

We understand the hazard in breaking changes to the semantics. We do believe that this change is one which is benign on the major platforms Rust currently supports, and mitigations like fuzzy_provenance_casts and the crate attribute suggested early in this discussion will make these changes tolerable for bringing up support for platforms with CHERI as/when they become generally available.

For what its worth, the existing documentation is a little inconsistent about how it guarantees usize == uintptr_t.

In the UCG it's clear (UCG):

The isize and usize types are pointer-sized signed and unsigned integers. They have the same layout as the pointer types for which the pointee is Sized , and are layout compatible with C's uintptr_t and intptr_t types

In the Rust book it is less clear (book.usize):

The pointer-sized unsigned integer type.

The size of this primitive is how many bytes it takes to reference any location in memory. For example, on a 32 bit target, this is 4 bytes and on a 64 bit target, this is 8 bytes.

And in the reference for numeric types (book.numeric):

The usize type is an unsigned integer type with the same number of bits as the platform's pointer type. It can represent every memory address in the process.

The isize type is a signed integer type with the same number of bits as the platform's pointer type. The theoretical upper bound on object and array size is the maximum isize value. This ensures that isize can be used to calculate differences between pointers into an object or array and can address every byte within an object along with one byte past the end.

usize and isize are at least 16-bits wide.

The first definition is very clear, the second two are contradictory for any platform where C's size_t is not uintptr_t. These definitions would all need to be tidied up and made more precise in any attempt to add either CHERI support to Rust, or any platform like w65/8086.

For developers in Group A, they may well have found that the guarantees of Rust only really extend to platforms like x86/aarch64 where uintptr_t == size_t.

Vorpal · September 4, 2023, 3:07pm

Based on the issues with breaking changes, conflict with other obscure architectures (i.e. w65 etc) and most importantly that this hardware is not available for purchase (and no one knows for sure that it will ever be more than a research project) it seems more prudent to focus on support for architectures that actually exist currently.

We have no idea if CHERI will be commercialised, and even if it is, it might flop. There are several niche architectures (such as w65) that (while not popular) at least exists.

It seems strange to prioritise CHERI such that it will break those, and break the backward compatibility guarantee of the ecosystem (which rust previously promised it would only do for soundness issues).

This doesn't seem relevant to implement until it is possible for a general consumer to buy CHERI hardware (e.g. something like a pi or dev board with CHERI). It is fine to discuss it before in order to be prepared of course. But it seems very premeditated to actually change anything yet.

Edit: it also doesn't seem impossible that an actually commercialised version of this might have different limitations. So that is another reason to wait.

pitaj · September 4, 2023, 4:29pm

This sounds like a good space where we should start experimenting with clippy lints.

bjorn3 · September 4, 2023, 4:55pm

I think group A can be further subdivided in code that only assumes that usize can losslessly represent all pointer values and vice versa and code that actually depends on usize and pointers to have exactly the same size. The former case will either work fine or reliably crash on CHERI with SIGPROT depending on if the compiler optimizes away the ptr2int2ptr cast or not. If it is not optimized away the pointer will be unambiguously an invalid pointer due to the capability bit not being set. This makes it a denial-of-service at worst. The later case of actually depending on equal size is the actually worrisome case as it can silently corrupt things and thus could be exploitable.

tcsc · September 16, 2023, 9:28pm

This seems nearly impossible to lint against to me. The cases where I've seen this would be completely impossible because the layout computation happens completely independently from use. I think that's the common case.

tcsc · September 16, 2023, 9:32pm

The documentation for size_of is also quite clear: size_of in std::mem - Rust

The types *const T, &T, Box<T>, Option<&T>, and Option<Box<T>> all have the same size. If T is Sized, all of those types have the same size as usize.

I believe we have other places this guarantee exists scattered throughout our docs.

tgross35 · February 21, 2024, 9:40pm

Is a RFC in the works for this?

If the decision is to go forward here, it could be reasonable to make this semantics change as part of edition 2027 (or, less likely, 2024). This wouldn't really have any meaningful impact unless a lint is brought with it, but if it is well-publicized then at least crate authors can be made aware of the semantic change. Architectures like CHERI can just fail to build dependencies before a specific edition.

Also note that 128-bit pointers with different sized offsets may not be just limited to CHERI in the future, there are talks about 128-bit pointers in the kernel in a decade or so ("Zettalinux", supposedly you could have an entire cluster with RAM, file system, GPU memory, MMIO etc all in the same address space). The thread starting at Re: [PATCH] usercopy: use unsigned long instead of uintptr_t - Matthew Wilcox is discussing some ways they might address this with C types.

dlight · February 21, 2024, 10:10pm

That's Python 3 level of ecosystem split. Even if it applies only to a niche platform, it's kind of against the Rust ethos of stability without stagnation.

In special, there are tons of high quality libraries that were made in the last few years and are feature complete now, and won't have a new release for many years.

pitaj · February 21, 2024, 10:27pm

It's not even close to that level of ecosystem split. It's kinda offensive you would even make that comparison.

What @tgross35 suggested would enable the normal Rust compiler to soundly target sizeof(*_) != sizeof(usize) platforms, and provide a simple and largely automated way for library authors to support those platforms as well.

saethlin · February 21, 2024, 11:07pm

Someone would need to first explain what we could even lint on here, in response to Thom's comment above. Are we going to lint on size_of::<usize>()? That seems bananas. Linting against as or transmutes between pointers and integers might even be done by 2027 for other reasons.

Also note that 128-bit pointers with different sized offsets may not be just limited to CHERI in the future,

If anything, the interesting part of these seems to be discussion of near and far pointers. But in my reading I don't see anything about pointers with more bits than their address, which is the tripping point with CHERI.

tgross35 · February 22, 2024, 1:34am

On the contrary: there is no guarantee that code written for previous editions works in newer ones, just that future compiler will continue to compile them at a pinned edition. Edition changes are exactly for this breaking behavior such as introducing new keywords or changing panic! usage.

CHERI is less than unstable anyway, there can be no ecosystem split here.

Yeah, it is impossible to catch every single case, but linting at the pointer<->integer entrypoint should hopefully catch the vast majority of mistakes. Really this could start out as a clippy lint not long after strict_provenance/exposed_provenance APIs become stable.

Right, it's not exactly the same problem. But depending on implementation, it could wind up being a case where usize === sizeof(address) === sizeof(offset) === sizeof(max_sized_object) may not hold up.

IBM-i is another better real-world example where IIUC ptrdiff_t is the size of a fat pointer, 2*size_t (IBM Documentation). Obviously Rust support isn't really feasible there for other reasons, but it is an interesting case that is in use now.

I haven't seen hardware support features like ARM TBI or Intel LAM mentioned either, where maximum allocation is effectively 48, 57, or 63 bits stored with metadata as a u64. This probably doesn't affect usize too much, but definitely make the usize<->pointer conversions interesting.

RalfJung · February 22, 2024, 6:52am

Many of these will be caught on CHERI anyway due to transmute doing size-checks. It's the ones where transmute is hidden behind raw ptr cast, and the ones involving ptr arithmetic, that I am most concerned about.

I met some of the CHERI folks in London earlier this year and they wanted to know what the chances are for Rust on CHERI to happen. So there's definitely people there that would be willing to put in some work, if they can have project liaisons that can help them navigate our processes.

There was some discussion of this on Zulip. I don't think the int/ptr casts are the problem here, the more interesting question is how "pointers to seemingly different addresses" being aliases should work. The status quo is that it doesn't work (not on the LLVM level and not on the Rust level).

talchas · February 25, 2024, 5:20pm

If CHERI-rust wants to use usize == uintptr_t for the minimum breakage (and maximum subtlety of the remaining breakage, as is typical for that sort of tradeoff), then you wind up with a usize that fits u64 and yet has size_of::<usize>() be 16, right? (And then has some set of PVI-like behaviors for the hidden capability part for math and comparisons and such)

I'm assuming this because AIUI the fully-u128 alternative wouldn't have hardware support and might be even weirder on the behavior around usize<->ptr. (Or at least have the weirdness be visible)

(TBH the usize == size_t option is imo the most principled choice for CHERI-rust, and just completely remove the from_exposed_addr/expose_addr and the as casts on that platform. It's already unlikely to be fully spec-compliant from requiring MaybeUninit<*const ()> or equivalent as the fundamental unit of memory for an arbitrary unknown type instead of MaybeUninit<u8>; might as well not pretend to have expose when rust has with_addr/map_addr for a fully working API on CHERI and rust is generally less yolo than C)

Vorpal · February 25, 2024, 6:50pm

Exposing addresses seems like it might be needed for FFI still.

Topic		Replies	Views
[Pre-RFC] usize is not size_t language design	143	12103	February 7, 2022
CHERI pointers and Rust / LLVM SIMD language design	2	1166	January 4, 2022
[Pre-RFC] Flexible Unsize and CoerceUnsize traits language design	7	1352	August 15, 2023
ABI discussion for w65 language design	17	2101	November 18, 2021
Prefer usizes in std libs	12	1371	October 3, 2019

Pre-RFC: `usize` semantics

Related topics