[Pre-RFC] usize is not size_t

InfernoDeity · September 23, 2021, 12:26am

`usize` is not `size_t`

Brief

Currently, while rust does not explicitly guarantee compatibility between usize and size_t, some rust FFI code assumes this to be the case. This impacts ABIs such as w65 and CHERI, where the size type is a different size from the pointer-sized type. Thus, to support development for rust targetting these platforms, it is sought to explicitly declare this to not be guaranteed.

Background

In rust, usize is an integer type which is compatible with (same size as) thin raw pointer types. The standard library docs describe it as

[A primitive for which the size] is how many bytes it takes to reference any location in memory. For example, on a 32 bit target, this is 4 bytes and on a 64 bit target, this is 8 bytes.

The Unsafe Code Guidelines notes that

The isize and usize types are pointer-sized signed and unsigned integers. They have the same layout as the pointer types for which the pointee is Sized

The Unsafe Code Guidelines also notably defines that usize and isize are respectively compatible with uintptr_t and intptr_t defined in C.

This has interesting implications, most notably for FFI. This Pre-RFC is brought to formally declare that the C types size_t and ptrdiff_t are not necessarily compatible with usize and isize respectively; that is, size_t and ptrdiff_t aren't required to be compatible with uintptr_t and intptr_t on the platform C abi in order for it to support Rust.

This has been discussed on zulip (archive), and briefly in ABI discussion for w65, which demonstraits potential issues where compatibility is assumed.

Where this Applies

On most platforms, and all that rust currently supports, size_t and uintptr_t are the same type, likewise ptrdiff_t and intptr_t. However, as rust evolves to support new platforms, some may not make such a guarantee. For example, in the w65 abi defined by the SNES-Dev Project (which is designed to allow users to develop SNES Homebrew in modern languages, and intends to eventually include Rust), defines the type size_t as an typedef for unsigned int and uintptr_t as a typedef for unsigned long. These types notably have different size (2 for size_t and 4 for uintptr_t). As a result, under this abi, the usize type is not compatible with size_t (indeed, the types use two entirely different parameter and return value conventions - the former passes in a hardware register, and the latter in a specially designated memory location). This was initially noted in ABI discussion for w65.

As another example, the CHERI platform, which stores machine-level provenance information in pointer values, has a similar issue - uintptr_t, which stores a capability, is 128-bit, and size_t, which does not, is 64-bit.

Why this is a problem

FFI code currently assumes that usize is "the size type" and is compatible with size_t, despite nothing explicitly guaranteeing it.

In libc's api definition, the size_t type is defined as identically usize. Additionally, bindgen has a flag for this, --size_t-is-usize, though size_t vs usize · Issue #1671 · rust-lang/rust-bindgen · GitHub notes this exact issue, and seems to rationalize the flag not being default from this exact thing.

While I am not specifically aware of other crates that have this issue, I wouldn't find it hard to believe that it is relied upon elsewhere, or that something in the future may come to rely upon it.

Note: While this can be considered a de facto breaking change, there are a couple reasons it can be justified:

No existing platform that rust supports has this issue - only new platforms.
FFI code typically needs to be tailored for the specific platform in other ways, so I find it less likely that some old code wouldn't otherwise have issues when ported to a new architecture.
In the case of w65, which is primarily a freestanding architecture with only freestanding targets, much of it's FFI will be specific to the platform, and thus written specifically for the platform, with the ability to account for this abi difference. I cannot speak as to whether or not the same reasoning applies to CHERI.

Alternatives

There are a number of alternatives to consider:

Rust could do nothing explicitly, leaving it as is. This is potentially the most dangerous option, as rust abis for platforms where this the types are not compatible may emerge and come into use, while FFI code may continue relying on this,
Rust could explicitly declare that usize and size_t are compatible. This would establish a de facto guarantee as an official guarantee. This would result in platforms, where possible, needing to ensure this is the case. In the case where it would be impossible (the abi already exists and is stable) or infeisible, those platforms simply cannot be supported with rust. I would prefer this option not be taken, as it would likely rule out rust support for the w65 platform (making size_t 4 bytes may imperissibly penalize both abi and codegen involving the type)
A hybrid approach of this could be taken, where hosted targets (ones that run with the benefit of an operating system, and, in particular, have std available, rather than just core and/or alloc) guarantee the compatibility, and freestanding targets explicitly do not.

tcsc · September 23, 2021, 1:49am

Honestly, it's probably better for FFI code to rely on this, even though it's not strictly true. In practice, the alternative to using --size_t-is-usize (which isn't on by default) is situations like rust-sdl2/sdl_bindings.rs at 56f36cfb732166fd86994ea3febfebeb0dbf718c · Rust-SDL2/rust-sdl2 · GitHub (have fun using that definition on win64).

Honestly, IMO the default for bindgen should change because of these kinds of issues (it's too easy to incorrectly use bindgen), regardless of the decision here.

Regarding the actual question at hand, well, I'm weakly against it. It's very clear that in the past Rust has deliberately made the decision not to support certain exotic platforms in order to make the common case easier to program against, and I think that was a very good move.

That said, I see this as somewhat inevitable, as indicated by being the author of Add `c_size_t` and `c_ssize_t` to `std::os::raw`. by thomcc · Pull Request #88340 · rust-lang/rust · GitHub.

InfernoDeity · September 23, 2021, 2:05am

To be fair, w65 is less "exotic" than some. It has 8-bit bytes, and (mostly) well-behaved pointers in the ABI defined above (notably an abi-level change made to accomodate rust: see this zulip discussion (archive)). The main issue is the instant one - whether size_t can differ from uintptr_t.

josh · September 23, 2021, 2:33am

What size is ptrdiff_t on CHERI?

What is the maximum array size on w65?

I think we need to determine whether usize is the maximum size of any one data item (e.g. the maximum array size, or the reasonable difference between any two pointers in an address space), or if usize can actually hold and round-trip a pointer. Most code expects at least the former; some code additionally expects the latter. Despite what the UCG currently says, I feel more willing to break the assumptions of the latter (and tell them they need to use uintptr_t types) than the former (and force FFI to use c_size_t or similar).

I would, on balance, prefer if usize was size_t, rather than guaranteeing the full size of a pointer.

InfernoDeity · September 23, 2021, 2:51am

Unfortunately, my knowledge of CHERI ends with the basics I already mentioned, so I can't offer a diffinitive answer. However, I would assume ptrdiff_t is just signed size_t, so it would also be 64-bit. Pointers and uintptr_t are 128-bits because it stores a machine address and permissions.

The maximum size of any object, which includes an array with an element size >=1 in the w65 abi mentioned above is SIZE_MAX, 65535 bytes. Whether or not this also restrains the maximum extent of an array of a ZST, that isn't answered by the ABI (IE., can you have only [(); 65535] as a type, or is [(); usize::MAX] a valid type that can be allocated given that it's size is within the maximum size for the abi).

I'm not sure that would be valid to do so, as multiple official sources of documentation make it so, varying from non-normative (Reference, UCG), to "probably normative" (I would personally consider the stdlib docs normative, at least wrt. to the standard library). If it were possible, that would be ideal imo, but I doubt that it can be done without breaking actual guarantees.

CAD97 · September 23, 2021, 3:21am

cc @RalfJung for fun provenance things since usize can't "really" hold a pointer (plus all of its shadow state) in the first place.

However, I think this is basically entirely a nonstarter of a change, at least if you want to say this is always the case (even on platforms where it happens size_t === uintptr_t). It is safe to do ptr as usize as *mut _ (only the latter dereferencing is unsafe), and if size_of<usize>() < size_of<c_uintptr_t>(), then this would silently truncate the value.

A lot of code does pointer-to-usize casts in order to do bitbanging tricks with the pointer value (e.g. alignment tagging, NaN boxing, etc.). I will be hard against any changes that cause this to silently break on any supported platform.

The only way I'd personally even consider usize !== c_uintptr_t is if this as cast is special cased to be a hard error when it would lose data, notably unlike ptr as u32.

The assumption that usize === size_t === uintptr_t is too ingrained in the ecosystem to allow code making this assumption in the obvious manner to silently miscompile.

quinedot · September 23, 2021, 4:39am

usize (formerly uint) being defined as the size of a pointer can be seen in RFC 0544 and various other threads around that period. More recently, RFC 1861 also defined a pointer to an extern type to be the same size as a usize, and it's also implicit in RFC 2580 (e.g. Thin).

H2CO3 · September 23, 2021, 6:10am

I have seen official-ish recommendations to treat size_t as usize, even in extern declarations themselves (on my phone right now so I'm not gonna dig them up).

Otherwise, I think breaking this de facto guarantee would be devastating. The weird platforms that you wish to support should be fixed in some other way, but most definitely not by breaking the other 99% of the whole ecosystem.

Isn't Rust supposed to do away with C's weirdness, anyway? This is a bad enough corner case that we must not explicitly try to guide people towards shooting themselves in the foot.

Disgaree. Authors of half-decent C libraries know how to write portable code. The problem is that due to the lack of fixed-sized integer types before C89, most libraries resort to defining their own integer types. We should instead thrive for supporting this idiom, e.g. in bindgen.

Sure, it is easier for language designers to throw up them hands and say "you are holding it wrong", but that's hardly the right thing to do in a language that is centered around correctness. There has been so much discussion around perceived "ergonomics" of insignificant syntactical features, it's not even funny. Why aren't we focusing on "ergonomy" in this case, too? Let's try to declare rules that make the simple/"obvious" approach correct. Anything else is highly detrimental to the usability of the language.

quinedot · September 23, 2021, 9:11am

Existing issue: Support index size != pointer width.

Related: core dependency on c_int in memcmp.

InfernoDeity · September 23, 2021, 12:11pm

How would you suggest this be fixed, without penalizing the ABI (and possibly codegen/assembly heavily) of arguably the most important functions in existance - memcpy, memmove, memcmp, memchr, and memset.

This still suggests alternative 3 is a reasonable option. I highly doubt most portable FFI was developed to target freestanding platforms. Also, even if the library itself is truly portable, the FFI still may not. As mentioned, libSDL incorrectly relies on unsigned long being u64 on 64-bit platforms, which is not true on windows. Bindings generated by bindgen are not portable and need to be regenerated on host systems. Also, I am dubious on whether people , I have see a lot of things that rely on int being 32 bit, long being 64 bit, long being 32 bit, size_t being uintptr_t, etc. without preprocessor guards; none of this, except code relying on long being 32-bit, will work on w65. In fact, in my first draft of such an ABI, I considered and made int 32-bit, long 64-bit. However, it doesn't take a benchmark to figure out that forcing two 16-bit memory accesses (6 cycles each since they can be to direct page) to pass a parameter and to read the parameter is not the most efficient abi possible on the platform - which is what int is intended to be, which is why in the latest attempt, the version I listed, int, long, and long long are all using their Standard defined minimum width.

That being said, even if this is the case, would this not be fixed by Rust defining a c_size_t interger in std::os::raw and potentially in core::ffi (which is proposed in https://github.com/rust-lang/rust/issues/88345). FFI could use this without issue if std is available. Defining it in core would be tricky, though, as IIRC it does depend on the operating system.

To me, the obviously correct approach would be to have different core language types - usize and uptr, where usize is the size type, and uptr is the pointer-sized type. However, as argued, this ship has probably sailed.

InfernoDeity · September 23, 2021, 12:43pm

I was wondering if any RFCs guaranteed this - I would consider those 100% normative. If an RFC guarantess uintptr_t=`usize, then changing that would definately be a breaking change.

kornel · September 23, 2021, 12:44pm

I strongly disagree.

Rust sizes and indexes collections by usize, and that is size_t's job, not uintptr_t's. There's even size in the name!

I know some Rust spec somewhere said usize is meant to be uintptr_t, but this has been largely ignored by everyone. Given that even Rust's own libstd is confused about it, the spec should be changed to equate usize with size_t, rather than trying to change the whole world to the mistaken spec.

I suggest redefining usize as exactly size_t, and adding uintptr to Rust instead.

InfernoDeity · September 23, 2021, 12:53pm

The issue is the impact of the change - as mentioned, rust unsafe code relies on the definition of usize being the pinter-sized type. It would be a significant breaking change to reverse this definition especially without any kind of error, as @CAD97 mentioned. In contrast, the amount of code that uses usize as size_t for soundness purposes (so, really FFI), that will otherwise work without modification on freestanding w65, and would be silently broken by this change is, imo, likely comparatively small. Code that, in rust, indexes slices/containers by usize is not an issue since it doesn't cross an ffi boundary, and rustc can turn it into the proper type when performing pointer arithmetic - it only suffers a penalization in it's abi on w65 (which likely could be optimized away).

nacaclanga · September 23, 2021, 1:02pm

In an ideal world, this would be the case. The problem is, how do we fix this. Currently Rust code assumes usize to be uintptr_t. To fix this on would have to do these things:

a) Introduce a new uaddr type, that is cohercible into usize and wise versa. This works only on plattforms where size_t is uintptr_t.

b) Adjust the signatures of std functions to use the correct type (uaddr or usize).

c) Add a lint warning when cohercing between usize and uadd.

d) Add targets, where size_t is not uintptr_t. On this targets compiling code, that tries to coherce between uaddr or usize should simply fail.

e) In a new Rust edition, disallow cohercing between usize and uadd.

chrisd · September 23, 2021, 3:52pm

You've never seen code that uses pointer as usize? Or uses a union of the two? I've seen these patterns quite a bit.

scottmcm · September 23, 2021, 6:02pm

Interestingly, outside of ZSTs usize is arguably too big for indexing. It would have been nice if array lengths and size_of could have been a different type that only takes the values usize ∩ isize. That niche would be handy, and would immediately remove a bunch of the messier safety preconditions from from_raw_parts and such.

Of course that's another of those "needs a time machine" problems, I think. And it wouldn't help here at all because it'd be neither size_t nor ptrdiff_t.

quinedot · September 23, 2021, 6:51pm

I wouldn't say it's "some spec somewhere"; every issue, documentation, and RFC I've found so far is unambiguous about usize being pointer sized. This includes the RFC that renamed int and uint to isize and usize.

Drawbacks of isize/usize :

The names fail to indicate the precise semantics of the types - pointer-sized integers . (And they don't follow the i32/u32 pattern as faithfully as possible, as 32 indicates the exact size of the types, but size in isize/usize is vague in this aspect.)

The names favour some of the types' use cases over the others.

The names remind people of C's ssize_t/size_t , but isize/usize don't share the exact same semantics with the C types.

Size being in the name was just bike-shedding.

jacobbramley · September 23, 2021, 7:27pm

It does not, unfortunately. Whilst CHERI and its implementations so far are research architectures, they aim to explore complete deployments of full, conventional systems.

That said, declaring usize and a hypothetical uptr to have different types only for CHERI is a plausible approach for piecemeal deployment; the subsequent changes necessary in crates will probably be similar to those necessary for much C and C++ code.

jacobbramley · September 23, 2021, 7:40pm

For today's purposes, 64 bits.

If you read through the CHERI papers and reports, you'll find a few variants and encodings that they have explored through the years. We shouldn't rule those out, but our concern for now is really about 64-bit address spaces and 128-bit capabilities, like Arm's Morello.

Note that a uintptr_t, holding a 128-bit capability, still behaves like a 64-bit integer if you treat it as a plain arithmetic type. This makes it very inefficient when used only as a size, and is one reason that I dislike the usize-is-uintptr_t approach suggested elsewhere (though it does make a very useful step).

Another is that it's not easy to determine how to compile {usize} + {usize}, since it's nonsensical to add two pointers, but either of those arguments could have capability provenence (and it could change at run time).

quinedot · September 23, 2021, 8:42pm

Ping @gnzlbg re: Deprecate pointer-width integer aliases · Issue #1400 · rust-lang/libc · GitHub

Topic		Replies	Views
ABI discussion for w65 language design	17	2102	November 18, 2021
Pre-RFC: `usize` semantics Unsafe Code Guidelines	155	7423	June 5, 2024
CHERI pointers and Rust / LLVM SIMD language design	2	1166	January 4, 2022
To improve usize (and isize) handling in Rust language design	6	1376	September 19, 2020
Would having both `iptr/uptr` and `idiff/usize` in Rust be a good idea? (Answer: No.) bikeshed (deprecated)	5	2591	March 25, 2019