Pre-RFC: `usize` semantics

Summary

Change the definition of usize to pave the way for supporting new provenance-based architectures like CHERI, and to better fit current common usage.

The size of this primitive is how many bytes it takes to reference any location in memory. For example, on a 32 bit target, this is 4 bytes and on a 64 bit target, this is 8 bytes.

becomes

The size of this primitive is how many bytes it takes to store a memory address. For example, on a 32 bit target, this is 4 bytes and on a 64 bit target, this is 8 bytes.

Motivation

The existing definition of usize makes adding a CHERI (and likely any similar architecture) target to Rust unnecessarily difficult. These architectures are built around the idea that there is more to a pointer than just an address, which breaks many assumptions about how pointers and usize behave. Given the safety and security focus of CHERI, and its increasing prominence of late, it seems likely that the Rust community will want to have the option to support it in future.

Additionally, the current definition does not seem to match up well to real world usage of Rust. Updating the definition as proposed would make it a better fit for use cases that appear to be by far the most common: indexing and calculations about object sizes and addresses.

This change makes a lot of difference in making it easier to support CHERI, but requires very little actual work (three lines of documentation need updating), and doesn't break any existing code on any existing target.

It makes sense to have this discussion now because at least two academic groups have separately run experiments with CHERI ports of Rust. The two experiments provide some useful perspective on how different definitions of usize work with CHERI.

Guide-level explanation

At this point in time, this change only affects wording in documentation. It will matter more for some targets that might exist in future, but on current targets no code changes are needed, and no breakage is expected.

The most obvious possible target that would complicate this is CHERI, which adds metadata to pointers (in addition to addresses). On targets like CHERI it becomes important that usize will specifically hold addresses (not metadata) which makes sure that common kinds of index, address, and size calculations continue to work properly. Safe Rust code is automatically forward compatible with CHERI, though some pointer casts could produce warnings when building for CHERI. &data as *mut u32 as usize will still get the address of data as you'd currently expect, even on CHERI targets (where the metadata would be discarded). On all current targets this cast remains a no-op as far as hardware is concerned.

Examples

use std::mem::size_of;
use libc::{size_t, uintptr_t};

assert_eq!(size_of::<usize>(), size_of::<*const u8>()); // will pass on current targets, will fail on CHERI.
assert_eq!(size_of::<usize>(), size_of::<size_t>()); // will pass on current targets, will pass on CHERI, likely to always pass.
assert_eq!(size_of::<usize>(), size_of::<uintptr_t>()); // will pass on current targets, will fail on CHERI.
let data = 314159 as u32;
let pointer = &data as *const u32;
let _ = pointer as usize; // continues to get address of data, as expected, on all targets.
let _ = unsafe { *(pointer as usize as *const u32) }; // will work on current targets, will crash on CHERI.

FFI

libc::size_t remains an alias of usize on all targets, and is unlikely to change in future.

libc::uintptr_t remains an alias of usize on current targets, but may change on future targets like CHERI.

What if want my unsafe code to be CHERI compatible?

The following operations will fail on CHERI and similar targets:

unsafe {
	let data = 314159 as u32;
	let pointer = data as *const u32;
	let address = pointer as usize;
	let _ = *(address as *const u32); // fails at dereference.
	std::mem::transmute::<usize, *const u32>(address); // fails because types have different sizes.
	std::mem::transmute::<*const u32, usize>(pointer); // fails because types have different sizes.
}

They can be avoided in a few ways, here are some example cases:

“I want to align a pointer”

unsafe fn align<T>(pointer: *mut T, alignment_bytes: isize) -> *mut T {
	let misalignment = (pointer as usize % alignment_bytes as usize) as isize;
	if misalignment == 0 {
		// Already aligned, do nothing.
		pointer
	} else {
		// Adjust alignment without casting to an integer type.
		(pointer as *mut u8).offset(alignment_bytes-misalignment) as *mut T
	}
}

“I want to store a flag in the low bit of a pointer”

fn is_tagged<T>(pointer: *mut T) -> bool {
	(pointer as usize & 1) == 1
}
unsafe fn set_tag<T>(pointer: *mut T, tag: bool) -> *mut T {
	// Remove any previous tag.
	let clean = pointer as usize & (!1 as usize);
	// Add tag.
	let tagged = clean | if tag { 1 } else { 0 };
	// Apply to pointer without casting to an integer type.
	let offset = if tagged < pointer as usize {
		-((pointer as usize - tagged) as isize)
	} else {
		(tagged - pointer as usize) as isize
	};
	(pointer as *mut u8).offset(offset) as *mut T
}

“I want a type compatible with C's uintptr_t for FFI”

Use libc's uintptr_t alias, or if you really must, a void pointer.

“I want to store a pointer in a usize

You can't, they're fundamentally different types. If you could do this, CHERI's safety guarantees wouldn't mean anything any more.

“What about the strict provenance API?”

#![feature(strict_provenance)]
fn do_something_unhelpful<T>(pointer: *mut T) -> *mut T {
	pointer.with_addr(pointer.addr() ^ 0b10101010)
}

Reference-level explanation

The resulting sizes of usize on different targets:

Target Bit-width of usize Bit-width of pointers
x86 32 32
x86-64 64 64
arm 32 32
aarch64 64 64
Morello (aarch64+CHERI) 64 128

The definition of isize is unchanged, so it continues to be the same bit-width as usize.

Standard library functions and types

No changes are needed to core or std. Functions that operate on indices and lengths (slice.get(), slice.len()) should continue to use usize. Functions that operate on addresses and sizes (std::mem::size_of(), pointer.offset()) should continue to use usize/isize. Neither core nor std have any use of usize equivalent to C's uintptr_t. It may be worth noting that the documented behavior of ptr::hash() becomes ambiguous on systems with pointer metadata.

libc's size_t and uintptr_t aliases will need to be set appropriately for any future CHERI target, and might benefit from being examined sooner. size_t is currently aliased to usize, which will continue to be correct. uintptr_t is also aliased to usize, which will no longer be guaranteed to be correct. There are multiple options for replacements. Whatever is done is unlikely to affect users as size_of::<size_t>() == size_of::<uintptr_t>() on all current targets.

Strict provenance

The experimental strict provenance API is already a good fit for any possible CHERI target, and already uses types compatible with this proposal. Specifically, it already uses usize in ways that are correct for this definition.

Examples on a CHERI target:

let data = 314159 as u32;
let pointer = &data as *const u32;
assert_eq!(pointer.addr(), pointer as usize); // passes.
assert_eq!(pointer.with_addr(pointer.addr()), pointer); // passes.
assert_eq!(unsafe { *pointer.with_addr(pointer.addr()) }, data); // passes.

Corner cases

Targets using some banked memory configurations (i.e. a complete address contains both a bank number and an offset within the bank) may need further examination. Rust currently supports no targets like this.

Concrete changes to make

Update documentation (standard library, language reference, unsafe guide).

Update size_t and uintptr_t aliases in libc.

Drawbacks

This change overturns a long-standing definition. While it might not break code in the sense of compilation, it may make the intent of some code incorrect. This is unfortunate, but does not seem to be avoidable.

The definition chosen for this RFC is subtly different to the definition of size_t used by C. This could in theory become awkward for some targets (see explanation in Rationale section).

Rationale and alternatives

Current situation

Several pieces of documentation define usize to be a pointer-sized integer compatible with C's uintptr_t:

In practice, use of usize seems to be heavily weighted toward indexing and size calculations rather than manipulation of entire pointers. Given that indexing operations only accept usize, and the frequency of array accesses in everyday use of Rust, it seems likely this is one of the most common uses of usize.

usize also sees use in FFI as both size_t and uintptr_t. Bindgen already assumes that usize is equivalent to C's size_t by default (https://github.com/rust-lang/rust-bindgen/pull/2062).

Typical use cases for the current uintptr_t definition are casting a pointer to an integer for manipulation or storage. Provenance analyses used for compiler optimisations (see discussions around strict provenance) and the rise of platforms like CHERI may make these sorts of manipulations increasingly un-desirable.

New Evidence

Work on CHERI has led to two experiments in adding CHERI targets to Rust, both targeting ARM's Morello CPU (aarch64+CHERI).

The first was by Nicholas Sim (https://github.com/nw0/rust/tree/cheri), who chose to use the current definition of usize. This made usize 128 bits wide to match Morello's 128 bit pointers. In the resulting dissertation (https://nw0.github.io/cheri-rust.pdf) Sim reported that this caused significant overhead for indexing, and that limited support for 128 bit integers in LLVM made the implementation difficult. Sim's overall assessment was that this approach should be avoided. The results of Sim's investigation were fed into a number of community discussions at the time.

The second was by a team at the University of Kent, (https://github.com/kent-weak-memory/rust), this time using the definition of usize given in this RFC. This made usize 64 bits wide to match Morello's 64 bit address space. This port currently passes all tests for core, and can compile many crates from crates.io In the process, the team found exactly one component of std that used a round-trip pointer-usize-pointer conversion, which has since been replaced in mainline std (independent of the researchers). This appears to provide some support for our prediction that usize is rarely used to hold enitre pointers.

Options

There are a number of plausible definitions for usize. The definitions are as follows:

  • use C's definition for size_t, in short: an integer large enough to describe the largest object/allocation size
  • use C's definition for uintptr_t, in short: an integer which can store a pointer that will still be usable afterward
  • the definition presented in the summary: an integer that can hold any address, but not necessarily a complete pointer

The impacts and properties of these definitions on Morello are summarised in the following table:

Solution Bit size (Morello)¹ Indexing² Round-trip³ C FFI⁴ Object size⁵ Provenance⁶
usize = size_t 64 ok no*² yes (size_t) ok (64 bit) awkward*⁵
usize = uintptr_t 128 inefficient*¹ yes yes (uintptr_t) excessive*⁴ (128 bit) inefficient*⁶
usize = address 64 ok no*² yes*³ (size_t) ok (64 bit) ok

¹ What size is usize on Morello?

² How well does usize work for indexing arrays?

³ Is *mut T as usize as *mut T guaranteed to work?

⁴ Can usize be used to define types for C FFI?

⁵ What is the maximum size of an object? usize defines the width of isize. The maximum value of isize defines the maximum object size. Result: usize defines the maximum object size.

⁶ Does the experimental strict provenance API work?

*¹ 128 bit indices cause extra overhead on a system that uses 64 bit words.

*² Round trips will fail on CHERI if usize is smaller than a 128 bit pointer.

*³ Still matches size_t on all current targets, but definition is technically different.

*⁴ On Morello, causes 128 bit object size, but address space is only 64 bit.

*⁵ Provides enough space to hold an address on current targets, but definition doesn't guarantee that.

*⁶ If used to hold addresses, 50% of the value will be empty on Morello.

Is CHERI relevant to Rust?

The aims of the CHERI and Rust projects seem to be quite similar in terms of providing better memory safety. Both provide protection against out of bounds accesses, and both seek to provide some measure of protection against use after free and similar bugs. This becomes useful for Rust in unsafe code, where CHERI can restore protection of many of the safety properties that unsafe traditionally looses, though with the limitation of only doing so at run time. Remember that errors in unsafe can propagate outward and break safety guarantees in otherwise safe code, so even though unsafe code is quite rare this could have an affect on much more!

Possible Extension: Warnings

We could gather further information for discussion of this RFC by separately introducing warnings designed to detect uses of usize incompatible with the proposed definition. These would need to be gated behind an unstable feature, and could be used to evaluation how many crates use an incompatible idea of usize by compiling a broad cross-section, possibly via a Crater run.

Cases where warnings could be useful:

  • value_usize as *mut T
  • transmute::<usize, *mut T>()
  • transmute::<*mut T, usize>()
  • atomic_usize.to_ptr()

Unions provide another way to cast between types, and it would be useful to warn about usize to pointer casts here too. By their nature, unions allow a wide variety of powerful casting operations, while providing very little information about the programmer's intent. Unfortunately, this makes issuing meaningful warnings about unions difficult. It's notable that there are already no warnings for other unsafe uses of unions, like casting between types of different sizes. Given that unions are already highly unsafe in a number of ways, and usually only useful for FFI, not warning on this case does not seem to be a significant problem.

Some examples of ways unions can cause problems:

union Problem1 {
	a: usize,
	b: *const u32,
}
unsafe {
	let data = Problem1{a: 1234};
	let pointer = data.b; // casts to field of different size on CHERI.
	let _ = *pointer; // crashes due to invalid pointer on CHERI.
}

#[derive(Clone, Copy)]
struct A {
	padding: u32,
	data: u32,
}
#[derive(Clone, Copy)]
struct B {
	pointer: *const u32,
}
union Problem2 {
	a: A,
	b: B,
}
unsafe {
	let data = Problem2{a: A{padding: 1234, data: 4321}};
	let pointer = data.b.pointer; // fields have the same size on Morello and possibly other CHERI targets.
	let _ = *pointer; // crashes due to invalid pointer on CHERI.
}

The same or very similar warnings could, as an alternative, provide a tool that users could opt in to as a way to check their code for CHERI compatibility. This would be less inconvenient than having to set up an entire cross compilation toolchain, and could be available before any real CHERI target has been implemented.

Possible Extension: uintptr_t

As currently written, this RFC provides no built in representation for a pointers other than... pointers. It could be useful to provide a built in integer type that can contain a valid pointer for purposes of arithmetic manipulation, though given that Rust's pointers already provide good facilities for this, it may not be very useful. The University of Kent port currently only has to define uintptr_t for FFI use, and they have been able to make do by aliasing *mut () without any problems (that have so far been found).

Prior art

usize has been discussed at length in several places. There have been (at least) two previous RFC proposals:

One pre-RFC on the internals forum:

...and a number of tickets for various repositories:

None of these discussions seems to have reached much of a satisfactory conclusion. It seems that the community has leant somewhat in the same direction as this RFC as time has gone on.

In addition to these discussions, there have been at least two academic attempts to port the Rust compiler to CHERI (both mentioned above) which have had to grapple with usize:

Unresolved questions

While not immediately an issue for this RFC, it isn't clear how to define aliases for uintptr_t. Using *mut () may work, but isn't technically an integer type, so it could cause unexpected error for users.

Also not on immediate concern, how many crates will not be forward compatible with CHERI? Adding feature-gated compatibility warnings and compiling a large number of crates (possibly via a Crater run) seems like the easiest way to answer this.

Future possibilities

The main aim for this proposal is to make it possible in future to add support for CHERI or any similar architectures that might appear. Hopefully we've already made the case for why this is interesting! A full CHERI target would still be a significant undertaking, as we have found from our own work on building one.

This proposal fits nicely with the strict provenance API, and could benefit from it, but it is also not dependent upon it.

22 Likes

I'm a bit confused by the first two quotes in your proposal, because both refer to "location in memory" or "memory address", which can be interpreted like uintptr_t. I think the wording could be more explicit about equivalence to size_t.

I wholeheartedly agree with officially accepting usize as size_t. IMHO it's already assumed to be size_t in too many places, and the boat has sailed. Trying to enforce its usage as uintptr_t that Rust has originally intended would cause more churn than accepting the de-facto usage.

Rust is already experimenting with pointers with provenance, and this is a chance to create new CHERI-compatible types, instead of burdening usize with additional bits.

5 Likes

I think it would be good to write these examples using strict provenance APIs instead of using as casts. That will likely result in more clear code, and after all CHERI was one of the motivations for having the strict provenance APIs in the first place.

12 Likes

How does this work with targets that want size_t < uintptr_t but also ptraddr_t = uintptr_t, namely w65 and 8086 (with far pointers).

2 Likes

A mitigation option would be to be very picky and add a crate attribute #![assert_usize_pointer_equivalence_guranteed = true/false/auto].

The default option (if no attribute is specified), is auto, which evaluates to true if:

  • Unions are used
  • Certain specifically labeled functions in core/alloc/std are used
  • Direct casts between pointers and usize are used.
  • Upstream dependencies have this option set to true or auto evaluating to true.

In doubt auto should evaluate to true.

If the value is set to false, any code that assumes usize-to-pointer cast equivalence is showing undefined behavior. Such a crate must not have any upstream dependencies, where the value is true or auto evaluating to true.

On most targets, if the value is set to true or auto evaluating to true, a warning is issued, like: "This crate may exploit usize to pointer equivalence and thus cannot be compiled on some targets. If you do not rely on this behavior you can state so by adding #![assert_usize_pointer_equivalence_guranteed = false] in the crate root.

On some targets like CHERI, where size_t != uintptr_t, if the value is set to true or auto evaluating to true, the crate simply cannot be compiled.

In a future edition of Rust, the default for assert_usize_pointer_equivalence_guranteed should be changed to false. A migration tool would then only have to add the attribute explicitly if it is missing.

Possibly it would make sense to introduce the attribute a few releases prior to issuing the warning, so upstream dependencies can prepare for the situation to not scare everybody.

I think this RFC makes no progress towards supporting those targets. It helps with other targets like CHERI, but it doesn't magically solve all "weird" targets.

I would like to see those addressed by any action on this issue, and I'm certainly willing to help with that.

The problem is that for CHERI we have a nice and non-invasive solution that should hopefully get us very far -- no new integer type required, "just" use strict provenance everywhere and don't cast/transmute between integers and pointers ever.

For your targets, I am not aware of any solution that would be similarly non-invasive. Or is there one I have missed?

So, it seems like we have to choose between blocking everything on some way to actually introduce a new integer type into Rust, or making progress on CHERI without a new integer type, but unfortunately also without progress for other targets.

The main thing I'm worried about is progress here permanently barring progress on those other targets. IE. picking a solution that not only cannot be applied to those targets, but also cannot be expanded or modified to include those targets.

1 Like

I don't see how adopting this solution for CHERI makes the other targets any harder to support.

And I don't think it is reasonable to block progress on CHERI on also supporting these other targets, when there is no credible proposal for how to do that with similar cost for the overall ecosystem.

1 Like

I'm not sure we can do this, since it's a breaking change of the worst kind -- the kind which will subtly invalidate assumptions of unsafe code in uncheckable ways.

2 Likes

We've made changes to the language that are "soft" breaking before. This seems like pretty low risk.

But if we want to really hedge our bets, we could force-forbid fuzzy-provenance-casts for every crate when compiling for CHERI or other sizeof(usize) != sizeof(*const _) targets.

3 Likes

I'm curious as to what your proposed alternative solution to this problem would be.

The issue is that it suggests folding usize into size_t, but that guarantees strict_provenance apis are correct to categorize code. If both of these apply generally (and not specific to CHERI), then that would prohibit w65/8086 which would only satisfy strict_provenance being correct for uintptr_t, not size_t.

1 Like

The 8086 is very old. Is it even relevant at all today (e.g. I know some old CPU architectures are still used in embedded/industrial)? And I have been unable to find anything at all relevant about w65.

There are lots of old architectures that rust will never support. I would love to see rust on a vacuum tube computer (presumably as a hobby thing) or maybe an old Soviet base-3 system. Realistically neither will happen. What makes 8086 and w65 more relevant?

The formal name of w65 is WDC 65c816. I use the abreviated form from GNU's config.sub adopted by lccc.

And yes, I would like to support 8086 and w65 in lccc, allowing people to write code for these old platforms. For the latter, I have an entire project about this called SNES-Dev. Other than the size_t issue (which, for 8086 applies only to the far ptr abi), the targets otherwise (nearly) perfectly fit within the current Rust model.

My alternative solution would be that CHERI-rustc is not quite a compliant rustc compiler, the same way their C compilers are not quite spec-compliant either (despite the infamously lenient C spec).

Ultimately a relevant question to how much breakage this causes is whether or not random rust crates without obviously platform-dependent impls on crates.io are generally expected to support CHERI (things like arcswap or dashmap or whatever). If they are, then this is 100% a major breaking change (and frankly you'd want strict_provenance stable first before people go around harassing crate authors with this change); if they're not that would need to be very clearly communicated and would instead imo a major change to the expectations on using crates.io.

3 Likes

This is fine unless CHERI becomes the dominant architecture. If CHERI comes to dominate, then Rust will deeply regret not addressing it today. And the arguments against making this change today will only apply more in the future, when more code depending on the current behavior exists.

Plus, defining the size of usize as independent from the size of pointers is useful beyond just CHERI.

1 Like

For those of us who may not be well acquainted with the various types being discussed:

  • uintptr_t: an unsigned integer type with the property that any valid pointer to void can be converted to this type, then converted back to pointer to void, and the result will compare equal to the pointer
  • ptraddr_t: unsigned integer large enough for any virtual address
  • size_t: unsigned integer large enough to describe the size of the largest contiguous allocation supported by the target
5 Likes

Making things worse for programmers now just in case an extremely speculative bet pays out (slowly, over the course of years) is questionable even without the breaking change aspect. That (usability) is why usize was defined as both in the first place. If CHERI actually becomes dominant then crates will have had years to consider this when CHERI (and CHERI-not-quite-rustc) was merely "relevant and increasing in popularity".

Uhh, citation needed. Are you talking about even less relevant archs like ancient 8086 (which as Inferno mentioned, this might not even work for), or something else entirely?

1 Like