Summary
Change the definition of usize
to pave the way for supporting new provenance-based architectures like CHERI, and to better fit current common usage.
The size of this primitive is how many bytes it takes to reference any location in memory. For example, on a 32 bit target, this is 4 bytes and on a 64 bit target, this is 8 bytes.
becomes
The size of this primitive is how many bytes it takes to store a memory address. For example, on a 32 bit target, this is 4 bytes and on a 64 bit target, this is 8 bytes.
Motivation
The existing definition of usize
makes adding a CHERI (and likely any similar architecture) target to Rust unnecessarily difficult.
These architectures are built around the idea that there is more to a pointer than just an address, which breaks many assumptions about how pointers and usize
behave.
Given the safety and security focus of CHERI, and its increasing prominence of late, it seems likely that the Rust community will want to have the option to support it in future.
Additionally, the current definition does not seem to match up well to real world usage of Rust. Updating the definition as proposed would make it a better fit for use cases that appear to be by far the most common: indexing and calculations about object sizes and addresses.
This change makes a lot of difference in making it easier to support CHERI, but requires very little actual work (three lines of documentation need updating), and doesn't break any existing code on any existing target.
It makes sense to have this discussion now because at least two academic groups have separately run experiments with CHERI ports of Rust.
The two experiments provide some useful perspective on how different definitions of usize
work with CHERI.
Guide-level explanation
At this point in time, this change only affects wording in documentation. It will matter more for some targets that might exist in future, but on current targets no code changes are needed, and no breakage is expected.
The most obvious possible target that would complicate this is CHERI, which adds metadata to pointers (in addition to addresses).
On targets like CHERI it becomes important that usize
will specifically hold addresses (not metadata) which makes sure that common kinds of index, address, and size calculations continue to work properly.
Safe Rust code is automatically forward compatible with CHERI, though some pointer casts could produce warnings when building for CHERI.
&data as *mut u32 as usize
will still get the address of data
as you'd currently expect, even on CHERI targets (where the metadata would be discarded).
On all current targets this cast remains a no-op as far as hardware is concerned.
Examples
use std::mem::size_of;
use libc::{size_t, uintptr_t};
assert_eq!(size_of::<usize>(), size_of::<*const u8>()); // will pass on current targets, will fail on CHERI.
assert_eq!(size_of::<usize>(), size_of::<size_t>()); // will pass on current targets, will pass on CHERI, likely to always pass.
assert_eq!(size_of::<usize>(), size_of::<uintptr_t>()); // will pass on current targets, will fail on CHERI.
let data = 314159 as u32;
let pointer = &data as *const u32;
let _ = pointer as usize; // continues to get address of data, as expected, on all targets.
let _ = unsafe { *(pointer as usize as *const u32) }; // will work on current targets, will crash on CHERI.
FFI
libc::size_t
remains an alias of usize
on all targets, and is unlikely to change in future.
libc::uintptr_t
remains an alias of usize
on current targets, but may change on future targets like CHERI.
What if want my unsafe code to be CHERI compatible?
The following operations will fail on CHERI and similar targets:
unsafe {
let data = 314159 as u32;
let pointer = data as *const u32;
let address = pointer as usize;
let _ = *(address as *const u32); // fails at dereference.
std::mem::transmute::<usize, *const u32>(address); // fails because types have different sizes.
std::mem::transmute::<*const u32, usize>(pointer); // fails because types have different sizes.
}
They can be avoided in a few ways, here are some example cases:
“I want to align a pointer”
unsafe fn align<T>(pointer: *mut T, alignment_bytes: isize) -> *mut T {
let misalignment = (pointer as usize % alignment_bytes as usize) as isize;
if misalignment == 0 {
// Already aligned, do nothing.
pointer
} else {
// Adjust alignment without casting to an integer type.
(pointer as *mut u8).offset(alignment_bytes-misalignment) as *mut T
}
}
“I want to store a flag in the low bit of a pointer”
fn is_tagged<T>(pointer: *mut T) -> bool {
(pointer as usize & 1) == 1
}
unsafe fn set_tag<T>(pointer: *mut T, tag: bool) -> *mut T {
// Remove any previous tag.
let clean = pointer as usize & (!1 as usize);
// Add tag.
let tagged = clean | if tag { 1 } else { 0 };
// Apply to pointer without casting to an integer type.
let offset = if tagged < pointer as usize {
-((pointer as usize - tagged) as isize)
} else {
(tagged - pointer as usize) as isize
};
(pointer as *mut u8).offset(offset) as *mut T
}
“I want a type compatible with C's uintptr_t
for FFI”
Use libc's uintptr_t
alias, or if you really must, a void
pointer.
“I want to store a pointer in a usize
”
You can't, they're fundamentally different types. If you could do this, CHERI's safety guarantees wouldn't mean anything any more.
“What about the strict provenance API?”
#![feature(strict_provenance)]
fn do_something_unhelpful<T>(pointer: *mut T) -> *mut T {
pointer.with_addr(pointer.addr() ^ 0b10101010)
}
Reference-level explanation
The resulting sizes of usize
on different targets:
Target | Bit-width of usize |
Bit-width of pointers |
---|---|---|
x86 | 32 | 32 |
x86-64 | 64 | 64 |
arm | 32 | 32 |
aarch64 | 64 | 64 |
Morello (aarch64+CHERI) | 64 | 128 |
The definition of isize
is unchanged, so it continues to be the same bit-width as usize
.
Standard library functions and types
No changes are needed to core
or std
.
Functions that operate on indices and lengths (slice.get()
, slice.len()
) should continue to use usize
.
Functions that operate on addresses and sizes (std::mem::size_of()
, pointer.offset()
) should continue to use usize
/isize
.
Neither core
nor std
have any use of usize
equivalent to C's uintptr_t
.
It may be worth noting that the documented behavior of ptr::hash()
becomes ambiguous on systems with pointer metadata.
libc's size_t
and uintptr_t
aliases will need to be set appropriately for any future CHERI target, and might benefit from being examined sooner.
size_t
is currently aliased to usize
, which will continue to be correct.
uintptr_t
is also aliased to usize
, which will no longer be guaranteed to be correct.
There are multiple options for replacements.
Whatever is done is unlikely to affect users as size_of::<size_t>() == size_of::<uintptr_t>()
on all current targets.
Strict provenance
The experimental strict provenance API is already a good fit for any possible CHERI target, and already uses types compatible with this proposal.
Specifically, it already uses usize
in ways that are correct for this definition.
Examples on a CHERI target:
let data = 314159 as u32;
let pointer = &data as *const u32;
assert_eq!(pointer.addr(), pointer as usize); // passes.
assert_eq!(pointer.with_addr(pointer.addr()), pointer); // passes.
assert_eq!(unsafe { *pointer.with_addr(pointer.addr()) }, data); // passes.
Corner cases
Targets using some banked memory configurations (i.e. a complete address contains both a bank number and an offset within the bank) may need further examination. Rust currently supports no targets like this.
Concrete changes to make
Update documentation (standard library, language reference, unsafe guide).
Update size_t
and uintptr_t
aliases in libc.
Drawbacks
This change overturns a long-standing definition. While it might not break code in the sense of compilation, it may make the intent of some code incorrect. This is unfortunate, but does not seem to be avoidable.
The definition chosen for this RFC is subtly different to the definition of size_t
used by C.
This could in theory become awkward for some targets (see explanation in Rationale section).
Rationale and alternatives
Current situation
Several pieces of documentation define usize
to be a pointer-sized integer compatible with C's uintptr_t
:
- https://doc.rust-lang.org/std/primitive.usize.html
- https://rust-lang.github.io/unsafe-code-guidelines/layout/scalars.html#layout-of-scalar-types
- https://doc.rust-lang.org/reference/types/numeric.html#machine-dependent-integer-types
In practice, use of usize
seems to be heavily weighted toward indexing and size calculations rather than manipulation of entire pointers.
Given that indexing operations only accept usize
, and the frequency of array accesses in everyday use of Rust, it seems likely this is one of the most common uses of usize
.
usize
also sees use in FFI as both size_t
and uintptr_t
.
Bindgen already assumes that usize
is equivalent to C's size_t
by default (https://github.com/rust-lang/rust-bindgen/pull/2062).
Typical use cases for the current uintptr_t
definition are casting a pointer to an integer for manipulation or storage.
Provenance analyses used for compiler optimisations (see discussions around strict provenance) and the rise of platforms like CHERI may make these sorts of manipulations increasingly un-desirable.
New Evidence
Work on CHERI has led to two experiments in adding CHERI targets to Rust, both targeting ARM's Morello CPU (aarch64+CHERI).
The first was by Nicholas Sim (https://github.com/nw0/rust/tree/cheri), who chose to use the current definition of usize
.
This made usize
128 bits wide to match Morello's 128 bit pointers.
In the resulting dissertation (https://nw0.github.io/cheri-rust.pdf) Sim reported that this caused significant overhead for indexing, and that limited support for 128 bit integers in LLVM made the implementation difficult.
Sim's overall assessment was that this approach should be avoided.
The results of Sim's investigation were fed into a number of community discussions at the time.
The second was by a team at the University of Kent, (https://github.com/kent-weak-memory/rust), this time using the definition of usize
given in this RFC.
This made usize
64 bits wide to match Morello's 64 bit address space.
This port currently passes all tests for core
, and can compile many crates from crates.io
In the process, the team found exactly one component of std
that used a round-trip pointer-usize
-pointer conversion, which has since been replaced in mainline std
(independent of the researchers).
This appears to provide some support for our prediction that usize
is rarely used to hold enitre pointers.
Options
There are a number of plausible definitions for usize
.
The definitions are as follows:
- use C's definition for
size_t
, in short: an integer large enough to describe the largest object/allocation size - use C's definition for
uintptr_t
, in short: an integer which can store a pointer that will still be usable afterward - the definition presented in the summary: an integer that can hold any address, but not necessarily a complete pointer
The impacts and properties of these definitions on Morello are summarised in the following table:
Solution | Bit size (Morello)¹ | Indexing² | Round-trip³ | C FFI⁴ | Object size⁵ | Provenance⁶ |
---|---|---|---|---|---|---|
usize = size_t |
64 | ok | no*² | yes (size_t ) |
ok (64 bit) | awkward*⁵ |
usize = uintptr_t |
128 | inefficient*¹ | yes | yes (uintptr_t ) |
excessive*⁴ (128 bit) | inefficient*⁶ |
usize = address |
64 | ok | no*² | yes*³ (size_t ) |
ok (64 bit) | ok |
¹ What size is usize
on Morello?
² How well does usize
work for indexing arrays?
³ Is *mut T as usize as *mut T
guaranteed to work?
⁴ Can usize
be used to define types for C FFI?
⁵ What is the maximum size of an object?
usize
defines the width of isize
.
The maximum value of isize
defines the maximum object size.
Result: usize
defines the maximum object size.
⁶ Does the experimental strict provenance API work?
*¹ 128 bit indices cause extra overhead on a system that uses 64 bit words.
*² Round trips will fail on CHERI if usize
is smaller than a 128 bit pointer.
*³ Still matches size_t
on all current targets, but definition is technically different.
*⁴ On Morello, causes 128 bit object size, but address space is only 64 bit.
*⁵ Provides enough space to hold an address on current targets, but definition doesn't guarantee that.
*⁶ If used to hold addresses, 50% of the value will be empty on Morello.
Is CHERI relevant to Rust?
The aims of the CHERI and Rust projects seem to be quite similar in terms of providing better memory safety.
Both provide protection against out of bounds accesses, and both seek to provide some measure of protection against use after free and similar bugs.
This becomes useful for Rust in unsafe
code, where CHERI can restore protection of many of the safety properties that unsafe
traditionally looses, though with the limitation of only doing so at run time.
Remember that errors in unsafe
can propagate outward and break safety guarantees in otherwise safe code, so even though unsafe
code is quite rare this could have an affect on much more!
Possible Extension: Warnings
We could gather further information for discussion of this RFC by separately introducing warnings designed to detect uses of usize
incompatible with the proposed definition.
These would need to be gated behind an unstable feature, and could be used to evaluation how many crates use an incompatible idea of usize
by compiling a broad cross-section, possibly via a Crater run.
Cases where warnings could be useful:
value_usize as *mut T
transmute::<usize, *mut T>()
transmute::<*mut T, usize>()
atomic_usize.to_ptr()
Unions provide another way to cast between types, and it would be useful to warn about usize
to pointer casts here too.
By their nature, unions allow a wide variety of powerful casting operations, while providing very little information about the programmer's intent.
Unfortunately, this makes issuing meaningful warnings about unions difficult.
It's notable that there are already no warnings for other unsafe
uses of unions, like casting between types of different sizes.
Given that unions are already highly unsafe in a number of ways, and usually only useful for FFI, not warning on this case does not seem to be a significant problem.
Some examples of ways unions can cause problems:
union Problem1 {
a: usize,
b: *const u32,
}
unsafe {
let data = Problem1{a: 1234};
let pointer = data.b; // casts to field of different size on CHERI.
let _ = *pointer; // crashes due to invalid pointer on CHERI.
}
#[derive(Clone, Copy)]
struct A {
padding: u32,
data: u32,
}
#[derive(Clone, Copy)]
struct B {
pointer: *const u32,
}
union Problem2 {
a: A,
b: B,
}
unsafe {
let data = Problem2{a: A{padding: 1234, data: 4321}};
let pointer = data.b.pointer; // fields have the same size on Morello and possibly other CHERI targets.
let _ = *pointer; // crashes due to invalid pointer on CHERI.
}
The same or very similar warnings could, as an alternative, provide a tool that users could opt in to as a way to check their code for CHERI compatibility. This would be less inconvenient than having to set up an entire cross compilation toolchain, and could be available before any real CHERI target has been implemented.
Possible Extension: uintptr_t
As currently written, this RFC provides no built in representation for a pointers other than... pointers.
It could be useful to provide a built in integer type that can contain a valid pointer for purposes of arithmetic manipulation, though given that Rust's pointers already provide good facilities for this, it may not be very useful.
The University of Kent port currently only has to define uintptr_t
for FFI use, and they have been able to make do by aliasing *mut ()
without any problems (that have so far been found).
Prior art
usize
has been discussed at length in several places.
There have been (at least) two previous RFC proposals:
- Policy for assumptions about the size of
usize
- Add two new pointer-sized integer types; uptr and iptr
One pre-RFC on the internals forum:
...and a number of tickets for various repositories:
- Rust: Support index size != pointer width
- libc: Deprecate pointer-width integer aliases
- Unsafe Guide: Are raw pointers to sized types usable in C FFI ?
- Bindgen:
size_t
vs.usize
None of these discussions seems to have reached much of a satisfactory conclusion. It seems that the community has leant somewhat in the same direction as this RFC as time has gone on.
In addition to these discussions, there have been at least two academic attempts to port the Rust compiler to CHERI (both mentioned above) which have had to grapple with usize
:
- Nicholas Sim's compiler port: https://github.com/nw0/rust/tree/cheri
- ...and the accompanying dissertation: https://nw0.github.io/cheri-rust.pdf
- our compiler port: https://github.com/kent-weak-memory/rust
- ...and the accompanying paper: https://drops.dagstuhl.de/opus/frontdoor.php?source_opus=18232
Unresolved questions
While not immediately an issue for this RFC, it isn't clear how to define aliases for uintptr_t
.
Using *mut ()
may work, but isn't technically an integer type, so it could cause unexpected error for users.
Also not on immediate concern, how many crates will not be forward compatible with CHERI? Adding feature-gated compatibility warnings and compiling a large number of crates (possibly via a Crater run) seems like the easiest way to answer this.
Future possibilities
The main aim for this proposal is to make it possible in future to add support for CHERI or any similar architectures that might appear. Hopefully we've already made the case for why this is interesting! A full CHERI target would still be a significant undertaking, as we have found from our own work on building one.
This proposal fits nicely with the strict provenance API, and could benefit from it, but it is also not dependent upon it.