[Pre-RFC] usize is not size_t

Is an option making it a hard error in any edition on targets where size_t and uintptr_t are not equivalent ? Maybe if we added a cfg gate for it.

1 Like

Making it an error only on such targets will make portability to such targets harder, because most people won't notice; if those types may differ, we should at least warn on all targets.

Yeah, I wouldn't expect this instead of a warn-by-default lint that becomes deny-by-default or a full hard error in a future edition, but alongside such a lint.

1 Like

There would be a few things in flight:

  • Introducing a new type to represent pointer width
  • Changing std APIs that move away from usize to this new uptr type
  • Changing third party crates to use uptr when appropriate
  • Add language support for both explicit and implicit usize <-> uptr converstions

To summarize some of the concerns I see:

  • Introducing a new type will cause churn
  • Autoconversion of these types without warning would just be the status quo
  • Autoconversion with lint warnings/errors would be annoying on the currently supported platforms for people who don't care about supporting "niche" platforms
  • Moving the ecosystem to the new type will be slow
  • Making the conversion of types manual causes immediate churn
  • Likely MSRV concerns
  • Increased complexity for "marginal" gains
  • Having to make a decision on whether usize should have been size_t or uintptr_t, and communicate this to the whole ecosystem

I think that we can and should do this, and indeed keep usize as size_t and introduce a new ptr type. I furthermore think that we should have autocasting between these two, with an allow by default lint on all platforms, with eventual warn by default behavior, and a hard error on the "niche" platforms. It's very likely that any crates that would care to cater to these platforms wouldn't need additional incentives, and any crate that doesn't explicitly target them is likely to be unsuitable due to the kind of tradeoffs they've made.

I think that if we can go about a change like this in a way that is transparent for people that don't care, moving slowly until, let's say, the next edition rolls over, it'll be plenty of time for anyone who cares to prepare for these platforms and for the ecosystem to adapt.

The biggest unsolvable problem I see is the one about MSRV, but there are two things we can do:

  1. provide a new crate that on older versions is just a reexport of type uptr = usize;
  2. rely on specific crates to cfg the new behavior away for a while until they rise their MSRV again at some point in the future
15 Likes

AFAIK this is yet undecided for trait objects and user-defined DSTs, but the most basic approach could be to create a method that returns a tuple of (size_t, ()) or (size_t, u64) depending on platform. To avoid adding magical associated types to pointers the tag type could be an alias in std::ptr.

For example:

let (address, tag): (usize, std::ptr::Tag) = ptr.to_int_parts();
let ptr: *const u8 = std::ptr::from_int_parts(address, tag);

I'm not sure what to do about segmented memory. Should it be (size_t, segment, tag) with either segment or tag being (), or return (size_t, tagment) and let the other integer be either of them, and give these methods/parameters generic names?

There can also be:

let int: std::ptr::Intptr = ptr.to_integer_representation();
let ptr: *const u8 = std::ptr::from_integer_representation(int);

Rust currently implements (not explicitly) the equivalent of:

impl<T: Sized> Pointee for T {
    type Metadata = ();
}

How bad would it be if indeed for some platforms we had:

struct AuthenticationTag(u64);

impl<T: Sized> Pointee for T {
    type Metadata = AuthenticationTag;
}

Interestingly, this might resolve how it is currently unclear whether to use *const u8 for a 'location in memory' or not. Under that idea a *const u8 would have a platform dependent tag and would not be a pure usize/adress. Instead, we'd need a new (unsized ! under the above semantics) type that specifically has the () tag on all platforms. Quite curious indeed.

And how much is std dependent on the statically sized = Metadata = () assumption.

2 Likes

One thing that would help a lot with a transition is to have a real target which people can easily build Rust code for and test it on using an emulator. By easily I mean: the Rust target should be available in rustup (even if as a preview), and the emulator and OS image should be readily available without building anything from source. Ideal features for the target would be:

  • uptr is larger than usize, to catch cases where I accidentally cast a pointer to usize and back.
  • uptr is 128-bit, to catch cases where I accidentally cast a pointer to u64 and back.
  • As 'normal' as possible in every way not related to pointer representation – which further rules out w65, as much as I like SNES.
  • Perhaps CHERI should actually be enforced, to catch cases where I manually mangle pointers in a way CHERI won't accept. Though perhaps this could be a separate step.

Not only would such a target help catch mistakes in porting, it would reinforce the idea that the goal of uptr is to port Rust to new hardware with exciting new security guarantees, not just make churn.

Unfortunately, there is no CHERI target upstream in LLVM, but both Morello (ARM) and the original CHERI project have actively maintained LLVM forks. Both CHERI projects also have emulators. As for operating systems, CheriBSD is probably the best choice since it already exists and supports both Morello and the CHERI extensions to MIPS and RISC-V; meanwhile, porting Linux to Morello is apparently in progress.

In principle it should be feasible for rustc to have an official experimental target for one of these platforms, building against one of those LLVM forks, similarly to how it used to build emscripten targets against a separate version of LLVM (or so I hear). No, I'm not volunteering to do the work. :wink:

10 Likes

w65 wouldn't be a viable candidate here anyways. The gcc port is not yet prepared - I was waiting on potentially blocking issues with the ABI from the rust side before proceeding (and have been busy with other parts of the project, mostly the assembler/linker as they are currently in-use), and it would need rustc_codegen_gcc anyways (which I believe is not yet quite upstreamed in rustc - please correct me if I am wrong).

That's a portability issue for any architecture whose address size is not equivalent to u64, such as MSP430 or RISC-V RV128.

Or ix86.

In the interests of full disclosure: I work for Arm and have responsibilities both on Rust and Morello. That said, I do think that Morello is a good candidate here (for whatever approach we pick), for the reasons you give. CheriBSD and the LLVM ports are quite stable enough for an experimental target.

No, I'm not volunteering to do the work.

It fits within my remit, so I can, in principle, contribute some time. I'm more familiar with Morello than with Rust's internals, though, and don't have a good feel for the amount of work required (other than that it's quite substantial). Rust-on-CHERI is a popular topic, though.

1 Like

Indeed, casting a ptr to a usize and back is already necessarily a lossy operation.

So I wonder if the path forward should not involve fully embracing the idea that pointers and integers are fundamentally distinct kinds of values. As far as I am concerned, in an ideal universe, there would be no way to convert a usize to a pointer -- instead one would be required to explicitly declare which provenance that pointer should have, e.g. by giving another pointer whose provenance should be used:

/// Returns a pointer pointing to `addr`, with the provenance
/// taken from `provenance`.
fn ptr_from_int<T>(addr: usize, provenance: *const T) -> *const T

I assume this API is easy to support on CHERI as well. After all, using usize to represent pointer offsets is still perfectly fine, the issue is "just" that casting a pointer to usize is lossy in ways that are much more obvious than with Rust for regular targets. In other words, the only operation that is problematic if one considers Rust-on-CHERI with usize being 64bit in size is the int-to-ptr cast, which I think is a cursed operation anyway. Two birds, one stone!

Basically, what I imagine is that with the CHERI target, usize-to-ptr casts would fail to compile. We could have an allow-by-default lint against such casts that helps people ensure their code is portable to CHERI. (transmute between pointers and usize would also fail due to their different size, but then again that is already a cursed operation. This one might be harder to lint against, but it should be possible.)

ptr_from_int, together with the existing ptr as usize that extract the address (and loses provenance) is enough to implement things like "packing extra booleans into the aligned part of a pointer".

I think it also suffices to implement schemes such as the OCaml garbage collector where the last bit of a word is used to distinguish pointers from ints, if we further assume some global const DUMMY_PROVENANCE: *const () that can be used to create pointers that cannot be dereferenced (but that can be cast back to usize). Then we could use *const () as type for such a "pointer or int" value.

17 Likes

How would this work if you have something like a special Box with manual niche optimizations in the alignment, where would you get the provenance pointer from, without adding extra storage for the original pointer (Which defeats the point of the optimization)?

I don't know what such a Box provides as an API (no idea which kind of niche optimizations you are referring to). You would keep using pointer types for all values on which you want to preserve provenance, so no extra field to remember the provenance should be needed.

If you mean exploiting alignment to store a boolean, then to set that boolean you

  • cast the ptr to an int,
  • do the bit-ops on that int,
  • and use ptr_from_int to turn that int together with the original ptr provenance back into the new pointer

So you only "need extra storage" inside this set function (one of your local variables will be live slightly longer than it used to), but not in your Box type. Late-stage compiler optimizations (post provenance erasure) could easily reduce that live range; I assume register allocation would happen late enough that this won't affect the generated assembly.

fn set(self: &mut MyBox, data: bool) {
  let i = self.ptr as usize;
  let i = { /* Bit ops to set the least significant bit to `data` */ };
  self.ptr = ptr_from_int(i, self.ptr);
}
4 Likes

I was actually wondering how casting to usize (or whatever uptr type we come up with) erasing provenance would work with a hardware provenance model that doesn't come with a way to invent such.

Or would this be covered by "The result of casting a pointer to usize" is unspecified.

I find it hard to imagine a hardware provenance model where a pointer does not have a notion of "address" that defines the actual location in memory that the pointer refers to. Doing a load has to extract the address from a pointer after all; ptr as usize can just do the same thing.

The question is whether the reverse operation is possible if the integer type doesn't carry provenance information and the hardware doesn't allow you to invent provenance at all (I don't know if this is the actual case on CHERI - more a theoretical question - possibly something for UCG to explore in detail).

That solves a lot of problems, both now and in formalizing Rust semantics. Great idea! As far as I'm concerned that's the proposed migration path that we should be discussing.

4 Likes

You were talking about casting to usize though?

Now it seems you are asking about ptr_from_int, and specifically ptr_from_int(i, DUMMY_PROVENANCE). Indeed if some platform does not have a "dummy provenance" then we would have a problem here.

But that seems extremely hypothetical so I don't this we have to solve this problem now.

Yeah, sorry, I was wondering in general about the roundtrip, but specifically mentioned the destructive as usize cast.

That only works for something like CHERI, where the actual address size is the same as the size type. This doesn't help w65, which is what I'm concerned about, where (long) addresses are 24-bit, pointers are 32-bit, and size_t is 16-bit.