[Pre-RFC] usize is not size_t

That's only a problem with the optimizers. Not a fundamental issue with language-level provenance. There are also potentially ways arround it.

I mean, sure, if you are okay with throwing away our optimizers, then it is just a problem with those. As someone who considers "must be reasonably optimizable" a crucial constraint for language design, to me this is very much a fundamental issue with language-level provenance: if your language-level provenance is so broken such that it cannot be translated to LLVM IR, it is not feasible.

"Potential ways around it" exist for all problems of mankind, but none have been proposed for this particular issue yet, as far as I know. Even PNVI, despite trying very hard, cannot entirely fix this. So I don't think hoping for a solution to magically appear is a productive way forward here -- until a concrete proposal that actually solves the problem surfaces (which I doubt will happen, I think there are fundamental impossibilities here), this pure hypothetical should not affect our discussion.

7 Likes

Yes, we break that, uintptr_t preserves provenance in CHERI C. Doing anything else is basically impossible without really disgusting hacks, and would be contrary to the principles of CHERI. We see it as a feature not a bug that we break this, and in practice almost nothing relies on such horrible practices. Every pointer has a single source of provenance and it is never ambiguous (as would be the case when hacking to make it work, e.g. maintaining a table "somewhere" in the runtime that tracks all pointers that were cast to ints and then searching this table when casting back, but even that isn't entirely possible because there might be more than one pointer in the table that it could be). The rules in C are really only what they are today because uintptr_t is defined as long or equivalent on traditional architectures and they have to keep that working; you could introduce a new opaque integral type for uintptr_t that has the same representation as long but different semantics so you can track its provenance, but you can't retroactively change existing architectures to use a different type for uintptr_t. That corner of the C standard does not make any sense in the context of CHERI and is not necessary for real-world C code, it's primarily just a side-effect of trying to retroactively invent semantics for uintptr_t that make the normal cases we do support work.

Considering that the optimizations that can be performed on rust-with-provenance-through-int2ptr and rust-without-provenance-through-int2ptr should be a strict superset, I don't see an issue. It also rules out the reasonable implementation of runtime provenance tracking with no possibility of inventing or expanding it (in this manner).

Or, put another way, I don't see a reason that because one implementation cannot take advantage of language-level provenance preservation, that none should be allowed to.

Oh, I agree. The same is true for the formal language semantics, except that those actually do add the disgusting hacks that CHERI avoids. That's why I think we should just discourage the use of this primitive that requires so many disgusting hacks, and replace it by a better one that doesn't. Then we can also prove a theorem that the CHERI backend actually correctly implements the Rust semantics, which seems like it would currently not be possible with the approach you pursued for C.

We already have such a type, it is called void*. Most kinds of arithmetic would anyway not be possible on such an alternative uintptr_t since it is unclear how they should affect provenance.

They cannot, though.

Except that they can, because rust-without-provenance-through-int2ptr is a valid implementation for rust-with.

@InfernoDeity Integer substitution (replacing x by y inside an if x == y) is incorrect in Rust-with-provenance-in-integers. And compiling such a Rust to Rust with provenance on pointers but not integers is almost certainly wrong since you cannot reconstruct the provenance on the cast back. So no, the language semantics you are sketching simply do not work out.

1 Like

That substitution is, but that's an optimizer problem, not a language problem.

For us it's simple: the provenance is always that of the pointer you cast to uintptr_t, no matter what arithmetic you perform on it. The only exception is that, due to our bounds compression scheme (so pointers are "only" twice the size despite having 64-bit base and length), if any intermediate calculation takes the pointer "too far" outside its bounds (one past the end is always fine, and in practice you get either ~12.5% or ~25% leeway, I forget the exact details, it's a bit messy) then it is not guaranteed that the result will be valid (the validity tag may be cleared). Thus you can think of any arithmetic on a uintptr_t "as if" you did (char *)uptr + (new_address - (size_t)uptr) where new_address is just a size_t. Which is of course just what your ptr_from_int does, but implicitly with the right pointer provenance guaranteed.

This allows us to support all arithmetic operations on uintptr_t, with the caveat about not straying too far out of bounds. Some are of course silly on CHERI if it's actually storing a pointer, like * and / do not make any sense on pointers (though % does), only integer addresses, but we still let you write that code because not doing so does break real code and they are "fine" to use so long as the value is just storing a pointer (or it's a very strange situation where the multiplied/divided address is still in bounds, which is unlikely but not impossible for anything other than a second operand of 1), even if our generated code would be more efficient if the input had just used a size_t (since the other operand could be 1 at run time, etc.).

@RalfJung How would you propose to deal with running programs in a statically known address space if there's no "int-to-ptr" operation? For example, I need a pointer to 0x0400_0000, how do I get that, or from where?

2 Likes

Presumably the compiler that gives meaning to that magic address (note that by the standard C model, dereferencing an int2ptr(integer-literal) is always UB) would also provide some magic built-in provenance to use for such known addresses.

1 Like

I don't know about the C standard at all, but LLVM has rules for volatile accesses, and it's actually the first step in the flow chart for the memory model that volatile just does some target specific thing.

And a non-volatile access of such a pointer I don't care about, so it can be UB, that's fine by me.

C and C++ both say that casting an integer to a pointer yeilds an unspecified result, except in the round-trip case (which is defined to return the same pointer).

If you're actually running on CHERI, the hardware won't let you just dereference 0x0400_0000; you need a special instruction (e.g. SCTAG on Morello), which, in combination with other instructions, can mint a capability given a pointer and a size. Presumably there would be some kind of intrinsic for that, though it couldn't just be wired up to integer-to-pointer casts because those don't have a size. Also note that that instruction requires kernel-level privilege.

If you're not on CHERI, then you'd probably use good old usize-to-pointer casting, and even if the "allow-by-default lint against such casts" that RalfJung suggested is eventually made warn-by-default, you would #[allow] it under the reasoning that you know your program won't run on CHERI.

2 Likes

It was brought up in chat that the Win32 API tends to use void * and pointer-sized integers interchangeably for things that may or may not be pointers. Surely this FFI presents a problem for any optimizations? For example, say you can pass a pointer or an integer through a callback. Maybe you use a tag to distinguish the two. Presumably the optimizer can't track what's going on here? As far as it's concerned it's just a pointer used in one place and perhaps a new pointer magically appearing from an integer somewhere else, no?

@comex i think your answer is dodging the question. I'm aware of what to do in modern rust and that i could use an "allow" in future Rust, but I want to know what things would look like if int-to-ptr were removed entirely. My current impression is that either a language has some form of int-to-ptr or it's simply incapable of effectively expressing mmio. I'd like to know if there's some third path I'm not seeing.

If the address is truly static, you can just link to a symbol and treat it like FFI. If it's somehow dynamic then presumably you would use an intrinsic and/or malloc-like operation that is allowed to mint new objects.

2 Likes

I find it unlikely that they'd be removed entirely under the light of the usual deprecation/leniency processes. I would be happy if they weren't introduced in the first place for some new architectures/targets. What you had been wanting is platform dependent behavior in any case. So what about, replacing them with platform dependent (constant) intrinsics/function as in unsafe fn ptr::mmio(usize) -> *const u8 that returns a pointer with provenance specific parts of the address space, with unsafe preconditions to taste. We pretty certainly do not want to be allow conjuring up pointers to the any stack from thin air for example? So I would imagine 'must not alias any address usable by an allocator'—whatever that means.

Regarding uptr vs uaddr:

I acknowledge there are problems with the uaddr name, but I should note that there are also problems with the uptr name. Namely: We currently already have core::ptr as a module for actual pointers, and uptr would suggest that it is a pointer, when it is not: it is a sequence of bits (here often referred to as an "integer") that can hold all the values that addressing somewhere in memory may involve.

I acknowledge that it would be rather... ironic... for us to deviate from the preferred terminology of CHERI here when naming a new type because of the semantic drift between usize, size_t, and uintptr_t, and specifically so we could improve compatibility with targets like CHERI, which are attaching a different meaning to ptraddr_t, but I think people should weigh that against Rust already having an implicit desire to not view raw pointers as "just a number", due to provenance and such being involved, so uptr and iptr may risk injecting more confusion into that matter.

I also think having a different name would underscore that an actual conversion (and thus, a subtle loss of information) occurs if you do the trivial as uaddr cast.

If nothing else, we certainly kinda seem to be where we are today due to fudging our nomenclature and simply hoping people would interpret it correctly, rather than choosing glaringly distinct names.