[Pre-RFC] usize is not size_t

InfernoDeity · September 29, 2021, 7:57pm

That's only a problem with the optimizers. Not a fundamental issue with language-level provenance. There are also potentially ways arround it.

RalfJung · September 29, 2021, 8:00pm

I mean, sure, if you are okay with throwing away our optimizers, then it is just a problem with those. As someone who considers "must be reasonably optimizable" a crucial constraint for language design, to me this is very much a fundamental issue with language-level provenance: if your language-level provenance is so broken such that it cannot be translated to LLVM IR, it is not feasible.

"Potential ways around it" exist for all problems of mankind, but none have been proposed for this particular issue yet, as far as I know. Even PNVI, despite trying very hard, cannot entirely fix this. So I don't think hoping for a solution to magically appear is a productive way forward here -- until a concrete proposal that actually solves the problem surfaces (which I doubt will happen, I think there are fundamental impossibilities here), this pure hypothetical should not affect our discussion.

jrtc27 · September 29, 2021, 8:01pm

RalfJung:

jrtc27:

It provides a means for you to accidentally amplify the bounds of your pointer compared with the original one you cast to usize if your programming style is sloppy. Contrast that with just defining a uptr that is a real pointer, where the bounds are carried all the way through and doing so is impossible unless you deliberately do something to amplify bounds (i.e. you use something like your proposed little function and bypass uptr), which would stand out like a sore thumb as being dubious code, compared with your proposal where such patterns would be forced.

I am not sure what happens in CHERI, but with regular C/Rust provenance, when you cast a pointer to an integer and back, your provenance does get amplified. That is the only way to make all this existing code working, given the fact that the conversion loses the exact provenance: when doing an int-to-ptr cast, we have to construct a conservative overapproximation of the provenance that the pointer might have had.

If CHERI maintains the exact provenance here, then it is actually incorrect in the sense of ruling out valid Rust (or C) programs that rely on the fact that doing a ptr-int-ptr roundtrip loses some precision in the provenance.

Yes, we break that, uintptr_t preserves provenance in CHERI C. Doing anything else is basically impossible without really disgusting hacks, and would be contrary to the principles of CHERI. We see it as a feature not a bug that we break this, and in practice almost nothing relies on such horrible practices. Every pointer has a single source of provenance and it is never ambiguous (as would be the case when hacking to make it work, e.g. maintaining a table "somewhere" in the runtime that tracks all pointers that were cast to ints and then searching this table when casting back, but even that isn't entirely possible because there might be more than one pointer in the table that it could be). The rules in C are really only what they are today because uintptr_t is defined as long or equivalent on traditional architectures and they have to keep that working; you could introduce a new opaque integral type for uintptr_t that has the same representation as long but different semantics so you can track its provenance, but you can't retroactively change existing architectures to use a different type for uintptr_t. That corner of the C standard does not make any sense in the context of CHERI and is not necessary for real-world C code, it's primarily just a side-effect of trying to retroactively invent semantics for uintptr_t that make the normal cases we do support work.

InfernoDeity · September 29, 2021, 8:03pm

Considering that the optimizations that can be performed on rust-with-provenance-through-int2ptr and rust-without-provenance-through-int2ptr should be a strict superset, I don't see an issue. It also rules out the reasonable implementation of runtime provenance tracking with no possibility of inventing or expanding it (in this manner).

InfernoDeity · September 29, 2021, 8:06pm

Or, put another way, I don't see a reason that because one implementation cannot take advantage of language-level provenance preservation, that none should be allowed to.

RalfJung · September 29, 2021, 8:06pm

Oh, I agree. The same is true for the formal language semantics, except that those actually do add the disgusting hacks that CHERI avoids. That's why I think we should just discourage the use of this primitive that requires so many disgusting hacks, and replace it by a better one that doesn't. Then we can also prove a theorem that the CHERI backend actually correctly implements the Rust semantics, which seems like it would currently not be possible with the approach you pursued for C.

We already have such a type, it is called void*. Most kinds of arithmetic would anyway not be possible on such an alternative uintptr_t since it is unclear how they should affect provenance.

They cannot, though.

InfernoDeity · September 29, 2021, 8:07pm

Except that they can, because rust-without-provenance-through-int2ptr is a valid implementation for rust-with.

RalfJung · September 29, 2021, 8:09pm

@InfernoDeity Integer substitution (replacing x by y inside an if x == y) is incorrect in Rust-with-provenance-in-integers. And compiling such a Rust to Rust with provenance on pointers but not integers is almost certainly wrong since you cannot reconstruct the provenance on the cast back. So no, the language semantics you are sketching simply do not work out.

InfernoDeity · September 29, 2021, 8:10pm

That substitution is, but that's an optimizer problem, not a language problem.

jrtc27 · September 29, 2021, 8:27pm

For us it's simple: the provenance is always that of the pointer you cast to uintptr_t, no matter what arithmetic you perform on it. The only exception is that, due to our bounds compression scheme (so pointers are "only" twice the size despite having 64-bit base and length), if any intermediate calculation takes the pointer "too far" outside its bounds (one past the end is always fine, and in practice you get either ~12.5% or ~25% leeway, I forget the exact details, it's a bit messy) then it is not guaranteed that the result will be valid (the validity tag may be cleared). Thus you can think of any arithmetic on a uintptr_t "as if" you did (char *)uptr + (new_address - (size_t)uptr) where new_address is just a size_t. Which is of course just what your ptr_from_int does, but implicitly with the right pointer provenance guaranteed.

This allows us to support all arithmetic operations on uintptr_t, with the caveat about not straying too far out of bounds. Some are of course silly on CHERI if it's actually storing a pointer, like * and / do not make any sense on pointers (though % does), only integer addresses, but we still let you write that code because not doing so does break real code and they are "fine" to use so long as the value is just storing a pointer (or it's a very strange situation where the multiplied/divided address is still in bounds, which is unlikely but not impossible for anything other than a second operand of 1), even if our generated code would be more efficient if the input had just used a size_t (since the other operand could be 1 at run time, etc.).

Lokathor · September 30, 2021, 8:42pm

@RalfJung How would you propose to deal with running programs in a statically known address space if there's no "int-to-ptr" operation? For example, I need a pointer to 0x0400_0000, how do I get that, or from where?

CAD97 · September 30, 2021, 8:53pm

Presumably the compiler that gives meaning to that magic address (note that by the standard C model, dereferencing an int2ptr(integer-literal) is always UB) would also provide some magic built-in provenance to use for such known addresses.

Lokathor · September 30, 2021, 9:15pm

I don't know about the C standard at all, but LLVM has rules for volatile accesses, and it's actually the first step in the flow chart for the memory model that volatile just does some target specific thing.

And a non-volatile access of such a pointer I don't care about, so it can be UB, that's fine by me.

InfernoDeity · September 30, 2021, 9:18pm

C and C++ both say that casting an integer to a pointer yeilds an unspecified result, except in the round-trip case (which is defined to return the same pointer).

comex · September 30, 2021, 9:57pm

If you're actually running on CHERI, the hardware won't let you just dereference 0x0400_0000; you need a special instruction (e.g. SCTAG on Morello), which, in combination with other instructions, can mint a capability given a pointer and a size. Presumably there would be some kind of intrinsic for that, though it couldn't just be wired up to integer-to-pointer casts because those don't have a size. Also note that that instruction requires kernel-level privilege.

If you're not on CHERI, then you'd probably use good old usize-to-pointer casting, and even if the "allow-by-default lint against such casts" that RalfJung suggested is eventually made warn-by-default, you would #[allow] it under the reasoning that you know your program won't run on CHERI.

chrisd · September 30, 2021, 10:09pm

It was brought up in chat that the Win32 API tends to use void * and pointer-sized integers interchangeably for things that may or may not be pointers. Surely this FFI presents a problem for any optimizations? For example, say you can pass a pointer or an integer through a callback. Maybe you use a tag to distinguish the two. Presumably the optimizer can't track what's going on here? As far as it's concerned it's just a pointer used in one place and perhaps a new pointer magically appearing from an integer somewhere else, no?

Lokathor · September 30, 2021, 10:26pm

@comex i think your answer is dodging the question. I'm aware of what to do in modern rust and that i could use an "allow" in future Rust, but I want to know what things would look like if int-to-ptr were removed entirely. My current impression is that either a language has some form of int-to-ptr or it's simply incapable of effectively expressing mmio. I'd like to know if there's some third path I'm not seeing.

rpjohnst · September 30, 2021, 10:30pm

If the address is truly static, you can just link to a symbol and treat it like FFI. If it's somehow dynamic then presumably you would use an intrinsic and/or malloc-like operation that is allowed to mint new objects.

HeroicKatora · September 30, 2021, 10:33pm

I find it unlikely that they'd be removed entirely under the light of the usual deprecation/leniency processes. I would be happy if they weren't introduced in the first place for some new architectures/targets. What you had been wanting is platform dependent behavior in any case. So what about, replacing them with platform dependent (constant) intrinsics/function as in unsafe fn ptr::mmio(usize) -> *const u8 that returns a pointer with provenance specific parts of the address space, with unsafe preconditions to taste. We pretty certainly do not want to be allow conjuring up pointers to the any stack from thin air for example? So I would imagine 'must not alias any address usable by an allocator'—whatever that means.

workingjubilee · October 6, 2021, 11:09pm

Regarding uptr vs uaddr:

I acknowledge there are problems with the uaddr name, but I should note that there are also problems with the uptr name. Namely: We currently already have core::ptr as a module for actual pointers, and uptr would suggest that it is a pointer, when it is not: it is a sequence of bits (here often referred to as an "integer") that can hold all the values that addressing somewhere in memory may involve.

I acknowledge that it would be rather... ironic... for us to deviate from the preferred terminology of CHERI here when naming a new type because of the semantic drift between usize, size_t, and uintptr_t, and specifically so we could improve compatibility with targets like CHERI, which are attaching a different meaning to ptraddr_t, but I think people should weigh that against Rust already having an implicit desire to not view raw pointers as "just a number", due to provenance and such being involved, so uptr and iptr may risk injecting more confusion into that matter.

I also think having a different name would underscore that an actual conversion (and thus, a subtle loss of information) occurs if you do the trivial as uaddr cast.

If nothing else, we certainly kinda seem to be where we are today due to fudging our nomenclature and simply hoping people would interpret it correctly, rather than choosing glaringly distinct names.

Topic		Replies	Views
ABI discussion for w65 language design	17	2004	November 18, 2021
Pre-RFC: `usize` semantics Unsafe Code Guidelines	129	5639	May 16, 2024
CHERI pointers and Rust / LLVM SIMD language design	2	1064	January 4, 2022
To improve usize (and isize) handling in Rust language design	6	1246	September 19, 2020
Would having both `iptr/uptr` and `idiff/usize` in Rust be a good idea? (Answer: No.) bikeshed (deprecated)	5	2517	March 25, 2019

[Pre-RFC] usize is not size_t

Related Topics