Pointers Are Complicated, or: What's in a Byte?

H2CO3 · July 24, 2018, 8:52pm

I wrote my answer before you added that paragraph. To be honest, it wasn't clear to me at all that this isn't supposed to refer to any realistic implementation. Also you are using the word "integer" in a sense that implies a type which is much more than just integers. I suggest you use some other word for it for avoiding ambiguity.

rpjohnst · July 24, 2018, 8:56pm

It models the integers of the implementation-side version of the language, though—not calling it an integer would be even more confusing.

The only thing that needs to be made clear here is that this is a semantic model that has to apply to any possible implementation, because it’s going to be used to determine which operations are allowed—including those where “the programmer has to know” something.

However, that doesn’t mean that this isn’t supposed to refer to any realistic implementation. It doesn’t describe the operational semantics of compiled Rust code running on x86, sure, but it can and probably even should describe the operational semantics of something like MIRI. Implementing something that actually tracks this kind of information as the program executes allows it to do really useful things like detect UB in the general sense, regardless of what any particular implementation does with it.

RalfJung · July 24, 2018, 9:00pm

Yes. Moreover, and more importantly, the behavior of Rust running on x86 must not do anything that Rust couldn't also do on this abstract machine. Only when the abstract machine says UB, the x86 code may do anything.

I will try to go back to the post and see if I can somehow disentangle the type of integers from their set of values.

BurntPizza · July 24, 2018, 9:22pm

I picked it up, but you might put more explicit emphasis why the disentanglement:

If pointers are integers, then there are no invalid pointers, because all integers are valid. (@H2CO3’s model)

But we want these store-forwarding optimizations, which require a concept of “invalid” pointers. (actually we need them here because implementations already do them)

Therefore pointers must not be integers. (in the semantics; obviously we want the semantics to be efficiently implementable)

And later: But we also want to be able to cast back and forth between (or do arithmetic with) integers and (valid) pointers in some cases, because this is a low-level language and that is something we do all the time.

P.S. I enjoyed the twin-allocation paper earlier this week, and your continuing work on this subject.

derekdreery · July 24, 2018, 9:24pm

I'm probably missing something, but what are the use cases for pointer-integer casting?

EDIT this isn't meant to be a reply to @H2CO3 but I don't know how to change that now the post exists.

RalfJung · July 24, 2018, 9:28pm

I have extended that part where pointer-integer casts come into play, and turned it into a subsection called "The Model Falls Apart". I hope that helps clarify some of this confusion. Notable quotes:

This disconnect between the type and the value may seem somewhat strange, but we are actually not very concerned with types at this point. Types serve to classify values with the goal of establishing certain guarantees about a program, so we can only really start talking about types once we are done defining our set of values and program behaviors. Still, this means there are safe programs that miri cannot execute, such as (Box::new(0).into_raw() as usize) * 2.

This [doing nothing when converting a pointer to an integer] is the most lazy thing to do, and we do it because it is not clear what else to do -- in our abstract machine, there is no single coherent "address space" that all allocations live in, that we could use to map every pointer to a distinct integer. Every allocation is just identified by a (unobservable) ID. We could now start to enrich this model with extra data like a base address for each allocation, and somehow use that when casting integer back to pointers... but that's where it gets really complicated, and anyway discussing such a model is not the point of this post. The point it to discuss the need for such a model. If you are interested, I suggest you read this paper that explores the above idea of adding a base address.

mbrubeck · July 24, 2018, 9:43pm

Pointer->integer casts can be useful for things like storing extra data in unused bits of a pointer, or using an address as part of a hash key or unique identifier, or implementing an xor linked list.

kornel · July 24, 2018, 10:15pm

I get the distinction between unitialized and actual byte value, and that’s definitely useful. However, I’m not sure about giving pointer values special status.

If I understand correctly, you’d like to keep the extra edge-case-solving hidden properties of pointers even through casts to integers (and back?), so you’re trying to attach metadata to pointer-derived bytes and integers.

That doesn’t fit my mental model. In my ideal machine the check whether something is a valid pointer or not would not depend on how the value was obtained, but only whether it happens to look like any valid pointer (of compatible type) at the moment the pointer is crated from an integer (i.e. at the moment of foo as *mut u8 search through all possible valid values of pointers, and decide which pointer it is).

The pointer->integer->pointer cast does create ambiguity. You can’t know whether it was one-past-end of first array, or start of second array. That’s fine. Just pick one (preferably the latter) in a well-defined fashion. If you need this unambiguous for static analysis, then leave gaps between arrays

So while x[y-x] is UB, the same thing “lanudered” through integers IMHO should be allowed and point to y.

In other words IMHO:

let x = &[0u8; 10];
let y = rand() as *const u8;

y could point to x sometimes.

rpjohnst · July 24, 2018, 10:37pm

That would prohibit several optimizations that we do today. And to what end? So the formal model can have a simpler definition for integers? What does that buy you?

That's insufficient. The gaps would have to be infinitely large.

zrk · July 24, 2018, 10:42pm

I’m not certain I understand the problem correctly, but I feel like the gist of the matter is how should the abstract machine build a pointer from an integer?

The way I understand the article, miri actually stores the representation of pointer as an id + an offset.

So if I’m not mistaken the practical question is which id should we attribute to a pointer created from an integer (the integer possibly being created from from a previous pointer)? We have to pick exactly one to store in the pointer representation.

An existing id? Which one do we pick?
A fresh id? This means that the freshly created pointer cannot alias with another one (if I understand correctly, “aliasing” requires two pointers to have the same id). Also, what characteristics do we give to the new allocation that corresponds to that fresh id: e.g., its size, to enforce the rules for knowing if the pointer is legal or out-of-bounds?

It is an important question, because after the pointer has been created from the integer, its representation is used in the VM like the representation of any other pointer, and may cause/inhibit the same kind of optimizations.

EDIT: And I believe that even before that, the question is what range of values can an integer have when it is created from a pointer? This question is important to an optimizer, because it can simplify things according to the possible values of the integer. For instance, it could know that two successive iterations of a for loop that increment a pointer and takes its address are building integers that are only separated by the size of the pointed-to type. An option is to assume that we know nothing of the possible values of an integer created from a pointer, but then we lose potential for optimization. Another option would be to associate a common “base” to all pointers that share the same ID, so as to keep some information on the relative values of integers created from pointers with the same ID, but apparently this models has its drawbacks too.

kornel · July 24, 2018, 11:06pm

I’m assuming the whole idea is only for const evaluation and symbolic execution which could be slow/inefficient, but the desire is to have extra correctness. If every byte of memory is an enum then all efficiency is already lost

If we keep the rule that a valid pointer can only point to its allocation or 1 element past it, AND make allocator (including statics & stack) always add gap of one element*, then I think it’s possible to have unambiguous mapping of pointers to integers.

*) hmm, that’d have to be size of maximum possible element

notriddle · July 25, 2018, 12:13am

let a = 1;
let b = &a as *const i32;
let c = (b as usize) + 2000;
let d = c as *const i32;

This code must not be undefined behavior, because it's allowed in safe Rust. Dereferencing d, on the other hand, must be undefined behavior, even if d happens to end up with the same integer representation as another pointer.

let a = 1;
let b = &a as *const i32;
let c = b as usize;
let d = c as *const i32;
assert_eq!(unsafe { *d }, 1);

The memory model must allow you to round-trip from pointer to integer and back again. Otherwise, you wouldn't be able to use byte operations like libc::memcpy to move pointers around.

kornel · July 25, 2018, 12:15am

This can be solved by adding extra metadata to pointers, not integers:

enum Pointer {
   Valid {id: usize, offset: usize}
   Invalid {int: usize}
}

with Invalid created at the point of the cast. Dereference of Invalid state is of course UB.

notriddle · July 25, 2018, 12:17am

No, because this needs to be valid:

let a = [1; 2001];
let b = &a as *const _ as *const i32
let c = (b as usize) + 2000;
let d = c as *const i32; // how does the memory model know if `c` is within an allocation

kornel · July 25, 2018, 12:20am

By keeping an index of all valid allocations (GC, essentially).

for allocation in all_allocations {
    if c.is_within(allocation) {
        return Valid {allocation.id, c.offset}
    }
}
for allocation in all_allocations {
    if c.is_one_past(allocation) {
        return Valid {allocation.id, c.offset}
    }
}
return Invalid {c}

notriddle · July 25, 2018, 12:22am

Then this produces a valid pointer:

let _x1 = [1; 2001];
let a = 1;
let _x2 = [1; 2001]
let b = &a as *const i32;
let c = (b as usize) + 2000;
let d = c as *const i32;

Indexing into either _x1 or _x2, depending on whether memory addresses on the stack go up or down.

Ixrec · July 25, 2018, 12:23am

As a strictly hypothetical and/or pedagogical question of no practical value, since I’ve never written bit twiddling or pointer frobnication code, would we want to support code like:

let a = 1;
let b = &a as *const i32;
let c = (b as usize) + 2000;
let d = c as *const i32;

or

let a = 1;
let b = &a as *const i32;
let c = b as usize;
let d = c as *const i32;
assert_eq!(unsafe { *d }, 1);

or

let a = [1; 2001];
let b = &a as *const _ as *const i32
let c = (b as usize) + 2000;
let d = c as *const i32;

if Rust didn’t already support them? Do they have some intrinsic motivation?

Even in the simple round-tripping pointers case, I’m missing why you’d want to make it a usize if you aren’t going to do anything with it when it’s a usize.

kornel · July 25, 2018, 12:24am

Yes. If your integer happens to hit a valid allocation, then it’s fine. rand() as *const u8 may sometimes be a valid pointer too.

It’s a much simpler model than trying to ascribe origin and intent to integers. I should be able to write a pointer cast as an integer to a file, or send it over a network, and back, and still have it be a valid pointer.

notriddle · July 25, 2018, 12:33am

No.

Not only would that memory model basically disallow all forms of memory access reordering, it would also make things like undefined behavior sanitizer useless as a tool for finding buggy code. It would make code that relies on the internals of the memory allocator valid according to the memory model.

Unless I'm fundamentally misunderstanding it, that model makes Heartbleed defined behavior.

kornel · July 25, 2018, 12:34am

There’s no motivation to have them, and they’re technically forbidden. It’s more of a problem that they do exist, and Rust allows them to exist for performance and interoperability reasons. Basically because usize is for both pointer size and offset size.

If Rust allowed pointers to be larger than offsets (e.g. 128-bit pointers, but 64-bit offsets) then it would be possible to prevent the ambiguity in all these cases essentially by encoding whole (id, offset) pointer representation as an integer. But because (id, offset) has to fit the same space as offset, something has to give.

Topic		Replies	Views
Pointers Are Complicated II, or: We need better language specs	148	9541	January 17, 2021
Int2ptr and runtime provenance models Unsafe Code Guidelines	24	2837	February 12, 2022
Comparing dangling pointers language design	43	10906	March 25, 2019
Types as Contracts Unsafe Code Guidelines	118	11784	March 25, 2019
Stacked Borrows: An Aliasing Model For Rust Unsafe Code Guidelines	49	7676	March 25, 2019

Pointers Are Complicated, or: What's in a Byte?

Related topics