Pointers Are Complicated, or: What's in a Byte?

RalfJung · July 24, 2018, 2:41pm

This summer, I will again work (amongst other things) on a “memory model” for Rust/MIR. However, before I can talk about the ideas I have for this year, I have to finally take the time and dispel the myth that “pointers are simple: they are just integers”. Both parts of this statement are false, at least in languages with unsafe features like Rust or C: Pointers are neither simple nor (just) integers.

I also want to define a piece of the memory model that has to be fixed before we can even talk about some of the more complex parts: Just what is the data that is stored in memory? It is organized in bytes, the minimal addressable unit and the smallest piece that can be accessed (at least on most platforms), but what are the possible values of a byte? Again, it turns out “it’s just an 8-bit integer” does not actually work as the answer.

I hope that by the end of this post, you will agree with me on both of these statements.

https://www.ralfj.de/blog/2018/07/24/pointers-and-bytes.html

bgeron · July 24, 2018, 2:52pm

Looking forward to seeing your “2018 memory model”! Is it going to be similar to Compcert’s memory model?

RalfJung · July 24, 2018, 2:55pm

No, not at all. CompCert’s memory model is actually pretty close to the simple model I described in the post.

It’s going to be more like @arielb1’s and @ubsan’s ACA(-like) models, but operational.

H2CO3 · July 24, 2018, 3:11pm

Interesting post.

However, this simple model starts to fall apart once you consider pointer-integer casts. In miri, casting a pointer to an integer does not actually do anything, we now just have an integer variable whose value is a pointer (i.e., an allocation-offset pair). Multiplying that integer by 2 leads to an error, because it is entirely unclear what it means to multiply such a pair by 2. A full definition of a language like C++ or Rust of course cannot take this shortcut, it has to explain what really happens here

I don't understand why that is, or at least why it is a problem. If you cast a pointer to an integer, then the resulting value is an integer, now isn't it? Why would you care where it came from? After all, it's not a pointer anymore, you don't want to use it as a pointer – you want to use it as an integer, that's why you cast it away. Why would you need to specify (in a formal model) what pointer operations you can perform on an integer, or what integer operations you can do with a pointer? Probably none, they are distinct types…

However, this means that a simple operation like loading a byte from memory cannot just return a u8. What if that byte is part of a pointer?

Again, why would you care? If you wanted the pointer, you would have read the entire pointer, not only its first byte. If you want to copy a pointer, using memcpy for example, you would just… copy all of its bytes, and you could be sure it's the same pointer that you started with. I don't see the point in this, or why it would be useful.

notriddle · July 24, 2018, 3:52pm

Because you need to be able to cast the integer back into a pointer again.

RalfJung · July 24, 2018, 4:07pm

How do you compute that integer? All you have is some ID for the allocation, and an offset into this allocation.

You could start to decide that allocations have not just an ID, but also a base address... and if you take a couple more steps, you arrive at this 2015 paper. Which still has some problems, like making a pointer->integer cast a side-effecting operation that cannot be removed even if the result is unused.
Also, of course, you then have to choose an arbitrary base address, which is really hard for interpreters like miri. (They can effectively only explore the execution with one such choice, and may miss issues that arise with other choices.)

So, the alternative we used for miri is that a usize can be an integer, or it can be an allocation-offset pair (aka a pointer) that was obtained by casting from a pointer. If all you ever do with the usize is add 2 and cast back, you will not even notice that you had a somewhat "strange" value. Don't try multiplying it by 2, though...

You have to define what happens when loading one byte of a pointer. The result is stored in a local variable, which value does that variable have? You must make a choice, maths doesn't let you say "uh I'd prefer not to answer". So, you can say this is UB, but then byte-wise memcpy is UB if the data you copy contains a pointer. Not great. (But in fact not uncommon either, in fact CompCert does this.) Or you do something like what I described. There are other solutions, of course.

H2CO3 · July 24, 2018, 5:07pm

Well, how do I? How does rustc do it? How does LLVM do it? How does malloc do it? There has to be some sort of rule. For heap allocations, the library/kernel/whatever searches for unallocated memory, marks it as allocated, and e.g. returns its virtual address, which is representable by a single integer. For stack allocations, tell the dynamic loader to add such and such to the stack pointer. And so on.

If I want an integer which I multiply by 2, I don't want to cast it back to a pointer because the operation as such doesn't make sense, regardless of the representation. I.e. I still don't see why I need to care about someone multiplying the integer by 2 in order to "obtain a new pointer" whereas they certainly won't, because that's not an operation that makes sense to do to pointers.

Of course you have to define. But don't pointers have an in-memory representation? Literally everything in RAM is just a bunch of contiguous bytes. Why is there a need to do special tricks with pointers (or with any other type)? If I want the n-th byte of a pointer, then I want the n-th one of the however many (4 or 8 or 16 or so) bytes a pointer takes up when its physical representation resides in RAM.

I didn't assert that it would be a solution anywhere – I don't believe it's necessary.

RalfJung · July 24, 2018, 5:24pm

That's not true, and part of the point of this post is to explain that. (Or rather, it is true if you define Byte the way I do in the post. It is not true if you define Byte as "an element of 0..256.) You can't just assume there to be an in-memory representation. You have to define it. And these complex pointers do not fit into "realistic" representations that have 8 bits per byte. Hence the problem.

In other words, those complex pointers don't really exist on the machine. They are an abstract concept introduced to define the language semantics, in something like a "virtual machine". I wrote a blog post last year that explains this idea in more detail.

So you are saying it is not a problem that this operation is not allowed? That's a fair position.

Unfortunately for us, it is possible to do this in safe Rust: (Box::new(0).into_raw() as usize) * 2. So we kind of have to define what this does, we cannot make it UB.

Another use case is having a HashMap where the key is a pointer type. Then you need to hash pointers, which typically involves things "much worse" than multiplying by 2.

You also didn't say what to do instead, though.

Or are you not accepting that in a program like

use std::mem;

unsafe fn memcpy(src: *const u8, dest: *mut u8, n: usize) {
    for i in 0..n {
        let v = *src.offset(i as isize);
        *dest.offset(i as isize) = v;
    }
}

fn main() { unsafe {
    let x = Box::into_raw(Box::new(0));
    let mut y : *mut i32 = mem::uninitialized();
    memcpy(&x as *const _ as *const u8, &mut y as *mut _ as *mut u8, mem::size_of::<*mut i32>());
    let z = Box::from_raw(y);
    drop(z);
} }

we have to define what the value of v is each time around the loop? This is reading a pointer byte-per-byte. We have to explain what happens. We have to pick the set of "values" v can have, and say which of those it is that it actually has in this execution. I am explaining in the post why just considering integers is not enough for this set of values.

Those are all implementations of the language. None of them defines, abstractly, what the behavior must be for all conforming implementations. It would be a rather bad choice to say "you must use exactly what glibc's malloc/jemalloc do" just because that's what rustc happens to do right now.

Such a concrete model is also not suited for optimization, because it specifies way too many low-level details. For example, most of these fancy alias-analysis based optimizations would be entirely illegal (and my post explains why) if LLVM would actually let you rely on the fact that the pointers you are getting are integers computed by malloc.

RalfJung · July 24, 2018, 5:49pm

I have added a paragraph that I hope helps explaining this:

One important point to stress here is that we are just looking for an abstract model of the pointer. Of course, on the actual machine, pointers are integers. But the actual machine also does not do the kind of optimizations that modern C++ compilers do, so it can get away with that. If we wrote the above programs in assembly, there would be no UB, and no optimizations. C++ and Rust employ a more "high-level" view of memory and pointers, restricting the programmer for the benefit of optimizations. When formally describing what the programmer may and may not do in these languages, as we have seen, the model of pointers as integers falls apart, so we have to look for something else. This is another example of using a "virtual machine" that's different from the real machine for specification purposes, which is an idea I have blogged about before.

theduke · July 24, 2018, 6:04pm

@RalfJung thanks for your delightful posts.

What is the formal name for this kind of theoretical language analysis and formalization? Do you have any literature to recommend? (not papers, more like textbooks)

CAD97 · July 24, 2018, 6:37pm

The way I see it:

The physical representation of a pointer is the address in memory.

The abstract identity of the pointer is both of that reality – that it points to something – and the identity of what is being pointed to. It’s not ~~valid~~ useful to say pointer::from_unsafe(rand()) for good reason, and that difference is why a pointer is not just a number.

Formalizing this difference is hard, and what RalfJung is attempting to get closer to here, and in the coming 2018 abstract memory model.

H2CO3 · July 24, 2018, 6:52pm

It certainly is. The bits in the RAM are always in either the state 0 or 1. Not in "undefined", "both", or something else. Just 0 or 1. You can't change that fact by "redefining" them.

But the hardware, i.e. physical reality defines it. You can try to map other abstract concepts to it, but it doesn't make much sense if you want to do actual work with the language on an actual hardware implementation.

It's not doing what I was referring to though, because you didn't cast the pointer back to a usize. As long as you only have that integer, it can't cause a problem, because you are not treating it as a pointer. Wouldn't it be UB to write the following, i.e. cast the integer back and use the pointer, or just even cast the non-pointer-valued integer to a pointer?

unsafe {
    *((Box::new(0).into_raw() as usize) * 2 as *const i32)
}

And again, for that you can (and typically do) just convert the pointer to an integer based on some sort of predefined rule, then manipulate it as an integer, and don't care what the particular value is any more, because you won't try to reuse it as a pointer, because that was not what you wanted to do in the first place.

I totally am accepting that. What I say is that it's trivial: its value is what it is. There is a pointer in memory. It occupies some space, and it is populated with a particular bit pattern. The bit pattern is broken up into groups (of 8, typically) bits. The value of v on each iteration is the value you obtain by interpreting that bit pattern as an (8-bit) integer.

I am a long time C and C++ programmer and I am perfectly aware of the rules around strict aliasing, array access, and one-past-end pointers. I could recite your post's first half by heart. I'm not by any means trying to assert that "pointers ARE just integers" because I know they are not. I feel either or both of us misunderstands the other one's point. I'm only trying to say that there is a sensible and natural way of defining pointer -> integer conversions, without that affecting alias analysis. See, my point is that pointers and integers are different types, and when you cast a pointer to an integer, you lose information. If you want to cast to a pointer to something that isn't a pointer yet it doesn't lose the "pointer-ness" of the information, then I'm sorry but that's not a simple integer, and you shouldn't be using or referring to it as such.

H2CO3 · July 24, 2018, 6:56pm

That's exactly right! Again, observe how this is talking about the invalidity/unsafety of an integer -> pointer conversion, but not the other way around. Because a pointer is more than just an integer, it's not possible to recover the lost information when you cast a pointer to an integer and then back. Trying to define integers in a manner that you still can seems very much like madness, because this means that all integers basically need to be typed as fully-fledged pointers at all times. That just doesn't seem useful.

CAD97 · July 24, 2018, 7:07pm

Are you suggesting that ptr as usize as *const T should be UB then? Because if we get rid of the information when turning it into a number, we can’t get it back.

Conveniently, (*const T)::offset is implemented as an intrinsic, so you could say that the manual implementation is UB (((self as usize) + count * size_of::<T>()) as *const T), but that seems wrong to prohibit.

For as long as it’s safe to turn usize into *const T, we need to define what that means. And unless you want to forbid the obvious no-op of round-tripping, the extra pointer context needs to hang around while it’s temporarily a “plain old number” as well.

hanna-kruppe · July 24, 2018, 7:35pm

This sort of inflammatory comparison wouldn't be appropriate, or helpful for convincing anyone, even if the underlying point were correct.

Moreover, it's not even right. As explained for example in the paragraph @RalfJung added later, we're talking about an abstract machine that defines the semantics of the Rust language, not about implementations on real machines. It is a category error to dredge up physical RAM in this discussion. It must be possible to implement the language semantics efficiently on physical machines, but for the purposes of defining semantics mathematically, including undefined behavior that enables optimizations, it is not just perfectly possible, but eminently reasonable to have a richer data model than a flat array of binary symbols.

RalfJung · July 24, 2018, 8:41pm

In the machine I have chosen here, they are not just different types, they correspond to two disjoint sets of possible values that variables can have. I have argued in the post for why the additional class of values (Pointer) is needed. You keep saying it is not needed but without actually countering my arguments.

Have you read the paragraph that I added? Here's the part that answers this statement specifically:

If we wrote the above programs in assembly, there would be no UB, and no optimizations.

You seem to be thinking in terms of assembly. That's fine, but it's also off-topic in this discussion. The goal here is to find a definition that actually allows the optimizations LLVM et al do. It is not possible to do that while sticking to this simple model of RAM that you are promoting. I have some counterexamples in my post.

That's... that's not a definition. That's not saying anything.

That's explaining one unknown in terms of another, which doesn't help. Now you have to define the concept of an "address" in memory. I have used the term "pointer" in my post, but could have replaced it by "address" equivalently -- both identify the part of memory that we are talking about in some operation. And as it turns out, that "address" is not (in a language like C++ or Rust) just an integer, it has some more complex structure for the reasons laid out above.

It's actually perfectly valid to say that (rand() as usize as *mut u8 is even safe code). Just using that pointer is not valid.

But this is not the problem when it comes to formalization. It is possible, not even very hard, to define a memory model where pointers are just integers, where pointer-integer cast is trivial, where none of this is a problem. Unfortunately, we then also lose tons of optimizations, which is why people only do this when defining hardware, but not when defining programming languages.

H2CO3 · July 24, 2018, 8:47pm

No-no, I’m not saying that it’s unconditonally UB. I’m saying that the values themselves don’t know when it is – the programmer has to know. If an integer is really just an integer, then value-wise there’s no difference between a pointer obtained from one integer or another. If the information as to how those integers were obtained persists in the value, then its type is not just an integer, but something more with some metadata that allows correct conversion back to a pointer.

RalfJung · July 24, 2018, 8:47pm

I'm glad you like it. I hope the other discussion here doesn't make this answer drown entirely...

So, I didn't learn this stuff from textbooks, more from lectures or papers. So I haven't read any of what I am going to recommend. But I asked around and was told that a good start is Winskel's "Formal Semantics of Programming Languages". A standard book is also Pierce's "Types and Programming Languages".

H2CO3 · July 24, 2018, 8:50pm

I don't see what is inflammatory in writing down a true statement.

The post was also talking about physical RAM as well. It's all about bytes, addresses and reading from memory. If that wasn't the intent, it should have been clearer. It also seems to me, again, that when OP refers to "integers", he doesn't mean ordinary integers, but a more powerful data type that's an integer with some other metadata. That is also confusing.

CAD97 · July 24, 2018, 8:52pm

There, that's the disconnect.

We're trying to define the type inclusive of that "the programmer has to know" context. We're trying to define exactly when it is and isn't UB, and you can't just say "the programmer has to know", because there is some rule that we have to understand to mandate implementations / soundly optimize.

That extra metadata is the "programmer has to know" external context.

There is no intent to change physical representation. This is merely about formalizing when a pointer is a valid pointer to a value. That external knowledge about validity can be formalized. That's what this is an attempt to do.

Topic		Replies	Views
Understanding of rust Raw pointers libs	4	977	February 9, 2023
Pointers Are Complicated II, or: We need better language specs	148	9268	January 17, 2021
Reviving the bit-data discussion language design	36	5845	March 25, 2019
Re: why Rust references are not pointers language design	12	5063	March 25, 2019
Int2ptr and runtime provenance models Unsafe Code Guidelines	24	2808	February 12, 2022

Pointers Are Complicated, or: What's in a Byte?

Related topics