Pointers Are Complicated, or: What's in a Byte?

Two reasons:

  1. Bit packing tricks like "tagged pointers"

    A popular implementation trick for dynamically-typed languages exploits pointer alignment rules to store numbers in a place where pointers would normally go. For example, if my JavaScript runtime always stores objects on eight-byte boundaries, then the bottom three bits will always be 0. I could, then, store 61-bit integers inline, without a pointer indirection, set a lower bit to 1 whenever I do that, and add a branch to every arithmetic operation to see if it's stored inline or if it needs to follow a pointer.

    Performing bit-level ops on a pointer requires you to convert it to an integer, store the tweaked integer, and then, before dereferencing it, turn it back.

  2. Byte-shoveling code

    ptr::copy takes its parameters, casts them to *const u8, and calls memcpy. This should produce valid pointers on the other side, even though they're being round-tripped through u8.

The alternative of tracking what pointer an integer is derived from can get arbitrarily complex, e.g. when you xor pointers, the integer “holds” all of them:

let mut a = 0;
let pointers: Vec<_> = (0...1_000_000_000).map(|_| malloc(1) as *mut u8).collect();
for ptr in pointers {
   a = a ^ ptr as usize;
}
pointers.remove(rand() % pointers.len());
for ptr in pointers {
   a = a ^ ptr as usize;
}
let b = a as *mut u8; // this will recover the pointer that was randomly removed

So I can shove a million pointers into a single integer, guarantee it can be cast to a valid pointer, but make it impossible to statically know which one.

3 Likes

Wouldn’t a lot of the problems go away if the int -> pointer cast would take a base pointer as an additional argument? If a pointer is a triple (usize, offset, capacity), one could make pointer -> int work by just adding the first two elements, and (int, basepointer) -> pointer would just work by computing the offset as int - basepointer.usize, and taking the capacity from the basepointer. It would be easy to spot “non-valid” pointers because their offset is too large.

Maybe that’s too simple, but if you want to go int -> pointer, it feels like you need something like this.

Now this is the discussion I was hoping for -- lots of people coming up with examples and models to figure out how to explain pointers. :smiley:

Yes and no. Yes, miri keeps the extra hidden state through integers, but that's just because implementing it different would be a whole bunch of extra work. However, even if a round-trip through integers removes all that extra state (which it does! not doing so invalidates some optimizations), a round-trip through memory must preserve all the extra state. So we still need "fancy" bytes that can carry this state.

Only if we also ignore restrict. But in Rust, we specifically want to use LLVM's noalias (which is kind-of the same thing), so we cannot ignore restrict. So, unfortunately, such a simple approach cannot work.

I am not even sure about the "must be UB". You are doing your computations as integers.

For example, taking two points into integers, then swapping them using the XOR trick, and then using them again -- that better be defined:

fn swap (x: &mut usize, y: &mut usize) {
  *y = *x ^ *y;
  *x = *y ^ *x;
  *y = *x ^ *y;
}

let mut a = Box::into_raw(Box::new(0)) as usize;
let mut b = Box::into_raw(Box::new(0)) as usize;
swap(&mut a, &mut b);
// All defined behavior
let a2 = Box::from_raw(b);
let b2 = Box::from_raw(a);
drop(a2);
drop(b2);

miri does not support this. But a real model should.

The one execution where rand() returns the same as another pointer that I cast to an integer is actually valid:

let x = Box::new(0).into_raw() as usize;
let y = rand();
if x == y {
  // Now this must be valid. Integers cannot have "extra hidden state", so x and y can be used interchangeably.
  // (Actually, there are optimizations that will replace x by y if they see fit.)
  let z = Box::from_raw(y);
  drop(z);
}

I really enjoyed reading this. I truly appreciate this kind of formalization. Seeing this kind of work shoring up the underpinnings of Rust makes me feel that Rust is destined to be THE major language of the coming generation. I truly am in awe of all of those, like yourself, work to make Rust such a great language and ecosystem. Shout out to yourself and all the developers of Rust and Rustc! Thank you all!

5 Likes

Are you saying from a "UB checker" perspective or from a perspective of a normal implementation? The idea of these id-offset pairs is to just use them abstractly to specify program behavior. Implementing this is horrible for performance. Lucky enough, we can compile this the way you all expect it is compiled, and that compilation is correct -- we can just not detect UB any more. But usually, we just want to run our code, and run it fast, so that's okay.

I don’t feel like memcpy should (necessarily) maintain pointerness of values. but clearly the standard authors disagree.

(in other words I want a language where this can be done without UB: https://stackoverflow.com/questions/51344116/how-can-i-print-a-dangling-pointer-for-demonstration-purposes )

Given that memcpy is (conceptually, at least) used to pass around larger arguments, not preserving the values exactly would lose tons of optimizations.

In fact the memory model I am going to propose crucially relies in reliably provenance tracking for references. Copying them around in any way must preserve the extra "pointer" data.

That's a completely separate question. It is about accessing indeterminate values, which is not possible in any way and has nothing to do with pointers per se.
However, I would be all in favor of removing the rule that says that dangling pointers are indeterminate values. (LLVM doesn't have it. Rust doesn't have it.) The same effect for optimizations can be achieved by saying that just pointer comparison of dangling pointers is UB.

So, you can actually have such a language. It is called Rust. :slight_smile:

fn main() {
    let x = Box::new(0);
    let x_ptr = &*x as *const _;
    drop(x);
    println!("Dangling pointer: {:?}", x_ptr);
}
1 Like

I assume miri currently has x != y for all possible executions, and, based on your OP, it'll panic or something if you try to observe the numeric value of an integer-typed-pointer (so it can get away with saying "nuh uh not equal" no matter what the memory address happen to be).

Yeah, the "real" compiler can't get away with that. So, you're right, it can't be that simple.

I'm guessing the memory model you want to create is going to distinguish between pulling a pointer out of an integer, which could produce pointers with any arbitrary aliasing, and pointer offsetting (which would also be the underlying operation behind slice indexing) which is not allowed to escape an allocation?

This is a great post, but I feel like the basic thesis, although pretty simple, is not really stated explicitly; doing so at the beginning of the post might clear up some confusion.

It is possible that I have misunderstood this thesis; however, I'll do my best to state it as I understand it:

Every data type in a language has both a run-time representation, which is equivalent to an integer or an array of integers, and some associated compile-time metadata (or metadata managed by an interpreter), which must be accounted for in a formal model of the language semantics.

In the case of pointers, the metadata includes both type information (which is explicitly part of the language) and allocation information (which is not). Since the allocation metadata is not explicit in the language, and since casting from pointers to integer types or conversion to byte arrays is possible, it must be possible in the compiler, interpreter, or formal model to associate the allocation metadata with non-pointer types.

I haven't thought of a simple way to summarize the bit about uninitialized memory, but I think that part is fairly simple anyway once this idea of metadata that is neither present at runtime nor explicit in the language is understood.

I plan to change miri to just error (not the same as panic) when you do this comparison. Maybe later we'd like to have a "lax mode" that would allow more things (but might also miss more things), that would then say integers can never equal a pointer.

Yes indeed. LLVM does that as well, the former is normal arithmetic and the later is getelementptr.

I'm not sure if I fully agree with this.

The "language", in its abstract sense, should be described without talking about machine registers or RAM. So I don't think of the formal stuff as existing "next to" the integers on the machine. When looking at the formal abstract model, these abstract pointers are the only thing that exists. It's like a VM that has a different notion of what RAM is.

When compiling code to assembly, we have to compare what the source program (in our VM) and the target program (on a real CPU) do. If they observably do the same thing, we are good. It doesn't matter that internally, they function entirely different.

5 Likes

Would something like this work as a memory model?..

There is a set of memory allocations. These could refer to allocations on the heap or on the stack, whatever. An allocation consists of a base address and a type, the base address however is purely abstract - it is represented as a free variable of type usize, not as some specific value.

More specifically, the base address is represented by 64 boolean variables (I’ll assume a 64 bit architecture throughout this post). For some allocation named $a I’ll refer to these variables as $a0 to $a63.

Next we define a type of boolean polynomials over these variables.

struct Anded {
    anded: HashSet<VarId>,
}

struct Bit {
    xored: HashSet<Anded>,
}

Here, VarId refers to some specific boolean variable (eg. $a23). A Bit represents formulas like:

($a2 * $a23 * $b7) + ($a0 * $b1) + $c3

Where * is AND and + is XOR. The important thing about these polynomials is that they are the normal forms of a more generalised language of expressions containing variables, 1, AND, OR, XOR and NOT. eg. If I have some arbitrary expression like:

(!$a0 | $b23) & $c45

I can normalise to (in this case)

($a0 & $b23 & $c45) ^ ($a0 & $c45) ^ $c45

This means that any two boolean expressions, consisting of boolean variables and these operations, have decidable equality. Note that 1 is just the AND of an empty set of variables.

Next we define all our int and pointer types as being arrays of Bits.

struct U8([Bit; 8]);

struct Ptr([Bit; 64]);

// etc.

We can also define all our arithmetic operations on these types. eg.

U8([0, 0, 0, 0, 0, 0, 0, $a0]) + U8([0, 0, 0, 0, 0, 0, 0, $b0])
=> U8([0, 0, 0, 0, 0, 0, $a0 * $b0, $a0 + $b0])

When we perform a memory allocation, or get the address to a value on the stack, we get a value of the form Ptr([$a0, $a1, $a2, ... , $a63]). Note what this is saying: the allocation address is some arbitrary value, unknown to the model, but can still perform any kind of bitwise or arithmetic operation on it.

When we dereference a pointer, the pointer must be of the form $a + x where $a is some allocation and x is a valid offset. Anything else is undefined behaviour.

Reading uninitialized data also gives us variables, only these variable don’t correspond to any allocation.

Does this work? It seems to match my mental model, allow a lot of safe ways of generating valid addresses while still permitting optimizations.

1 Like

I didn't say anything about hardware; I said "equivalent to...an array of integers". I mean "equivalent" in the sense of mathematical isomorphism, i.e., every model of runtime computation is equivalent to the operation of a Turing machine, and the state of any section of the data tape on a Turing machine at any point in the computation process is representable as an array of numbers.

A model (formal or not) of a language is by definition a mathematical description of how the written symbols determine the behavior of the runtime. This is true whether the runtime is an actual execution on a piece of silicon or an abstract model of computation.

The major point, it seems to me, is that there is metadata that is neither indicated in the symbology of the language nor present in the data being used or operated on during the computation. This is what I was trying to say above, and what I believe is the main point of your blog post.

1 Like

I’ve been mulling this post over for the last few days… @RalfJung, do you think there’s value in trying to write down a fictional struct or enum representation of a pointer? You did something similar with Byte, and I think that was pretty illustrative. E.g.,

struct Index { /* opaque */ }
enum Pointer { // I feel like some of these need a 'a...
    Null,
    Exec(Segment, Index), // .data, .rodata, .text, etc
    Stack(Thread, Frame, Index),
    Alloc(Allocator, Index),
    // more exotic stuff, unclear if it should be included
    Gc(???),
    Rma(???), // remote memory access!
    Register(???), 
    // ..
}
1 Like

Don’t forget to add volatile to register.

Wow, this gets mighty complicated. It reminds me a bit of this memory model based on symbolic values. They are using a different algebra though.

One issue I think with your algebra is IIRC, arithmetic operations like multiplication become very complex on bit patterns. You very quickly end up in a situation where you need a SAT solver to test if two elements of the algebra are equal.

I hope we can find something less complicated.

What do you mean by "type"? A Rust type? I strongly believe that memory should be untyped. CompCert originally had typed memory but moved away from it. LLVM has untyped memory. C... is somewhat unclear, but it does allow type punning when "done right", and C with -fno-strict-aliasing has untyped memory (I would argue).


I do. :slight_smile: The one I described in the post is very simple:

pub struct Pointer {
    pub alloc_id: AllocId,
    pub offset: Size,
}

Oh, I didn’t notice it! That’s even better than I expected!

I feel like, for completeness, an Allocation could also be shortly defined in the post, even as a simplification containing only the vec of bytes. It would help remind the reader that we use the AllocId in the pointer to get the alloc, and check that the pointer is “inbounds”?

I didn’t want to define a full memory model. The post is already pretty long as it stands…

Yes I can understand that. Thank you.