Pointers Are Complicated II, or: We need better language specs

RalfJung · December 14, 2020, 4:34pm

Some time ago, I wrote a blog post about how there’s more to a pointer than meets the eye. One key point I was trying to make is that

just because two pointers point to the same address, does not mean they are equal in the sense that they can be used interchangeably.

This “extra information” that distinguishes different pointers to the same address is typically called provenance . This post is another attempt to convince you that provenance is “real”, by telling a cautionary tale of what can go wrong when provenance is not considered sufficiently carefully in an optimizing compiler. The post is self-contained; I am not assuming that you have read the first one. There is also a larger message here about how we could prevent such issues from coming up in the future by spending more effort on the specification of compiler IRs.

https://www.ralfj.de/blog/2020/12/14/provenance.html

nakacristo · December 14, 2020, 6:06pm

This provenance seems to me to be elided propositions. And that they cause problems when we later require the proof and we have not stored it. This is, we can have a pointer, or a pointer plus a proof that it is safe to reference it, or a proof bounding what that pointer can access. Similarly an integer coming from a pointer cast could be companied by a proof of what properties that pointer had.

This is also very related to the different concepts of equality. For example, if two functions give the same output for each possible input then a mathematician will say they are (extensionally) equal. An engineer would see their implementations to be different and would say the functions themselves are different. Indeed, just having different names can be enough reason to say those functions to be different. For pointers this is: is pointing to the same address to be equal or do they need to have equal provenance.

However in rust we do not have explicit proofs. The compiler can check for many things, such as correct lifetimes. In some special cases we resort to runtime, as in DerefCell and Rc. Could this be the thing to do with pointers? To have the compiler have intrinsic proofs for the common cases and some special types for more complicated ones? The usage of raw pointers in rust is already quite special, and it seems hard to me to determine its common use.

I am gonna assume that having fully explicit proofs as language terms in rust (compare with the Coq language) in impossible to obtain. Thus, we cannot store full proofs along the pointers as their provenance. We can still have that allocation_id and offset alike the counter in Rc.

The most important point is still to have the reference clear on what is being done in every case. For example, LLVM could take only a pointer for some optimizations and pointer plus provenance in other optimizations.

RalfJung · December 14, 2020, 6:08pm

I don't think there is any connection between provenance and proofs / proof irrelevance (other than a vague notion of being "difficult around equality"). In particular, this paper formally defines a memory model with provenance (and it is certainly not the first to do so), and the provenance there has nothing to do with any kind of proof.

nakacristo · December 14, 2020, 6:51pm

I was thinking on &T as a pointer plus proof of alive and &mut T as pointer plus proof of exclusivity. The 'proofs' are implicit, having only cost of compilation time. In the case of Rc, we have unsafe blocks in its implementation because we cannot state the proposition AmountReferalsTo(self.value)=self.strong, where strong is the number of strong references. This is, being able to state arbitrary propositions could reduce the number of unsafe uses.

For the case of pointers, if we are sure that a given model (like block id plus offset) is enough, then the implicit proposition is always the same and there is no need to state it. Stating propositions would then could be useful to allow the user to define new pointer models, which should be extremely rare. I am not actually proposing to have to write these propositions, I were just acknowledging they implicitly exist, that they can relatively chosen, and that the compiler should be able to track them or incur in runtime cost.

One could also argue that &T is already a pointer plus a provenance. And that the proposed model is just a weaker than &T and stronger than current *T.

matu3ba · December 14, 2020, 7:06pm

Excellent read. I like the overview table. Very interesting to see, how the memory model falls apart on parallelism and optimisations over indirection.

Regarding [Out of memory] "Designing a memory model that is sound in this respect seems challenging and it is not clear the result would be useful in practice. "

With custom allocators this should be possible: You can either allocate the memory pool or fail and use this as program semantic to optimize where to put what memory. However operating systems need to support this kind of actions (where a few 0.x% of time would need to justify the complexity). With the current situation of enabling memory overcommitment per default, this seems unlikely.

nakacristo · December 14, 2020, 8:45pm

Moving to a more technical aspect. One thing I have not seen considered either in the blog or on the paper is to reuse a pointer variable. In the following example code, how should LLVM track the provenance? Is the second comparison UB, physical to logical pointer comparison or something else?

char *p = malloc(..);
char *q = malloc(..);
*p = compute();
*q=0;
char *foo = p;//foo provenance to p
if *foo==3
{
	foo = q;
}
//foo provenance to any?
if foo == q
{
	*foo = 7;
}
check(*q);//can it have changed?

Other similar task could be to conditionally swap two pointers without losing the non-aliasing between them.

mcy · December 14, 2020, 8:55pm

Pointer equality is never UB (in C, at least, unlike pointer ordering operators). I would like to believe that in a reasonable world, the above would be rewritten to

if (compute() == 3) {
  check(3);
} else {
  check(7);
}

since the only scenario in which foo and q are equal as pointers is if *foo was 3 (escape analysis can trivially show that no other thread has access to the lvalue foo, and p and q are guaranteed to not alias). The rest of the rewriting follows from folding the two if basic blocks into one, and propagating constants through pointers.

I'm not sure how provenance factors into this, exactly. The pointers never get cast to integers, so you can reason about them as opaque objects.

nakacristo · December 14, 2020, 9:48pm

There is no need of castings to have problems. The first example in @RalfJung paper only use out of bound accesses. The example i gave is very trivial, and as you have seen it can be simplified before these considerations. It is not too hard to make a contrived example that cannot be simplified in such way (although perhaps admits other less trivial transformation):

char *p = malloc(size);
char *q = malloc(2*size);
compute(p,size);
compute(q,2*size);
char *foo=p;
for(i=1;i<size;i++)
{
	if(foo[i]==10)
	{
		if(p+i==foo)
		{
			//p+i is logical to p
			//foo starts logical to p, but later changes to q
			//Is the compiler allowed to delete this branch?
			bad_thing();
		}
		else
		{
			strncpy(p+i,foo,size-i);//still has overlap sometimes.
		}
		foo=q+i;
	}
}

Initially p plus non-zero something is greater than foo. Then, when foo refers to q it should compare always to false with p. Thus, in both cases you may erase the p+i==foo branch. But since the foo variable is outside the loop, is it possible to track the provenance?

ckaran · December 14, 2020, 9:57pm

I really liked the post! You put into words something that I've been feeling for a long time, that IRs aren't sufficiently formal. However, I disagree with this statement:

Do we need a formal mathematical definition to avoid such ambiguities? I do not think so;

My personal belief is that they should be formally mathematically defined. Whenever there is ambiguity, it's a coin toss as to what people will decide the ambiguity really means. This isn't just a problem for different people; I've had cases where I started out thinking that an ambiguity resolved one way, then slowly (and imperceptibly to myself!) changed my view. As a result, early code acted in one way, while later code acted differently, and their combination was a bug.

Formally defined, unambiguous IR has a second benefit; we can define higher-level languages in terms of the IR, which means that the higher languages are formally defined for free¹.

¹The formal definition of free is 'a ridiculous amount of work defining the IR in a formal, unambiguous manner, followed by a ridiculous amount of work mapping the higher-level language(s) to the IR in a formal, unambiguous manner that also captures what we really want the higher language(s) to do'.

laytonio · December 15, 2020, 12:16am

I found both this post and your original to be very informative and well written.

I would first like to say that I am somewhat new to compiler development, and I apologize if any my below questions are naive.

It seems to me that there is a somewhat conflated notion of what a pointer is that causes a lot of confusion. On the one hand, pointers are just locations in memory, "they are just integers". On the other, they are used to represent some element of an array. Is there a reason that these two concepts are not more seperated? Is there something that I am missing? For instance, in the example of casting a pointer to int and back, it becomes, at least to me, more obvious that this optimization is wrong if you think of it as casting from an "element of a array" pointer, to a "just an integer" pointer.

Should I instead be thinking of the "just an integer" style pointer as a pointer to the 0th element of an implicit 1 lenght array? In this case, would it make sense for intermediate representations to annotate pointers with the array they are referencing? Is this a pointless distinction? To again use the pointer to int and back example, the optimization is more obviously wrong because we're casting from a pointer in one array (p) to a pointer in another (implicit/unknown).

Amanieu · December 15, 2020, 5:09am

Great post, however I disagree with the conclusion that integers should not have provenance. If pointers that are cast from integers have no provenance then it means that dereferencing them is always UB. This would break a lot of code in practice, including parts of the standard library. A common example is adding metadata to a pointer by packing it in the alignment bits, which allows the result to fit in an AtomicUsize.

In fact I believe your proposal would even violate the C standard which allows pointers to round-trip through uintptr_t while preserving their value, but I am not fully sure since the standard wording is a bit ambiguous as to whether provenance is preserved.

I would instead argue that integers and pointers should be treated identically in terms of provenance. However my definition of provenance is slightly different from yours: it is the set of addresses that a value may have been computed from, rather than a single value.

One thing to note about this definition is that it is conservative: it is always legal for an optimization pass to grow the provenance set of a value, all this does is pessimize later passes since it artificially increases the amount of aliasing in the program.

Let's look at the first optimization again with this definition:

// Original code
char p[1], q[1] = {0};
uintptr_t ip = (uintptr_t)(p+1); // provenance: p
uintptr_t iq = (uintptr_t)q; // provenance: q
if (iq == ip) {
  *(char*)iq = 10;
  print(q[0]);
}

// Transformed code
char p[1], q[1] = {0};
uintptr_t ip = (uintptr_t)(p+1); // provenance: p
uintptr_t iq = (uintptr_t)q; // provenance: q
uintptr_t ip2 = ip; // provenance: p + q
uintptr_t iq2 = iq; // provenance: p + q
if (iq2 == ip2) {
  *(char*)ip2 = 10;
  print(q[0]);
}

Several things happened here:

We created copies of ip and iq and replaced later uses of them with those copies. This is valid since the copies are identical to the original values.
We expanded the provenance set of ip2 and iq2 to cover both p and q. This is valid as explained above.
Since ip2 and iq2 are identical in both value and provenance within the if body, we can safely substitute iq2 with ip2. Such a substitution is not valid if values have different provenance.

This transformation also works if we directly modify the provenance of ip and iq in place:

char p[1], q[1] = {0};
uintptr_t ip = (uintptr_t)(p+1); // provenance: p + q
uintptr_t iq = (uintptr_t)q; // provenance: p + q
if (iq == ip) {
  *(char*)ip = 10;
  print(q[0]);
}

You mention that allowing provenance on integers would block arithmetic optimizations such as y = x - x but this is not the case: the value of y is independent of the value of x, so x is not part of the provenance set of y. However a less sophisticated compiler may conservatively include x in the provenance set of y if it cannot determine that the result of a calculation is independent of its inputs.

This definition of provenance does cause problems for tools like Miri though: determining whether a result is independent of the inputs of a calculation is intractable in the general case to determine the minimal provenance sets for a value. This might make it impossible to prove whether a particular pointer dereference is valid.

Phew. If you made it this far, thanks for reading through my late-night brain dump. Obviously you have much more experience in this area than I do, so I would be happy to hear about any holes in my model. I also realize that my definitions aren't very rigorous and probably need to be reviewed, but I hope that the general gist was clear enough.

CAD97 · December 15, 2020, 7:39am

My (very informal) understanding of pointer provenance so far has been that it's a property required in order for a dereference to be sound. That would mean that integers would also have to track provenance.

The requirement of provenance is how I would argue *(0x005FC460usize as *const usize) is always UB at language level, no matter whether anything is at that address. (If it's a globally known address, either the system header tags a definition or the compiler knows and treats that location as having no provenance requirements.)

One of the hard questions for me about pointer provenance is if (C syntax)

char* p = malloc(1);
*p = '?';
char x;
if ( (uintptr_t)p == 0x005FC460 )
{
    x = *((char*)0x005FC460);
}
free(p);

contains potential UB (dereffing an invalid pointer). My intuition says "yes, because provenance."

The full on heaviest hammer would to be to say integers carry pointer provenance, and to define all integer operations' effect on provenance. I personally would be comfortable saying that any "interesting" operations done to the integer value of a pointer need to be undone exactly in reverse to get back a valid pointer, but I admit that is a potentially strong position to take.

RalfJung · December 15, 2020, 11:04am

Thanks all for the great feedback.

The Abstract Machine (think: Miri) has provenance that it always tracks. This gives LLVM the permission to exploit provenance when it sees fit. However, the behavior of provenance is designed in such a way that LLVM does not have to track it all the time.

This is similar to uninitialized memory: the Abstract Machine has rules like "performing addition on an uninitialized integer is UB". Miri and the Abstract Machine always precisely track which things are (un)initialized. LLVM does not and cannot, and that is okay since treating something uninitialized as if it was initialized is correct -- the Abstract Machine and Miri are carefully designed to make this true. Imagine if we added an operation that tests if some memory is initialized. In the Abstract Machine and Miri this is easy to do, but now LLVM would actually be required to track precisely which things are initialized, and now we have a problem.

Provenance, like uninitialized memory, can and has to be defined in a way that the compiler may track it but does not have to track it.

Oh, absolutely. But when I tell the LLVM and Rust communities that the only way to solve this problem is a fully formal spec, I think the reaction would be "well that's way too much effort". Also, writing a fully formal spec requires a background that few people have. I think by phrasing things in terms of "make the spec precise enough for an interpreter", we give many more people the chance to contribute, and thus we might even reach the goal of a sufficiently precise spec sooner.

We should have a fully formal spec! All I am saying here is that there's something much easier that we can do, and that more people with less PL background can contribute to, that already goes a long way to avoiding these issues. Going from an interpreter like Miri to a mathematical spec is actually comparatively easy.

No, it does not. The original program in the example chain has no UB. Or rather, you are using the term "no provenance" in a different way than I am.

Pointers cast from integers need to have a form of "broadcast" provenance that allows them to access at least the allocation that they were created from. The details of this can be defined in many different ways. But in all of them, the goal of course is to not make it UB to use the resulting pointer.

For example, we could, at the moment the int-to-ptr cast happens, check which allocations exist at that location in memory, and define that the pointer obtained from the integer has the provenance of those allocations. This paper describes a slightly more elaborate but still fairly simple version of that model. This model says pointers have provenance, integers do not, and yet casting a valid ptr to an integer and back yields again a valid ptr (the ptr might be different than the one we started with, as in it could have different, less restrictive provenance, but that's okay).

I didn't actually say what my definition of provenance is. Yours is certainly a possibility. However, this means that the first optimization becomes incorrect. You made it "correct" by adding comments, but those comments need to be reflected by actual operations in the code since they affect program behavior! So the first transformed program would have to look something like

char p[1], q[1] = {0};
uintptr_t ip = (uintptr_t)(p+1); // the cast implicitly adds the provenance of p to this integer
uintptr_t iq = (uintptr_t)q; // the cast implicitly adds the provenance of q to this integer
uintptr_t ip2 = add_provenance(ip, q);
uintptr_t iq2 = add_provenance(iq, p);
if (iq2 == ip2) {
  *(char*)ip2 = 10;
  print(q[0]);
}

This is a different optimization than the one I showed in the post. The optimization as shown there is, in general, incorrect in your model since integers have provenance.

The reason that adding an explicit add_provenance is crucial in your semantics is that if an optimization pass "grows the provenance set", it needs to ensure that all optimization passes running later work with the larger provenance set. The optimization pass cannot grow the set, do the optimization, and then shrink it again (which is what happens in your example if we "forget the comments"). That would be incorrect.

Also, replacing x-x by 0 is incorrect in your model since 0 presumably has the empty "set of addresses" associated with it. It is legal for an optimization pass to add add_provenance instructions and grow the set, but it is not legal for it to shrink the set. (See below for why I think your objection to this dos not work.)

IOW, your proposal is in conflict with naive GVN and with at least one basic arithmetic identitiy. There might be a modified form of GVN that is still correct, if it adds enough add_provenance calls, but the details of this are tricky.

How do you define the opreation of - in a way that it behaves like this? Your input is two integer values i1, i2 with their provenance sets; you must output an integer value and its provenance set. I had expected this to be:

the integer output is i1 - i2
the provenance output is the union of the two provenance sets

But clearly you had something else in mind.

There's no way to define "does this value depend on that value" by just doing local things in each operation along the way. I don't recall the details, but this is a formal result shown in the context of information-flow control, which is all about tracking which values depend on which other values. "Depends on" is a relational property, one that requires two traces of the program to be defined, and you cannot precisely capture it by only considering a single program execution at a time.

One fairly simple example: assume we know x is 0 or 1, then we can replace x by if x == 0 { 0 } else { 1 }. Already here it becomes very hard to define the semantics in a way that the resulting value captures the fact that it depends on x.

@Amanieu made a proposal for that, so see above for the issues with this approach.

steffahn · December 15, 2020, 11:56am

(emphasis mine)

Wait... what about the MADV_FREE problem?

I’m 100% ignorant of how LLVM works, but shouldn’t it, in theory, rely on things that sound more like: "treating an uninitialized value/register (as opposed to a piece of uninitialized memory on RAM) as initialized is okay" and "treating something initialized as something that may-or-may-not-be-initialized" is okay?

flodiebold · December 15, 2020, 12:04pm

I have a probably stupid and slightly off-topic question: What is casting integers to pointers good for? Disregarding backwards compatibility, what would be lost by disallowing casting integers to pointers, or at least making it UB to obtain a pointer aliasing any other variable that way?

I know this wouldn't be a solution to the problem from the blog post, which is why it's kind of off-topic, but I'm curious.

Edit: Found one example on HN: Low-level optimizations using the lower 3 bits of aligned pointers to store metadata. Though maybe that could be done on the pointer without casting it to integer...

RalfJung · December 15, 2020, 12:07pm

Okay, I should have been more precise. You people are too good at catching me when I get sloppy.

Compiling something uninitialized to something initialized is correct. (Formally: each initialized value refines the uninitialized value.) LLVM doesn't have to know that something is uninitialized to correctly compile operations on it. For example, addition has some rules for what happens when the input is undef/poison/whatever, but these rules are carefully picked in a way that using normal hardware addition will produce a correct result. (Formally: hardware addition refines addition on the extended domain of LLVM values that includes uninitialized things.)

If the rules said "undef + 0 equals 0", then this same compilation strategy would be wrong. Under such a strange set of rules, LLVM would have to track initializedness precisely. Does that make sense?

Indeed entirely forgetting about the mere possibility of uninitialized values is incorrect, as the MADV_FREE problem shows (and this paper shows some other examples). That's just like how it is incorrect to forget about the existence of provenance, as my post shows.

phlopsi · December 15, 2020, 2:38pm

You may want to take a look at this:

github.com

phlopsi/impatience-rs/blob/master/src/arc_handle.rs

use crate::std;
use crate::Arc;

const BITS_PER_BYTE: usize = 8;
const LOG_2_128: usize = 7;
const PTR_BIT_MASK: usize = usize::MAX >> LOG_2_128;
const USIZE_BITS: usize = std::size_of::<usize>() * BITS_PER_BYTE;
const DATA_BIT_SHIFT: usize = USIZE_BITS - LOG_2_128;

fn raw_arc_handle_from_ptr(ptr: *const ()) -> usize {
    ptr as usize >> LOG_2_128
}

fn raw_arc_handle_ptr(value: usize) -> *const () {
    (value << LOG_2_128) as _
}

fn raw_arc_handle_count(value: usize) -> isize {
    (value >> DATA_BIT_SHIFT) as _
}

This file has been truncated. show original

I have a huge comment block above ArcHandle. On the lowest level, a pointer is just a usize, but at the same time the higher level abstractions don't let me use every operation, that is valid for usize, e.g. shifting left / right. The way I use the value requires those operations for efficiency, tho. Only using addition and subtraction for this would be unfeasible.

notriddle · December 18, 2020, 6:21pm

12 posts were split to a new topic: Formalizing LLVM IR and Rust

matu3ba · December 15, 2020, 6:31pm

That is not necessarily true. The language with its logic can be proven, since you can separate it in multiple steps (of what types are compatible to operate one). The hard part on that is to find the common denominator of the memory + parallelism model, as you can see in this and the cited papers.

Way harder part is to prove that the implementations represent the formal model and on top of that the optimizations.

BTW: SPARK also does not have a fully formalized language model.

matu3ba · December 15, 2020, 6:48pm

Set analysis/bound checking/reachability analysis (also sometimes referred to as safety analysis), without runtime evaluation, always involves multiple runs with overapproximation or model checking(SAT-solving). If you want to be exact, you have term rewrite systems/graph reduction with even worse runtime.

Generally, having any semantics be defined by a (growing) set of values is costly to check. This becomes very fast unfeasible to do in the compiler.

Topic		Replies	Views
Formal specs and optimizations in general language design	89	5316	January 8, 2021
Int2ptr and runtime provenance models Unsafe Code Guidelines	24	2841	February 12, 2022
Types as Contracts Unsafe Code Guidelines	118	11804	March 25, 2019
Pointers Are Complicated, or: What's in a Byte? language design	79	6510	March 25, 2019
Comparing dangling pointers language design	43	10913	March 25, 2019

Pointers Are Complicated II, or: We need better language specs

Related topics