Thanks all for the great feedback.
The Abstract Machine (think: Miri) has provenance that it always tracks. This gives LLVM the permission to exploit provenance when it sees fit. However, the behavior of provenance is designed in such a way that LLVM does not have to track it all the time.
This is similar to uninitialized memory: the Abstract Machine has rules like "performing addition on an uninitialized integer is UB". Miri and the Abstract Machine always precisely track which things are (un)initialized. LLVM does not and cannot, and that is okay since treating something uninitialized as if it was initialized is correct -- the Abstract Machine and Miri are carefully designed to make this true. Imagine if we added an operation that tests if some memory is initialized. In the Abstract Machine and Miri this is easy to do, but now LLVM would actually be required to track precisely which things are initialized, and now we have a problem.
Provenance, like uninitialized memory, can and has to be defined in a way that the compiler may track it but does not have to track it.
Oh, absolutely. But when I tell the LLVM and Rust communities that the only way to solve this problem is a fully formal spec, I think the reaction would be "well that's way too much effort". Also, writing a fully formal spec requires a background that few people have. I think by phrasing things in terms of "make the spec precise enough for an interpreter", we give many more people the chance to contribute, and thus we might even reach the goal of a sufficiently precise spec sooner.
We should have a fully formal spec! All I am saying here is that there's something much easier that we can do, and that more people with less PL background can contribute to, that already goes a long way to avoiding these issues. Going from an interpreter like Miri to a mathematical spec is actually comparatively easy.
No, it does not. The original program in the example chain has no UB. Or rather, you are using the term "no provenance" in a different way than I am.
Pointers cast from integers need to have a form of "broadcast" provenance that allows them to access at least the allocation that they were created from. The details of this can be defined in many different ways. But in all of them, the goal of course is to not make it UB to use the resulting pointer.
For example, we could, at the moment the int-to-ptr cast happens, check which allocations exist at that location in memory, and define that the pointer obtained from the integer has the provenance of those allocations. This paper describes a slightly more elaborate but still fairly simple version of that model. This model says pointers have provenance, integers do not, and yet casting a valid ptr to an integer and back yields again a valid ptr (the ptr might be different than the one we started with, as in it could have different, less restrictive provenance, but that's okay).
I didn't actually say what my definition of provenance is. Yours is certainly a possibility. However, this means that the first optimization becomes incorrect. You made it "correct" by adding comments, but those comments need to be reflected by actual operations in the code since they affect program behavior! So the first transformed program would have to look something like
char p[1], q[1] = {0};
uintptr_t ip = (uintptr_t)(p+1); // the cast implicitly adds the provenance of p to this integer
uintptr_t iq = (uintptr_t)q; // the cast implicitly adds the provenance of q to this integer
uintptr_t ip2 = add_provenance(ip, q);
uintptr_t iq2 = add_provenance(iq, p);
if (iq2 == ip2) {
*(char*)ip2 = 10;
print(q[0]);
}
This is a different optimization than the one I showed in the post. The optimization as shown there is, in general, incorrect in your model since integers have provenance.
The reason that adding an explicit add_provenance
is crucial in your semantics is that if an optimization pass "grows the provenance set", it needs to ensure that all optimization passes running later work with the larger provenance set. The optimization pass cannot grow the set, do the optimization, and then shrink it again (which is what happens in your example if we "forget the comments"). That would be incorrect.
Also, replacing x-x
by 0
is incorrect in your model since 0
presumably has the empty "set of addresses" associated with it. It is legal for an optimization pass to add add_provenance
instructions and grow the set, but it is not legal for it to shrink the set. (See below for why I think your objection to this dos not work.)
IOW, your proposal is in conflict with naive GVN and with at least one basic arithmetic identitiy. There might be a modified form of GVN that is still correct, if it adds enough add_provenance
calls, but the details of this are tricky.
How do you define the opreation of -
in a way that it behaves like this? Your input is two integer values i1
, i2
with their provenance sets; you must output an integer value and its provenance set. I had expected this to be:
- the integer output is
i1 - i2
- the provenance output is the union of the two provenance sets
But clearly you had something else in mind.
There's no way to define "does this value depend on that value" by just doing local things in each operation along the way. I don't recall the details, but this is a formal result shown in the context of information-flow control, which is all about tracking which values depend on which other values. "Depends on" is a relational property, one that requires two traces of the program to be defined, and you cannot precisely capture it by only considering a single program execution at a time.
One fairly simple example: assume we know x
is 0 or 1, then we can replace x
by if x == 0 { 0 } else { 1 }
. Already here it becomes very hard to define the semantics in a way that the resulting value captures the fact that it depends on x
.
@Amanieu made a proposal for that, so see above for the issues with this approach.