Pointers Are Complicated II, or: We need better language specs

By the way, there is a pointer version of x - x:

pub fn foo(x: *const i8) -> *const i8 {
    x.wrapping_offset(-(x as isize))
}

which LLVM will currently optimize to return constant null, breaking provenance.

However, contrary to integers, it would be more feasible to just forbid that optimization on pointers, since there would be little reason to do this in actual code.

Which is not a problem in and by itself though, right? As long as no other optimizations that consider pointers to have provinence are carried out after the "x-x" is replaced by "0". Anyways, this replacing with 0 infact doesn’t seem to be restricted to late optimization passes only:

pub unsafe fn foo(x: *const i8) -> i8 {
    *x.wrapping_sub(x as _).wrapping_add(x as _)
}

fn main() {
    let x = 42;
    println!("{}", unsafe {foo(&x)});
}

Which should be safe (right?? I’m not actually 100% sure; the docs of those wrapping operations are a bit ambiguous about whether the pointer value is allowed to leave the object boundaries and go back into bound afterwards) only works on Debug (you get illegal instructions on Release build).

1 Like

That seems like a bug to me. @RalfJung, any idea?

I wonder if this section applies https://llvm.org/docs/GetElementPtr.html#can-i-cast-an-object-s-address-to-integer-and-add-it-to-null (wrapping_add is GEP without inbounds).

1 Like

That sounds like it is the reason for it behaving like this (since it would apply to ptr::null().wrapping_add(x as _), and we've seen that x.wrapping_sub(x as _) optimizes to the ptr::null() literal), but a naive reading of that seems to only be saying "you can't offset null to an object and access the object," not "you can't offset the address of an object to null and back then access the object." Or IOW, that says that *(0 + ptr) is invalid, but says nothing about *(ptr + 0) or (*ptr - ptr + ptr).

1 Like

The thing is, you need to make sure that the definition you propose agrees with what LLVM thinks it should be. If you formalize something at the current stage, you'll have to make plenty of decisions, and then if the LLVM community says "nope we're going to make a different decision" (a) you didn't actually formalize 'LLVM', and (b) you need to adjust your formalization and all the proofs.

You can define 'true' and 'false' all day long, but if you want this to benefit actual LLVM users you also need to convince the LLVM community to accept your definition of 'true' and 'false'.

So maybe "useless" is too strong of a term, but there's no reasonable way to claim that one has formalized the semantics of LLVM at the current stage. All one can do is formalize a semantics for LLVM and point out the consequences of the choices one made: which optimizations are lost, which ones are still working. IOW, all one can do is map out a part of the design space, but one cannot actually decide where in the design space real LLVM will end up being.

(Unfortunately there are some papers out there claiming to formalize X without mentioning these caveats, i.e., they present a formalization as having solved the problem so we can now move on to the next thing. I think that is problematic, since the problem of putting actual LLVM on a more solid foundation is not solved by such a paper at all -- and if there is no discussion or dialog or theorems, this barely even helps towards solving the problem. That's why I am emphasizing the importance of working with the community as part of the formalization process.)

Note that this is an incredibly fragile argument to make. Basically you are saying there are two LLVM IRs, IR1 and IR2, with the same syntax but different semantics: IR1 has provenance and IR2 doesn't. The given optimization can only be correct in IR2, so at some point in the optimization pipeline there needs to be a clear transition from working with IR1 to working with IR2. IR1 optimizations can easily be wrong on IR2 and vice versa, so putting each optimization on the right side of this transition is crucial.

So far I saw no indication that LLVM IR is intended to have multiple semantics, with different optimizations applying at different stages. Until that changes, I will assume the intention is that all optimizations can be applied in any order, and that there is a single semantics going along with their single IR.

Hihi, this is a fun one. :slight_smile: Yes I think it is a bug. If we write + for getelementptr, then 0+x is, to my knowledge (and as @bluss pointed out), explicitly intended to not give you a valid pointer, since 0 is considered its own, separate "object" and no amount of getelementptr can turn a NULL ptr into a ptr to something else. (It is even questionable what this will do for other integer literals.) So *(0+x) is always UB. Combine that with x + (-x) optimizing to 0, and you got yourself a nice provenance-related miscompilation.

Who wants to create a bugreport?

It is explicitly intended to be allowed to leave the object and then go back inbounds. Looks like we need to clarify the docs.

9 Likes

There are some tasks that require an ability to robustly determine whether two arbitrary pointers represent the same address, even in cases where they are based on different objects that are adjacent in memory. For example, even if a language doesn't provide a means of forcing top-level objects to be placed in adjacent storage, an implementation may allow programs to import symbols defined in other languages that do, and a program might e.g. use such symbols to mark the beginning and end of a region of storage whose size is determined at link time.

There are some tasks where it would be acceptable for a comparison between a pointer which identifies some object and one that points just past an immediately-preceding object to independently yield true or false, but only if the behavior of each comparison must be consistent with it either yielding true with no side effects, or yielding false with no side effect. For example, a task might require using one of two functions, which would be usable in an overlapping set of cases. If a program uses pointer comparisons to determine which function to use, and weird pointer comparisons would only occur in cases where both functions would be usable, the fact that an arbitrary comparison result might cause the program to arbitrarily select which function gets use shouldn't cause a problem.

Finally, there are tasks which would never involve comparison of pointers that could identify the same address without being based upon the same object, and where it would be acceptable for a program to behave in arbitrary fashion if such comparisons were performed.

Each category of task would be compatible with some optimizations that would be incompatible with the preceding category. The only way a language specification could be optimally suitable for all three categories of tasks would be if it recognized the different categories of tasks and their different semantic requirements, and allowed programs to specify what semantics they actually required. If a language only offers one way of comparing pointers, any specification would either render the language unsuitable for some tasks, or needlessly impair optimizations which would have improved the performance of some others.

The ability to form pointers from integers is essential in low-level programming scenarios where the programmer will know things about the environment that the compiler cannot, such as regions of address space the environment makes available for the various purposes. Further, the ability to convert pointers to integers is often essential in low-level programming scenarios where the a program may need to tell the environment about an address it should do something with (e.g. start a background I/O operation of N bytes starting at address P). Some compiler writers insist that it would be unacceptably expensive to treat every pointer-to-integer cast whose dependencies aren't fully tracked as though it might leak the address to the outside world, and treat every integer-to-pointer cast as though it might derive a pointer from any address that has been leaked, but such claims are fallacious. Although hyper-cautious treatment of such casts would force a compiler to generate many loads and stores that it otherwise wouldn't, and although compiler writers might not know of any useful purpose such loads and stores might serve, such casts are rare except in programs that interact with the environment in ways compilers can't expect to know about. Eliminating the loads and stores would thus be likely to convert a program that interacts with the environment in some useful fashion, with one that is more "efficient" (but useless) because it optimizes out such interaction.

In a low-level language, when a pointer is formed from an integer with no apparent scenarios, one of three scenarios should apply:

  1. The pointer targets a region of address space unrelated to anything the implementation knows about. In that case, an implementation should assume that the programmer somehow knows something about the environment the compiler doesn't.

  2. The pointer targets a region of address space belonging to a live object whose address has been examined, converted to an integer, or exposed to the outside world. In that case, it should behave as though it identifies the object in question.

  3. The pointer targets a region of address space that the implementation has received from the implementation to use as it sees fit, and does not correspond to any live object as described in #2. Accessing the storage associated with such a pointer would be Critical Undefined Behavior, whose effects can generally not be reasonably predicted even by someone familiar with the implementation and the environment.

Provided a compiler treads cautiously in the vicinity of casts between integers and pointers (essentially treating pointer-to-integer casts with release semantics, and integer-to-pointer casts with acquire semantics), it shouldn't need to know or care which of the above scenarios applies. In most cases where code uses pointer-to-integer conversions will do so for purposes of exploiting features of the target environment in ways the compiler shouldn't be expected to fully understand, so any "loss of optimizations" would be illusory.

1 Like

That is not what any compiler writer I know would say, for all compilers I'm aware of assume that casting a pointer to an integer causes the pointer to escape and every integer-to-pointer conversion creates a pointer that may alias any external pointer (including the aforementioned escaping pointers). The problem that's been identified is that this is not sufficient.

If you were to describe the semantics operationally, you'd observe that pointers carry information in them that are more than just the integer address of its location, and this extra information is not saved upon conversion to integer in a naïve semantics. Converting an integer back into a pointer must generate some information for the pointer. A naïve semantics would imply that a pointer-to-integer-to-pointer roundtrip is equivalent to doing nothing at all (on the basis of it all being nops at hardware level), but careful exploration of operational semantics suggests that it is equivalent to clearing that extra information fields, and it is not a valid refinement to fill in that information with a subset of everything.

With LLVM, there is another wrinkle, in that memory is untyped, so effectively every load or store instruction is converting its type to an integer. So LLVM is permitted (and used to until recently, albeit very controversially!) to turn a store (load i8*) into store (load i64), which will introduce inttoptr/ptrtoint into code where the user previously only pointers. (And yes, this has caused optimization problems--I've been bitten by this "bug"). Take the implicit inttoptr literally, and literally every pointer variable in your program escapes because of the presence of the memory inttoptr; ignore the implicit inttoptr entirely and you probably introduce a soundness hole by folding a user inttoptr into the implicit memory one and miss that you actually need one.

Semantics are difficult, even for the people who know what they're doing. A lot of potential pitfalls with semantics aren't apparent until you can find examples of where things go wrong or try to write the optimizations that you intend to write. C++'s experience with relaxed atomics is an example of where something seemingly simple turned out to be more complex than expected, and the issues LLVM has with inttoptr and undefined values are a long-running exercise in "vague, obvious semantics don't work, and most compiler authors aren't prepared to reason about these in a more correct semantic model" (it doesn't help that the LLVM community isn't particularly welcoming towards formalizing their semantic model).

4 Likes

That is not what any compiler writer I know would say, for all compilers I'm aware of assume that casting a pointer to an integer causes the pointer to escape and every integer-to-pointer conversion creates a pointer that may alias any external pointer (including the aforementioned escaping pointers). The problem that's been identified is that this is not sufficient .

It doesn't sound to me like clang is treats pointer-to-int conversions as having release semantics with regard to later int-to-pointe conversions, since if the conversions really were being processed with such semantics, any pointer produced by an integer-to-pointer conversion would be recognized as potentially aliasing every pointer that has leaked.

With LLVM, there is another wrinkle, in that memory is untyped, so effectively every load or store instruction is converting its type to an integer.

So, LLVM needs to include acquire/release notations when the conversions occur as a result of source-code constructs and, until tools that generate LLVM can properly generate such notations, a "force soundness" mode that would sacrifice optimizations for the benefit of people who would prefer a compiler that generates code that's reliably correct, even if it's sub-optimal, to one which generates more "efficient" code that will "probably" work.

(it doesn't help that the LLVM community isn't particularly welcoming towards formalizing their semantic model).

I don't think it would be possible to formalize anything resembling the LLVM semantic model without making some fundamental changes to it first. Such changes would include the abandonment of the idea that optimization must never observably affect the behavior of correct programs, and its replacement with specifications describing when transformations may be generally performed, and what constructs would block them. A program whose behavior might be affected by such transformations could be a correct program if it would meet all its application requirements with any allowable combination of transforms applied.

Consider, for example, the following two language rules: 1. If program would enter a loop whose exit point is statically reachable, but the loop would never terminate, an implementation may behave in arbitrary fashion; 2. If no individual operation within a loop would be observably ordered before any later operation that is statically reachable from it, the execution of the loop or any portion thereof may be reordered with respect to that later operation, or omitted entirely if it is not observably ordered before any observable operation.

If one allows that an implementation may impose a limit on a program's execution time, and raise a signal if it determines that the limit either has been exceeded, or would inevitably be exceeded in the absence of the signal, that would give an implementation license to trap if it happens to detect that it will never escape a particular loop. I can easily imagine situations where it may be useful to have an implementation trap on endless loops that it happens to detect, or where it would be useful to postpone or omit the execution of a loop which would be unable to affect program behavior under any circumstance where it would terminate. Note that if a compiler were given something like:

int test(long long x, int mode)
{
  do 
    { x = no_side_effects(x); }
  while(x & 15);
  if (!mode) return 0;
  return x & 1;
}

Under the latter rules, a compiler would be allowed to generate code that skips the execution of the loop in cases where mode is zero, or that performs the loop unconditionally and then returns zero without bothering to look at the value of mode. If mode is non-zero, however, the read of x within the return statement would be sequenced after its computation within the loop. If a compiler observes that x & 15 is zero, it would be entitled to assume that x & 1 would likewise be zero, but a compiler that postpones the evaluation of x & 15 that would have occurred within the loop would not be entitled to use computations it hasn't performed to justify assumptions about the value of x & 1

The discussions about formalization itself has been moved onto a separate thread, because discussing pointers in the specific and compiler IR and optimization in general in the same topic was getting to be a bit of a tangled read.

6 Likes

Doesn't seem like that powerful of a point to me. Of course if you derive pointers via some invalid means it will potentially have downstream negative effects on the optimizer. This goes for any invalid pointer nonsense:

int n = 5;
int* p = (int*)rand(); // by an impossible coincidence, p now points to n
*p = 2; // now n == 2
printf("%d\n", n); // OPTIMIZATIONS BROKEN!!

@jamesblacklock your example program has UB. The way UB interacts with non-determinism, if any non-deterministic choice leads to UB, then the compiler can assume that this program point is already unreachable. (It needs to make sure that it is consistent about it, which can be tricky, but it can be done.)

Since the program has UB, the optimization doesn't break the program -- it already was broken before.

1 Like

@RalfJung That's my point; the program is broken because of wonky pointer misuse, not because of optimizations. If you decide to do wonky invalid things with pointers, whether that be generating random numbers or reaching past the end of an array, then stuff (including optimizations) is going to break. I could be wrong, but I don't think the C standard even guarantees the layout of variables on the stack. Even if it does, it's a really bad idea to leverage that.

Unless an implementation specifies the behavior of rand() and how it will place n, a program execution where rand() happened to yield the address of n but optimization failed to accommodate that would be indistinguishable from one where rand() had simply yielded some other address. If, however, the program had been something more like:

int n = 5;
int main(void)
{
  uintptr_t pn = 3*(uintptr_t)&n;
  uintptr_t r = rand();
  uintptr_t rn = r*15;
  if (pn == rn)
    (int*)(r*5) = 9;
  printf("%llX %llX %d\n", pn, rn, n);
}

an implementation which doesn't document both how rand() is computed and how static objects are placed would be free to arrange things in such a way that pn and rn could never be equal, in which case it would not have to allow for the possibility that the assignment to (int*)(r*5) would affect n. On the other hand, I don't think an implementation should define uintptr_t (which is an optional type) unless it would treat conversion of a pointer to uintptr_t that may be observed in ways whose consequences the compiler can't fully understood as a potential leak, and a pointer formed by a casting a uintptr_t as a potential alias for any object whose address had leaked. If e.g. r happened to be 0x1110 and the address of n happened to be 0x5550, then whether or not the compiler could see any way by which a programmer might know that the (int*)(r*5) would yield the address of n, it should allow for the possibility that such a thing might occur. I would have no problem with implementations that were incapable of upholding such a guarantee, if they would refrain from defining uintptr_t and intptr_t.

Note that in terms of what the Standard would actually mandates, it requires nothing beyond the fact that if a pointer is round-tripped through uintptr_t and back to its original type, the resulting pointer must compare equal to the original. The Standard specifies nothing about whether an implementation must process any other actions involving the pointer in meaningful fashion, instead leaving that as a Quality of Implementation issue. Some compiler writers interpret the Standard's failure to forbid wacky treatment of a construct as an endorsement of such treatment, but in reality the Standard makes no effort to prevent the creation of "conforming" implementations that are of such garbage quality as to be totally useless, indeed allowing for the possibility that an implementation could be conformign without having to meaningly process any programs that aren't contrived and useless.

Your program is broken because if "wonky pointer misuse". But the first program in my blog post is not, it is only doing things that are explicitly supported in the standard: casting a pointer to an integer, and then casting the very same integer back to a pointer. Nothing wonky happening here.

2 Likes

Looks real wonky to me. Casting an out-of-bounds pointer to an integer. There is no real program where you should ever do this. It is a bug. If this were in a code review, it would be rejected. Is it interesting that it manifests itself in this particular way? Absolutely! But that certainly doesn't make the optimizer broken. It makes your program broken.

Whether or not a program is broken is defined by the specification, not by "do real programs do this". The specification has no objections with this program.

Also, real programs use one-past-the-end ptrs all the time (for iteration loops). I would not be surprised at all if some of them cast those to an integer as well. So I don't buy your assertion that no real program would do this. But it doesn't even matter, since the compiler has to be correct for all programs, not just for those you consider realistic.

4 Likes

Whether or not a program is broken is defined by the specification, not by "do real programs do this". The specification has no objections with this program.

Nothing an implementation does in response to an attempt to dereference a round-tripped pointer could render it non-conforming. An cases where two equal pointers could exist which would be usable only to access disjoint sets of objects, a capricious but conforming implementation could make any conversion of a round-tripped pointer yield a pointer which would compare equal to the original, but would not be usable to access any of the same objects. So in that sense, almost all programs that make use of integer-to-pointer conversions would be "broken" unless they are only intended for use on implementations that uphold guarantees stronger than what the Standard requires.

That having been said, if the example were changed slightly:

#include <stdint.h>
char x[1], y[1] = {0};
int test(unsigned char *p)
{
    uintptr_t ip = (uintptr_t)p;
    uintptr_t iq = (uintptr_t)(x+1);
    y[0] = 123;
    if (iq == ip) {
      *p = 45;
    }
    return y[0];
}

Here the code never dereferences the round-tripped pointer, but clang still processes the code nonsensically.

What's going on is similar to the way clang and gcc treat some goofy corner cases in the Standard involving restrict-qualified pointers, which would cause the C Standard would define the behavior of a construct of the form (someFlag && a || b) && doSomething) when a is 0 (meaning someFlag would be evaluated but doSomething would be unconditionally skipped) or when b is 1 (meaning someFlag would be evaluated but doSomething would be unconditionally executed), and yet not define program behavior in cases where execution of doSomething is conditioned on someFlag. I can't think of any reasonable abstraction model in which the reason for a compiler to decide to execute or skip a piece of code would play any role in determining whether its behavior is defined, but that's how the Standard specifies the behavior of restrict, and clang and LLVM are simply applying the same logic elsewhere.