Comparing dangling pointers

RalfJung · December 21, 2015, 5:30pm

Are there any definitive statements anywhere about the rules Rust follows for testing equality of dangling pointers? I think this is something that should be covered in the language reference.

Concretely, consider the following function:

fn compare_dangling_pointers()
{
    let p1: *const i32;
    {
        let x = Box::new(0);
        p1 = &*x;
    }
    {
        let y = Box::new(0);
        let p2: *const _ = &*y;
        let b = p1 == p2;
        println!("Are they equal? {} {}", b, b, p1 == p2);
    }
}

With “rustc 1.5.0-beta.2 (b0f440163 2015-10-28)”, this prints “true true true”. I assume “false false false” would also have been allowed, showing that even safe Rust allows some non-determinism here.

Personally, I found that very surprising: In C, once an object is deallocated, all pointers to it have an indeterminate value. This is the same or at least closely related to the values of uninitialized variables and padding bytes. Generally, Rust successfully shields its users from all the complications around indeterminate values - with the exception of dangling pointer comparison, it seems.

Complications around indeterminate values

The rules around what is and what is not allowed for such values are fairly unclear, to the effect that one of the very few formalizations of C [1] makes pretty much anything here (including comparison of an indeterminate value with anything else) undefined.

In LLVM, as far as I know, indeterminate values become “undef”, admitting many surprising effects. There are attempts to formalize the intended semantics of “undef” [2], with interesting consequences: Under these semantics of LLVM, the above Rust program could output any of the 8 combinations of three boolean values. In particular, not only is the result of comparing a dangling pointers indeterminate and hence non-deterministic, it is actually kind of “lazily picked”: Multiple comparisons can have different results. The non-determinism also does not happen when the equality test takes place. Using the same variable that contains such an indeterminate value multiple times can result in different outcomes, and this propagates through further arithmetic operations (think of let b2 = b | 0, now b2 could still have different values when used multiple times - that’s assuming BitOr would be implemented in the obvious way for bool) and potentially even through memory.

I will have to do some more digging to find the official LLVM documentation related to testing equality of dangling pointers, to figure out whether that is really modeled as “undef” or whether they treat this indeterminate value differently. I am reasonably sure that “undef” is used for uninitialized variables, which are also having indeterminate values according to C.

[1] http://robbertkrebbers.nl/thesis.html [2] https://www.cis.upenn.edu/acg/papers/popl12_vellvm.pdf

Indeterminate values in Rust

I was unable to find out whether Rust actually admits the other 6 combinations of output from the above program. However, unless extra measures have been taken in the translation, I would assume that LLVM actually permits all of them. I think this is important (and strange) enough to be actually documented in the Rust semantics. In particular, as far as I am aware, this is the only piece of underspecification that Rust inherits from C. Generally, Rust has been very careful not to have anything strange like that, either by ruling out certain cases through its type-system (like dereferencing dangling pointers, or using uninitialized variables) or by explicitly defining way more than C does (i32 is 32bit wide and naturally wraps on overflow - or panics, but nothing else). So, most of the time, there is not a big need for defining the exact semantics of safe Rust programs, they mostly have only one “obvious” behavior. For unsafe Rust, the answer is “go check LLVM’s semantics”, and that’s fine. However, I think it’d be a shame if that would also be needed for (any piece of) safe Rust.

Summary

By allowing comparison of raw pointers in safe code, Rust actually opens a can of worms that - in C - is pretty much equivalent to uninitialized variables. This is not documented in the official documentation (to my knowledge), and it’s unlikely to be widely known. It may be too late to close this Pandora’s box and disallow comparing raw pointers in safe code (this would be a breaking-change), but at the very least its intended semantics should be carefully documented, and should be checked to actually match what LLVM does.

Or maybe I am entirely missing a point here, in which case I’d be happy to become enlightened

Gankra · December 21, 2015, 5:44pm

I can’t imagine why this would be a problem? It’s just integers, right? Is there some context to this?

Edit: Oh! Will wait for full post.

RalfJung · December 21, 2015, 6:09pm

That would be fine - If you wouldn't compile to LLVM and would not like to have reasonable aliases analysis - with assumptions like, different calls to malloc will never return the same result . In C, pointers are certainly way more complicated beasts than integers, and by inheritance this is also true for LLVM. I suggest the second chapter of http://robbertkrebbers.nl/thesis.html for some more details.

glaebhoerl · December 22, 2015, 1:10am

I was unable to persuade playpen to segfault during some brief experimentation, but if LLVM really produces undefs here that can lead to undefined behavior and violation of memory safety as shown by e.g. @bluss’s comment here.

bluss · December 22, 2015, 6:05am

I was just quoting @comex!

If it can be exploited in rust, then we must consider if we can tune llvm to avoid it.

RalfJung · December 22, 2015, 11:46am

The C standard would certainly allow LLVM to produce undef any time a dangling pointer is involved. However, that doesn’t have to automatically lead to undefined behavior - I can’t think of a way that an undef boolean leads to anything more weird than non-deterministic control flow, and magic changing variables.

Unfortunately, their language reference I found at http://llvm.org/docs/LangRef.html does not say anything about the effects of free on dangling pointers to the deallocated objects. The latest version does not mention free at all, a completely outdated version at http://llvm.org/releases/1.1/docs/LangRef.html#i_free does, but it doesn’t say anything about what happens to the dangling pointers. So I guess this has to be brought upstream.

I would be surprised, though, if LLVM would not perform optimizations that would be invalid if pointers were “just integers”. Consider the following C code

int *p = malloc(sizeof(int)); assert (p != NULL);
free(p);
int *q = malloc(sizeof(int)); assert (q != NULL);
if (p == q) { // p is indeterminate due to the free
   *q = 10;
   *p = 14;
   printf("%d\n", *q); // p and q alias, expected to print 14
}

According to [1], Clang will optimize this to inline the value 10 in the call to printf. C compilers want this to be possible, because many analyses rely on distinct call sites of malloc returning different values. Hence p and q cannot alias, and hence the read from q has to return 10. However, it would be really hard for compilers to ensure that the integer representations of these pointers actually differ. Hence they have the license to cheat here: The value of p is indeterminate, and hence can change at any point in time (or worse - other interpretations say that any use of p, including comparison, is undefined behavior). So p can be equal to q in the conditional, but then different again when p is written to.

Notice that there are more parts in the C standard which show that pointers are clearly not just integers. For example, the following program has undefined behavior:

int *p = malloc(sizeof(int)); assert (p != NULL);
int *q = p + 180;
int *r = q - 180;
*r = 42;

Pointer values only remain valid as long as they stay within the boundaries of the object they point to, in particular, within the boundaries of a malloc. This is captured in LLVM with the inbounds variant of getelementptr.

[1] http://robbertkrebbers.nl/thesis.html

jethrogb · December 22, 2015, 12:29pm

LLVM is used to compile many languages, surely not all of those languages have the same UB as C. Can you give some examples of actual IR (as opposed to C) that leads to problems?

RalfJung · December 22, 2015, 1:03pm

I’m sorry I don’t know LLVM IR very well. I suppose it is possible to get clang to dump the IR of code like the above, and play with it - it’ll take me a while to figure out the tools for this, though.

Of the many languages used with LLVM, how many have pointers that can still be talked about after deallocating the object they talk to?

RalfJung · December 22, 2015, 2:30pm

I played around with LLVM a little. Here’s the result: The above C code translates to the LLVM code at http://pastebin.com/F0EeHLqs. To me, this looks like a fairly straight-forward compilation; in particular, it is actually performing two writes (to q and p, in that order), followed by a read from q, followed by the printf.

On that code in dangp.ll, I now run

 opt -O2 dangp.ll -o dangp.bc && llc dangp.bc -o dangp.s && gcc dangp.s -o dangp && ./dangp

This prints “10”. Clearly, something in the chain made use of the fact that unallocated pointers are indeterminate. I don’t know how to read LLVM bitcode, but the assembly contains

    movl    $10, (%rax)
    movl    $14, (%rbx)
    movl    $.L.str, %edi
    movl    $10, %esi
    xorl    %eax, %eax
    callq    printf

I’m bad at reading assembly, but I think this hard-codes the printed result to 10. So, it was one of the two LLVM pases that did this. Replacing -O2 by -O1 changes the result of the program to 14, so it’s probably the opt that did this.

oli-obk · December 22, 2015, 2:37pm

you can let llvm print out all the intermediate steps during optimization by passing -p

Gankra · December 22, 2015, 5:22pm

I discussed this with Dan Gohman on IRC. LLVM does not use C’s “indeterminate after free” rule.

Edit: see below.

nikomatsakis · December 22, 2015, 5:30pm

Interesting. This is a tricky case and certainly falls into that category of things we ought to clarify -- all the more urgently since this operation is permitted in safe code. I personally think that in retrospect we ought to have much tighter rules around what is possible with unsafe pointers in safe code. (Another example of something that probably ought to be unsafe: casting an unsafe pointer of type *T to *U.)

That said, this WOULD be hard to rollback, given that *const T implements the PartialOrd trait (and, what's worse, Ord!). I wonder if we could sidestep some of these problems by casting to usize first in the LLVM IR?

In general, the fact that LLVM makes it hard to "opt out" of various bits of C UB can be rather frustrating.

Gankra · December 22, 2015, 5:33pm

Discussing it further, the comparison of x == y may produce surprising results, but it will generally reflect the runtime values of the variables being compared. That is, LLVM may decide to reuse allocations of values with distinct lifetimes, and therefore be the case that x == y deterministically. This is likely what the original post is witnessing.

RalfJung · December 24, 2015, 11:22am

Casting things around should not help. Also, in this case, I have to say LLVM doesn't really have much of a choice - if you want to do a strong alias analysis, and treat malloc as always returning fresh pointers, while maintaining low-level access to the pointer representation - I don't see a way around "weird stuff". It doesn't have to be full UB, but something like values that can be different each time you look at them I think cannot be avoided.

[quote="Gankra, post:11, topic:3019"] I discussed this with Dan Gohman on IRC. LLVM does not use C's "indeterminate after free" rule. [/quote]It doesn't use that rule? Then, how else does LLVM justify the optimization I witnessed above? As far as I am concerned, that example proves that LLVM does funky stuff with dangling pointers. If pointer values remain "sensible" after deallocation, then LLVM should not be allowed to re-order the writes to p and q in a conditional branch that is taken only if these two pointers have equal representation.

Or, taken from the other perspective: If pointers are entirely determined by the bit pattern that represents them, and deallocating an object has no effect on the pointers that point into it - then certainly, in the example above, p and q are the same pointer: They have the same representation, as witnessed by the run-time test, and nothing funky is allowed to go on, and the representation entirely determines the pointer. But if p and q are the same pointer, then re-ordering writes through them changes the semantics of the program, and is illegal.
Hence any semantics that allows reordering these writes, including the C and LLVM semantics, must have a more complicated story for pointers.

Unfortunately, the problem is not solved by saying "it compares the runtime representation". You cannot just define problems away, that's not how things work . You have to pick, semantically, what pointers "are" and how they are represented in memory, how bit-level operations behave on them, etc. And once these definitions are set, we can check which transformations are or are not allowed. The choice here has to be consistent, you cannot sometimes say "pointers are just integers", and in other moments say "oh but if these integers are equal, I may still sometimes consider the pointers different".

Sure, LLVM could consistently say "pointers are integers that address into memory", but as I argued in the previous paragraph that would immediately invalidate lots of useful compiler optimizations which LLVM certainly performs.

Florob · December 24, 2015, 3:39pm

Personally I think this conflates two issues. The value of a pointer, and the validity of a pointer. Notably the C standard (also) says:

Among the invalid values for dereferencing a pointer by the unary * operator are a null pointer, an address inappropriately aligned for the type of object pointed to, and the address of an object after the end of its lifetime.

Note that there is no requirement that this address has not been reused by another stack allocation, or malloc() call.

Hence, my personal mental model has always been that a pointer's validity and its value are separate properties. I.e. in your example p and q can have the same value, but since p has been invalidated by freeing its pointee writing to *p still triggers undefined behaviour.

It might also be worth pointing out that there are tricks that would have to work, if the only property involved here was that a pointer's value becomes indeterminate/undef. Consider for example:

int *x = malloc(sizeof(int));
uintptr_t xi = (uintptr_t)x;
free(x);
int *y = malloc(sizeof(int));
int *fresh_x = (int*)xi;

*y = 4;
*fresh_x = 12;
printf("%i %i\n", fresh_x == y, *y);

The value of xi never becomes indeterminate (or at least I'm not aware of this happening retrospectively to values that were derived from something that later becomes indeterminate). Yet clang is happy to treat *fresh_x and *y as two separate objects, even though fresh_x and y have the same value and neither is undef (assuming malloc() reuses the memory, which is usually the case).

eefriedman · December 24, 2015, 9:16pm

You're right, it does use that rule. It just doesn't use it in a way which affects safe Rust: the relevant optimizations only trigger if you actually dereference a dangling pointer.

arielb1 · December 25, 2015, 10:50pm

Rust is unfortunately a “pointers are just integers” language. We use other ways to control aliasing.

RalfJung · December 26, 2015, 12:12pm

I agree, this is a sensible interpretation of the standard. However, that just goes to confirm what I said above, that the in-memory representation of a pointer (what you call it's value) has to be separated from the actual, abstract value that is the pointer. That's why I would be careful with the term "value of a pointer here". The distinction I make here is equivalent to the distinction between the abstract, mathematical concept of the number "3", and the bit sequence "00000011" that represents "3" as a member of the type unsigned char. Similarly, pointers will have an abstract value (and it's unclear what this should best be, though the PhD thesis I quited above makes a proposal that I like), and a representation in terms of bits on the machine. The two must not be conflated.

So, it seems we agree that two pointers can have the same bit representation, but be abstractly different values.

This is indeed a very important question, and the standard is unclear - in fact, the standard and compiler writers don't even agree on all points here: How far do these indeterminate values propagate, and what happens at the "end" of propagation - do we see undefined behavior, or "standard" non-determinism. In your example, clang is using the fact that pointers with equal bit-representation can be different pointers, and on pointer-to-integer conversion preserving the bit representation. I would argue that the transformation relies on the fact that the indeterminate value propagates through the integer. Unfortunately, pointer-to-integer conversion is formally very poorly understood.

So, maybe one could say that while LLVM uses the liberty granted by the C standard (pointers become indeterminate when the object they point to is deallocated) to justify making a distinction between a pointer bit pattern, and a pointer value, such that the same bit pattern can be "different" pointers. However, other than that, it doesn't do anything funky with the bit pattern; in particular, it does not allow the bit pattern to change in random ways, and pointer comparison would remain deterministic.

This would mean, for example, that even though LLVM assumes p and q in my example not to alias, it would not be allowed to optimize p == q or p != q away into a constant. it would have to always preserve the original check. Is that the case?

Of course, this means someone will have lots of fun, eventually, to figure out what the heck this all means formally...

LLVM is not a "pointers are just integers" language, so certainly unsafe Rust cannot be such a language, either. Maybe safe Rust can uphold that fiction. Since safe code cannot do anything with raw pointers except for comparing them (and casting them), right now I can't find a way that safe code alone could witness strange pointer behavior - provided that LLVM only uses the "indeterminate values" license granted by C for optimizing pointer accesses, but not for pointer comparisons.

notriddle · December 26, 2015, 6:04pm

// convert an undef bool to an undef u32.
let a = undef_bool;
let mut b: u32 = 0;
for i in 0..32 {
    if a {
        b |= 1 << i;
    }
}
let c = [ 0 ];
 // UB because the bounds check need not use the same value as the actual memory read
// See also the discussion surrounding OOR shifts
c[b as usize]

However, when I try to run this, LLVM refuses to turn the comparison into undef. Weird.

glaebhoerl · December 26, 2015, 9:26pm

(That’s basically the same thing I tried, I think, with the same result.)

Topic		Replies	Views
Taming Undefined Behavior in LLVM Unsafe Code Guidelines	6	2229	March 25, 2019
Dyn pointer equality	6	881	April 10, 2020
[off-topic] uintptr_t round tripping in C	4	2104	March 25, 2019
Pointers Are Complicated II, or: We need better language specs	148	9244	January 17, 2021
Lifetime bug in safe part (String - ptr)? Unsafe Code Guidelines	14	708	August 28, 2024

Comparing dangling pointers

Complications around indeterminate values

Indeterminate values in Rust

Summary

Related topics