Comparing dangling pointers

There seems to be some confusion in this thread about what LLVM actually does. You can see the attributes being applied to library functions here:

http://llvm.org/docs/doxygen/html/FunctionAttrs_8cpp_source.html

The only relevant one is malloc's noalias, which means that alias analysis can assume that pointers it returns don't alias with anything else.

In particular, there is nothing special on free. While the C standard would allow LLVM to turn the argument to a free call into undef afterwards, it doesn't actually do so, presumably because this wouldn't be a useful optimization on any real programs - rather, the standard clause in question is a standardsese attempt to justify the alias optimization. This could theoretically change but I wouldn't hold my breath.

As for noalias -

This would mean, for example, that even though LLVM assumes p and q in my example not to alias, it would not be allowed to optimize p == q or p != q away into a constant. it would have to always preserve the original check. Is that the case?

Glancing at the LLVM code, and checking by looking at assembly output, it seems like that is currently the case, but said optimization sounds more like something that could hypothetically be added in the future, as one could imagine it being useful to remove redundant pointer equality checks in inlined functions or whatever. I certainly don't see any language in the definition of noalias suggesting it would be out of bounds.

However, were LLVM to add such an optimization, I believe Rust would have no choice but to disable it: the optimizer is pretty powerful, so if there is any situation where safe code can trick it into assuming something that can turn out to be false, it is likely possible to get it to mistakenly eliminate bounds checks or whatnot with enough contortions. (As mentioned, you can certainly do that with undef, though that isn't the issue here.)

(If you intend to test anything, note that noalias is inherently applied only to the C malloc function itself. AFAIK, LLVM can propagate noalias towards callers all the way to Box::new, but only if it actually sees the whole chain, and only if the C malloc function is actually being used. The former requires LTO, and the latter requires using the system allocator rather than jemalloc directly, so you will have to compile with both to possibly notice a difference.

Alternately there may be something I don't know about that disables this functionality entirely for Rust, but I don't think so..)

LLVM IR really doesn’t have a concept of pointers acquiring indeterminate values.

There are two completely independent sets of rules governing pointers. There are the bitwise rules, and there are the aliasing rules. The bitwise rules are completely intuitive. A bitwise value never becomes indeterminate; it may be undef from the start, but a non-undef value never becomes undef later on.

The aliasing rules only apply to loads and stores, and not comparisons. The aliasing rules are concerned with the method of computation that produced a pointer value rather than the actual bits of the pointer value. The aliasing rules are an additional constraint on loads and stores and can trigger UB.

When an object is deallocated, another object may be allocated at the first object’s address. When this happens, a pointer comparison will correctly report that the addresses are the same, because comparisons only consider the bitwise semantics. But loads and stores additionally consider the aliasing semantics, and the aliasing semantics care where the pointer came from. A pointer to the deallocated object and a pointer to the new object are considered non-aliasing even when they compare equal, because they’re pointing to two different objects. Accesses to non-aliasing objects may be reordered by the optimizer.

To opt out of this behavior, you’ll need to find ways to neutralize the aliasing semantics, specifically. Removing noalias attributes is one such way of doing this.

3 Likes

Oh noes, noalias enter the stage... that's another very poorly understood bit of LLVM, as far as I know :-/ . Is there a coherent documentation somewhere of what it is supposed to "mean", semantically, and how LLVM uses it? Existing academic research on the LLVM IR skips over noalias, just like it skips over C's restrict - for good reasons...

I agree with your analysis of what LLVM does. However, I think adding noalias to malloc is only valid because the C standard allows the value of pointers to change on free. The trouble is that return values of malloc actually can alias - looking just as malloc itself, the annotation is plain wrong. However, due to what the C standard says about free, it is impossible to observe aliasing in C programs, and hence noalias becomes valid.

[quote="comex, post:21, topic:3019"]Glancing at the LLVM code, and checking by looking at assembly output, it seems like that is currently the case, but said optimization sounds more like something that could hypothetically be added in the future, as one could imagine it being useful to remove redundant pointer equality checks in inlined functions or whatever. I certainly don't see any language in the definition of noalias suggesting it would be out of bounds.

However, were LLVM to add such an optimization, I believe Rust would have no choice but to disable it: the optimizer is pretty powerful, so if there is any situation where safe code can trick it into assuming something that can turn out to be false, it is likely possible to get it to mistakenly eliminate bounds checks or whatnot with enough contortions. (As mentioned, you can certainly do that with undef, though that isn't the issue here.) [/quote]

Right now, Rust would have to disable it because it allows comparing raw pointers safely. I think if that was prohibited, then this optimization would be fine. Indeterminate values arise from uninitialized variables as well as dangling pointers, Rust can rule out the use of both in safe code.

Thanks a lot, this is very helpful information! Is this your reconstruction based on what LLVM does, or is there actually some kind of official stanza on this? The closes thing to a language definition of the LLVM IR I could find is http://llvm.org/docs/LangRef.html, but that doesn't touch all these complicated points. In particular I wonder whether...[quote="sunfishcode, post:22, topic:3019"] The aliasing rules only apply to loads and stores, and not comparisons. [/quote]... is a "guarantee", or whether it could be changed without much of an announcement. Rust is relying on this right now.

1 Like

Continuing the discussion from Comparing dangling pointers:

From LLVM's perspective, the return values of malloc don't ever alias, even when they compare equal, because for two pointers from malloc to compare equal one of them must be freed and thus no longer valid to load or store from. LLVM itself never considers pointers to change value after a free.

LLVM's semantics evolved fairly organically with C/C++ rules pushing on one side and "what an LLVM-style optimizer really wants" pushing on the other. The result is something which still has holes and fuzzy areas and isn't fully documented, but otherwise seems to hold together, so my guess is that it's unlikely to significantly change, unless the C/C++ rules dramatically change, which seems unlikely at this point.

I think we agree here. I'm not saying LLVM's approach is unreasonable or so, and I think I understood it. (Well, except for the part where we don't have a formal definition of noalias, and where I was unable to even find the official documentation of all of this for LLVM. But that's a separate issue.)

What I was talking about above is: Clang claims to be a correct C compiler, and it adds the noalias annotation to calls to malloc. For this to be sound, the LLVM approach to this ("Results to malloc never alias") has to be justified based on the C semantics. This is the part which, I think, needs to refer to the part where C says that pointers to deallocated objects are indeterminate. Due to the C standard being ambiguous (not just because it's plain English, but even that language doesn't always seem to be clear), reasonable people may disagree on what exactly in the C standard justifies clang's translation to LLVM.

This is hardly relevant for Rust though, which targets LLVM directly. The reason I brought up C in the beginning of this thread is that I didn't know about how LLVM handles this (did I mention it's hard to find documentation?^^). I knew LLVM was designed such that clang is sound, so I started from there.

Now, you all convinced me that for the case of pointer comparison, LLVM actually is entirely predictable here, right now. (Assuming that was has been said here, reflects the official position of the LLVM devs.) So, Rust doesn't have a problem due to allowing raw pointer comparison in safe code. Great :slight_smile:

I still would have preferred if this question wouldn't even have to be raised - if raw pointers could only be compared in unsafe code - but oh well. I'll argue for that when it comes to Rust 2.0 :smiley:

What I think I would like is a way to "freeze" an undef value -- perhaps this does exist, but I am not sure. That is, something like:

let x = undef;

where x is now some random value, but it is the same random value each time you look at it.

To be clear: I don’t claim this would magically address the issues we’re talking about here. It’d just be useful for handling out-of-bounds checks where the index may be undef. :slightly_smiling:

Yeah, the real question is whether LLVM might find it useful to use aliasing to inform comparisons at some point in the future. It seems plausible to me, but ultimately it's hard to know.

It seems fairly clear to me that there are more operations on unsafe pointers that ought to have been unsafe: casting them and, now, comparing them. We have discussed the idea of phasing in these rules by making a lint that warns if such operations are performed outside of an unsafe block (which perhaps eventually graduates to an error at some point). What is needed though is a strong justification: the possibility of undefined behavior could be such a thing.

One concern that I always have in these situations though is that I want rules for undefined behavior that match what programmers actually do and sort of expect to work. This is why, for example, I find C's strict aliasing rules unfortunate, because I think they violate most people's mental model of how the computer works and what should be allowed. Basically, people should be able to think of the machine as a kind of simple turing machine that runs along and loads and stores data, and strict aliasing rules break that. Put another way: it's not enough that, in the event of a crash, we know there was unsafe code involved. Ideally, there'd be no crash in the first place. :slightly_smiling:

Along those lines, it does seem to me that having free magically transmute the freed pointer into undef would be rather surprising. That is, I could imagine people trying to do wicked tricks in unsafe code like:

free(x);
let y = malloc(...);
if x == y { /* actual aliasing occurred! recover from some annoying situation */ }

but I don't really have an actual example of how this could be useful. :slightly_smiling:

Anyway, I'm mostly just musing out loud here at the moment.

1 Like

For this to be sound, the LLVM approach to this ("Results to malloc never alias") has to be justified based on the C semantics.

LLVM seems to use an access-based approach to aliasing - aliasing controls accesses, not pointer equality. LLVM has the weird pointer-provenance rule for noalias, which I don't completely understand, but other than that it is quite sane.

Well, also, I like to think of Rust as independent from LLVM -- it's only our current choice of backend -- so it's valuable to think in terms of a more abstract specification. Perhaps the LLVM distinction of "aliasing that affects loads" as distinct from comparisons is the right way to go, but that's not entirely clear.

The LLVM Reference manual disagrees:

This example points out that two ‘undef‘ operands are not necessarily the same. This can be surprising to people (and also matches C semantics) where they assume that “X^X” is always zero, even if X is undefined.

x could have a different value every time (if the "undef" is intended in the LLVM way).

Right, I know that undef can be different each time. What I was trying to say is that it would be useful if LLVM offered a “fix(V)” operator/intrinsic/whatever which, if V is undef, yields a random (but no longer undef) value, but otherwise yields V. As far as I know, though, no such operator exists.

Perhaps someone might be using the raw pointer as a key in a hash table that holds metadata about the pointer's referent. They might want to clean up the hash table entry after the pointer itself has been freed.

Yeah, type-based aliasing analysis (TBAA) is kind of surprising. I read claims that LLVM doesn't do that, but then I also read (and verified) that Clang does use TBAA. The documentation of load in the LLVM language reference unfortunately does not say anything about what happens when the load is performed at the "wrong" type.

Of course the Rust memory model should be defined independent from LLVM. However, considering that the translation Rust performs right now should better be sound, and that Rust (IMHO) should have as little undefined behavior as possible - there's probably not going to be too much wiggle-room between the two memory models. The advantage of a memory model for Rust is that it can remain much more Rust-specific, e.g. it doesn't have to define anything as general as noalias.

[quote="jimb, post:33, topic:3019"] Perhaps someone might be using the raw pointer as a key in a hash table that holds metadata about the pointer's referent. They might want to clean up the hash table entry after the pointer itself has been freed. [/quote]... which of course would break real bad if pointer addresses are ever re-used in a real program.

There is no such thing as a "wrong" type for a load in LLVM. clang implements TBAA by attaching metadata to loads to indicate they don't alias (which has nothing to do with LLVM type of the load).

There are architectures where simply accessing a pointer to an unmapped area can cause a fault. And it’s not impossible that the C library can unmap a block if there are no allocations pinning it.

Well, what happens if I store an i32 into a location, and then load from that location into an f32, but the bit pattern is not a valid floating point? Does LLVM commit to using IEEE floating point semantics? Is this equivalent to doing a bitcast, with the same UB in the same situations?

Actually, I just noticed bitcast is defined as "this is like storing to memory and loading back", and then it goes on saying that converting to/from a pointer and a non-pointer is not allowed.

However, memcpy can certainly copy pointers, and (I assume) it does use an integer type, so... maybe loading a pointer using i64 is fine, and storing it back to a different location as i64 and then loading that 2nd location as a pointer is fine, but doing anything else with the intermediate i64 is bad because it is the result of a ptr-to-int conversion that did not go through ptrtoint.

That's great for Rust, then!

Does LLVM commit to using IEEE floating point semantics? Is this equivalent to doing a bitcast, with the same UB in the same situations?

Quoting the LLVM documentation:

The ‘bitcast‘ instruction converts value to type ty2. (...) The conversion is done as if the value had been stored to memory and read back as type ty2.

Seems like that is indeed the case.

There is no such thing as a bit pattern which isn't valid floating-point.

ptrtoint is semantically the same as a bitcast if the input and output sizes are the same. It exists mostly as a historical artifact; at one point, it was possible to compile for multiple targets with different pointer sizes using the same IR file.

Oh, I see. Well, then it seems all LLVM types can contain all bit patterns, the types merely serve to describe their intended use.

The doc kind of contradicts you here

Pointer (or vector of pointers) types may only be converted to other pointer (or vector of pointers) types with the same address space through this instruction.

I read this as "if you violate this condition, you get UB". Of course ptrtoint of equal-sized types and bitcast are (probably) implemented the same way in the backends, but that would still leave some room for optimizations to treat them differently.