Pointers Are Complicated II, or: We need better language specs

RalfJung · December 16, 2020, 8:20am

The thing is, you need to make sure that the definition you propose agrees with what LLVM thinks it should be. If you formalize something at the current stage, you'll have to make plenty of decisions, and then if the LLVM community says "nope we're going to make a different decision" (a) you didn't actually formalize 'LLVM', and (b) you need to adjust your formalization and all the proofs.

You can define 'true' and 'false' all day long, but if you want this to benefit actual LLVM users you also need to convince the LLVM community to accept your definition of 'true' and 'false'.

So maybe "useless" is too strong of a term, but there's no reasonable way to claim that one has formalized the semantics of LLVM at the current stage. All one can do is formalize a semantics for LLVM and point out the consequences of the choices one made: which optimizations are lost, which ones are still working. IOW, all one can do is map out a part of the design space, but one cannot actually decide where in the design space real LLVM will end up being.

(Unfortunately there are some papers out there claiming to formalize X without mentioning these caveats, i.e., they present a formalization as having solved the problem so we can now move on to the next thing. I think that is problematic, since the problem of putting actual LLVM on a more solid foundation is not solved by such a paper at all -- and if there is no discussion or dialog or theorems, this barely even helps towards solving the problem. That's why I am emphasizing the importance of working with the community as part of the formalization process.)

Note that this is an incredibly fragile argument to make. Basically you are saying there are two LLVM IRs, IR1 and IR2, with the same syntax but different semantics: IR1 has provenance and IR2 doesn't. The given optimization can only be correct in IR2, so at some point in the optimization pipeline there needs to be a clear transition from working with IR1 to working with IR2. IR1 optimizations can easily be wrong on IR2 and vice versa, so putting each optimization on the right side of this transition is crucial.

So far I saw no indication that LLVM IR is intended to have multiple semantics, with different optimizations applying at different stages. Until that changes, I will assume the intention is that all optimizations can be applied in any order, and that there is a single semantics going along with their single IR.

Hihi, this is a fun one. Yes I think it is a bug. If we write + for getelementptr, then 0+x is, to my knowledge (and as @bluss pointed out), explicitly intended to not give you a valid pointer, since 0 is considered its own, separate "object" and no amount of getelementptr can turn a NULL ptr into a ptr to something else. (It is even questionable what this will do for other integer literals.) So *(0+x) is always UB. Combine that with x + (-x) optimizing to 0, and you got yourself a nice provenance-related miscompilation.

Who wants to create a bugreport?

It is explicitly intended to be allowed to leave the object and then go back inbounds. Looks like we need to clarify the docs.

Topic		Replies	Views
Understanding of rust Raw pointers libs	4	1004	February 9, 2023
Function pointers are inconsistent with other language features language design	2	1692	August 25, 2020
C++ "Lifetime Profile 1.0", a.k.a. C++ might get a sort of borrow checker	8	5220	March 25, 2019
Int2ptr and runtime provenance models Unsafe Code Guidelines	24	2828	February 12, 2022
Dyn pointer equality	6	886	April 10, 2020

Pointers Are Complicated II, or: We need better language specs

Related topics