The question is mostly one of working with the LLVM community to resolve the ambiguities and contradictions in the current specification. We made a concrete proposal for a spec of the subtle parts of LLVM IR that are involved in the discussion here, but this is probably not the spec -- it is just a basis for discussion.
This process is tedious and slow. For example, getting freeze
into LLVM took 3 years from when this paper got published. But it is unavoidable; there is no way to just say "here's a formal spec of LLVM, that's it"; this cannot be imposed onto the LLVM community, it needs to be worked out with them. I barely know anyone in that community, so I am not the right person to do that work. Nuno and John, and other people like Juneyoung, are doing great work here, but it takes time.
Officially formalizing these aspects LLVM IR only makes sense once the informal spec is unambiguous and free of contradictions. At that point, formalizing it e.g. in Coq is a matter of putting in the work, which could take a PhD student few months to a year depending on their prior background and how far the spec goes in covering all of the IR. A formalization on its own is not terribly useful either, it needs to come with a way to test that the implementation matches the spec, ideally one cam prove some theorems... there's many things one can do here, but I think the time is not yet ripe for that.
Summary: Formalization doesn't magically make these hard problems go away, and in particular a formalization that makes no attempt at actually matching what LLVM optimizations do is pretty much useless -- and what LLVM optimizations do is not self-consistent, so LLVM needs to change before a formalization can have any chance of matching what LLVM does. This requires working closely with the LLVM community.
I agree with the earlier parts of your post, but the last paragraph is highly speculative. What about ((x+3)-x)-3
? Now add other arithmetic and bitwise operations to this, and these questions become quite complicated. You are basically suggesting to turn part of the program into a symbolic computation and decide something about it; I can feel undecidability lurking around the corner. So if the goal is to keep our semantics decidable, I am far from convinced that it is "possible".
And that is not even considering the problem of tracking this symbolic computation through memory or whatever other means of communication there are. How are symbolic names even assigned to values at runtime? What if the program writes x
to a file and reads the content back as y
and then computes x-y
? Your proposal raises more questions than it answers.
It seems much more acceptable to lose such transformations (converting between control and data dependencies, arithmetic simplifications) on pointer types than on integers (and you seem to agree, based on your last post). That's why I think "provenance on pointers only" is a reasonable trade-off.