At the moment, we have two representations of Rust code in the compiler: the AST and LLVM’s IR. With the help with many, many side tables, the various passes are driven by the AST and the compiler eventually builds an IR representation which LLVM then goes and optimizes and performs magic.
My problem with this is that the AST is too abstract to capture a Rust program properly, while LLVM is too low-level to reason about effectively using Rust sematics. All the side tables we carry around exist to fill in the gaps that AST leaves about what is really going on in the code. The types, autoderef behaviour, method calls and others have to be stored in these side tables because the AST has no way of incorporating it. Using all these side tables is clunky and error-prone. It’s easy to overlook something or forget about one of the tables.
With this in mind, I would like to propose the implementation and use of an intermediate representation that is more conducive to the kind of analysis and processing we want to do. This IR would exist at a much higher level than LLVM’s IR, but be more explicit than the AST. This would ideally be the primary representation of the code after typechecking
Requirements
As I see it, the requirements for this IR are as follows:
- Fully typed. Using the full power of Rust’s type system, but everything has an explicit type attached.
- Control- and Data-flow friendly. Dataflow especially is important to pretty much every part of the compiler and is closely related to control flow.
- No side tables. Or at least, minimal side tables. As much as possible any information that is currently stored in side-tables should either be stored inline (types) or represented in an expanded or re-written form in the IR (autoderef, operator overload).
Advantages
- Less error-prone. By storing information inline, we cannot have missing entries in tables. Insead, such errors arise at compile time, not runtime.
- Better represents the program, not the source. The source code is, ultimately, just a notation for a graph that performs the computation. By using a representation closer to that, it should be easier to reason about the behaviour of the program.
- Allows for optimisation. While I’m not suggesting we try and do what LLVM does, there are certain things that LLVM cannot do (like NRVO). We also can perform better analysis on this datastructure to help with generating good code to start with.
So what do you guys think? Bad idea? Worst idea?