At the moment, we have two representations of Rust code in the compiler: the AST and LLVM’s IR. With the help with many, many side tables, the various passes are driven by the AST and the compiler eventually builds an IR representation which LLVM then goes and optimizes and performs magic.
My problem with this is that the AST is too abstract to capture a Rust program properly, while LLVM is too low-level to reason about effectively using Rust sematics. All the side tables we carry around exist to fill in the gaps that AST leaves about what is really going on in the code. The types, autoderef behaviour, method calls and others have to be stored in these side tables because the AST has no way of incorporating it. Using all these side tables is clunky and error-prone. It’s easy to overlook something or forget about one of the tables.
With this in mind, I would like to propose the implementation and use of an intermediate representation that is more conducive to the kind of analysis and processing we want to do. This IR would exist at a much higher level than LLVM’s IR, but be more explicit than the AST. This would ideally be the primary representation of the code after typechecking
Requirements
As I see it, the requirements for this IR are as follows:
Fully typed. Using the full power of Rust’s type system, but everything has an explicit type attached.
Control- and Data-flow friendly. Dataflow especially is important to pretty much every part of the compiler and is closely related to control flow.
No side tables. Or at least, minimal side tables. As much as possible any information that is currently stored in side-tables should either be stored inline (types) or represented in an expanded or re-written form in the IR (autoderef, operator overload).
Advantages
Less error-prone. By storing information inline, we cannot have missing entries in tables. Insead, such errors arise at compile time, not runtime.
Better represents the program, not the source. The source code is, ultimately, just a notation for a graph that performs the computation. By using a representation closer to that, it should be easier to reason about the behaviour of the program.
Allows for optimisation. While I’m not suggesting we try and do what LLVM does, there are certain things that LLVM cannot do (like NRVO). We also can perform better analysis on this datastructure to help with generating good code to start with.
I was skeptical about this until Swift did it. Note that, as far as I can tell, Swift is going to try to recreate all or most of LLVM’s optimization passes and ultimately use LLVM just for codegen; I think Rust is close enough to C++ that we don’t need to go that far, though.
Cameron Zwarich has also put forth a good argument that taking full advantage of Rust’s type system for optimization is going to entail reasoning about memory the same way LLVM reasons about registers, which LLVM is likely never going to be able to really do. (For example, unique &mut memory locations can basically be treated as registers.)
In general I’m OK with it, but we should be sure not to regress compiler performance. Another challenge is making sure we can map back to the AST with full fidelity for error reporting. But, if these two problems can be solved, I see no issues with it in principle.
One way to do the migration might be to start with the CFG that we already have, migrate all passes to be driven off it, then upgrade the CFG to become a full IR.
I think you actually want more than one IR, then we view the compiler as a series of lowering steps with a little processing along the way. Something like: frontend AST (also an interface to syntax extensions) -> compiler AST -> CFG -> trans IR -> llvm. The compiler AST would have semantic info rather than syntactic and metadata would become just a serialisation of that AST. Optimisation passes would work on the trans IR, borrow checking on the CFG. I imagine the trans IR would be almost an SSA, and would use Rust types.
I think the split is important - the needs of the middle end and of trans are very different. Even in the frontend, there is a lot of stuff that works better with different representations, thus the split between the CFG and the compiler AST.
In GHC chapter of the book The Architecture of Open Source Applications, there are some “Key Design Choices” discussed. I think “No side tables” corresponds to “No Symbol Table” there. They discuss advantages and disadvantages of the choice in their experience, which I think would be wise for us to listen to.
They also write, “It is hard to know whether it would be better or worse overall to use symbol tables, because this aspect of the design is so fundamental that it is almost impossible to change. Still, avoiding symbol tables is a natural choice in the purely functional setting, so it seems likely that this approach is a good choice for Haskell.”
With what @nrc said, that would be a benefit yes. Removing some of the stuff that only exists to help the compiler from the frontend AST would aid in stabilisation since it decouples the frontend from the rest of the compiler. We’d be able to tweak the behind-the-scenes stuff without affecting the parsing stages.
I wonder if such a structure could actually reduce overall compiler time in context of TDD like programming styles (frequent compilations with only minor changes) by caching/saving some/all of the rust IR(s). Thought this would need additional disk storage for saving it between the compiler executions this is normally not a problem and if it is this feature could be disabled.
Except this the idea of introducing a additional IR seem to bring a lot of benefits when it comes to optimisations and separation of concerns.
Nevertheless I wonder if it will come with a noticeable performance regression for non incremental builds. Also the implementation seems to be a lot of work including temporal regressions witch make it hard to be introduced between minor rust release.
Where did you hear that? I had a talk with Ted Kremenek a while back about SIL (among other things) and he believed that it was important in order to do a lot of optimizations that can't be expressed very well at the LLVM IR level, and that if Clang were to be rewritten today it would probably use its own intermediate language as well (although LLVM IR is a lot more suited to C optimizations than to Swift, so there's less of a need). But I never got the impression that SIL was meant to replicate LLVM optimizations; after all, why do any work that LLVM is already willing to do?
I’d be very much in favor of giving rustc’s architecture a general overhaul after 1.0 is solid. Getting rid of the plethora of side table certainly seems like a good idea to me. And I don’t think this has to be accompanied with performance regressions, especially if a new architecture makes use of the opportunities for doing things in parallel.
Oh yes, something like this (probably like an earlier ir I mentioned, rather than trans-stage ir) would make incremental compilation much easier (without such a thing, or a massive overhaul of metadata, which itself is easier with such a re-architecture, then incremental compilation is pretty much impossible)
Note that I think we could get a significant improvement with relatively little work by simply collapsing all the side tables into a small number of them, similar to how clang’s Sema works.
I had a great experience using the two level types (as described in http://blog.ezyang.com/2013/05/the-ast-typing-problem/). Made it really convenient to almost have an IR per pass. I suspect this would limit the downside of no side tables as the highly specialized IRs leak less information to phases that don’t need them.
I was thinking about IRs with errors and metaprogramming recently, and wanted to record some thoughts here.
The real reason you want an IR is because you want some core language that is easy to reason about (very few cases) but contains all the semantic information you care about (fully typed). GHC’s main IR is “just” a typed lambda calculus with coercions, but this has really worked out well for us. A core Rust calculus that is capable of expressing all Rust programs after borrow checking would be the ideal.
GHC doesn’t allow mapping back from Core to the source AST, which means that all user-level type checking is done directly on the source AST where useful locations are still allowed. (Well, GHC also has a “renamed” syntax tree where binders are resolved to their locations, but we don’t do any optimizations on that.) If you’re optimizing your IR (that’s why you want one, right?) then you wouldn’t generally expect the transformation to be reversible, though you could build a decompiler to make it easier to read the IR. It seems like an odd thing to do, but not as odd as you think… (although GHC doesn’t do it.)
Procedural macros are an interesting problem. Template Haskell ended up going the route of giving users access to the user-level AST rather than the core language; there are some people who use this, but dealing with Haskell’s large and complicated source language can be quite a pain. However, working with a lower-level IR poses some of its own challenges, e.g. users would have to be able to interpret low level IR code (here a decompiler would be useful), and they would have to be able to discharge any proof obligations that were baked into the IR when they generate it. it could work out, but not obviously so.
The IR is a great boon for incremental compilation. For example, the cache of a file you have compiled could contain IR snippets for inlinable code, rather than the original source syntax.
One (or even more) additional human readable intermediate representation would have another advantage: Working on the compiler would become more approachable.
Since it introduces a well defined interface between different steps of the compilation process it would enable people to work on specific parts without having to grok all of rustc.