The span issue, mentioned by @kpreid, is something I've thought a bit about before. I'd really like it if spans could be made relative to the named item that contains them. So for example in the code given above, the panic message would contain a span relative to the start of the function bar
and a DefId of bar
, or some stable ID derived from the item path. At runtime, or when a panic occurs, the actual line number could be looked up by calling some function, passing the relative span, the DefId and a reference to a table that maps DefIds to their spans. Structured like this, when you make an edit, all that would need to change would be the function you edited and the table mapping DefIds to spans.
A related issue is the size of the codegen unit. At the moment, even in debug builds a lot of functions are packed together into a codegen unit. At least for a non-optimised (dev build), it'd be ideal if each function was a separate codegen unit. That way when a single function gets changed, only that one function needs to be recompiled. If that changed function was inlined into another function, then it too would been to be recompiled. But other functions that weren't changed, shouldn't need to be. I think I recall @bjorn3 mentioning that the cranelift backend compiles each function separately.
Another thing related to incremental compilation that I've thought a bit about is whether it's primarily pull-based or push-based. In a pull-based model (which is what is there now), the compiler starts by effectively asking what it needs in order to build the binary. It parses everything, then runs queries, reusing cache hits from previous runs. One issue with this is that if you have a very large tree of queries, you need to traverse the tree right down to the leaves before you can determine that those leaves and thus their parents in the tree haven't actually changed. Another problem is that some things don't lend themselves to queries like this at all because they always change. An example of this is the list of monomorphised items. i.e. the list of all functions that need to be passed to codegen. Any code edit might have changed this list, so it doesn't make sense to cache it. Recomputing it from scratch every compile takes time and is something that often shows up in -Ztime-passes
.
The alternative model (although potentially both models can be used together) is push-based. In a push-based model, the compiler starts by determining what inputs have changed. e.g. it looks at all its input files and finds that just one file has changed. It then reparses just that one file. Taking the parsed items, it pushes these changes through the stages of the compiler. Only the items that have actually changed need to be pushed. So once you get to say the list of monomorphised items, rather than computing it from scratch, you've got some deltas, adding, removing or redefining some functions.
A push-based model is also ideal for integrating with an incremental linker, since it can pass just the bits that have changed to the linker rather than passing everything and making the linker figure out what has changed. I've been writing a linker called Wild with the plan to make incremental, so this has been on my mind.