Next steps for reducing overall compilation time

nikomatsakis · September 19, 2018, 7:26pm

Continuing the discussion from 2019 Strategy for Rustc and the RLS, let’s talk about ways to improve compilation time in a big way (over a 1-2 year time horizon, but with incremental wins along the way).

What are the big changes that make sense? What data can we gather to validate whether they make sense? What are the dependencies between them? Can we make these changes in a way that yields gains as we go, and not all at the end?

nikomatsakis · September 19, 2018, 8:06pm

Let me re-post a list of ideas from an earlier post, slightly extended:

End-to-end queries
- Right now, parsing + macro-expansion + name-resolution are not part of the query system and incremental can’t do anything with them
- Creating the incremental infrastructure from the start would let us start to change that
- This would be very useful for RLS integration of queries, I think
- However, this “session” code is grungy and old and I think this is probably non-trivial. We sketched out some plans at Rust All Hands, I’d have to go back to our notes, but I remember the conclusion was “this will be work”.
- This is also one place where the libsyntax2 story definitely intesects
Parallelize the compiler (tracking issue)
- @Zoxc did awesome work enabling us to parallelize the compiler
- But the last mile isn’t done! There were various refactorings and architectural changes we wanted to make as well.
- It is not integrated with incremental, either
- We’ve not done much measuring of perf
Multicrate sessions
- The “full vision” for queries is that we should be able to compile many crates at once, as I talked about earlier
- This might allow us to skip a lot of work
MIR-only rlibs (tracking issue)
- Right now, our rlibs contain actual executable code
- The idea here is that we just store MIR in the rlibs so that we can do a better job reducing duplicates and so forth
Erasing types where possible
- We’ve often discussed (most recently here) the idea of erasing types (to reduce monomorphization overhead) when we can tell it’s safe. This is a big job but a worthwhile one.
Optimizations:
- Optimizing on the MIR level saves work for LLVM. Plus we can do things it can’t. This might start with inlining plus propagating stuff we know to be immutable because of &T types, as well as CSE and copy-prop like things that should eliminate a lot of memcopies.
- The main goal here is improving compilation time: LLVM probably does most of this already, if not all of it, but it takes longer to do it, and it does it post-monomorphization (hence: many times).

Zoxc · September 19, 2018, 9:16pm

When it comes to parallelizing the compiler here are some of my idea / plans:

Test the existing code. Currently you can build rustc with Rayon supports which parallelizes some things. However this support is not tested. We need a way for end-users to easily be able to try out and test this. We should also do crater runs.
Extend the test suite to handle non-determinstic error message orderings so we can actually run the test suite with Rayon support enabled.
Run compiler passes in parallel. Instead of using Rayon internally in the compiler passes which has high overhead, try to run the compiler passes in parallel instead. This is a bit hard to do due to the early exits spread around in the passes to cut down error messages. There are also implicit dependencies between passes which would need to get removed.
Look for opportunities to parallelize serial parts of the compiler. This would mostly be in the front end, parsing, macro expansion, procedural macros, name resolution, etc.
Create a rustc server which cargo can use to run compilation sessions instead of spawning multiple rustc processes. This server would use a single Rayon thread pool and allow the compilation sessions to efficiently share CPU cores. There’s likely some overlap with multicrate sessions here.
Use Rayon to generate LLVM bitcode and run LLVM optimizations instead of the current custom thread pool used to run LLVM optimizations. This would be a major simplification of the code and would make the LLVM bitcode generation run in parallel.

Note that incremental compilation and parallel compilation can both be used at the same time, but there is a race condition which can cause non-ideal error message orderings.

Zoxc · September 19, 2018, 10:47pm

Another way incremental performance can be improved is to split compiler passes into queries operating on modules. Where we only run them if the module change instead of running them for the entire crate. This is useful for cheap passes where having a query per item would have too much overhead. I did try to introduce this for some passes, but ran into some issues. This probably needs some attention from @michaelwoerister. I also have a local branch which applies this to a nice number of passes, some of them in a very questionable way. It’s this approach very applicable to lints in particular, since the are usually cheap and local.

Doing this would also allow running the passes on the modules in parallel, without per-item overhead. I think this approach is probably one of the easiest ways to improve performance.

earthengine · September 20, 2018, 6:05am

Is it possible to have a distributed compiler like distcc? Instead of just parallelizing, this is much better and should be considered the next step after parallelizing. (I am dreaming to use 10 Raspberry Pi for speed up, or even 100 Amazon/GCloud/Azure VMs…)

What shall we call it? drustc or distrustc? Or I think we can just simply make rustc compatible with distcc and don’t reinvent the wheel?

scottmcm · September 20, 2018, 6:16am

distrust

n. Lack of trust or confidence.

earthengine · September 20, 2018, 6:20am

Looks like a good name for a block-chain currency... Maybe we can create one that you gain credit when you share your host to serve compile requests?

nikomatsakis · September 20, 2018, 1:57pm

It is possible, yes. We've talked about it in rust-lang/rust#47518.

Another related idea is "pre-compiling" crates.io crates for a certain set of architectures and so forth.

repax · September 20, 2018, 2:49pm

How much of the generated MIR code can be reused if one downloads a crate? And are there possible security issues with trusting that the compiled MIR code correctly corresponds with the source code?

nikomatsakis · September 20, 2018, 2:51pm

There are a lot of question marks here. In particular, the generated code depends very much on the precise versions of all the dependencies and so forth.

I'm not aware of any security holes but I wouldn't be surprised if I'm overlooking things. (Modulo the obvious thing of needing to trust the builder.)

Personally, I think I lean towards improving the compiler vs caching results, but I'm not sure. And it's at least plausible to do some of both. (Or maybe make it easy for projects to setup their own caching...?)

Soni · September 20, 2018, 4:28pm

Can you JIT the type system?

nikomatsakis · September 20, 2018, 8:45pm

My long-term plans for Chalk are to compile trait matching (and possibly more operations which we lower to logical predicates) to a byte-code and interpret it. This is pretty close. I probably wouldn't actually want to jit code though hey never say never.

Soni · September 20, 2018, 8:51pm

<(Zero, One) as Plus<(Zero, One)>>::Result == (One, Zero)

tmandry · November 5, 2018, 3:18pm

Can we collect diagnostics from parallel parts, then stitch the error messages back together in a deterministic way?

I'm guessing this would require a refactoring of diagnostics, so we could collect them in TLS and then "commit" them as a separate step.

bjorn3 · November 6, 2018, 7:30am

There exists Emitter in rustc_errors::emitter - Rust

nikomatsakis · November 13, 2018, 7:33pm

So I made a diagram that tries to show all these ideas and some of the dependencies between them:

diagram

I put an edge A -> B if B requires A to be done first (in some cases, the edges have labels for other meanings).

Some things are clear from this diagram:

We should pursue parallel queries. It’s the closest thing to low-hanging fruit and it would benefit end-to-end performance, which in turn benefits RLS
Some good topics for design discussion:
- RLS (and the relationship to rust-anayzer as well
- Design of “MIR 2.0” and MIR optimizations
  - MIR 2.0 is perhaps an overwrought term. But the idea is to change some aspects of MIR to help enable optimizations (while not breaking the borrow checker =)
- End-to-end queries and the “longer term” look of the compiler

Anyway, I’m curious if people have thoughts for “major initiatives” that are worth adding to the diagram but are not part of it. One thing that comes to mind is the Unsafe Code Guidelines work.

felix91gr · November 14, 2018, 2:11am

I might be wholly misinformed, but isn't Const Generics getting implemented right after the Genesis of Generic Germination PR? And if it is, does that imply that Chalk is implemented already? Or maybe Chalk is what's behind the GGG PR's code?

nikomatsakis · November 14, 2018, 2:07pm

That PR enables a lot of things, but realizing the “full design” is still blocked on lazy normalization, at least as far as I know.

The “Chalk” label is a shorthand for moving the compiler over to adopt the Chalk version of trait solving, as described in the rustc-guide.

alexcrichton · November 14, 2018, 3:18pm

One thing I’ve been thinking recently in terms of long term planning and high level ideas is to explore better cached and distributed compilation with rustc. The more adoption we get the more this tends to come up where it may be one of the next “low ish hanging fruit” for big perceived wins in compilation speed.

This is along the lines of sccache which has made some great strides in this area but I’d like to be more ambitious about running sccache on remote machines and running compilations across a cluster instead of always centralized on one machine.

Most of this work won’t actually go into rustc itself, but I worry that rustc makes it difficult, today at least, to enable this work. For example we can’t architecturally parallelize compilation of one crate across many machines. We also can’t easily kick off compilations of downstream crates once metadata is ready, even though that’s all it needs to proceed. Finally the story for incremental in rustc doens’t integrate well with sccache-like strategies or otherwise external caches.

I think it’d be awesome if we could take a look at the possibilities here and see what capabilities need to be added to rustc to enable these sort of compilation schemes, and then once the tools are in place we could work in parallel externally on tools like sccache to leverage rustc’s new capabilities

michaelwoerister · November 14, 2018, 3:55pm

Yes, I think we should review the entire compilation model and see if things could be improved. This ties in with (at least) “multi-crate compilation” and “MIR-only rlibs” from the diagram above.

Topic		Replies	Views
Next steps for RLS, Rustc, and IDE support compiler	21	4043	March 25, 2019
So you want to help with incremental compilation! compiler	6	3441	March 25, 2019
2019 Strategy for Rustc and the RLS compiler	88	20611	March 25, 2019
The RLS / Rustdoc interface to rustc compiler	9	1810	March 25, 2019
Help us benchmark incremental compilation!	48	12267	March 25, 2019

Next steps for reducing overall compilation time

Related topics