tl;dr I propose adding an explicit codegen_units flag to Cargo and to make the Rust and Servo build systems use codegen_units implicitly in some situations. Long term, I would like Cargo to also use codegen_units implicitly in some circumstances. I propose not turning on parallel codegen by default for rustc.
Background
Parallel codegen is a facility for parallelising some of Rust’s compilation. Specifically it allows for parallelisation of the LLVM passes. All of rustc itself still runs sequentially. Since the LLVM passes can take up to 50% of build time, this can, in the right circumstances, have significant impact on build times.
You can use parallel codegen by adding the -Ccodegen_units=4
flag to rustc. You can use values other than 4. 1 is the default and is a regular (non-parallel) build. If you do, please let me know how it goes (build times, any bugs, etc.).
In a normal build we generate an LLVM module for each crate. LLVM then processes that module into an object file, which is eventually linked with other object files (from other crates or libs) to make the final program. Under parallel codegen, rustc creates multiple LLVM modules per crate (one per ‘codegen unit’), these modules are processed in parallel to produce multiple object files. Then these object files are linked together to produce a single object file for the whole crate. That object file can then be further linked as normal.
There is some overhead to this - the additional linking step takes some time and due to inlining, LLVM can end up duplicating some work across modules. The generated code is also slightly lower quality - since the LLVM module is smaller, there is less opportunity for LLVM to optimise. Thus, parallel codegen is not always a win on compile times, and is not suitable for release builds. (I don’t actually have measurements for the code quality, I would like to measure performance on some benchmarks of parallel codegen’ed code and quantify how much of an impact there is).
It is hard to say when exactly parallel codegen is useful. As guidelines, it is more effective to use parallel codegen in the following circumstances:
- optimised builds - because LLVM is doing more work
- large crates - the overhead is more likely to be outweighed by the benefit
- where there are low levels of parallelism in the overall build - because there is more ‘spare parallelism’ to take advantage of
This implies relatively monolithic projects are a good target. I.e., where there is one large crate which is compiled more or less alone, e.g., the librustc crate in Rust (although much less so now than a year ago), or the script crate in Servo.
There is also some mystery - some projects just seem to get more benefit than others. Presumably this has something to do with inlining, but I don’t have anything more precise.
Where parallel codegen does help, the optimum number of codegen units is nearly always 4. Occasionally 2 is better, but that is rare. I’ve never seen >4 give an advantage. (Measured across a bunch of different crates and a few different machines, more data would be nice here).
See this reddit thread for some user reports. And see the previous discussion the last time we thought parallel codegen was ready.
Next steps
The next step is to make better use of parallel codegen. That means making it more accessible, and perhaps using it by default in some situations. However, there are some trade-offs: parallel codegen works much better on optimised builds than debug builds, but the generated code is not good enough for release builds. It is not always a win for compile times, and can sometimes make things worse. Furthermore, in order to make a good guess about whether it will be a win or not, you generally need information about the whole program being compiled, not just the current crate (which is all rustc knows about).
Proposals
rustc
I propose no changes to rustc. I.e., codegen_units will never be used unless explicitly requested.
Cargo
I propose adding the codegen_units flag to cargo build
. This should improve visibility of the facility and encourage people to use it. However, since this can be done already using the rustc
flag, this might not be worth the flag clutter. The other downside of this is that people might want to tailor the number of codegen units per crate, rather than for the whole project. That suggests some additions to the manifest. Suggestions welcome! My worry is that that will make things too complicated for most users to be bothered with.
Long term, Cargo seems like the place we should set codegen_units by default. It has the information it needs - the deps, and thus which crates build in parallel, and it can easily find the size on disk of crates, which is a good heuristic for the size of the code. It could then use this info (and the kind of build) to supply a number of codegen units to rustc for each crate. This seems optimal in terms of usability and close to optimal in terms of compile times. Its obviously a bit more work though.
Rust and Servo
We should make use of codegen_units in Rust and Servo. When doing non-release builds, we should use codegen_units=4 for appropriate crates - this will take a bit of measuring to get right, but shouldn’t be too hard. I think even using it for all crates will give some improvement. It should be possible to turn that off via a configure or mach option.
alternatives
Use codegen_units=2 for all rustc debug builds. This seems viable - there will be some gains in some circumstances and some losses in others. But the losses should be relatively minor and the occasional gain will make this worthwhile. I would like more data before doing this though.
Introduce a new optimisation level for rustc (and maybe Cargo). We currently don’t really advertise -Copt-level=1
. We could add a top level flag that uses that and codegen_units=4
or possibly even a flag for -O -Ccodegen-units=4
or something similar. The intuition here is where you want an optimised, but not release-ready build. Perhaps there should be a Cargo option (instead or as well) which is not --release
but does do some optimisations, but is still suitable for debugging.
This might be a good idea as well as the proposals above, but there it is a wide design space and I’m not really sure what would be best. Suggestions welcome!