Parallel codegen plans


#1

tl;dr I propose adding an explicit codegen_units flag to Cargo and to make the Rust and Servo build systems use codegen_units implicitly in some situations. Long term, I would like Cargo to also use codegen_units implicitly in some circumstances. I propose not turning on parallel codegen by default for rustc.

Background

Parallel codegen is a facility for parallelising some of Rust’s compilation. Specifically it allows for parallelisation of the LLVM passes. All of rustc itself still runs sequentially. Since the LLVM passes can take up to 50% of build time, this can, in the right circumstances, have significant impact on build times.

You can use parallel codegen by adding the -Ccodegen_units=4 flag to rustc. You can use values other than 4. 1 is the default and is a regular (non-parallel) build. If you do, please let me know how it goes (build times, any bugs, etc.).

In a normal build we generate an LLVM module for each crate. LLVM then processes that module into an object file, which is eventually linked with other object files (from other crates or libs) to make the final program. Under parallel codegen, rustc creates multiple LLVM modules per crate (one per ‘codegen unit’), these modules are processed in parallel to produce multiple object files. Then these object files are linked together to produce a single object file for the whole crate. That object file can then be further linked as normal.

There is some overhead to this - the additional linking step takes some time and due to inlining, LLVM can end up duplicating some work across modules. The generated code is also slightly lower quality - since the LLVM module is smaller, there is less opportunity for LLVM to optimise. Thus, parallel codegen is not always a win on compile times, and is not suitable for release builds. (I don’t actually have measurements for the code quality, I would like to measure performance on some benchmarks of parallel codegen’ed code and quantify how much of an impact there is).

It is hard to say when exactly parallel codegen is useful. As guidelines, it is more effective to use parallel codegen in the following circumstances:

  • optimised builds - because LLVM is doing more work
  • large crates - the overhead is more likely to be outweighed by the benefit
  • where there are low levels of parallelism in the overall build - because there is more ‘spare parallelism’ to take advantage of

This implies relatively monolithic projects are a good target. I.e., where there is one large crate which is compiled more or less alone, e.g., the librustc crate in Rust (although much less so now than a year ago), or the script crate in Servo.

There is also some mystery - some projects just seem to get more benefit than others. Presumably this has something to do with inlining, but I don’t have anything more precise.

Where parallel codegen does help, the optimum number of codegen units is nearly always 4. Occasionally 2 is better, but that is rare. I’ve never seen >4 give an advantage. (Measured across a bunch of different crates and a few different machines, more data would be nice here).

See this reddit thread for some user reports. And see the previous discussion the last time we thought parallel codegen was ready.

Next steps

The next step is to make better use of parallel codegen. That means making it more accessible, and perhaps using it by default in some situations. However, there are some trade-offs: parallel codegen works much better on optimised builds than debug builds, but the generated code is not good enough for release builds. It is not always a win for compile times, and can sometimes make things worse. Furthermore, in order to make a good guess about whether it will be a win or not, you generally need information about the whole program being compiled, not just the current crate (which is all rustc knows about).

Proposals

rustc

I propose no changes to rustc. I.e., codegen_units will never be used unless explicitly requested.

Cargo

I propose adding the codegen_units flag to cargo build. This should improve visibility of the facility and encourage people to use it. However, since this can be done already using the rustc flag, this might not be worth the flag clutter. The other downside of this is that people might want to tailor the number of codegen units per crate, rather than for the whole project. That suggests some additions to the manifest. Suggestions welcome! My worry is that that will make things too complicated for most users to be bothered with.

Long term, Cargo seems like the place we should set codegen_units by default. It has the information it needs - the deps, and thus which crates build in parallel, and it can easily find the size on disk of crates, which is a good heuristic for the size of the code. It could then use this info (and the kind of build) to supply a number of codegen units to rustc for each crate. This seems optimal in terms of usability and close to optimal in terms of compile times. Its obviously a bit more work though.

Rust and Servo

We should make use of codegen_units in Rust and Servo. When doing non-release builds, we should use codegen_units=4 for appropriate crates - this will take a bit of measuring to get right, but shouldn’t be too hard. I think even using it for all crates will give some improvement. It should be possible to turn that off via a configure or mach option.

alternatives

Use codegen_units=2 for all rustc debug builds. This seems viable - there will be some gains in some circumstances and some losses in others. But the losses should be relatively minor and the occasional gain will make this worthwhile. I would like more data before doing this though.

Introduce a new optimisation level for rustc (and maybe Cargo). We currently don’t really advertise -Copt-level=1. We could add a top level flag that uses that and codegen_units=4 or possibly even a flag for -O -Ccodegen-units=4 or something similar. The intuition here is where you want an optimised, but not release-ready build. Perhaps there should be a Cargo option (instead or as well) which is not --release but does do some optimisations, but is still suitable for debugging.

This might be a good idea as well as the proposals above, but there it is a wide design space and I’m not really sure what would be best. Suggestions welcome!


#2

Turns out Cargo already has some facility for this via the manifest, but per-project, rather than per-crate: https://github.com/rust-lang/cargo/issues/1691


#3

I’ve personally been quite hesitant in the past to add extra flags to cargo build which end up just exporting all the flags of the compiler itself, so I’d be somewhat wary of doing this. It looks like you’ve found the codegen-units manifest, option but as @huonw points out it may not be the most appropriate location for this.

An alternative, I think, is to add a new .cargo/config key so the value can be set globally per-machine perhaps (e.g one for dev and one for release maybe).

Unfortunately the custom flags passed to cargo rustc are only passed to the top crate in question, not transitively throughout the dependency graph.

I would personally still consider the compiler as perhaps the best place to set this default. Not everyone’s building with Cargo all the time and if the “best option” happens by default within the compiler itself it seems like it’d benefit more projects (e.g. rust itself). Technically the compiler also has the disk size of the crate it’s compiling, so unless information about the dependencies is used I’m not sure Cargo has much more information than the compiler.


#4

This is a bit confusing to me. I mean, I understand what causes it, but I’m wondering whether you think that better heuristics would improve the situation? It seems like most C++ projects work on this principle and they do ok – I always thought of our “whole module optimization” as gravy, not a requirement. Am I wrong in that? It’d be nice to get some numbers here, too.


#5

I don’t think there is a ‘best option’ for compiling any given crate in isolation - it depends very much on if there are other crates being built at the same time. E.g., given a large crate, if it is like librustc and is a bottle neck in the build process, then it is worth using codegen_units=4. However, if the same crate is built in parallel with three other crates, then the optimal codegen_units is probably 1. Although rustc knows the size of the crate, it doesn’t know if other builds are happening at the same time or about the dependency tree and I think they are the most important things to know when choosing how parallel to go.

It might also be worth choosing a default in rustc based purely on size (it would be much easier, for a start). Indeed it might be that the simple size heuristic is good enough to give good results.


#6

We certainly get better generated code by using the “whole module optimisation”, I’m not sure how required that is. I know that with C++ the difference between O3 and O3 + LTO is around 15% on benchmark perf (with Firefox), and I’m guessing we really don’t want to through away that kind of perf increase. I’m not sure that carries across to Rust. As you say, we need to measure.

I don’t think we will get much better generated code with better heuristics. Or rather, my intuition is that we can, but the improvements will be small relative to the amount of implementation effort.


#7

OK, that all sounds about right to me. This reinforces for me the idea that the current “debug/release” breakdown is too simplistic, and we probably want another level that is more for “day-to-day” execution:

  • includes debug info and debug assertions;
  • does optimization, but:
    • uses parallel codegen if that seems useful
    • perhaps avoids some optimization that mess up debugability?

(Presumably the ability to control debug info and assertions is something people might to toggle independently as well.)


#8

So…

  • Release: gotta go fast, give it the beans, must outperform C/C++/Java/Go/Nim on microbenchmarks, compile time is no object, look the in-laws are coming over and I want to make a good impression.
  • Regular: optimise but not if it takes too long, don’t strip debug info, whatever you feel like, I just want an executable why are you even asking me, just do the thing that makes it run idk.
  • Debug: oh no everything is on fire, we’re all going to die, how did you even set fire to sheet metal, you need to tell me exactly what you did and in how many states it’s illegal.

#9

The tradeoff here is on the spectrum between simple to use and comprehend debug/release “mode” which is very limited in amount of control and a very precise control over every aspect of the compiler via command line flags which generate an unwieldy, big and complicated interface.

I’d suggest a middle ground, provide support for “profiles” - basically a configuration file in a known and well documented format. Than the user can simply specify one of the bundled default profiles by name (which is equivalent to what we have today) or provide a profile definition file where they can fine tune all aspects of the compiler.

If we look at other tool-chains - DMD (the official D compiler) accepts a configuration file in addition to command line arguments for similar purposes.