tl;dr; I think we should turn up codegen-units to the number of cores by default to get 2-3x faster builds with virtually no runtime impact
It’s been quite awhile now since -C codegen-units
started working again, and
we’ve had very few reports related to it, so I think it’s time to discuss
turning it on by default. As a recap the compiler currently builds up one giant
LLVM module for each crate, optimizes it, and then generates an object file
from that LLVM module. The codegen-units
option indicates that this giant
module should be split up into N units, allowing each unit to be optimized and
codegen’d in parallel.
Builds can typically be 2-3x faster with -C codegen-units
at an appropriate
value. The speedup is not linear because the rest of the compiler is still a
good chunk of time and is sequential. Additionally, one of the longest stages
of the compiler, translation, is still serial. Finally, creating lots of LLVM
modules is pretty memory-heavy, and as I’ll show below does degrade
compile-time if there’s too many.
To help prepare for this transition, I’ve collected some data to help show why I think that parallel codegen should be enabled by default.
Compile-time improvements
First, it should be mentioned that all numbers here are for optimized crates only. The win for parallel codegen isn’t very large for debug builds (as there the codegen is just a small portion of the build).
This first table is collected from compiling the Cargo crate itself. The
compiler used -C opt-level=3
and numbers were generated from the current
stable compiler. All timings just used the Unix time
utility and
are in seconds for my machine. The column is the number of codegen units used,
and the row is the metric measured within that number of codegen units. Some
terminology is:
- build (all) - time it took to take the whole build to finish
- build (llvm) - time taken for only the LLVM passes
- build (trans) - time taken for just the translation pass of the compiler
- foo avg - average time it took to do “foo” across each codegen unit
- foo std - standard deviation of the time it took each codegen unit to do “foo”
- module - the “llvm module passes” optimization pass
- function - the “llvm function passes” optimization pass
- codegen - the “llvm codegen passes” to generate an object file
| | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
|---------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| build (all) | 86.120 | 72.180 | 40.814 | 40.310 | 44.646 | 53.524 | 73.050 | 108.710 | 188.920 | 359.600 | 751.00 |
| build (llvm) | 71.681 | 56.437 | 23.563 | 20.529 | 19.820 | 20.809 | 24.246 | 28.933 | 44.858 | 83.161 | 204.102 |
| build (trans) | 7.519 | 8.717 | 10.316 | 12.810 | 17.825 | 25.659 | 41.613 | 72.358 | 136.275 | 267.906 | 536.293 |
| module avg | 55.01 | 36.53 | 13.69 | 10.13 | 9.07 | 7.55 | 6.00 | 6.64 | 11.03 | 18.67 | 36.98 |
| module std | 0.00 | 9.38 | 2.97 | 2.61 | 2.96 | 3.50 | 3.56 | 3.05 | 3.20 | 6.74 | 16.87 |
| codegen avg | 14.03 | 7.80 | 3.45 | 2.14 | 1.32 | 1.20 | 0.79 | 0.66 | 0.45 | 0.37 | 0.33 |
| codegen std | 0.00 | 0.47 | 0.59 | 0.43 | 0.42 | 0.40 | 0.52 | 0.92 | 0.45 | 0.54 | 0.52 |
| function avg | 1.69 | 1.71 | 0.84 | 0.73 | 1.44 | 2.06 | 3.25 | 5.62 | 10.75 | 22.95 | 50.69 |
| function std | 0.00 | 0.08 | 0.20 | 0.12 | 0.28 | 0.60 | 1.13 | 1.69 | 3.75 | 9.30 | 22.45 |
| max rss (MB) | 490 | 593 | 640 | 753 | 951 | 1379 | 2133 | 3526 | 6459 | 12323 | 24070 |
| size (MB) | 6.5M | 6.7M | 6.8M | 6.9M | 7.0M | 7.1M | 7.1M | 7.1M | 7.1M | 7.1M | 7.1M |
Given all this data, there’s some interesting conclusions here!
- We do not benefit from just increasing codegen units. My machine had 8 cores, and there’s a sharp increase in build time when going far beyond that number.
- We get a quite nice speedup getting up to the number of cores on the machine, but the parallelization wins aren’t quite by a factor of N where N is the number of codegen units.
- More codegen units is pretty memory intensive
- binary size is affected by codegen units, but not by a large amount
- Increasing the number of codegen units takes a pretty heavy toll on translation time
One reason lots of codegen units may be pretty inefficient is that the compiler
currently spawns one thread per codegen unit. I tweaked the compiler to have a
-C codegen-threads
option and recompiled cargo with a constant 8 codegen
threads and N codegen units to get the following data:
| | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 |
|---------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| build (all2) | 74.450 | 58.418 | 33.258 | 30.998 | 29.962 | 37.528 | 47.703 | 70.310 | 117.590 |
Note that these timings are a bit lower in general than the previous set of numbers, but I’d just chalk that up to this compiler being the current nightly which is presumably faster than the current stable. Overal, though, it looks like we do get some nice wins by limiting the number of threads, for example 16 codegen units no longer spikes up, but there is still a general upward trend of compile time with too many codegen units.
Now another aspect of codegen-units is precisely how the compiler splits up a crate into separate LLVM modules. Currently the compiler will simply switch codegen units whenever it starts translating a new literal Rust module. This unfortunatey means that the maximum number of codegen units a crate can benefit from is the number of Rust modules. In Cargo, for example, only 54 of the generated 1024 object files (for 1024 codegen units) had code in them, the remaining 900 or so were all blank!
I modified the compiler instead of only calling .rotate()
when translating a
Rust module, instead it rotates on each call to trans_item
. This loosely means
that each function goes into a new codegen unit (in a round robin fashion). This
should give us a nice distribution across codegen units and make sure we fill up
all object files. Let’s take a look at these compile numbers with the same hard
limit on 8 codegen threads as above:
| | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 |
|---------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| build (all3) | 74.580 | 43.077 | 33.611 | 28.048 | 33.350 | 41.665 | 57.750 | 104.850 | 237.160 |
Here it looks like the benefit isn’t that great for Cargo. The upward trend is steeper, and it’s not clearly beneficial in the 8/16 case. Note that this may benefit smaller crates more (such as regex below which has only 8 Rust modules), and it may also help keep cores warm over time as codegen units are more evenly sized. I suspect some more investigation can happen here in terms of the best way to split up a crate for LLVM.
That’s all just the Cargo crate, however, so let’s also take a look at compiling the regex crate
| | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
|---------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| build regex | 6.437 | 4.786 | 4.129 | 4.126 | 5.186 | 6.449 | 8.800 | 14.258 | 24.061 | 45.858 | 98.680 |
This data basically confirms what we found in the Cargo crate, so nothing too too exciting here!
Runtime impact
Alright, so after talking about compile time, let’s talk about runtime performance. The main drawback of codegen-units is that by splitting up the LLVM module you’re in theory missing inlining decisions which can in turn decrease performance. As inlining is the “mother of all optimization” (especially in Rust), let’s take a look at the numbers, but first let’s talk about what each row means:
- cargo foo - Using the cargo library generated above, this cargo binary was used to perform “cargo foo” in the Servo source tree doing a noop build. Quite a lot has to happen to determine that nothing needs to be built, so this is a relatively good benchmark of Cargo’s performance.
- regex hard1K - the regex crate’s benchmarks were compiled and the foo1K benchmarks were run with the ns/iter output here.
- *3 - these timings were measured with the “more round robin” approach where each item being translated caused the compiler to rotate codegen units.
| | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
|---------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| cargo fetch | 0.311 | 0.328 | 0.318 | 0.330 | 0.325 | 0.318 | 0.327 | 0.320 | 0.319 | 0.312 | 0.317 |
| cargo build | 0.544 | 0.538 | 0.548 | 0.533 | 0.547 | 0.549 | 0.559 | 0.559 | 0.563 | 0.543 | 0.560 |
| cargo build3 | 0.512 | 0.518 | 0.552 | 0.546 | 0.546 | 0.530 | 0.534 | 0.537 | 0.551 | -- | -- |
| regex med1K | 2107 | 2108 | 2117 | 2140 | 2146 | 2116 | 2117 | 2139 | 2145 | 2119 | 2119 |
| regex hard1K | 32388 | 39505 | 53087 | 53168 | 52140 | 53326 | 53139 | 53150 | 53391 | 52151 | 53160 |
| regex3 med1K | 2105 | 2165 | 2164 | 2175 | 2164 | 2169 | 2117 | -- | -- | -- | -- |
| regex3 hard1K | 32426 | 38240 | 51240 | 52150 | 52355 | 52139 | 52051 | -- | -- | -- | -- |
Some interesting results from this!
- The performance of Cargo was entirely unaffected by codegen units, in all cases it performed the same no matter what.
- Regexes also were pretty flat, with the one exception of the "hard1K" benchmark took a dive in performance between 1-4 codegen units and then flattened off after that.
This sort confirms our initial suspicion about runtime impact. Larger
applications like Cargo probably don’t rely nearly as much on inlining as
microbenchmarks like the regex crate has. One reason there isn’t a total cliff
in performance I believe is that even with codegen units #[inline]
functions
are still inlined in to each unit (I may be wrong on this though).
Changing the defaults
So given all this data, my conclusion would be that if we used the number of codegen units you cores when optimizing it would not regress performance in many cases and would drastically increase compile times (2x in most cases). You’d of course always be able to tell the compiler to go back to 1 codegen unit to get the current behavior.
Concretely, I would propose changing the default value of codegen-units
to the
number of CPUs on the machine (in all builds, debug and optimized). At this time
I wouldn’t propose any other changes (such as the ones I’ve made here). Some
future possibilities would be:
- Adding a
-C codegen-threads
flag. Above it is clear that you don’t want an absurd number of codegen threads (you benefit from maxing a the number of cores) but it wasn’t clear that having more codegen units beyond threads was a clear win. - Tweaking how the LLVM modules are sharded. Above I experimented with a sort of "maximal even sharding" but there weren’t cleary many compile-time benefits to this. In theory inlining may be primarily important within a module (e.g. like a C++ file), so this continues to seem like a reasonable decision.
- Figuring out why 1K codegen units is so very costly. It should in theory be a pretty low impact operation to have so many codegen units, but it seems disproportionately expensive above. There may be obvious inefficiencies in managing codegen units which would fix this.
- Adding a “job server” like functionality to the compiler and Cargo. Currently Cargo spawns many compiler processes in parallel and if each of these compilers spawns N codegen threads (one per core) then that’s a bit too much contention. Makefiles have a jobserver to manage this kind of parallelism, and the compiler/Cargo could do something similar to limit parallelism. For now though most builds are waiting on one crate to compile, so this probably won’t be too beneficial to start out.
Conclusions
Bumping up the number of codegen units by default seems like a win-win situation to me, but what does everyone else think? Perhaps some more data should be gathered? Do you think these measurements aren’t representative? Have you seen other problems with codegen-units? Curious to hear your thoughts!