Fellow Rustaceans I bring good news! Today’s nightly may be able to improve compile times in release mode for your project by 2x or more, and if you’d like to read no further, please execute the following:
$ cargo +nightly build
$ RUSTFLAGS='-C codegen-units=16 -Z thinlto' cargo +nightly build --release
and report back with (a) did compile times get faster/slower in either mode and (b) did your project’s runtime in release mode get slower. If the answers are yes and no, then we’d still love to know!
In this chapter of the war against Rust’s long compile times we’re going to take a look at LLVM optimizations and code generation. Although not true for every Rust project, more often than not I’ve seen the costliest part of a compilation be LLVM optimizations and/or generation of machine code. In debug mode code generation is often quite expensive and in release mode the optimizations tend to dwarf the code generation.
As a bit of a recap of the current state of the compiler, today rustc will by default create one codegen unit which is fed to LLVM for optimizations and code generation. This basically means that your entire crate is compiled to one object file which is then eventually fed to the linker at a later date. Due to a variety of factors rustc tends to generate quite large single codegen units which then naturally takes quite some time in LLVM!
These huge modules we’re sending to LLVM are one of the most fertile grounds for optimization. There’s various tricks you can employ as a crate author to reduce the size of the codegen unit sent to LLVM, but we’ve also been investigating ways of optimizing codegen units as they are today. One of the most obvious optimizations to look at first is parallelization! LLVM is, however, not parallelized internally during these phases which means that while you’re waiting for the large codegen unit to be optimized and/or translated to machine code you’re just burning one CPU on what is likely an otherwise-idle multi-core system.
It turns out that for quite some time we’ve actually had the ability to
run optimization passes and code generation in parallel. You can see this in
action today by configuring your project’s Cargo.toml
:
[profile.release]
codegen-units = 32
This instructs rustc to split the one large codegen unit into 32 different portions. Each codegen unit is then submitted to LLVM in parallel which ensures that we’re always using all your cores for doing work while compiling a crate. This sounds great, but you may be wondering why we haven’t turned this on by default! Unfortunately it turns out that multiple codegen units don’t come without their downsides.
One of the first problems we ran into with multiple codegen units was managing
all this parallelization across multiple instances of rustc. For example if
Cargo spawns 10 rustc processes and then each of those rustc processes spawns 30
separate threads that can quickly overload a system! To solve this we
implemented a jobserver in Cargo, an implementation of a GNU
jobserver. This, when integrated into rustc, effectively
enabled a global rate limit for Rust compilations, ensuring that cargo build -j10
indeed never had more than 10 units of work at any one point in time.
The next problem we ran into was that rustc was consuming excessive amounts of memory! The problem here was that we would build up all codegen units in memory and keep them all there before we submitted them to LLVM. Worse still, while working on some codegen units all the others were just sitting idle in memory. To solve this problem we implemented "async translation" which enabled rustc to optimize/code generate code in parallel with translation to LLVM. Better still we’d pause translation of code to LLVM when enough codegen units were ready. Overall, this helped keep memory under control when lots of codegen units came into the picture.
The final problem we’ve now faced (modulo a huge slew of other assorted bugs) is that codegen units hurts runtime performance! Rust has almost always compiled code in one codegen unit by default, and a large motivation factor for this is that it enables more inlining and optimization opportunities in LLVM. When we split one codegen unit into many then LLVM loses optimization opportunities. Sure enough in practice this causes huge regressions as optimization in Rust relies so heavily on inlining.
We’re hoping to solve this runtime performance problem with an implementation of ThinLTO. Announced some time ago, ThinLTO is a novel implementation of LTO in LLVM which optimizes for incremental compilation and parallelization. Glossing over all a bunch of details, it basically means that you can get the runtime benefits of a full LTO build without paying the significant cost that comes with today’s LTO.
The primary way we’re thinking of leveraging ThinLTO right now is to regain the performance loss of multiple codegen units when compiling one crate. This means that when you compile a crate with multiple codegen units in release mode you’ll initially lose a lot of inlining and optimization opportunities, but with ThinLTO we should be able to regain all of those losses while still retaining the parallel codegen and optimization benefits.
Now that’s quite a lot of background! Long story short is that we’re hoping to improve compile times in rustc by ensuring that any work in LLVM uses all the cores available on a machine. This means that the often longest part of a compilation, LLVM, should now be parallelized and scale better with the size of the hardware it’s running on.
Parallel code generation was enabled by default for debug mode and is continuing to receive perf improvements. Note that the loss of inlining opportunities doesn’t matter to much in debug mode!
The next step is to enable parallel code generation by default in release mode, but we’re hesitant to do this until we’re sure that we won’t cause 10x regressions throughout the ecosystem! We’re hoping that ThinLTO can help us solve this problem, but we need your help to test it out and find bugs to ensure it’s production ready!
The implementation of ThinLTO just hit nightly and likely has a bug or two in it, so it’d be greatly appreciated if you could test out your local project and see how compile times and runtimes are affected. With your feedback we should soon be able to turn this on by default and ensure we all get benefits of faster compiles by default!