Fellow Rustaceans I bring good news! Today’s nightly may be able to improve
compile times in release mode for your project by 2x or more, and if you’d like
to read no further, please execute the following:
$ cargo +nightly build
$ RUSTFLAGS='-C codegen-units=16 -Z thinlto' cargo +nightly build --release
and report back with (a) did compile times get faster/slower in either mode and
(b) did your project’s runtime in release mode get slower. If the answers are
yes and no, then we’d still love to know!
In this chapter of the war against Rust’s long compile times we’re going to
take a look at LLVM optimizations and code generation. Although not true for
every Rust project, more often than not I’ve seen the costliest part of a
compilation be LLVM optimizations and/or generation of machine code. In debug
mode code generation is often quite expensive and in release mode the
optimizations tend to dwarf the code generation.
As a bit of a recap of the current state of the compiler, today rustc will by
default create one codegen unit which is fed to LLVM for optimizations and
code generation. This basically means that your entire crate is compiled to one
object file which is then eventually fed to the linker at a later date. Due to
a variety of factors rustc tends to generate quite large single codegen units
which then naturally takes quite some time in LLVM!
These huge modules we’re sending to LLVM are one of the most fertile grounds
for optimization. There’s various tricks you can employ as a crate author to
reduce the size of the codegen unit sent to LLVM, but we’ve also been
investigating ways of optimizing codegen units as they are today. One of the
most obvious optimizations to look at first is parallelization! LLVM is,
however, not parallelized internally during these phases which means that while
you’re waiting for the large codegen unit to be optimized and/or translated to
machine code you’re just burning one CPU on what is likely an otherwise-idle
multi-core system.
It turns out that for quite some time we’ve actually had the ability to
run optimization passes and code generation in parallel. You can see this in
action today by configuring your project’s Cargo.toml:
[profile.release]
codegen-units = 32
This instructs rustc to split the one large codegen unit into 32 different
portions. Each codegen unit is then submitted to LLVM in parallel which ensures
that we’re always using all your cores for doing work while compiling a crate.
This sounds great, but you may be wondering why we haven’t turned this on by
default! Unfortunately it turns out that multiple codegen units don’t come
without their downsides.
One of the first problems we ran into with multiple codegen units was managing
all this parallelization across multiple instances of rustc. For example if
Cargo spawns 10 rustc processes and then each of those rustc processes spawns 30
separate threads that can quickly overload a system! To solve this we
implemented a jobserver in Cargo, an implementation of a GNU
jobserver. This, when integrated into rustc, effectively
enabled a global rate limit for Rust compilations, ensuring that cargo build -j10 indeed never had more than 10 units of work at any one point in time.
The next problem we ran into was that rustc was consuming excessive amounts of
memory! The problem here was that we would build up all codegen
units in memory and keep them all there before we submitted them to LLVM. Worse
still, while working on some codegen units all the others were just sitting idle
in memory. To solve this problem we implemented "async translation"
which enabled rustc to optimize/code generate code in parallel with translation
to LLVM. Better still we’d pause translation of code to LLVM when enough
codegen units were ready. Overall, this helped keep memory under control when
lots of codegen units came into the picture.
The final problem we’ve now faced (modulo a huge slew of other assorted bugs) is
that codegen units hurts runtime performance! Rust has almost always compiled
code in one codegen unit by default, and a large motivation factor for this is
that it enables more inlining and optimization opportunities in LLVM. When we
split one codegen unit into many then LLVM loses optimization opportunities.
Sure enough in practice this causes huge regressions as
optimization in Rust relies so heavily on inlining.
We’re hoping to solve this runtime performance problem with an implementation
of ThinLTO. Announced some time ago, ThinLTO is a novel
implementation of LTO in LLVM which optimizes for incremental compilation and
parallelization. Glossing over all a bunch of details, it basically means that
you can get the runtime benefits of a full LTO build without paying the
significant cost that comes with today’s LTO.
The primary way we’re thinking of leveraging ThinLTO right now is to regain the
performance loss of multiple codegen units when compiling one crate. This means
that when you compile a crate with multiple codegen units in release mode you’ll
initially lose a lot of inlining and optimization opportunities, but with
ThinLTO we should be able to regain all of those losses while still retaining
the parallel codegen and optimization benefits.
Now that’s quite a lot of background! Long story short is that we’re hoping to
improve compile times in rustc by ensuring that any work in LLVM uses all the
cores available on a machine. This means that the often longest part of a
compilation, LLVM, should now be parallelized and scale better with the size of
the hardware it’s running on.
Parallel code generation was enabled by default for debug mode
and is continuing to receive perf improvements. Note that the loss
of inlining opportunities doesn’t matter to much in debug mode!
The next step is to enable parallel code generation by default in release mode,
but we’re hesitant to do this until we’re sure that we won’t cause 10x
regressions throughout the ecosystem! We’re hoping that ThinLTO can help us
solve this problem, but we need your help to test it out and find bugs to ensure
it’s production ready!
The implementation of ThinLTO just hit nightly and likely has a bug or two
in it, so it’d be greatly appreciated if you could test out your local project
and see how compile times and runtimes are affected. With your feedback we
should soon be able to turn this on by default and ensure we all get benefits of
faster compiles by default!