Help us test incremental ThinLTO!


#1

TL;DR - The nightly compiler can do incremental ThinLTO now and we’d like to know what the benefits are for (1) compilation time and (2) run-time performance of artifacts generated this way.

For a few weeks now the nightly compiler supports combining incremental re-compilation and ThinLTO. This should be good news for anyone regularly re-compiling things that need good runtime performance even for testing, like benchmarks and soft real-time applications.

Incremental compilation has always supported release builds but the resulting binaries often were 2-3 times slower than their non-incrementally built counterparts. The reason for this is that incremental compilation splits crates into many small compilation units that are optimized in isolation and LLVM thus misses many inter-procedural optimizations that it can do for a regular build. ThinLTO was invented to alleviate exactly this problem: After each compilation unit is optimized, ThinLTO does an analysis pass that takes note of what things could be inlined into other compilation units. Using this information, it then does another pass, doing optimizations across compilation unit boundaries.

This is similar to what we already do for regular release builds since ThinLTO is the default for those. However, incremental compilation partitions the crate in a more fine-grained way and we’d like to know if and how much this affects runtime performance (we hope that it’s not too much) and how much compilation time we can save by compiling incrementally. We have some idea about the later from perf.rust-lang.org (compilation is 2-5 times as fast for small changes, depending on the project) but know very little about the former. This is where you come in:

How can I help?

I would be great to have two data points for as many real-world projects out there as possible:

  1. How does incremental compilation affect compile times for release builds.
  2. How do incrementally compiled programs perform at runtime, compared to non-incrementally built ones.

For that you need a project to test which either contains benchmarks or has some other way of measuring runtime performance. If your project uses cargo bench, you can do something like the following:

# Make sure we have the latest nightly
rustup update nightly

# Make sure all dependencies are downloaded and
# we start with a fresh `target` directory
cargo +nightly build && cargo +nightly clean

# Get the non-incremental baseline, collect the values in baseline.txt
# Note the build time here, it should be displayed as something like:
#   Finished release [optimized] target(s) in 5.84s
CARGO_INCREMENTAL=0 cargo +nightly bench | tee ./baseline.txt

# Make a small change in a program somewhere, something you'd likely do in
# between two benchmark runs.
#
# USER INTERACTION REQUIRED HERE

# Build again non-incrementally, in order to see how long re-compiling takes.
# You don't have to wait for the benchmarks to finish.
CARGO_INCREMENTAL=0 cargo +nightly bench


# Clear the target directory for good measure
cargo +nightly clean

# Build incremental now and run the benchmarks. The build time, again, should
# show up as something like:
#   Finished release [optimized] target(s) in 6.04s
CARGO_INCREMENTAL=1 cargo +nightly bench | tee ./incremental.txt

# Again, make a small change in a program somewhere, something you'd likely do in
# between two benchmark runs.
#
# USER INTERACTION REQUIRED HERE

# Now run `cargo bench` again and take note of the build time.
CARGO_INCREMENTAL=1 cargo +nightly bench

# If you don't have it yet, install cargo-benchcmp
cargo install cargo-benchcmp

# Print a comparison of the benchmark results
cargo benchcmp ./baseline.txt ./incremental.txt

If your project does not use cargo bench, you’ll have to adapt the above as needed. The most interesting questions are: Is building with CARGO_INCREMENTAL=1 faster than with CARGO_INCREMENTAL=0, and does the resulting program have roughly the same performance in both modes?

Looking forward to seeing results :)


#2

@anp Is there some way to get a special one off run of lolbench to get all this data in a systematic way?


#3

All of lolbench can be run locally and I think this could be hacked together in that environment, but setting compiler flags and other variables like that isn’t supported in the infrastructure yet. I’m pretty busy atm but could provide some direction if someone wanted to either build support for this into lolbench or if they wanted to run the suite locally and collect those results.