Help test out ThinLTO!


#1

Fellow Rustaceans I bring good news! Today’s nightly may be able to improve compile times in release mode for your project by 2x or more, and if you’d like to read no further, please execute the following:

$ cargo +nightly build
$ RUSTFLAGS='-C codegen-units=16 -Z thinlto' cargo +nightly build --release

and report back with (a) did compile times get faster/slower in either mode and (b) did your project’s runtime in release mode get slower. If the answers are yes and no, then we’d still love to know!


In this chapter of the war against Rust’s long compile times we’re going to take a look at LLVM optimizations and code generation. Although not true for every Rust project, more often than not I’ve seen the costliest part of a compilation be LLVM optimizations and/or generation of machine code. In debug mode code generation is often quite expensive and in release mode the optimizations tend to dwarf the code generation.

As a bit of a recap of the current state of the compiler, today rustc will by default create one codegen unit which is fed to LLVM for optimizations and code generation. This basically means that your entire crate is compiled to one object file which is then eventually fed to the linker at a later date. Due to a variety of factors rustc tends to generate quite large single codegen units which then naturally takes quite some time in LLVM!

These huge modules we’re sending to LLVM are one of the most fertile grounds for optimization. There’s various tricks you can employ as a crate author to reduce the size of the codegen unit sent to LLVM, but we’ve also been investigating ways of optimizing codegen units as they are today. One of the most obvious optimizations to look at first is parallelization! LLVM is, however, not parallelized internally during these phases which means that while you’re waiting for the large codegen unit to be optimized and/or translated to machine code you’re just burning one CPU on what is likely an otherwise-idle multi-core system.

It turns out that for quite some time we’ve actually had the ability to run optimization passes and code generation in parallel. You can see this in action today by configuring your project’s Cargo.toml:

[profile.release]
codegen-units = 32

This instructs rustc to split the one large codegen unit into 32 different portions. Each codegen unit is then submitted to LLVM in parallel which ensures that we’re always using all your cores for doing work while compiling a crate. This sounds great, but you may be wondering why we haven’t turned this on by default! Unfortunately it turns out that multiple codegen units don’t come without their downsides.

One of the first problems we ran into with multiple codegen units was managing all this parallelization across multiple instances of rustc. For example if Cargo spawns 10 rustc processes and then each of those rustc processes spawns 30 separate threads that can quickly overload a system! To solve this we implemented a jobserver in Cargo, an implementation of a GNU jobserver. This, when integrated into rustc, effectively enabled a global rate limit for Rust compilations, ensuring that cargo build -j10 indeed never had more than 10 units of work at any one point in time.

The next problem we ran into was that rustc was consuming excessive amounts of memory! The problem here was that we would build up all codegen units in memory and keep them all there before we submitted them to LLVM. Worse still, while working on some codegen units all the others were just sitting idle in memory. To solve this problem we implemented "async translation" which enabled rustc to optimize/code generate code in parallel with translation to LLVM. Better still we’d pause translation of code to LLVM when enough codegen units were ready. Overall, this helped keep memory under control when lots of codegen units came into the picture.

The final problem we’ve now faced (modulo a huge slew of other assorted bugs) is that codegen units hurts runtime performance! Rust has almost always compiled code in one codegen unit by default, and a large motivation factor for this is that it enables more inlining and optimization opportunities in LLVM. When we split one codegen unit into many then LLVM loses optimization opportunities. Sure enough in practice this causes huge regressions as optimization in Rust relies so heavily on inlining.

We’re hoping to solve this runtime performance problem with an implementation of ThinLTO. Announced some time ago, ThinLTO is a novel implementation of LTO in LLVM which optimizes for incremental compilation and parallelization. Glossing over all a bunch of details, it basically means that you can get the runtime benefits of a full LTO build without paying the significant cost that comes with today’s LTO.

The primary way we’re thinking of leveraging ThinLTO right now is to regain the performance loss of multiple codegen units when compiling one crate. This means that when you compile a crate with multiple codegen units in release mode you’ll initially lose a lot of inlining and optimization opportunities, but with ThinLTO we should be able to regain all of those losses while still retaining the parallel codegen and optimization benefits.


Now that’s quite a lot of background! Long story short is that we’re hoping to improve compile times in rustc by ensuring that any work in LLVM uses all the cores available on a machine. This means that the often longest part of a compilation, LLVM, should now be parallelized and scale better with the size of the hardware it’s running on.

Parallel code generation was enabled by default for debug mode and is continuing to receive perf improvements. Note that the loss of inlining opportunities doesn’t matter to much in debug mode!

The next step is to enable parallel code generation by default in release mode, but we’re hesitant to do this until we’re sure that we won’t cause 10x regressions throughout the ecosystem! We’re hoping that ThinLTO can help us solve this problem, but we need your help to test it out and find bugs to ensure it’s production ready!

The implementation of ThinLTO just hit nightly and likely has a bug or two in it, so it’d be greatly appreciated if you could test out your local project and see how compile times and runtimes are affected. With your feedback we should soon be able to turn this on by default and ensure we all get benefits of faster compiles by default!


#2

I tried this on Serde’s json benchmarks. Compile time was about twice as fast, and runtime ranged from 12% faster to 16% slower. Nicely done!

Before

                                DOM                STRUCT
======= serde_json ======= parse|stringify === parse|stringify ===
data/canada.json          11.0ms    11.4ms     4.3ms     7.0ms
data/citm_catalog.json     5.5ms     1.4ms     2.1ms     0.7ms
data/twitter.json          2.5ms     0.6ms     1.2ms     0.6ms

After

                                DOM                STRUCT
======= serde_json ======= parse|stringify === parse|stringify ===
data/canada.json           9.7ms    11.0ms     4.0ms     7.7ms
data/citm_catalog.json     6.0ms     1.4ms     2.0ms     0.8ms
data/twitter.json          2.9ms     0.7ms     1.2ms     0.6ms


#3

So I tried my rust-doom repo.

TL;DR thinlto regresses build times by 32% (compared to 1cgu), with no difference in runtime :frowning:

debug 1cgu debug 16cgu release 1cgu release 16cgu release 16cgu + thinlto
build #1 332.07 303.45 645.45 585.93 874.60
build #2 329.96 302.81 654.78 592.34 850.25
runtime #1 9.28 9.12 1.03 1.24 1.06
runtime #2 9.33 9.15 1.13 1.21 1.12

So adding -C codegen-units=16 speeds up the build by ~10% but increases run time by about ~13%.

Adding thinlto brings back the runtime to the same speed, but the builds actually take longer by 32% than not using codegen units at all (and by 46% compared to 16 CGU and no thinlto).

Ages ago (well two years) Niko asked me to create a branch with zero dependencies, ripping out graphics and all that, to use as a rustc benchmark. That code has rotted (and has been broken by (stable!) rustc upgrades), I could do the same thing again with the newest version of the repo if that would be useful.

Raw data:

git clone https://github.com/cristicbz/rust-doom
git checkout 51e88df

nightly: rustc 1.22.0-nightly (150b625a0 2017-10-08)

build:
  debug:
    cargo +nightly build  332.07s user 14.08s system 488% cpu 1:10.81 total
    cargo +nightly build  329.96s user 14.31s system 480% cpu 1:11.58 total

  debug+cgu:
    RUSTFLAGS='-C codegen-units=16' cargo +nightly build  303.45s user 12.15s system 479% cpu 1:05.86 total
    RUSTFLAGS='-C codegen-units=16' cargo +nightly build  302.81s user 12.19s system 487% cpu 1:04.61 total

  release:
    cargo +nightly build --release  645.45s user 8.63s system 347% cpu 3:08.11 total
    cargo +nightly build --release  654.78s user 8.48s system 364% cpu 3:01.90 total

  release+cgu:
    RUSTFLAGS='-C codegen-units=16' cargo +nightly build --release  585.93s user 8.90s system 543% cpu 1:49.39 total
    RUSTFLAGS='-C codegen-units=16' cargo +nightly build --release  592.34s user 9.07s system 532% cpu 1:52.95 total

  release+thinlto:
    RUSTFLAGS='-C codegen-units=16 -Z thinlto' cargo +nightly build --release  874.60s user 10.57s system 571% cpu 2:34.96total
    RUSTFLAGS='-C codegen-units=16 -Z thinlto' cargo +nightly build --release  850.25s user 9.64s system 576% cpu 2:29.06 total

runtime:
  debug:
    target/debug/rs_doom -i ~/data/DOOM2.WAD --check  9.28s user 0.14s system 96% cpu 9.710 total
    target/debug/rs_doom -i ~/data/DOOM2.WAD --check  9.33s user 0.06s system 99% cpu 9.406 total

  debug+cgu:
    target/debug/rs_doom -i ~/data/DOOM2.WAD --check  9.12s user 0.07s system 99% cpu 9.213 total
    target/debug/rs_doom -i ~/data/DOOM2.WAD --check  9.15s user 0.06s system 99% cpu 9.266 total

  release:
    target/release/rs_doom -i ~/data/DOOM2.WAD --check  1.03s user 0.07s system 85% cpu 1.277 total
    target/release/rs_doom -i ~/data/DOOM2.WAD --check  1.13s user 0.06s system 82% cpu 1.445 total

  release+cgu:
    target/release/rs_doom -i ~/data/DOOM2.WAD --check  1.24s user 0.06s system 84% cpu 1.535 total
    target/release/rs_doom -i ~/data/DOOM2.WAD --check  1.21s user 0.08s system 84% cpu 1.532 total

  release+thinlto:
    target/release/rs_doom -i ~/data/DOOM2.WAD --check  1.06s user 0.05s system 86% cpu 1.282 total
    target/release/rs_doom -i ~/data/DOOM2.WAD --check  1.12s user 0.06s system 87% cpu 1.348 total

#4

Excited to see the improvement, I ran it against a few things I had laying around.

Rayon

  • Before: 7.27 secs <-- FASTEST
  • ThinLTO (cgu=4): 8.13 secs
  • ThinLTO (cgu=8): 8.49 secs
  • ThinLTO (cgu=16): 8.18 secs

WebRender

  • Before: 301.85 secs
  • ThinLTO (cgu=4): 490.41 secs
  • ThinLTO (cgu=8): 285.80 secs <-- FASTEST
  • ThinLTO (cgu=16): 291.6 secs

A random personal project

  • Before: 15.62 secs <-- FASTEST
  • ThinLTO (cgu=4): 19.19 secs
  • ThinLTO (cgu=8): 18.77 secs
  • ThinLTO (cgu=16): 20.24 secs

#5

Tested crates.io:

cargo clean && RUSTFLAGS='-C codegen-units=16 -Zthinlto' cargo +nightly build --release: 175.77s

touch src/lib.rs && RUSTFLAGS='-C codegen-units=16 -Zthinlto' cargo +nightly build --release: 52.16 seconds

cargo clean && cargo +nightly build --release: 237.59s

touch src/lib.rs && cargo +nightly build --release: 119.96 seconds

So a full build including dependencies sped up 33%, and just rebuilding the main crate sped up 100%. Nice! (No benchmarks available on this one)


#6

Please note that stable is compiled with assertions disabled, which results in a very significant speedup. It’s better to compare builds using the current nightly, with and without ThinLTO.


#7

With thinlto

$ cargo clean && RUSTFLAGS='-C codegen-units=16 -Z thinlto' cargo +nightly build --release --features simd
   ...
   Compiling base100 v0.3.0 (file:///home/adam/Programs/base100)
    Finished release [optimized] target(s) in 67.64 secs

Without thinlto

$ cargo clean && RUSTFLAGS='-C codegen-units=16' cargo +nightly build --release --features simd
   ...
   Compiling base100 v0.3.0 (file:///home/adam/Programs/base100)
    Finished release [optimized] target(s) in 81.79 secs

Without codegen-units or thinlto

$ cargo clean && cargo +nightly build --release --features simd
   ...
   Compiling base100 v0.3.0 (file:///home/adam/Programs/base100)
    Finished release [optimized] target(s) in 150.11 secs

I’m seeing a very significant improvement even for a single-file project (mostly from the clap dependency if I’m not mistaken). Looks promising. Good work, guys!


#8

Not enabled on windows?

>rustup update nightly info: syncing channel updates for ‘nightly-x86_64-pc-windows-msvc’

nightly-x86_64-pc-windows-msvc unchanged - rustc 1.22.0-nightly (150b625a0 2017-10-08)

>rustc --version rustc 1.22.0-nightly (150b625a0 2017-10-08)

>cargo clean && set RUSTFLAGS="-C codegen-units=16 -Z thinlto" && cargo +nightly build --release error: failed to run rustc to learn about target-specific information

Caused by: process didn’t exit successfully: rustc - --crate-name ___ --print=file-names "-C codegen-units=16 -Z thinlto" --target x86_64-pc-windows-msvc --crate-type bin --crate-type rlib (exit code: 101) — stderr error: unknown debugging option: thinlto"


#9

Riiight, the instructions make it a little unclear since they don’t say run cargo +nightly build --release so I took that to mean I should compare the debug and release versions between stable and nightly rather than the “no thinlto” vs “with thinlto” within the newest nightly. Running stuff again now, will report back.


#10

Thanks so much for the data everyone! I should clarify what’s probably the most beneficial for timing though, sounds like there may be some confusion!

I’m particularly curious in three numbers:

  • First, debug compile times
$ RUSTFLAGS='-C codegen-units=1' cargo +nightly build
# vs ...
$ RUSTFLAGS='-C codegen-units=16' cargo +nightly build
  • Next, release compile times with ThinLTO
$ RUSTFLAGS='-C codegen-units=1' cargo +nightly build --release
# vs ...
$ RUSTFLAGS='-C codegen-units=16 -Z thinlto' cargo +nightly build --release
  • Finally, release runtimes
$ RUSTFLAGS='-C codegen-units=1' cargo +nightly build --release
$ ./target/release/my_benchmark
# vs ...
$ RUSTFLAGS='-C codegen-units=16 -Z thinlto' cargo +nightly build --release
$ ./target/release/my_benchmark

All the number so far are excellent! I’ll dig into some of the more worrisome cases when I can.


#11

Thanks for clarifying! I updated my numbers as requested, still a 32% build time regression :frowning:


#12

Awesome, thanks regardless for the update!

As a follow-up question, what’s the machine you’re compiling on? Namely, are you using an SSD and do you have multiple cores?


#13

Yes SSD, 4 cores (8 threads) Kaby Lake: 7th Generation Intel® Core™ i7-7700HQ Quad Core Processor (6M cache, 3.8 GHz). 32GB ram clocked at 2.4GHz.


#14

Hm fascinating, here’s the timings I get locally:

what debug debug + 16cgu release release + 16 cgu release + 16 cgu + thinlto
build 73 67 174 110 157

It’s definitely expected that ThinLTO is slower than just 16 codegen units (that’s the whole regaining the original perf problem) but it’s expected that the parallelization through multiple codgen units makes it “easily” faster than the 1 codegen unit we have today.

FWIW locally I was running i7-4770 CPU @ 3.40GHz w/ 8 cores (some hyperthreaded). I’m surprised that you timings are so much slower and also in a different direction! Do you think you can drill in and see which crates are regressing the most when activating ThinLTO?


#15

Compile times for an internal work project:

$ RUSTFLAGS='-C codegen-units=1' cargo +nightly build
      138.79 real       400.32 user        33.90 sys
$ RUSTFLAGS='-C codegen-units=16' cargo +nightly build
      119.96 real       506.58 user        41.06 sys
$ RUSTFLAGS='-C codegen-units=1' cargo +nightly build --release
      370.87 real       934.23 user        32.40 sys
$ RUSTFLAGS='-C codegen-units=16 -Z thinlto' cargo +nightly build --release
      284.49 real      1357.25 user        47.25 sys

#16

Just to check, my builds were from cargo clean, are you just building the repo crates?

Build speed is very consistent for me, yours being so much faster is odd. Is there any way to spew out individual crate timings with cargo (also ideally build a single crate at a time, whilst parallelizing across codegen units? I think -j1 would stop all parallelism thanks to jobserver, right?).


#17

bindgen

Debug Compile Times

$ RUSTFLAGS='-C codegen-units=1' cargo +nightly build
    Finished dev [unoptimized + debuginfo] target(s) in 32.49 secs
$ RUSTFLAGS='-C codegen-units=16' cargo +nightly build
    Finished dev [unoptimized + debuginfo] target(s) in 24.12 secs

Release Compile Times

$ RUSTFLAGS='-C codegen-units=1' cargo +nightly build --release
    Finished release [optimized] target(s) in 102.96 secs
$ RUSTFLAGS='-C codegen-units=16 -Z thinlto' cargo +nightly build --release
    Finished release [optimized] target(s) in 63.29 secs

Release Runtimes (Building Stylo Bindings)

$ RUSTFLAGS='-C codegen-units=1' cargo +nightly test --release --features testing_only_libclang_4 sanity_check_can_generate_stylo_bindings
  time:   438.615 ms.	parse
  time:     0.004 ms.	process_replacements
  time:    12.593 ms.	deanonymize_fields
  time:   475.890 ms.	compute_whitelisted_and_codegen_items
  time:    38.554 ms.	compute_has_vtable
  time:    33.133 ms.	compute_has_destructor
  time:   156.722 ms.	find_used_template_parameters
  time:   111.814 ms.	compute_cannot_derive_debug
  time:     0.000 ms.	compute_cannot_derive_default
  time:   124.499 ms.	compute_cannot_derive_copy
  time:    34.317 ms.	compute_has_type_param_in_array
  time:     0.000 ms.	compute_has_float
  time:     0.000 ms.	compute_cannot_derive_hash
  time:     0.000 ms.	compute_cannot_derive_partialord_partialeq_or_eq
  time:   208.608 ms.	codegen
Generated Stylo bindings in: Duration { secs: 3, nanos: 724134097 }

$ RUSTFLAGS='-C codegen-units=16 -Z thinlto' cargo +nightly test --release --features testing_only_libclang_4 sanity_check_can_generate_stylo_bindings
  time:   422.057 ms.	parse
  time:     0.000 ms.	process_replacements
  time:    22.555 ms.	deanonymize_fields
  time:   518.159 ms.	compute_whitelisted_and_codegen_items
  time:    38.898 ms.	compute_has_vtable
  time:    39.674 ms.	compute_has_destructor
  time:   175.396 ms.	find_used_template_parameters
  time:   124.995 ms.	compute_cannot_derive_debug
  time:     0.000 ms.	compute_cannot_derive_default
  time:   130.400 ms.	compute_cannot_derive_copy
  time:    40.068 ms.	compute_has_type_param_in_array
  time:     0.000 ms.	compute_has_float
  time:     0.000 ms.	compute_cannot_derive_hash
  time:     0.000 ms.	compute_cannot_derive_partialord_partialeq_or_eq
  time:   222.699 ms.	codegen
Generated Stylo bindings in: Duration { secs: 3, nanos: 994839330 }

#18

@sfackler can you try the debug build again with tonight’s nightly? Notably https://github.com/rust-lang/rust/pull/45075 should improve it. Similarly Could you try comparing release mode again with tonight’s nightly, passing -Z inline-in-all-cgus=no for release mode as well?


#19

I believe so yeah, always clean builds. You’re right in that -j1 would stop all parallelism, but if you’ve got more than one core then you should definitely be seeing a benefit in all cases.


#20

Will do. I’m assuming you want the inline-in-all-cgus to be added to the thinlto RUSTFLAGS?