Evaluating pipelined rustc compilation

jsgf · May 23, 2019, 10:17pm

Does Bazel support pipelining by default, or would it need some special implementation?

nnethercote · May 24, 2019, 4:00am

I did various measurements on my high-end Linux desktop machine, which has:

14 physical cores, 28 virtual cores
32 GiB RAM
A fast SSD

rustc-perf

I measured all the multi-crate benchmarks in rustc-perf, plus a couple of extra ones.

Methodology:

For debug builds, I used cargo +nightly build.
For opt builds, I used cargo +nightly --release build.
For incremental builds, I did a normal build, then touched all the .rs files, then rebuilt; this is not a realistic workflow.
All measurements were only taken once.
I have a script that I’m happy to share if anyone wants to benchmark these.

Results:

cargo
- dbg     : 81.91 s -> 80.69 s; 1.01x faster
- dbg incr: 4.04 s -> 4.07 s; .99x faster
- opt     : 90.41 s -> 91.31 s; .99x faster
- opt incr: 25.57 s -> 24.20 s; 1.05x faster

clap-rs
- dbg     : 10.87 s -> 10.73 s; 1.01x faster
- dbg incr: 2.08 s -> 2.04 s; 1.01x faster
- opt     : 21.74 s -> 21.53 s; 1.00x faster
- opt incr: 20.83 s -> 20.75 s; 1.00x faster

coercions
- dbg     : 1.53 s -> 1.48 s; 1.03x faster
- dbg incr: .44 s -> .44 s; 1.01x faster
- opt     : 1.16 s -> 1.14 s; 1.02x faster
- opt incr: 1.01 s -> 1.01 s; .99x faster

cranelift-codegen/cranelift-codegen
- dbg     : 16.93 s -> 16.90 s; 1.00x faster
- dbg incr: 3.63 s -> 3.62 s; 1.00x faster
- opt     : 28.15 s -> 23.45 s; 1.20x faster
- opt incr: 20.83 s -> 20.99 s; .99x faster

crates.io
- dbg     : 76.38 s -> 75.89 s; 1.00x faster
- dbg incr: 3.61 s -> 3.62 s; .99x faster
- opt     : 82.45 s -> 85.14 s; .96x faster
- opt incr: 15.90 s -> 16.25 s; .97x faster

ctfe-stress-2
- dbg     : 9.38 s -> 9.35 s; 1.00x faster
- dbg incr: 1.40 s -> 1.44 s; .97x faster
- opt     : 5.95 s -> 5.95 s; .99x faster
- opt incr: 5.82 s -> 5.84 s; .99x faster

encoding
- dbg     : 1.74 s -> 1.69 s; 1.03x faster
- dbg incr: .63 s -> .60 s; 1.04x faster
- opt     : 1.88 s -> 1.78 s; 1.05x faster
- opt incr: 1.68 s -> 1.71 s; .98x faster

futures
- dbg     : 1.38 s -> 1.32 s; 1.04x faster
- dbg incr: .27 s -> .27 s; 1.00x faster
- opt     : 1.36 s -> 1.20 s; 1.13x faster
- opt incr: .90 s -> .93 s; .97x faster

html5ever
- dbg     : 6.04 s -> 6.02 s; 1.00x faster
- dbg incr: 1.69 s -> 1.70 s; .99x faster
- opt     : 7.58 s -> 7.44 s; 1.01x faster
- opt incr: 3.41 s -> 3.43 s; .99x faster

hyper
- dbg     : 6.09 s -> 5.46 s; 1.11x faster
- dbg incr: .65 s -> .66 s; .99x faster
- opt     : 8.71 s -> 5.99 s; 1.45x faster
- opt incr: 2.46 s -> 2.47 s; .99x faster

piston-image
- dbg     : 7.79 s -> 7.31 s; 1.06x faster
- dbg incr: .84 s -> .85 s; .99x faster
- opt     : 13.67 s -> 9.83 s; 1.38x faster
- opt incr: 4.53 s -> 4.45 s; 1.01x faster

regex
- dbg     : 3.58 s -> 3.26 s; 1.09x faster
- dbg incr: .95 s -> .83 s; 1.14x faster
- opt     : 5.62 s -> 4.29 s; 1.30x faster
- opt incr: 5.39 s -> 4.05 s; 1.33x faster

ripgrep
- dbg     : 10.72 s -> 10.15 s; 1.05x faster
- dbg incr: 3.12 s -> 3.11 s; 1.00x faster
- opt     : 20.19 s -> 17.39 s; 1.16x faster
- opt incr: 9.59 s -> 8.35 s; 1.14x faster

script-servo/components/script
- dbg     : 267.78 s -> 257.57 s; 1.03x faster
- dbg incr: 42.45 s -> 42.27 s; 1.00x faster
- opt     : 271.10 s -> 252.18 s; 1.07x faster
- opt incr: 142.91 s -> 141.21 s; 1.01x faster

sentry-cli
- dbg     : 67.67 s -> 67.43 s; 1.00x faster
- dbg incr: 4.17 s -> 4.14 s; 1.00x faster
- opt     : 75.06 s -> 76.83 s; .97x faster
- opt incr: 13.59 s -> 13.62 s; .99x faster

style-servo/components/style
- dbg     : 61.03 s -> 60.56 s; 1.00x faster
- dbg incr: 13.60 s -> 13.45 s; 1.01x faster
- opt     : 56.00 s -> 52.71 s; 1.06x faster
- opt incr: 42.37 s -> 42.50 s; .99x faster

syn
- dbg     : 3.13 s -> 2.70 s; 1.15x faster
- dbg incr: .67 s -> .61 s; 1.09x faster
- opt     : 5.37 s -> 3.28 s; 1.63x faster
- opt incr: 3.09 s -> 2.51 s; 1.23x faster

tokio-webpush-simple
- dbg     : 12.17 s -> 11.70 s; 1.04x faster
- dbg incr: 1.44 s -> 1.43 s; 1.00x faster
- opt     : 17.11 s -> 16.01 s; 1.06x faster
- opt incr: 2.85 s -> 2.89 s; .98x faster

webrender/webrender
- dbg     : 35.79 s -> 32.18 s; 1.11x faster
- dbg incr: 2.21 s -> 2.20 s; 1.00x faster
- opt     : 50.45 s -> 39.70 s; 1.27x faster
- opt incr: 13.16 s -> 12.90 s; 1.02x faster

webr_api/webrender_api
- dbg     : 29.39 s -> 29.22 s; 1.00x faster
- dbg incr: 4.15 s -> 4.18 s; .99x faster
- opt     : 50.50 s -> 48.70 s; 1.03x faster
- opt incr: 14.87 s -> 14.83 s; 1.00x faster

wg-grammar
- dbg     : 9.25 s -> 9.05 s; 1.02x faster
- dbg incr: .28 s -> .27 s; 1.00x faster
- opt     : 11.41 s -> 10.51 s; 1.08x faster
- opt incr: 2.47 s -> 2.49 s; .99x faster

Opt (non-incremental) results are the best, with a few big speed-ups: 1.63x for syn, 1.45x for hyper, 1.38x for piston-image, 1.30x for regex, 1.27x for webrender, 1.20x for cranelift-codegen, 1.16x for ripgrep, 1.13x for futures.

Debug (non-incremental) results are less good, but not bad, with the best speed-ups being 1.15x for syn, 1.11x for hyper, and 1.11x for webrender.

Incremental improvements are weak except for regex and syn, but as I said, my incremental workload wasn’t realistic so I don’t know how much to conclude from that.

Firefox

(Update: I removed these results because I realized I was not measuring with an appropriate version of Cargo.)

rustc

(Update: I removed these results because I realized I was not measuring with an appropriate version of Cargo.)

nnethercote · May 24, 2019, 10:10am

I ran erahm’s script for Firefox (which does a non-incremental opt build) on the same 14-core machine:

Baseline run without pipelining
===============================
            Mean        Std.Dev.    Min         Median      Max
real        704.137     5.200       693.277     705.624     708.635     
user        12297.793   44.110      12230.514   12292.957   12376.622   
sys         388.200     1.736       386.268     388.233     391.457   

Running with pipelining
=======================
            Mean        Std.Dev.    Min         Median      Max
real        660.525     29.074      610.066     677.233     683.780     
user        12561.307   256.090     12361.577   12394.073   12946.205   
sys         390.659     2.502       388.449     389.257     394.441

This shows the following improvements, which are well outside the standard deviation:

Mean 1.07x
Min 1.13x
Median 1.04x
Max 1.03x

It’s interesting that the standard deviation increased with pipelining. Parallelism leads to higher variance? Seems plausible at first glance.

acmcarther · May 24, 2019, 9:21pm

rules_rust maintainer here.

Bazel supports parallel execution of independent compilation steps. It executes rustc directly though, and would need to be taught how to interpret the stderr messages rustc is now emitting when “rmeta” is available in order to dispatch dependent rust library actions. It’s not clear to me if the “partial completion” of the rustc invocation is compatible with how Bazel models compilation actions though, so we might not be able to benefit from this directly.

We’re tracking this in https://github.com/bazelbuild/rules_rust/issues/228

repi · May 24, 2019, 10:29pm

Did a quick comparison with our tech stack we have here at Embark with and without pipelining for a full release build.

Build time went from 141 s -> 127 s. 1.11x faster

This was on a Threadripper 16 physical cores, Windows 10.

Michael-F-Bryan · May 25, 2019, 5:29am

Compiling the mdbook project (biggest project I have at hand).

$ rustc --version
rustc 1.36.0-nightly (6afcb5628 2019-05-19)

$ cargo clean && time cargo build
cargo build  295.62s user 12.53s system 557% cpu 55.232 total

$ CARGO_BUILD_PIPELINING=true cargo clean && time cargo build
cargo build  295.14s user 12.36s system 555% cpu 55.307 total

$ cargo clean && time cargo build --release
cargo build --release  915.72s user 14.77s system 653% cpu 2:22.42 total

$ CARGO_BUILD_PIPELINING=true cargo clean && time cargo build --release
cargo build --release  905.33s user 14.23s system 672% cpu 2:16.80 total

specs:

Linux x86_64 (ubuntu)
4 cores, 8 virtual cores
12GB RAM
fast SSD

Dushistov · May 25, 2019, 7:09am

Is it real command? If yes, then you run only cargo clean with CARGO_BUILD_PIPELINING=true, and time cargo build with unsettled CARGO_BUILD_PIPELINING

$ cat test.sh 
#!/bin/sh

echo "A: $A"
$ A=1 ./test.sh && ./test.sh 
A: 1
A:

brson · May 25, 2019, 5:55pm

Here are results for TiKV.

using rustc 1.36.0-nightly (50a0defd5 2019-05-21)

Sorry they aren’t very organized, but you can probably figure out what they mean. “without” means no pipelining, “with” means pipelining. Release build is with thinLTO. This is only a single run in each configuration, on a 40 core machine. All the release builds show some kind of improvement. touching tikv_util is further back in the dag, engine less far, and tikv is the final lib in the dag before three bins are built.

without debug full Finished dev [unoptimized + debuginfo] target(s) in 3m 00s

without debug touch tikv_util Finished dev [unoptimized + debuginfo] target(s) in 43.39s

without debug touch engine Finished dev [unoptimized + debuginfo] target(s) in 43.55s

without debug touch tikv Finished dev [unoptimized + debuginfo] target(s) in 40.71s

with debug full Finished dev [unoptimized + debuginfo] target(s) in 2m 54s

with debug touch tikv_util Finished dev [unoptimized + debuginfo] target(s) in 40.49s

with debug touch engine Finished dev [unoptimized + debuginfo] target(s) in 41.95s

with debug touch tikv Finished dev [unoptimized + debuginfo] target(s) in 41.39s

without release full Finished release [optimized + debuginfo] target(s) in 15m 17s

without release touch tikv_util Finished release [optimized + debuginfo] target(s) in 11m 26s

without release touch engine Finished release [optimized + debuginfo] target(s) in 10m 54s

without release touch tikv Finished release [optimized + debuginfo] target(s) in 11m 11s

with release full Finished release [optimized + debuginfo] target(s) in 12m 37s

with release touch tikv_util Finished release [optimized + debuginfo] target(s) in 10m 34s

with release touch engine Finished release [optimized + debuginfo] target(s) in 10m 37s

with release touch tikv Finished release [optimized + debuginfo] target(s) in 10m 33s

nnethercote · May 28, 2019, 8:05am

Here are my rustc-perf results again, in an easier-to-read form, and with the incremental results omitted because they are suspect.

-----------------------------------------------------------------------------
OPT
-----------------------------------------------------------------------------
syn                    5.37 s ->   3.28 s; 1.63x faster
hyper                  8.71 s ->   5.99 s; 1.45x faster
piston-image          13.67 s ->   9.83 s; 1.38x faster
regex                  5.62 s ->   4.29 s; 1.30x faster
webrender             50.45 s ->  39.70 s; 1.27x faster
cranelift-codegen     28.15 s ->  23.45 s; 1.20x faster
ripgrep               20.19 s ->  17.39 s; 1.16x faster
futures                1.36 s ->   1.20 s; 1.13x faster
wg-grammar            11.41 s ->  10.51 s; 1.08x faster
script-servo         271.10 s -> 252.18 s; 1.07x faster
style-servo           56.00 s ->  52.71 s; 1.06x faster
tokio-webpush-simple  17.11 s ->  16.01 s; 1.06x faster
encoding               1.88 s ->   1.78 s; 1.05x faster
webrender_api         50.50 s ->  48.70 s; 1.03x faster
coercions              1.16 s ->   1.14 s; 1.02x faster
html5ever              7.58 s ->   7.44 s; 1.01x faster
clap-rs               21.74 s ->  21.53 s; 1.00x faster
ctfe-stress-2          5.95 s ->   5.95 s;  .99x faster
cargo                 90.41 s ->  91.31 s;  .99x faster
sentry-cli            75.06 s ->  76.83 s;  .97x faster
crates.io             82.45 s ->  85.14 s;  .96x faster

-----------------------------------------------------------------------------
DEBUG
-----------------------------------------------------------------------------
syn                    3.13 s ->   2.70 s; 1.15x faster
webrender             35.79 s ->  32.18 s; 1.11x faster
hyper                  6.09 s ->   5.46 s; 1.11x faster
regex                  3.58 s ->   3.26 s; 1.09x faster
piston-image           7.79 s ->   7.31 s; 1.06x faster
ripgrep               10.72 s ->  10.15 s; 1.05x faster
tokio-webpush-simple  12.17 s ->  11.70 s; 1.04x faster
futures                1.38 s ->   1.32 s; 1.04x faster
coercions              1.53 s ->   1.48 s; 1.03x faster
script-servo         267.78 s -> 257.57 s; 1.03x faster
encoding               1.74 s ->   1.69 s; 1.03x faster
wg-grammar             9.25 s ->   9.05 s; 1.02x faster
cargo                 81.91 s ->  80.69 s; 1.01x faster
clap-rs               10.87 s ->  10.73 s; 1.01x faster
cranelift-codegen     16.93 s ->  16.90 s; 1.00x faster
crates.io             76.38 s ->  75.89 s; 1.00x faster
ctfe-stress-2          9.38 s ->   9.35 s; 1.00x faster
html5ever              6.04 s ->   6.02 s; 1.00x faster
sentry-cli            67.67 s ->  67.43 s; 1.00x faster
style-servo           61.03 s ->  60.56 s; 1.00x faster
webrender_api         29.39 s ->  29.22 s; 1.00x faster

kinnison · June 4, 2019, 7:56pm

I apologise for the lateness of this report, but here’s a breakdown of the current state of play on Rustup:

Building `rustup` with pipelining

This experiment was carried out on a Lenovo T480 to build rustup using the pipelining capability in recent nightly Rust. The laptop has an NVMe SSD, 32G of RAM, an i7 (4 core, 2 threads per core) at 1.8GHz peaking at 2.6GHz (i7-8550U)

The git revision of rustup was 2e97e32 for this test. This was a 266 crate build sequence, each time done from scratch.

As a baseline, building with current stable (1.35.0) in debug mode:

Command being timed: "cargo build"
User time (seconds): 401.88
System time (seconds): 17.19
Percent of CPU this job got: 579%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:12.34

Or in release mode:

Command being timed: "cargo build --release"
User time (seconds): 758.69
System time (seconds): 16.44
Percent of CPU this job got: 346%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:43.60

The reduction in apparent parallelism on the release builds likely comes down to the fact that rustup has quite a linear end to its build. We have three crates in our repo, download, then rustup, and then rustup-init which is the final CLI app itself. While these don’t form a completely linear dep chain, the build seems quite single-threaded in the sense that I rarely see more than one of the crates being built at a time.

Unscientifically we also see a choke before then, with serde_derive then cookie_store then reqwest also all being single crates under compilation.

The tests were done on a relatively unloaded system, general desktop running but nothing really using a lot of CPU/IO. 99% idle type situation.

Nightly

Nightly builds were done with cargo 1.37.0-nightly (545f35425 2019-05-23)

First up, without pipelining, debug:

Command being timed: "cargo +nightly build"
User time (seconds): 424.43
System time (seconds): 19.15
Percent of CPU this job got: 583%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:16.01

And release:

Command being timed: "cargo +nightly build --release"
User time (seconds): 771.45
System time (seconds): 17.14
Percent of CPU this job got: 347%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:46.69

So baseline vs. nightly without pipelining is pretty much the same in terms of the wallclock build time and also in terms of the parallelism achieved. This is good since it helps us to see how stable -> potential-new-stable behaves.

Nightly with pipelining in debug:

Command being timed: "env CARGO_BUILD_PIPELINING=true cargo +nightly build"
User time (seconds): 432.66
System time (seconds): 19.15
Percent of CPU this job got: 625%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:12.28

Nightly with pipelining in release:

Command being timed: "env CARGO_BUILD_PIPELINING=true cargo +nightly build --release"
User time (seconds): 788.07
System time (seconds): 17.44
Percent of CPU this job got: 388%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:27.17

As we can see, the parallelism definitely went up with pipelining, though the gain for a debug build pretty much only mititgates the additional cost of whatever has gone into the compiler since current stable. The release build benefits somewhat more at nearly 19s gained vs. non-pipelined, 16s vs. current stable.

This isn’t a huge gain, and vs. a 17min or so CI cycle it won’t really be noticed, but it is visible and for someone doing regular compiles which affect more than the last few crates will definitely be appreciated. Heck, every second before rls can respond is painful, so if this can also be used in rls then that’s worth it.

Thanks for your work on this,

Daniel.

vi0 · June 10, 2019, 10:51am

Note: https://github.com/rust-lang/rls/issues/1484

Also I got error: crate `openssl` required to be available in rlib format, but was not found in this form spurious error once.

alexcrichton · June 10, 2019, 2:41pm

Could you detail how you generated the error message about crates and the rlib format? That sounds like a bug in Cargo

vi0 · June 10, 2019, 4:17pm

As far as I remember, I was just building a crate with cargo build. I don’t remember if it is release or debug mode. Maybe vscode with RLS was also active. The build had failed, but after I repeated the building command it got through.

I tried running while true; do sleep 2 && rm -Rf target/debug/ && lowps cargo -v build --all-features || break; done for a while with pipelining active, but haven’t reproduced the issue.

brson · June 18, 2019, 12:47am

In TiKV we are attempting to upgrade our toolchain and turn on pipelining. Here is the PR: https://github.com/tikv/tikv/pull/4913

Our full release build sees a 12% reduction in full build time on a 40 core machine.

This time I also tested cargo bench, and for running cargo bench --no-run I see an interesting 27% and 22% reduction for full and partial rebuilds respectively. I’m guessing this has to do with the large number of bench bins we have.

jackmott · July 17, 2019, 10:01pm

rust-simd-noise crate, windows 10,nightly as of 7/17/2019, ryzen3900x (12 physical 24 logical cores)

release build pipeline on: 31seconds release build pipeline off: 33 seconds

this 2 second difference was repeatable.

icefoxen · July 18, 2019, 5:09pm

Info for the ggez crate devel branch, which might be somewhat of an odd duck 'cause it has a lot of small, simple dependencies and a few large, complicated dependencies. Currently 245 in total. I also tried on two different computers, a Ryzen 2400G (4c/8t) and a Ryzen 1700 (8c/16t):

Ryzen 2400G debug clean nightly w/o pipelining: 1m 25s, w/ pipelining: 1m 25s, difference: 0%
Ryzen 2400G release clean nightly w/o pipelining: 2m 52s, w/ pipelining: 2m 54s, difference: 1.1% slower
Ryzen 1700 debug clean nightly w/o pipelining: 53s, w/ pipelining: 55s, difference: 3.7% slower
Ryzen 1700 release clean nightly w/o pipelining: 1m 40s, w/ pipelining: 1m 40s, difference: 0%

Bit surprised by this actually. I suppose there’s just enough things to build, with a narrow and deep dependency tree, that it’s already overlapping as much as it can.

alexcrichton · October 16, 2019, 5:09pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Help us benchmark incremental compilation!	48	12312	March 25, 2019
Incremental Compilation Beta compiler	37	30426	March 25, 2019
Exploring Crate Graph Build Times with `cargo build -Ztimings` cargo	37	16142	December 22, 2024
Help test parallel rustc!	27	12006	May 30, 2020
Let's talk about parallel codegen compiler	49	9958	March 25, 2019