Does Bazel support pipelining by default, or would it need some special implementation?
I did various measurements on my high-end Linux desktop machine, which has:
- 14 physical cores, 28 virtual cores
- 32 GiB RAM
- A fast SSD
rustc-perf
I measured all the multi-crate benchmarks in rustc-perf, plus a couple of extra ones.
Methodology:
- For debug builds, I used
cargo +nightly build. - For opt builds, I used
cargo +nightly --release build. - For incremental builds, I did a normal build, then touched all the
.rsfiles, then rebuilt; this is not a realistic workflow. - All measurements were only taken once.
- I have a script that I’m happy to share if anyone wants to benchmark these.
Results:
cargo
- dbg : 81.91 s -> 80.69 s; 1.01x faster
- dbg incr: 4.04 s -> 4.07 s; .99x faster
- opt : 90.41 s -> 91.31 s; .99x faster
- opt incr: 25.57 s -> 24.20 s; 1.05x faster
clap-rs
- dbg : 10.87 s -> 10.73 s; 1.01x faster
- dbg incr: 2.08 s -> 2.04 s; 1.01x faster
- opt : 21.74 s -> 21.53 s; 1.00x faster
- opt incr: 20.83 s -> 20.75 s; 1.00x faster
coercions
- dbg : 1.53 s -> 1.48 s; 1.03x faster
- dbg incr: .44 s -> .44 s; 1.01x faster
- opt : 1.16 s -> 1.14 s; 1.02x faster
- opt incr: 1.01 s -> 1.01 s; .99x faster
cranelift-codegen/cranelift-codegen
- dbg : 16.93 s -> 16.90 s; 1.00x faster
- dbg incr: 3.63 s -> 3.62 s; 1.00x faster
- opt : 28.15 s -> 23.45 s; 1.20x faster
- opt incr: 20.83 s -> 20.99 s; .99x faster
crates.io
- dbg : 76.38 s -> 75.89 s; 1.00x faster
- dbg incr: 3.61 s -> 3.62 s; .99x faster
- opt : 82.45 s -> 85.14 s; .96x faster
- opt incr: 15.90 s -> 16.25 s; .97x faster
ctfe-stress-2
- dbg : 9.38 s -> 9.35 s; 1.00x faster
- dbg incr: 1.40 s -> 1.44 s; .97x faster
- opt : 5.95 s -> 5.95 s; .99x faster
- opt incr: 5.82 s -> 5.84 s; .99x faster
encoding
- dbg : 1.74 s -> 1.69 s; 1.03x faster
- dbg incr: .63 s -> .60 s; 1.04x faster
- opt : 1.88 s -> 1.78 s; 1.05x faster
- opt incr: 1.68 s -> 1.71 s; .98x faster
futures
- dbg : 1.38 s -> 1.32 s; 1.04x faster
- dbg incr: .27 s -> .27 s; 1.00x faster
- opt : 1.36 s -> 1.20 s; 1.13x faster
- opt incr: .90 s -> .93 s; .97x faster
html5ever
- dbg : 6.04 s -> 6.02 s; 1.00x faster
- dbg incr: 1.69 s -> 1.70 s; .99x faster
- opt : 7.58 s -> 7.44 s; 1.01x faster
- opt incr: 3.41 s -> 3.43 s; .99x faster
hyper
- dbg : 6.09 s -> 5.46 s; 1.11x faster
- dbg incr: .65 s -> .66 s; .99x faster
- opt : 8.71 s -> 5.99 s; 1.45x faster
- opt incr: 2.46 s -> 2.47 s; .99x faster
piston-image
- dbg : 7.79 s -> 7.31 s; 1.06x faster
- dbg incr: .84 s -> .85 s; .99x faster
- opt : 13.67 s -> 9.83 s; 1.38x faster
- opt incr: 4.53 s -> 4.45 s; 1.01x faster
regex
- dbg : 3.58 s -> 3.26 s; 1.09x faster
- dbg incr: .95 s -> .83 s; 1.14x faster
- opt : 5.62 s -> 4.29 s; 1.30x faster
- opt incr: 5.39 s -> 4.05 s; 1.33x faster
ripgrep
- dbg : 10.72 s -> 10.15 s; 1.05x faster
- dbg incr: 3.12 s -> 3.11 s; 1.00x faster
- opt : 20.19 s -> 17.39 s; 1.16x faster
- opt incr: 9.59 s -> 8.35 s; 1.14x faster
script-servo/components/script
- dbg : 267.78 s -> 257.57 s; 1.03x faster
- dbg incr: 42.45 s -> 42.27 s; 1.00x faster
- opt : 271.10 s -> 252.18 s; 1.07x faster
- opt incr: 142.91 s -> 141.21 s; 1.01x faster
sentry-cli
- dbg : 67.67 s -> 67.43 s; 1.00x faster
- dbg incr: 4.17 s -> 4.14 s; 1.00x faster
- opt : 75.06 s -> 76.83 s; .97x faster
- opt incr: 13.59 s -> 13.62 s; .99x faster
style-servo/components/style
- dbg : 61.03 s -> 60.56 s; 1.00x faster
- dbg incr: 13.60 s -> 13.45 s; 1.01x faster
- opt : 56.00 s -> 52.71 s; 1.06x faster
- opt incr: 42.37 s -> 42.50 s; .99x faster
syn
- dbg : 3.13 s -> 2.70 s; 1.15x faster
- dbg incr: .67 s -> .61 s; 1.09x faster
- opt : 5.37 s -> 3.28 s; 1.63x faster
- opt incr: 3.09 s -> 2.51 s; 1.23x faster
tokio-webpush-simple
- dbg : 12.17 s -> 11.70 s; 1.04x faster
- dbg incr: 1.44 s -> 1.43 s; 1.00x faster
- opt : 17.11 s -> 16.01 s; 1.06x faster
- opt incr: 2.85 s -> 2.89 s; .98x faster
webrender/webrender
- dbg : 35.79 s -> 32.18 s; 1.11x faster
- dbg incr: 2.21 s -> 2.20 s; 1.00x faster
- opt : 50.45 s -> 39.70 s; 1.27x faster
- opt incr: 13.16 s -> 12.90 s; 1.02x faster
webr_api/webrender_api
- dbg : 29.39 s -> 29.22 s; 1.00x faster
- dbg incr: 4.15 s -> 4.18 s; .99x faster
- opt : 50.50 s -> 48.70 s; 1.03x faster
- opt incr: 14.87 s -> 14.83 s; 1.00x faster
wg-grammar
- dbg : 9.25 s -> 9.05 s; 1.02x faster
- dbg incr: .28 s -> .27 s; 1.00x faster
- opt : 11.41 s -> 10.51 s; 1.08x faster
- opt incr: 2.47 s -> 2.49 s; .99x faster
Opt (non-incremental) results are the best, with a few big speed-ups: 1.63x for syn, 1.45x for hyper, 1.38x for piston-image, 1.30x for regex, 1.27x for webrender, 1.20x for cranelift-codegen, 1.16x for ripgrep, 1.13x for futures.
Debug (non-incremental) results are less good, but not bad, with the best speed-ups being 1.15x for syn, 1.11x for hyper, and 1.11x for webrender.
Incremental improvements are weak except for regex and syn, but as I said, my incremental workload wasn’t realistic so I don’t know how much to conclude from that.
Firefox
(Update: I removed these results because I realized I was not measuring with an appropriate version of Cargo.)
rustc
(Update: I removed these results because I realized I was not measuring with an appropriate version of Cargo.)
I ran erahm’s script for Firefox (which does a non-incremental opt build) on the same 14-core machine:
Baseline run without pipelining
===============================
Mean Std.Dev. Min Median Max
real 704.137 5.200 693.277 705.624 708.635
user 12297.793 44.110 12230.514 12292.957 12376.622
sys 388.200 1.736 386.268 388.233 391.457
Running with pipelining
=======================
Mean Std.Dev. Min Median Max
real 660.525 29.074 610.066 677.233 683.780
user 12561.307 256.090 12361.577 12394.073 12946.205
sys 390.659 2.502 388.449 389.257 394.441
This shows the following improvements, which are well outside the standard deviation:
- Mean 1.07x
- Min 1.13x
- Median 1.04x
- Max 1.03x
It’s interesting that the standard deviation increased with pipelining. Parallelism leads to higher variance? Seems plausible at first glance.
rules_rust maintainer here.
Bazel supports parallel execution of independent compilation steps. It executes rustc directly though, and would need to be taught how to interpret the stderr messages rustc is now emitting when “rmeta” is available in order to dispatch dependent rust library actions. It’s not clear to me if the “partial completion” of the rustc invocation is compatible with how Bazel models compilation actions though, so we might not be able to benefit from this directly.
We’re tracking this in https://github.com/bazelbuild/rules_rust/issues/228
Did a quick comparison with our tech stack we have here at Embark with and without pipelining for a full release build.
Build time went from 141 s -> 127 s. 1.11x faster
This was on a Threadripper 16 physical cores, Windows 10.
Compiling the mdbook project (biggest project I have at hand).
$ rustc --version
rustc 1.36.0-nightly (6afcb5628 2019-05-19)
$ cargo clean && time cargo build
cargo build 295.62s user 12.53s system 557% cpu 55.232 total
$ CARGO_BUILD_PIPELINING=true cargo clean && time cargo build
cargo build 295.14s user 12.36s system 555% cpu 55.307 total
$ cargo clean && time cargo build --release
cargo build --release 915.72s user 14.77s system 653% cpu 2:22.42 total
$ CARGO_BUILD_PIPELINING=true cargo clean && time cargo build --release
cargo build --release 905.33s user 14.23s system 672% cpu 2:16.80 total
specs:
- Linux x86_64 (ubuntu)
- 4 cores, 8 virtual cores
- 12GB RAM
- fast SSD
Is it real command?
If yes, then you run only cargo clean with CARGO_BUILD_PIPELINING=true,
and time cargo build with unsettled CARGO_BUILD_PIPELINING
$ cat test.sh
#!/bin/sh
echo "A: $A"
$ A=1 ./test.sh && ./test.sh
A: 1
A:
Here are results for TiKV.
using rustc 1.36.0-nightly (50a0defd5 2019-05-21)
Sorry they aren’t very organized, but you can probably figure out what they mean. “without” means no pipelining, “with” means pipelining. Release build is with thinLTO. This is only a single run in each configuration, on a 40 core machine. All the release builds show some kind of improvement. touching tikv_util is further back in the dag, engine less far, and tikv is the final lib in the dag before three bins are built.
without debug full Finished dev [unoptimized + debuginfo] target(s) in 3m 00s
without debug touch tikv_util Finished dev [unoptimized + debuginfo] target(s) in 43.39s
without debug touch engine Finished dev [unoptimized + debuginfo] target(s) in 43.55s
without debug touch tikv Finished dev [unoptimized + debuginfo] target(s) in 40.71s
with debug full Finished dev [unoptimized + debuginfo] target(s) in 2m 54s
with debug touch tikv_util Finished dev [unoptimized + debuginfo] target(s) in 40.49s
with debug touch engine Finished dev [unoptimized + debuginfo] target(s) in 41.95s
with debug touch tikv Finished dev [unoptimized + debuginfo] target(s) in 41.39s
without release full Finished release [optimized + debuginfo] target(s) in 15m 17s
without release touch tikv_util Finished release [optimized + debuginfo] target(s) in 11m 26s
without release touch engine Finished release [optimized + debuginfo] target(s) in 10m 54s
without release touch tikv Finished release [optimized + debuginfo] target(s) in 11m 11s
with release full Finished release [optimized + debuginfo] target(s) in 12m 37s
with release touch tikv_util Finished release [optimized + debuginfo] target(s) in 10m 34s
with release touch engine Finished release [optimized + debuginfo] target(s) in 10m 37s
with release touch tikv Finished release [optimized + debuginfo] target(s) in 10m 33s
Here are my rustc-perf results again, in an easier-to-read form, and with the incremental results omitted because they are suspect.
-----------------------------------------------------------------------------
OPT
-----------------------------------------------------------------------------
syn 5.37 s -> 3.28 s; 1.63x faster
hyper 8.71 s -> 5.99 s; 1.45x faster
piston-image 13.67 s -> 9.83 s; 1.38x faster
regex 5.62 s -> 4.29 s; 1.30x faster
webrender 50.45 s -> 39.70 s; 1.27x faster
cranelift-codegen 28.15 s -> 23.45 s; 1.20x faster
ripgrep 20.19 s -> 17.39 s; 1.16x faster
futures 1.36 s -> 1.20 s; 1.13x faster
wg-grammar 11.41 s -> 10.51 s; 1.08x faster
script-servo 271.10 s -> 252.18 s; 1.07x faster
style-servo 56.00 s -> 52.71 s; 1.06x faster
tokio-webpush-simple 17.11 s -> 16.01 s; 1.06x faster
encoding 1.88 s -> 1.78 s; 1.05x faster
webrender_api 50.50 s -> 48.70 s; 1.03x faster
coercions 1.16 s -> 1.14 s; 1.02x faster
html5ever 7.58 s -> 7.44 s; 1.01x faster
clap-rs 21.74 s -> 21.53 s; 1.00x faster
ctfe-stress-2 5.95 s -> 5.95 s; .99x faster
cargo 90.41 s -> 91.31 s; .99x faster
sentry-cli 75.06 s -> 76.83 s; .97x faster
crates.io 82.45 s -> 85.14 s; .96x faster
-----------------------------------------------------------------------------
DEBUG
-----------------------------------------------------------------------------
syn 3.13 s -> 2.70 s; 1.15x faster
webrender 35.79 s -> 32.18 s; 1.11x faster
hyper 6.09 s -> 5.46 s; 1.11x faster
regex 3.58 s -> 3.26 s; 1.09x faster
piston-image 7.79 s -> 7.31 s; 1.06x faster
ripgrep 10.72 s -> 10.15 s; 1.05x faster
tokio-webpush-simple 12.17 s -> 11.70 s; 1.04x faster
futures 1.38 s -> 1.32 s; 1.04x faster
coercions 1.53 s -> 1.48 s; 1.03x faster
script-servo 267.78 s -> 257.57 s; 1.03x faster
encoding 1.74 s -> 1.69 s; 1.03x faster
wg-grammar 9.25 s -> 9.05 s; 1.02x faster
cargo 81.91 s -> 80.69 s; 1.01x faster
clap-rs 10.87 s -> 10.73 s; 1.01x faster
cranelift-codegen 16.93 s -> 16.90 s; 1.00x faster
crates.io 76.38 s -> 75.89 s; 1.00x faster
ctfe-stress-2 9.38 s -> 9.35 s; 1.00x faster
html5ever 6.04 s -> 6.02 s; 1.00x faster
sentry-cli 67.67 s -> 67.43 s; 1.00x faster
style-servo 61.03 s -> 60.56 s; 1.00x faster
webrender_api 29.39 s -> 29.22 s; 1.00x faster
I apologise for the lateness of this report, but here’s a breakdown of the current state of play on Rustup:
Building rustup with pipelining
This experiment was carried out on a Lenovo T480 to build rustup using the pipelining capability in recent nightly Rust. The laptop has an NVMe SSD, 32G of RAM, an i7 (4 core, 2 threads per core) at 1.8GHz peaking at 2.6GHz (i7-8550U)
The git revision of rustup was 2e97e32 for this test. This was a 266 crate build sequence, each time done from scratch.
As a baseline, building with current stable (1.35.0) in debug mode:
Command being timed: "cargo build"
User time (seconds): 401.88
System time (seconds): 17.19
Percent of CPU this job got: 579%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:12.34
Or in release mode:
Command being timed: "cargo build --release"
User time (seconds): 758.69
System time (seconds): 16.44
Percent of CPU this job got: 346%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:43.60
The reduction in apparent parallelism on the release builds likely comes down
to the fact that rustup has quite a linear end to its build. We have three
crates in our repo, download, then rustup, and then rustup-init which is
the final CLI app itself. While these don’t form a completely linear dep chain,
the build seems quite single-threaded in the sense that I rarely see more than
one of the crates being built at a time.
Unscientifically we also see a choke before then, with serde_derive then
cookie_store then reqwest also all being single crates under compilation.
The tests were done on a relatively unloaded system, general desktop running but nothing really using a lot of CPU/IO. 99% idle type situation.
Nightly
Nightly builds were done with cargo 1.37.0-nightly (545f35425 2019-05-23)
First up, without pipelining, debug:
Command being timed: "cargo +nightly build"
User time (seconds): 424.43
System time (seconds): 19.15
Percent of CPU this job got: 583%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:16.01
And release:
Command being timed: "cargo +nightly build --release"
User time (seconds): 771.45
System time (seconds): 17.14
Percent of CPU this job got: 347%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:46.69
So baseline vs. nightly without pipelining is pretty much the same in terms of the wallclock build time and also in terms of the parallelism achieved. This is good since it helps us to see how stable -> potential-new-stable behaves.
Nightly with pipelining in debug:
Command being timed: "env CARGO_BUILD_PIPELINING=true cargo +nightly build"
User time (seconds): 432.66
System time (seconds): 19.15
Percent of CPU this job got: 625%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:12.28
Nightly with pipelining in release:
Command being timed: "env CARGO_BUILD_PIPELINING=true cargo +nightly build --release"
User time (seconds): 788.07
System time (seconds): 17.44
Percent of CPU this job got: 388%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:27.17
As we can see, the parallelism definitely went up with pipelining, though the gain for a debug build pretty much only mititgates the additional cost of whatever has gone into the compiler since current stable. The release build benefits somewhat more at nearly 19s gained vs. non-pipelined, 16s vs. current stable.
This isn’t a huge gain, and vs. a 17min or so CI cycle it won’t really be noticed, but it is visible and for someone doing regular compiles which affect more than the last few crates will definitely be appreciated. Heck, every second before rls can respond is painful, so if this can also be used in rls then that’s worth it.
Thanks for your work on this,
Daniel.
Note: https://github.com/rust-lang/rls/issues/1484
Also I got error: crate `openssl` required to be available in rlib format, but was not found in this form spurious error once.
Could you detail how you generated the error message about crates and the rlib format? That sounds like a bug in Cargo
As far as I remember, I was just building a crate with cargo build. I don’t remember if it is release or debug mode. Maybe vscode with RLS was also active. The build had failed, but after I repeated the building command it got through.
I tried running while true; do sleep 2 && rm -Rf target/debug/ && lowps cargo -v build --all-features || break; done for a while with pipelining active, but haven’t reproduced the issue.
In TiKV we are attempting to upgrade our toolchain and turn on pipelining. Here is the PR: https://github.com/tikv/tikv/pull/4913
Our full release build sees a 12% reduction in full build time on a 40 core machine.
This time I also tested cargo bench, and for running cargo bench --no-run I see an interesting 27% and 22% reduction for full and partial rebuilds respectively. I’m guessing this has to do with the large number of bench bins we have.
rust-simd-noise crate, windows 10,nightly as of 7/17/2019, ryzen3900x (12 physical 24 logical cores)
release build pipeline on: 31seconds release build pipeline off: 33 seconds
this 2 second difference was repeatable.
Info for the ggez crate devel branch, which might be somewhat of an odd duck 'cause it has a lot of small, simple dependencies and a few large, complicated dependencies. Currently 245 in total. I also tried on two different computers, a Ryzen 2400G (4c/8t) and a Ryzen 1700 (8c/16t):
- Ryzen 2400G debug clean nightly w/o pipelining: 1m 25s, w/ pipelining: 1m 25s, difference: 0%
- Ryzen 2400G release clean nightly w/o pipelining: 2m 52s, w/ pipelining: 2m 54s, difference: 1.1% slower
- Ryzen 1700 debug clean nightly w/o pipelining: 53s, w/ pipelining: 55s, difference: 3.7% slower
- Ryzen 1700 release clean nightly w/o pipelining: 1m 40s, w/ pipelining: 1m 40s, difference: 0%
Bit surprised by this actually. I suppose there’s just enough things to build, with a narrow and deep dependency tree, that it’s already overlapping as much as it can.
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.