Evaluating pipelined rustc compilation

alexcrichton · May 17, 2019, 4:09pm

Good Friday to you all! Recently landed in nightly is the ability for Cargo to execute rustc in a “pipelined” fashion which has the promise of faster build times across the ecosystem. This support is turned off by default and the Cargo team is interested to gather more data and information about this feature, and that’s where you come in! If you’re interested in faster compiles, we’re interested in getting your feedback on this feature!

To enable pipelined compilation in Cargo, you’ll need at least nightly-2019-05-17 (today’s nightly):

$ rustc +nightly -V
rustc 1.36.0-nightly (7d5aa4332 2019-05-16)
$ cargo +nightly -V
cargo 1.36.0-nightly (c4fcfb725 2019-05-15)

After doing so you can set CARGO_BUILD_PIPELINING=true or configure the build.pipelining = true key in .cargo/config:

$ CARGO_BUILD_PIPELINING=true cargo +nightly build

What is “pipelined compilation”?

Cargo today builds a DAG of crates to build whenever you execute cargo build. This represents the dependency graph between crates and Cargo will execute rustc whenever all of its dependencies for that compliation have completely finished. The compiler, however, typically doesn’t have to fully wait for a dependency to finish compiling before starting the next one. In many cases all we need is “metadata” produced by the compiler to start the next compilation, and metadata from rustc is typically available much earlier in the compilation process.

With ASCII diagrams, let’s say we have a binary which depends on libB which in turn depends on libA. A compilation today might look like this:

         meta                meta
[-libA----|--------][-libB----|--------][-binary-----------]
0s        5s       10s       15s       20s                30s

With pipelined compilation, however, we can transform this to:

[-libA----|--------]
          [-libB----|--------]
                             [-binary-----------]
0s        5s       10s       15s                25s

So by simply tweaking how we call rustc we’ve shaved 5s off this compile! The exact intricacies of how this all works is somewhat complicated, but you can also find some more notes on the compiler-team repository if you’re interested for more information! The general gist is that when Cargo compiles more than one crate it may be able to start rustc sooner than it does today, and if your build machine has enough parallelism this could mean faster build times.

When is pipelined compilation faster?

On the PR which implemented pipelined compilation in Cargo we did some loose measurements here and there, but we’ve unfortunately at this time been unable to find enough compelling use cases for this feature to justify the support in rustc and Cargo. That’s where you can come in though to help us gather more data!

In general it’s unlikely that pipelined compilation will provide an order of magnitude speedup across the board. Rather a few ingredients are necessary for pipelined compliation to really shine:

The compiler must be called at least twice. If you’re only editing your leaf rlib, then pipelining doesn’t matter since there’s only one rustc instance.
The best wins will come from --release mode. Metadata in the compiler is currently produced just before translation to LLVM, optimization, and codegen. If metadata is produced 10ms before compilation finishes it’s not much of an opportunity to pipeline, but if metadata is produced 15s before compilation finishes then that could be 15s saved! Optimization typically takes longest in release mode.
Your machine needs to have idle parallelism. If Cargo could spawn rustc but your machine is already full on work doing other things, then there’s no benefit to spawning rustc sooner since it’d simply wait for the previous rustc to start sooner anyway. Most Rust builds tend to be on multicore machines, however, and tend to have available cores towards the end of compilation.
Full crate graph builds may not see much benefit. We can’t pipeline all possible compilations due to details like procedural macros, build scripts, linked executables, etc. For these compilations Cargo still has to wait for rustc to previously completely finish before proceeding.

Given all that, the real use case for pipelined compilation is you’re incrementally compiling a project in release mode on a beefy machine where the incremental changes are a few crates down from the final product.

If you see benefits, though, we’re curious to hear about other use cases!

What measurements do we want?

The general idea of what we’d like to see is toggling CARGO_BUILD_PIPELINING=true in some scenarios and seeing if compilation is faster with pipelining enabled. Some ideas are:

Full crate graph

$ cargo clean && cargo build
$ cargo clean && CARGO_BUILD_PIPELINING=true cargo build

and release mode

$ cargo clean && cargo build --release
$ cargo clean && CARGO_BUILD_PIPELINING=true cargo build --release

On CI

Try enabling CARGO_BUILD_PIPELINING in nightly jobs on CI (e.g. Travis/AppVeyor/Azure Pipelines) and see if builds are faster.

Incremental Builds

Make a change to a crate (a few crates down in the dependency graph) and then measure before/after using CARGO_BUILD_PIPELINING=true

Other timing information

If you’ve got other scenarios you think might be worthwhile, let us know! If you can describe your scenario below, that’d also be great!

I’d like to get more involved!

Feel free to hop into into the wg-pipelining zulip channel for this topic, and we’d love to chat!

alexcrichton · May 17, 2019, 4:17pm

I’ve just tested this out with the wasmtime repository (e.g. something including CraneLift) to get on a 14 core (28 hyperthread) machine:

Full Build Type	Default	Pipelined	Difference
debug	39.5s	33.8s	14.6% faster
release	52.4s	45.5s	13.1% faster

Then after touch wasmtime-obj/src/lib.rs

Build type	Default	Pipelined	Diffference
debug	4.17s	4.06s	2.6% faster
release	2.50s	2.47s	1.2% faster

Then after find . -name '*.rs' | xargs touch

Build type	Default	Pipelined	Diffference
debug	7.24s	7.26s	0.3% slower
release	9.73s	7.54s	22.5% faster

(note that I don’t really know that much about wasmtime/CraneLift, just wanted to test something that I figured was mostly Rust code with not a ton of procedural macros and/or C code!)

nikomatsakis · May 17, 2019, 4:33pm

I tested it on the Lark repository on a 14-core, 28 hyperthread machine. Results:

full build type	default	pipelined	difference
debug	66s	57s	13% faster
release	71s	52s	26% faster

tesuji · May 17, 2019, 4:41pm

Hmm, I wish I could test it now. Can’t wait for latest nightly?

mark-i-m · May 17, 2019, 4:44pm

I just tried this out on one of my side projects (https://github.com/mark-i-m/os2), which is an OS kernel. The build system uses make to drive cargo-xbuild, gas, and ld.

The project doesn’t really have sane non-release builds (the binaries are too large), so I will only report release values.

I ran these a few times and reported a representative value. I have a 4-core Intel Core i5-7500T CPU @ 2.70GHz, and 15GB RAM, and the system is mostly idle apart from firefox…

After building

Build type	Command	Time	Difference
full build	`make clean && time make`	40.661s	–
full build pipelined	`make clean && time CARGO_BUILD_PIPELINING=true make`	38.743s	4.7% faster

After touching lib.rs:

Build type	Command	Time	Difference
full build	`touch kernel/lib.rs && time make`	0.807s	–
full build pipelined	`touch kernel/lib.rs && time CARGO_BUILD_PIPELINING=true make`	0.848s	5.0% slower

After touching everything:

Build type	Command	Time	Difference
full build	`(find . -name '*.rs' \| xargs touch) && time make`	0.885s	–
full build pipelined	`(find . -name '*.rs' \| xargs touch) && time CARGO_BUILD_PIPELINING=true make`	0.891s	difference is within the error

alexcrichton · May 17, 2019, 4:45pm

Wow the incremental results for lark are also super impressive. After find . -type '*.rs' | xargs touch the build times for lark are:

build type	default	pipelined	difference
debug	7.97	7.29	8% faster
release	36.39	20.57	43% faster

mark-i-m · May 17, 2019, 4:46pm

Oh

cargo 1.36.0-nightly (c4fcfb725 2019-05-15)

no wonder my results are underwhelming

EDIT: actually, they are fine… just that my project doesn’t benefit much…

alexcrichton · May 17, 2019, 4:46pm

Er sorry if the OP isn’t clear, but no need to wait, tonight’s nightly has all the support necessary.

$ rustup update nightly
info: syncing channel updates for 'nightly-x86_64-unknown-linux-gnu'

  nightly-x86_64-unknown-linux-gnu unchanged - rustc 1.36.0-nightly (7d5aa4332 2019-05-16)

info: checking for self-updates

mark-i-m · May 17, 2019, 4:48pm

@alexcrichton I’m assuming that metadata is dumped before monomorphization? If so, trait-heavy crates with lots of generic uses would see a large boost, right?

mark-i-m · May 17, 2019, 4:50pm

Hmm… ok now I’m not sure if my results are legit or not.

$ cargo +nightly --version
cargo 1.36.0-nightly (c4fcfb725 2019-05-15)

$ rustup update nightly
info: syncing channel updates for 'nightly-x86_64-unknown-linux-gnu'

  nightly-x86_64-unknown-linux-gnu unchanged - rustc 1.36.0-nightly (7d5aa4332 2019-05-16)

info: checking for self-updates

alexcrichton · May 17, 2019, 4:50pm

Correct, metadata is output before any codegen is performed (or at least that’s the intent, there could be bugs).

Your results are fine btw, that’s using the right rustc/cargo (I’ve updated the OP to clarify a bit). Dates reported by tools are slightly off from build dates. (commit vs build vs time zone)

josh · May 17, 2019, 4:59pm

I’m curious if this could also work for “speculative pipelining”: if this is adopted, it’d be to rustc’s benefit to generate metadata as early as possible, even to the point of being able to tell what kinds of code changes result in metadata changes and what kinds don’t. Cache the previous metadata, try starting the next build assuming the metadata won’t change, restart it if it turns out to not match.

With cached metadata, rustc could often tell if changes affect the signatures of pub items or not. (rustc already has much of that information for incremental compilation.)

pwoolcoc · May 17, 2019, 5:26pm

Ran it on my biggest personal project (compiles ~151 crates):

                            cargo build  296.38s user 24.49s system 417% cpu 1:16.86 total
CARGO_BUILD_PIPELINING=true cargo build  316.31s user 26.07s system 442% cpu 1:17.30 total
                            cargo build --release  721.26s user 28.04s system 598% cpu 2:05.16 total
CARGO_BUILD_PIPELINING=true cargo build --release  740.17s user 28.17s system 609% cpu 2:06.13 total

Not sure if I’m doing something wrong or not, but it seems like the pipelining takes longer?

alexcrichton · May 17, 2019, 5:46pm

Those numbers look like it’s probably neither beneficial nor detrimental to that project (the numbers being seemingly in the noise), but would it be possible to build the code myself to double check?

pwoolcoc · May 17, 2019, 5:48pm

Sure, the repo is https://github.com/pwoolcoc/elefren

I should also mention that I compiled it on a 2016 MBP / 2.6 GHz Intel Core i7

josh · May 17, 2019, 5:50pm

The total time figures don’t substantially differ, but the user time goes up significantly, which suggests that some idle time became busy time.

Gankra · May 17, 2019, 6:05pm

clean build of webrender’s wrench tool (all of webrender), on my 2014 macbook pro (work laptop). Very prone to thermal throttling, tried to take cooldown breaks between every build.

build	default	pipelined	difference
debug	3m 19s	3m 04s	7.5% decrease
release	8m 13s	7m 13s	12.7% decrease

add println to webrender_api

build	default	pipelined	difference
debug	56.37s	55.80s	1% decrease
release	3m 37s	3m 30s	3% decrease

Encouraging clean results, underwhelming incremental results (I think we have too many compiler plugins/non-rust tasks for this to be a big win).

alexcrichton · May 17, 2019, 6:11pm

Ok thanks for the link! I’ve confirmed locally that on a beefy machine there’s modest speedups (2-3s in a 30s build ish) and limiting to 4 cpus doesn’t show anything awry, so I think this is likely just scheduling changes and whatnot and falling out in the noise.

nagisa · May 17, 2019, 7:09pm

On a certain internal project

build	default real	default user	pipelined real	pipelined user
debug	1m4.412s	12m54.256s	0m52.378s	13m51.236s
release	1m44.727s	6m6.764s	1m29.862s	6m15.468s

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
Stepping:              4
CPU MHz:               2101.000
CPU max MHz:           2101.0000
CPU min MHz:           1000.0000
BogoMIPS:              4201.53
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              22528K
NUMA node0 CPU(s):     0-15,32-47
NUMA node1 CPU(s):     16-31,48-63
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku flush_l1d

tesuji · May 17, 2019, 7:15pm

I’ve just tested this with the rustup repository on 32 cores Linux machine:

(I disabled sccache if it would affect the result)

Full Build Type	Default	Pipelined	Difference
debug	49.67s	41.62s	16.21% faster
release	4:10.02s	3:27.24s	17.11% faster

Topic		Replies	Views
Exploring Crate Graph Build Times with `cargo build -Ztimings` cargo	37	15990	December 22, 2024
Incremental Compilation Beta compiler	37	30360	March 25, 2019
Proposal: Add "cargo:rustc-compile-crate-without-waiting-for-build-rs" for build.rs compiler	11	1055	November 28, 2022
Help us benchmark incremental compilation!	48	12281	March 25, 2019
Could rustc compile dependencies in parallel? compiler	5	3104	June 5, 2019

Evaluating pipelined rustc compilation

What is “pipelined compilation”?

When is pipelined compilation faster?

What measurements do we want?

I’d like to get more involved!

Related topics