Evaluating pipelined rustc compilation

Good Friday to you all! Recently landed in nightly is the ability for Cargo to execute rustc in a “pipelined” fashion which has the promise of faster build times across the ecosystem. This support is turned off by default and the Cargo team is interested to gather more data and information about this feature, and that’s where you come in! If you’re interested in faster compiles, we’re interested in getting your feedback on this feature!

To enable pipelined compilation in Cargo, you’ll need at least nightly-2019-05-17 (today’s nightly):

$ rustc +nightly -V
rustc 1.36.0-nightly (7d5aa4332 2019-05-16)
$ cargo +nightly -V
cargo 1.36.0-nightly (c4fcfb725 2019-05-15)

After doing so you can set CARGO_BUILD_PIPELINING=true or configure the build.pipelining = true key in .cargo/config:

$ CARGO_BUILD_PIPELINING=true cargo +nightly build

What is “pipelined compilation”?

Cargo today builds a DAG of crates to build whenever you execute cargo build. This represents the dependency graph between crates and Cargo will execute rustc whenever all of its dependencies for that compliation have completely finished. The compiler, however, typically doesn’t have to fully wait for a dependency to finish compiling before starting the next one. In many cases all we need is “metadata” produced by the compiler to start the next compilation, and metadata from rustc is typically available much earlier in the compilation process.

With ASCII diagrams, let’s say we have a binary which depends on libB which in turn depends on libA. A compilation today might look like this:

         meta                meta
[-libA----|--------][-libB----|--------][-binary-----------]
0s        5s       10s       15s       20s                30s

With pipelined compilation, however, we can transform this to:

[-libA----|--------]
          [-libB----|--------]
                             [-binary-----------]
0s        5s       10s       15s                25s

So by simply tweaking how we call rustc we’ve shaved 5s off this compile! The exact intricacies of how this all works is somewhat complicated, but you can also find some more notes on the compiler-team repository if you’re interested for more information! The general gist is that when Cargo compiles more than one crate it may be able to start rustc sooner than it does today, and if your build machine has enough parallelism this could mean faster build times.

When is pipelined compilation faster?

On the PR which implemented pipelined compilation in Cargo we did some loose measurements here and there, but we’ve unfortunately at this time been unable to find enough compelling use cases for this feature to justify the support in rustc and Cargo. That’s where you can come in though to help us gather more data!

In general it’s unlikely that pipelined compilation will provide an order of magnitude speedup across the board. Rather a few ingredients are necessary for pipelined compliation to really shine:

  • The compiler must be called at least twice. If you’re only editing your leaf rlib, then pipelining doesn’t matter since there’s only one rustc instance.
  • The best wins will come from --release mode. Metadata in the compiler is currently produced just before translation to LLVM, optimization, and codegen. If metadata is produced 10ms before compilation finishes it’s not much of an opportunity to pipeline, but if metadata is produced 15s before compilation finishes then that could be 15s saved! Optimization typically takes longest in release mode.
  • Your machine needs to have idle parallelism. If Cargo could spawn rustc but your machine is already full on work doing other things, then there’s no benefit to spawning rustc sooner since it’d simply wait for the previous rustc to start sooner anyway. Most Rust builds tend to be on multicore machines, however, and tend to have available cores towards the end of compilation.
  • Full crate graph builds may not see much benefit. We can’t pipeline all possible compilations due to details like procedural macros, build scripts, linked executables, etc. For these compilations Cargo still has to wait for rustc to previously completely finish before proceeding.

Given all that, the real use case for pipelined compilation is you’re incrementally compiling a project in release mode on a beefy machine where the incremental changes are a few crates down from the final product.

If you see benefits, though, we’re curious to hear about other use cases!

What measurements do we want?

The general idea of what we’d like to see is toggling CARGO_BUILD_PIPELINING=true in some scenarios and seeing if compilation is faster with pipelining enabled. Some ideas are:

Full crate graph

$ cargo clean && cargo build
$ cargo clean && CARGO_BUILD_PIPELINING=true cargo build

and release mode

$ cargo clean && cargo build --release
$ cargo clean && CARGO_BUILD_PIPELINING=true cargo build --release

On CI

Try enabling CARGO_BUILD_PIPELINING in nightly jobs on CI (e.g. Travis/AppVeyor/Azure Pipelines) and see if builds are faster.

Incremental Builds

Make a change to a crate (a few crates down in the dependency graph) and then measure before/after using CARGO_BUILD_PIPELINING=true

Other timing information

If you’ve got other scenarios you think might be worthwhile, let us know! If you can describe your scenario below, that’d also be great!

I’d like to get more involved!

Feel free to hop into into the wg-pipelining zulip channel for this topic, and we’d love to chat!

37 Likes

I’ve just tested this out with the wasmtime repository (e.g. something including CraneLift) to get on a 14 core (28 hyperthread) machine:

Full Build Type Default Pipelined Difference
debug 39.5s 33.8s 14.6% faster
release 52.4s 45.5s 13.1% faster

Then after touch wasmtime-obj/src/lib.rs

Build type Default Pipelined Diffference
debug 4.17s 4.06s 2.6% faster
release 2.50s 2.47s 1.2% faster

Then after find . -name '*.rs' | xargs touch

Build type Default Pipelined Diffference
debug 7.24s 7.26s 0.3% slower
release 9.73s 7.54s 22.5% faster

(note that I don’t really know that much about wasmtime/CraneLift, just wanted to test something that I figured was mostly Rust code with not a ton of procedural macros and/or C code!)

2 Likes

I tested it on the Lark repository on a 14-core, 28 hyperthread machine. Results:

full build type default pipelined difference
debug 66s 57s 13% faster
release 71s 52s 26% faster
2 Likes

Hmm, I wish I could test it now. Can’t wait for latest nightly?

I just tried this out on one of my side projects (https://github.com/mark-i-m/os2), which is an OS kernel. The build system uses make to drive cargo-xbuild, gas, and ld.

The project doesn’t really have sane non-release builds (the binaries are too large), so I will only report release values.

I ran these a few times and reported a representative value. I have a 4-core Intel Core i5-7500T CPU @ 2.70GHz, and 15GB RAM, and the system is mostly idle apart from firefox…

After building

Build type Command Time Difference
full build make clean && time make 40.661s
full build pipelined make clean && time CARGO_BUILD_PIPELINING=true make 38.743s 4.7% faster

After touching lib.rs:

Build type Command Time Difference
full build touch kernel/lib.rs && time make 0.807s
full build pipelined touch kernel/lib.rs && time CARGO_BUILD_PIPELINING=true make 0.848s 5.0% slower

After touching everything:

Build type Command Time Difference
full build (find . -name '*.rs' | xargs touch) && time make 0.885s
full build pipelined (find . -name '*.rs' | xargs touch) && time CARGO_BUILD_PIPELINING=true make 0.891s difference is within the error

Wow the incremental results for lark are also super impressive. After find . -type '*.rs' | xargs touch the build times for lark are:

build type default pipelined difference
debug 7.97 7.29 8% faster
release 36.39 20.57 43% faster

Oh :man_facepalming:

cargo 1.36.0-nightly (c4fcfb725 2019-05-15) 

no wonder my results are underwhelming :stuck_out_tongue:

EDIT: actually, they are fine… just that my project doesn’t benefit much…

Er sorry if the OP isn’t clear, but no need to wait, tonight’s nightly has all the support necessary.

$ rustup update nightly
info: syncing channel updates for 'nightly-x86_64-unknown-linux-gnu'

  nightly-x86_64-unknown-linux-gnu unchanged - rustc 1.36.0-nightly (7d5aa4332 2019-05-16)

info: checking for self-updates
1 Like

@alexcrichton I’m assuming that metadata is dumped before monomorphization? If so, trait-heavy crates with lots of generic uses would see a large boost, right?

Hmm… ok now I’m not sure if my results are legit or not.

$ cargo +nightly --version
cargo 1.36.0-nightly (c4fcfb725 2019-05-15)

$ rustup update nightly
info: syncing channel updates for 'nightly-x86_64-unknown-linux-gnu'

  nightly-x86_64-unknown-linux-gnu unchanged - rustc 1.36.0-nightly (7d5aa4332 2019-05-16)

info: checking for self-updates

Correct, metadata is output before any codegen is performed (or at least that’s the intent, there could be bugs).

Your results are fine btw, that’s using the right rustc/cargo (I’ve updated the OP to clarify a bit). Dates reported by tools are slightly off from build dates. (commit vs build vs time zone)

1 Like

I’m curious if this could also work for “speculative pipelining”: if this is adopted, it’d be to rustc’s benefit to generate metadata as early as possible, even to the point of being able to tell what kinds of code changes result in metadata changes and what kinds don’t. Cache the previous metadata, try starting the next build assuming the metadata won’t change, restart it if it turns out to not match.

With cached metadata, rustc could often tell if changes affect the signatures of pub items or not. (rustc already has much of that information for incremental compilation.)

2 Likes

Ran it on my biggest personal project (compiles ~151 crates):

                            cargo build  296.38s user 24.49s system 417% cpu 1:16.86 total
CARGO_BUILD_PIPELINING=true cargo build  316.31s user 26.07s system 442% cpu 1:17.30 total
                            cargo build --release  721.26s user 28.04s system 598% cpu 2:05.16 total
CARGO_BUILD_PIPELINING=true cargo build --release  740.17s user 28.17s system 609% cpu 2:06.13 total

Not sure if I’m doing something wrong or not, but it seems like the pipelining takes longer?

Those numbers look like it’s probably neither beneficial nor detrimental to that project (the numbers being seemingly in the noise), but would it be possible to build the code myself to double check?

Sure, the repo is https://github.com/pwoolcoc/elefren

I should also mention that I compiled it on a 2016 MBP / 2.6 GHz Intel Core i7

The total time figures don’t substantially differ, but the user time goes up significantly, which suggests that some idle time became busy time.

clean build of webrender’s wrench tool (all of webrender), on my 2014 macbook pro (work laptop). Very prone to thermal throttling, tried to take cooldown breaks between every build.

build default pipelined difference
debug 3m 19s 3m 04s 7.5% decrease
release 8m 13s 7m 13s 12.7% decrease

add println to webrender_api

build default pipelined difference
debug 56.37s 55.80s 1% decrease
release 3m 37s 3m 30s 3% decrease

Encouraging clean results, underwhelming incremental results (I think we have too many compiler plugins/non-rust tasks for this to be a big win).

Ok thanks for the link! I’ve confirmed locally that on a beefy machine there’s modest speedups (2-3s in a 30s build ish) and limiting to 4 cpus doesn’t show anything awry, so I think this is likely just scheduling changes and whatnot and falling out in the noise.

1 Like

On a certain internal project

build default real default user pipelined real pipelined user
debug 1m4.412s 12m54.256s 0m52.378s 13m51.236s
release 1m44.727s 6m6.764s 1m29.862s 6m15.468s
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
Stepping:              4
CPU MHz:               2101.000
CPU max MHz:           2101.0000
CPU min MHz:           1000.0000
BogoMIPS:              4201.53
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              22528K
NUMA node0 CPU(s):     0-15,32-47
NUMA node1 CPU(s):     16-31,48-63
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku flush_l1d

I’ve just tested this with the rustup repository on 32 cores Linux machine:

(I disabled sccache if it would affect the result)

Full Build Type Default Pipelined Difference
debug 49.67s 41.62s 16.21% faster
release 4:10.02s 3:27.24s 17.11% faster