Help test parallel rustc!

I’ll try to dig in further in a bit, but I got a very surprising result trying to build Volta: it has some small but noticeable improvements in debug builds… and crashed the laptop hard, twice in a row, when running the parallel release build when on battery. The regular release build was fine. Once I’m on power again I’ll see what happens and figure out whether this was a fluke!

Just tested wasmtime with -Zthreads=72, and the time went from 1m9s (parallel rustc without -Zthreads) to 2m1s (parallel rustc with -Zthreads=72). The user time massively increased, from 16m50s to 50m.

Watching htop, it looks like the jobserver bits between cargo and rustc aren't actually limiting to 72 jobs at a time, because I see a load average around 300-500 and many hundreds of running (not blocked) rustc processes across many crates.

Yes this is a known bug we're working on, on my 28-thread system locally -Zthreads=28 makes compiles times quite bad. We'll be sure to reach out to you though when we think we have a fix for this, 72 is the highest number we've seen so far!

Basically, it's expected that -Zthreads=72 performs pretty badly right now.

1 Like

Results from compiling the spurs crate...

I ran each command twice.

Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz


# full build, serial compiler
real	0m10.321s
user	0m36.706s
sys	0m2.307s

real	0m10.215s
user	0m36.790s
sys	0m2.333s

# full build, parallel compiler
real	0m11.396s
user	0m38.184s
sys	0m2.716s

real	0m12.243s
user	0m38.848s
sys	0m2.724s

# full build, release, serial
real	0m23.591s
user	1m27.648s
sys	0m2.180s

real	0m23.220s
user	1m27.705s
sys	0m2.254s

# full build, release, parallel

real	0m23.931s
user	1m29.144s
sys	0m2.432s

real	0m24.661s
user	1m28.921s
sys	0m2.435s

# check, serial
real	0m7.703s
user	0m20.652s
sys	0m1.827s

real	0m7.712s
user	0m20.718s
sys	0m1.816s

# check, parallel
real	0m7.921s
user	0m22.267s
sys	0m1.986s

real	0m8.068s
user	0m22.192s
sys	0m1.987s

So in summary

full build: parallel is 15% (~1.4 seconds) slower on average 
full build release: parallel is 3% (<1 second) slower on average
check: 3% (<1 second) slower on average

On's codebase at 1fc6bfa on macOS Catalina 10.15.1, 3.1 GHz Dual-Core Intel Core i7, 16 GB RAM:

Full build

$ cargo clean && time cargo +nightly-2019-12-18 build
real	10m1.042s
user	18m47.605s
sys	1m50.826s
$ cargo clean && time cargo +nightly-2019-12-17 build
real	8m33.007s
user	17m14.295s
sys	1m38.205s

Soooo parallelism made it slower?

Full release build

$ cargo clean && time cargo +nightly-2019-12-18 build --release
real	18m58.546s
user	44m1.749s
sys	1m48.858s
$ cargo clean && time cargo +nightly-2019-12-17 build --release
real	16m54.551s
user	43m2.278s
sys	1m38.579s

Also slower for me :frowning:

Full check

$ cargo clean && time cargo +nightly-2019-12-18 check
real	6m35.014s
user	12m18.504s
sys	1m22.026s
$ cargo clean && time cargo +nightly-2019-12-17 check
real	5m16.316s
user	11m27.044s
sys	1m17.450s


I'm happy to run anything else that would be useful, or provide any other info, let me know what!


I tested on Fuchsia. Everything built and all tests passed correctly, so no problems there.

Here are three timings from builds that were run without parallel rustc enabled:

5284.35user 696.95system 3:00.15elapsed 3320%CPU (0avgtext+0avgdata 2963072maxresident)k
1704inputs+61728288outputs (173653major+94432200minor)pagefaults 0swaps

5337.72user 679.63system 3:07.12elapsed 3215%CPU (0avgtext+0avgdata 2962480maxresident)k
6217992inputs+61728224outputs (174415major+97968950minor)pagefaults 0swaps

5279.77user 705.58system 3:01.22elapsed 3302%CPU (0avgtext+0avgdata 2968028maxresident)k
14080inputs+61729144outputs (182088major+96475006minor)pagefaults 0swaps

And here are 3 runs with parallel rustc enabled:

5313.74user 689.57system 2:59.98elapsed 3335%CPU (0avgtext+0avgdata 2982788maxresident)k
1048inputs+60992440outputs (155345major+96491042minor)pagefaults 0swaps

5318.76user 692.16system 2:58.39elapsed 3369%CPU (0avgtext+0avgdata 2964200maxresident)k
1137304inputs+61014216outputs (160991major+98495705minor)pagefaults 0swaps

5330.31user 691.24system 2:57.67elapsed 3389%CPU (0avgtext+0avgdata 2965328maxresident)k
1040inputs+60995288outputs (159364major+97233184minor)pagefaults 0swaps

So it looks like we saw a small but significant improvement in build time. I suspect that enabling >4 threads would lead to further improvements.

Note that there are some non-Rust steps in those builds, so the gain might be higher percentage-wise. (Only the Rust targets were invalidated, but we have some targets that depend on everything at the very end.)

This wasn't very scientific, so should probably be taken with a grain of salt. :slight_smile:

On the build of specifically our third party code (built with cargo; takes around 20s; included in the above builds) I did not notice any significant change.

1 Like

Okay, after getting machine on power, results:

  • cargo clean && time cargo +nightly-2019-12-18 build: 80.14 real 485.11 user 48.62 sys

  • cargo clean && time cargo +nightly-2019-12-17 build: 75.14 real 456.43 user 37.09 sys

  • cargo clean && time cargo +nightly-2019-12-18 build --release: 136.43 real 1227.13 user 43.18 sys

  • cargo clean && time cargo +nightly-2019-12-17 build --release: 138.98 real 1214.59 user 33.23 sys

(Best guess: I hit some odd condition with maxing the cores on low battery earlier.)

1 Like

Quick followup: I did some perf runs on -Zthreads=72 builds, and it looks like the substantial user-time overhead (going from 16m50s to 50m, or 40m with a kernel patch to improve pipe wakeup fairness) consists heavily of attempts to do work-stealing:

  13.07%  rustc            [.] std::thread::local::LocalKey<T>::try_with
  10.95%  rustc            [.] crossbeam_epoch::internal::Global::try_advance
   6.93%  rustc            [unknown]                                    [k] 0xffffffff91a00163
   5.86%  rustc            [.] crossbeam_deque::Stealer<T>::steal
   4.14%  rustc            [.] <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::try_fold
   3.02%  rustc                                     [.] _dl_update_slotinfo
   2.18%  rustc                                     [.] __tls_get_addr
   1.65%  rustc            [.] crossbeam_epoch::default::HANDLE::__getit
   1.39%  rustc            [.] crossbeam_epoch::default::pin
   1.14%  rustc                                     [.] update_get_addr
   1.14%  rustc                                   [.] __memmove_sse2_unaligned_erms
   1.10%  rustc            [unknown]                                    [k] 0xffffffff91a00b27
   1.05%  rustc            rustc                                        [.] free
   0.93%  rustc            rustc                                        [.] malloc
   0.70%  rustc                                     [.] __tls_get_addr_slow
   0.66%  rustc               [.] combineInstructionsOverFunction
   0.64%  rustc            [.] __tls_get_addr@plt

I dug further, and the calls to std::thread::local::LocalKey<T>::try_with come from crossbeam_deque::Stealer<T>::steal.

1 Like

I used my latest project multi_file_writer and compared it with stable too.

The CPUs look more occupied in the load graph when 2019-12-18 is running. However, having the benchmark running tree times shows me that the variance is bigger than the improvement. And for a laptop, cooling has a greater effect than the optimization.

  • Hardware: AMD Ryzen 7 PRO 3700U (8 Cores, 24GB RAM, 1TB SSD - Lenovo T495
  • Software: Kubuntu Linux 19.10, Kernel 5.3.0-24-generic
# full build
cargo clean && time cargo +nightly-2019-12-18 build
real    0m17.877s   0m16.555s   0m18.536s
user    1m35.093s   1m32.151s   1m39.613s
sys     0m4.695s    0m4.615s    0m5.139s

cargo clean && time cargo +nightly-2019-12-17 build
real    0m18.620s   0m17.727s   0m18.014s
user    1m25.832s   1m23.918s   1m24.582s
sys     0m3.884s    0m3.891s    0m3.960s

cargo clean && time cargo +stable build
real    0m18.954s   0m18.702s   0m19.721s
user    1m31.474s   1m29.693s   1m31.224s
sys     0m4.240s    0m3.892s    0m4.166s

# full release build
cargo clean && time cargo +nightly-2019-12-18 build --release
real    0m45.455s   0m40.779s   0m41.285s
user    4m42.960s   4m27.842s   4m27.474s
sys     0m5.058s    0m4.540s    0m5.088s

cargo clean && time cargo +nightly-2019-12-17 build --release
real    0m43.404s   0m43.381s   0m41.122s
user    4m35.760s   4m30.659s   4m25.856s
sys     0m4.376s    0m4.173s    0m4.233s

cargo clean && time cargo +stable build --release
real    0m40.928s   0m40.848s   0m42.138s
user    4m22.785s   4m24.846s   4m28.641s
sys     0m4.290s    0m4.274s    0m4.358s

# full check
cargo clean && time cargo +nightly-2019-12-18 check
real    0m12.896s   0m13.526s   0m12.870s
user    0m59.298s   0m59.617s   0m58.271s
sys     0m3.352s    0m3.519s    0m3.672s

cargo clean && time cargo +nightly-2019-12-17 check
real    0m13.727s   0m14.880s   0m13.569s
user    0m49.721s   0m53.186s   0m51.119s
sys     0m2.922s    0m2.983s    0m3.023s

cargo clean && time cargo +stable check
real    0m14.428s   0m14.235s   0m14.488s
user    0m52.424s   0m52.405s   0m53.769s
sys     0m3.142s    0m2.962s    0m3.148s

My CPUs layout: image

If there is an other benchmark, I would be happy to run that too.

Testing against reso_dd at commit 2d21506, a soon-to-be-released crate that is 2.2 MB of serde-annotated structs. The only functions it contains are custom serde serialization/deserialization. Dependencies are serde and chrono.

For this particular type of workload, the speedup seems quite significant.

Operating system: MacOS 10.15.1 Hardware: MacBook Pro (16-inch, 2019), 2.4 GHz 8-Core Intel Core i9, 32 GB 2667 MHz DDR4

# full build
cargo +nightly-2019-12-18 build -p reso_dd  63.89s user 3.26s system 208% cpu 32.208 total
cargo +nightly-2019-12-17 build -p reso_dd  58.59s user 2.94s system 134% cpu 45.696 total

# full release build
cargo +nightly-2019-12-18 build -p reso_dd --release  136.12s user 3.42s system 390% cpu 35.720 total
cargo +nightly-2019-12-17 build -p reso_dd --release  129.22s user 3.11s system 289% cpu 45.679 total

# full check
cargo +nightly-2019-12-18 check -p reso_dd  53.91s user 2.51s system 212% cpu 26.513 total
cargo +nightly-2019-12-17 check -p reso_dd  49.01s user 2.22s system 134% cpu 38.152 total

Using commit 80066510b54f1ae05d51f65b52e18bdd5357016c of differential dataflow and compiling some of the examples, I found:

  • a noticeable speedup when doing a full build, i.e. dependencies are being compiled (~10% for debug, ~5% for release)
  • no speedup (or: lost in the noise) when just compiling final binaries (makes sense - rustc timings indicate most time is spent in llvm, also explaining debug/release difference) (I don't know how much the type rechecking etc the touch will cause)
# Parallel debug builds:

$ rm -rf target

$ /usr/bin/time cargo +nightly-2019-12-18 build --example progress
    Finished dev [unoptimized + debuginfo] target(s) in 31.48s
110.99user 4.06system 0:31.50elapsed 365%CPU (0avgtext+0avgdata 920740maxresident)k
440inputs+1249848outputs (7major+1433125minor)pagefaults 0swaps

$ /usr/bin/time cargo +nightly-2019-12-18 build --example pagerank
    Finished dev [unoptimized + debuginfo] target(s) in 11.17s
28.30user 0.69system 0:11.17elapsed 259%CPU (0avgtext+0avgdata 804092maxresident)k
0inputs+444360outputs (0major+307311minor)pagefaults 0swaps

$ /usr/bin/time cargo +nightly-2019-12-18 build --example monoid-bfs
    Finished dev [unoptimized + debuginfo] target(s) in 10.13s
25.70user 0.81system 0:10.14elapsed 261%CPU (0avgtext+0avgdata 770656maxresident)k
0inputs+413664outputs (0major+290452minor)pagefaults 0swaps

# Non-parallel debug builds

$ rm -rf target

$ /usr/bin/time cargo +nightly-2019-12-17 build --example progress
    Finished dev [unoptimized + debuginfo] target(s) in 36.61s
101.69user 3.60system 0:36.62elapsed 287%CPU (0avgtext+0avgdata 960348maxresident)k
32inputs+1249832outputs (0major+1290682minor)pagefaults 0swaps

$ /usr/bin/time cargo +nightly-2019-12-17 build --example pagerank
    Finished dev [unoptimized + debuginfo] target(s) in 10.59s
28.35user 0.87system 0:10.60elapsed 275%CPU (0avgtext+0avgdata 845212maxresident)k
0inputs+444344outputs (0major+322688minor)pagefaults 0swaps

$ /usr/bin/time cargo +nightly-2019-12-17 build --example monoid-bfs
    Finished dev [unoptimized + debuginfo] target(s) in 10.11s
26.29user 0.71system 0:10.12elapsed 266%CPU (0avgtext+0avgdata 755828maxresident)k
0inputs+413624outputs (0major+277129minor)pagefaults 0swaps

# Parallel release builds 

$ rm -rf target

$ /usr/bin/time cargo +nightly-2019-12-18 build --example progress --release
    Finished release [optimized + debuginfo] target(s) in 2m 11s
297.09user 4.98system 2:11.90elapsed 229%CPU (0avgtext+0avgdata 2827380maxresident)k
0inputs+1018456outputs (0major+2283912minor)pagefaults 0swaps

$ touch examples/

$ /usr/bin/time cargo +nightly-2019-12-18 build --example progress --release
    Finished release [optimized + debuginfo] target(s) in 1m 35s
132.45user 1.34system 1:35.33elapsed 140%CPU (0avgtext+0avgdata 2835180maxresident)k
0inputs+424168outputs (0major+993830minor)pagefaults 0swaps

#Non-parallel release builds

$ rm -rf target

$ /usr/bin/time cargo +nightly-2019-12-17 build --example progress --release
    Finished release [optimized + debuginfo] target(s) in 2m 20s
289.44user 4.20system 2:20.92elapsed 208%CPU (0avgtext+0avgdata 2828940maxresident)k
0inputs+1042776outputs (0major+2153859minor)pagefaults 0swaps

$ touch examples/

$ /usr/bin/time cargo +nightly-2019-12-17 build --example progress --release
    Finished release [optimized + debuginfo] target(s) in 1m 37s
134.14user 1.39system 1:37.55elapsed 138%CPU (0avgtext+0avgdata 2845420maxresident)k
0inputs+448848outputs (0major+1003126minor)pagefaults 0swaps

Timing data for the turtle crate.

This is using a fairly powerful CPU. Full system info at the end of this comment.

cargo clean && time cargo +nightly-2019-12-18 build

real	0m23.471s
user	4m30.381s
sys	0m17.490s

cargo clean && time cargo +nightly-2019-12-17 build

real	0m26.377s
user	4m3.021s
sys	0m13.118s

cargo clean && time cargo +nightly-2019-12-18 build --release

real	0m43.751s
user	10m41.284s
sys	0m17.831s

cargo clean && time cargo +nightly-2019-12-17 build --release

real	0m47.365s
user	10m21.766s
sys	0m13.473s

The turtle crate has a lot of examples, so I thought it would be worth it to run those too:

cargo clean && time cargo +nightly-2019-12-18 build --examples

real	0m33.796s
user	6m52.774s
sys	0m36.885s

cargo clean && time cargo +nightly-2019-12-17 build --examples

real	0m36.875s
user	6m27.881s
sys	0m31.876s
System info


Testing Beamium (a Prometheus to Warp10 metrics forwarder) at commit 819ef8c:

cargo +nightly-2019-12-18 build
534,97s user 30,78s system 604% cpu 1:33,65 total

cargo +nightly-2019-12-17 build
517,03s user 26,19s system 577% cpu 1:34,11 total

cargo +nightly-2019-12-18 build --release
1319,95s user 29,98s system 515% cpu 4:22,00 total

cargo +nightly-2019-12-17 build --release 
1356,64s user 28,18s system 496% cpu 4:39,15 total

cargo +nightly-2019-12-18 check
341,53s user 26,41s system 689% cpu 53,353 total

cargo +nightly-2019-12-17 check 
338,01s user 22,66s system 654% cpu 55,137 total

On Rust stable, just in case (1.40.0):

cargo build
526,18s user 27,23s system 619% cpu 1:29,33 total

cargo build --release 
1322,03s user 27,14s system 520% cpu 4:19,12 total

cargo check
354,53s user 23,55s system 630% cpu 1:00,01 total

Tested on Dell XPS 15 (9550):

  • Intel® Core™ i7-6700HQ CPU @ 2.60GHz
  • 16GB of RAM
  • Ubuntu 19.10 64bits

Timing for grayarea-runtime on 40 Intel CPU 32Gb RAM Ubuntu 19 shows about 25% perf boost for debug build and check and 10% for release builds:

+nightly-2019-12-17 build
real    1m22.457s
user    8m46.634s
sys     0m37.100s
+nightly-2019-12-18 build
real    1m2.301s
user    10m8.246s
sys     0m51.905s
+nightly-2019-12-17 build --release
real    1m51.374s
user    26m32.222s
sys     0m42.592s
+nightly-2019-12-18 build--release
real    1m37.700s
user    29m30.259s
sys     0m58.299s
+nightly-2019-12-18 check
real    1m5.725s
user    5m40.797s
sys     0m26.380s
+nightly-2019-12-18 check
real    0m44.395s
user    6m37.955s
sys     0m39.464s

P.S. Give it first dry run to download deps

Building substrate 5dece712a on AMD EPYC Rome (64 cores/128 threads), 256 GB machine.

cargo clean && time cargo +nightly-2019-12-17 build
    Finished dev [unoptimized + debuginfo] target(s) in 6m 15s

real    6m17.934s
user    135m2.361s
sys     17m1.909s
cargo clean && time cargo +nightly-2019-12-18 build
    Finished dev [unoptimized + debuginfo] target(s) in 6m 29s

real    6m32.575s
user    228m14.238s
sys     32m0.462s
cargo clean && time cargo +nightly-2019-12-17 build --release
    Finished release [optimized] target(s) in 8m 38s

real    8m41.842s
user    411m59.930s
sys     23m30.786s
cargo clean && time cargo +nightly-2019-12-18 build --release
    Finished release [optimized] target(s) in 8m 45s

real    8m48.119s
user    537m58.740s
sys     45m5.127s
cargo clean && time cargo +nightly-2019-12-17 check
    Finished dev [unoptimized + debuginfo] target(s) in 4m 46s

real    4m49.572s
user    154m7.090s
sys     20m57.464s
cargo clean && time cargo +nightly-2019-12-18 check
    Finished dev [unoptimized + debuginfo] target(s) in 4m 26s

real    4m29.848s
user    210m40.669s
sys     39m16.688s

Running cargo +nightly-2019-12-18 check with RUSTFLAGS="-Zthreads=64" I am getting:

    Finished dev [unoptimized + debuginfo] target(s) in 20m 54s

real    20m57.729s
user    746m42.182s
sys     1532m35.727s

The second time I ran it I got

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: "SendError(..)"', src/libcore/
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/
   4: core::fmt::write
             at src/libcore/fmt/
   5: std::io::Write::write_fmt
             at src/libstd/io/
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/
   9: std::panicking::default_hook
             at src/libstd/
  10: rustc_driver::report_ice
  11: std::panicking::rust_panic_with_hook
             at src/libstd/
  12: rust_begin_unwind
             at src/libstd/
  13: core::panicking::panic_fmt
             at src/libcore/
  14: core::result::unwrap_failed
             at src/libcore/
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

error: internal compiler error: unexpected panic

note: the compiler unexpectedly panicked. this is a bug.

note: we would appreciate a bug report:

note: rustc 1.41.0-nightly (3ed3b8bb7 2019-12-17) running on x86_64-unknown-linux-gnu

note: compiler flags: -Z threads=64 -C debuginfo=2 --crate-type lib

note: some of the compiler flags provided by cargo are hidden

query stack during panic:
end of query stack

Testing sv-parser at 6e887e2. The testing environment is Intel Xeon Gold 6134 CPU @ 3.20GHz (16cores/32threads) / 1.5TB DRAM. About 15-20% improved.

cargo clean && time cargo +nightly-2019-12-17 build

real    2m35.962s
user    5m6.225s
sys     0m21.336s

cargo clean && time cargo +nightly-2019-12-18 build

real    2m10.055s
user    5m40.738s
sys     0m26.435s

cargo clean && time cargo +nightly-2019-12-17 build --release

real    2m49.236s
user    16m11.879s
sys     1m29.441s

cargo clean && time cargo +nightly-2019-12-18 build --release

real    2m24.627s
user    17m15.008s
sys     1m26.060s

cargo clean && time cargo +nightly-2019-12-17 check

real    2m5.321s
user    2m55.150s
sys     0m13.223s

cargo clean && time cargo +nightly-2019-12-18 check

real    1m41.439s
user    3m26.564s
sys     0m15.532s

On my 2-cpu 8-thread server (and old xeon E5), all of my projects were faster with the 2019-12-18 nightly (across all projects, best was 1.51 times faster for check --release, worst as 1.01 times faster for another check --release). In no cases was it slower.

Adding another set of data points, I just grabbed a few repos people were using in this thread. Seeing awesome improvements, nice job to the team working on this.


  • Processor: AMD 3970X
  • Memory: 64Gb DDR4 @ 3000MHz C15
  • Storage: PCIE-Gen4 Corsair MP600


sv-parser @ 6e887e:
  2019-12-17 Build: 1:57.42
  2019-12-18 Build: 1:37.97

crates-io @ e0d149:
  2019-12-17 Build: 0:57.62
  2019-12-18 Build: 0:42.93

turtle @ 9efd23:
  2019-12-17 Build: 0:22.69
  2019-12-18 Build: 0:12.23

Build of Materialize is 5% faster on 2019-12-17.

FreeBSD 13.0-current, 32-core (64-thread) 2990wx, 64gb RAM.

cargo +nightly-2019-12-17 build --release  6592.05s user 372.31s system 2733% cpu 4:14.79 total
cargo +nightly-2019-12-18 build --release  7017.31s user 436.78s system 2791% cpu 4:27.08 total

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.