Plan to test parallel rustc

Good evening all! The parallel rustc working group (you can find us on Zulip!) has been developing and working on a parallel compiler for quite some time now, and we're getting very close to shipping! What we'd like to do now is lay out a plan for how we can get users testing this before we turn it on by default.

Note: if you're not familiar, the parallelism in the "parallel compiler" is specifically referring to parallelism in the frontend static analysis (think typeck, borrowck, etc). Rustc already has a parallel LLVM backend, and that's here to stay and isn't changing.

We would like to start shipping nightlies as soon as we're confident that parallelism is ready to get turned on by default. We would like user testing before that, however. Unfortunately unlike other compiler features this isn't a simple flag to pass to the compiler, but rather you need to recompile the whole compiler in "parallel mode". Nightlies are not currently built in "parallel mode", and as a result can't be used to test parallelism.

We do, however, want users to be able to easily test these changes! That typically means using rustup in one form or another. To that end the parallel working group would like to propose the following plan for testing parallel rustc:

  1. One day this week (tomorrow, Tuesday if possible) we will land a PR to the master branch of rust-lang/rust which enables the parallel compiler by default. This would make its way to the nightly channel, for example nightly-2019-12-17.
  2. Immediately after the nightly is published, we revert this default, switching the master branch to what it is today (no parallelism in the frontend).
  3. Next, make a follow-up post on internals about gathering data. This way users would rustup to one nightly to test parallel rustc, and use the previous nightly to have a baseline measurement of what today's performance looks like (assuming no major changes land in that one 24 hour window)

Would others be ok with this plan? We're relatively confident that the whole ecosystem won't break if one nightly has parallel turned on by default, but if lots of breakage does happen it's at least only one day of breakage! Note that we also want to get this done while we're all still working, it'd be disastrous for something like this to go out over the holidays when most aren't around to fix bugs!

31 Likes

Awesome to hear that parallel rustc is getting close!

This seems like a reasonable plan with our current infrastructure.

Ideally, I'd love to see a way for us to build "special" builds that rustup can install, that won't ever be installed with a simple rustup update or rustup install nightly, but will only get installed if specifically requested. Does that seem like something that would require an excessive amount of work in rustup to enable?

I can think of several other things we might want to test that way, rather than making everything we want to test into a runtime flag.

7 Likes

I agree that having better infrastructure for these one off builds would be great to have! At this time though I don't think it'd be easy to whip it up, so having a single "hopefully relatively stable" nightly we think is our best bet.

2 Likes

(Seconding a more robust system for arbitrary toolchain installs at some future date - this is one of the main obstacles to being able to run lolbench measurements on every merge, not just nightlies.)

1 Like

This plan sounds ok to me for this test. The PR that enables parallel rustc will go through CI in order to land to master, ensuring that it produces a compiler that behaves at least somewhat correctly. And even if something unexpected and catastrophic happens and we get a "bad" Nightly, there should still be a "good" one the day before and the day after.

As for a better way to distribute one-off builds, even if it’s not out of the box in rustup could we recommend https://github.com/kennytm/rustup-toolchain-install-master or some other tool that downloads and uses rustup toolchain link?

2 Likes

Wouldn't adding a new "nightly-parallel" channel be the proper solution?

Or changing the code so parallel mode can be turned on/off at runtime.

This shouldn't need a whole new channel; this just needs a few short-term one-off releases for test purposes. As mentioned earlier in the thread, we should have a mechanism in rustup to support explicitly requesting specific one-off builds that aren't part of any of the normal channels.

The opening post of this thread made it clear that that isn't a straightforward change.

One question about parallel rustc:

Are the results completely deterministic, no matter how many or how few parallel threads it runs?

Ok parallelism was enabled in https://github.com/rust-lang/rust/pull/67362 and is being disabled in https://github.com/rust-lang/rust/pull/67379. That means that tonight's nightly should have parallel support and tomorrow's nightly should be back to normal. Please report any regressions you see in CI as normal compiler bugs! I'll make a follow-up post tomorrow with instructions about what sort of data we're interested in gathering.

@SimonSapin we talked about recommending rustup-toolchain-install-master, and while it is pretty buttery-smooth it's not quite as nice as simply using rustup, and we wanted to minimize the effort needed to get a toolchain so you could test it out locally.

@josh compilation should be always deterministic as usual, regardless of the number of threads used. It's a bug if that's not the case!

Possibly worth following up that we're expecting some crate-level non-determinisim in this first build due to a known bug with how the jobserver acquires/releases tokens, but it shouldn't be noticeable unless you're e.g. running with -j1. In any case, if you do notice something, please do file bugs, as we want to hear about as many as possible!

To be clear so I test the right thing in the next couple days: the nightly named for 2019-12-17 is the one?

4 Likes

Is there a reported issue tracking that bug that you could link to?

I've just filed https://github.com/rust-lang/rust/issues/67385.

We'll have a dedicated post out tomorrow, but the correct nightly is installed via rustup update nightly-2019-12-18 and should report itself as rustc 1.41.0-nightly (3ed3b8bb7 2019-12-17) via rustc --version.

3 Likes

I did some preliminary tests, using wasmtime (commit hash 31472fbb5a6417ea3d9eb10417ff5ea49712998a) as a crate that takes a while to compile from scratch. I installed the parallel nightly and the previous nightly, and ensured that cargo had already downloaded all the dependencies for wasmtime.

System details:

josh@jet:~$ rustc +nightly --version
rustc 1.41.0-nightly (99b89533d 2019-12-16)
josh@jet:~$ rustc +nightly-2019-12-18 --version
rustc 1.41.0-nightly (3ed3b8bb7 2019-12-17)
josh@jet:~$ nproc 
72

I first ran two builds with the non-parallel rustc, using time cargo +nightly build --release (cleaning in between), and got:

real	1m19.163s
user	15m30.836s
sys	0m19.799s
real	1m19.636s
user	15m26.183s
sys	0m19.938s

(Side note: the non-parallel rustc gave non-reproducible results, as the two builds produced different target/release/wasmtime binaries. That needs some investigating separately, but it might just be an issue with some crate in the wasmtime dependencies. EDIT: It's an issue in cranelift itself, fixed upstream in commit 497b4e1ca1d33dfd54314366d8e3a27a9fea225f.)

Next, I ran two builds with the parallel rustc, using time cargo +nightly-2019-12-18 build --release (cleaning in between), and got:

real	1m9.260s
user	16m50.977s
sys	0m23.343s
real	1m9.191s
user	16m55.383s
sys	0m23.964s

So, the wall-clock time had a consistently impressive ~17% improvement from 79s to 69s, while the user time (total CPU time used for all computation) went up by roughly 80-90 seconds.

Also worth noting, the compiled binary size went up a bit from the non-parallel rustc to the parallel rustc.

Non-parallel rustc (1): 11899880 bytes
Non-parallel rustc (2): 11899608 bytes
    Parallel rustc (1): 11905352 bytes
    Parallel rustc (2): 11905376 bytes

Using the parallel rustc, wasmtime still passes its entire testsuite with no issues.

Next, two debug builds with non-parallel rustc:

real	1m2.775s
user	5m42.333s
sys	0m22.377s
real	1m5.372s
user	5m41.907s
sys	0m21.883s

And two debug builds with parallel rustc:

real	0m49.142s
user	6m27.121s
sys	0m24.957s
real	0m48.880s
user	6m25.960s
sys	0m24.767s

Again, user time goes up but wall-clock time goes down, this time from 62-65s to 49s.

Debug build size similarly increases proportionally, from 149980800 bytes to 149947456 or 149947432 bytes.

Overall, this looks like a great improvement in compile performance.

3 Likes

I ran another test, with a much smaller crate (commit hash c570a015e15214be46a7fd06ba08526622738e20), and got less consistent results, more in the noise.

Non-parallel rustc:

real	0m19.857s
user	2m33.522s
sys	0m9.018s
real	0m19.849s
user	2m30.636s
sys	0m8.092s

Parallel rustc (EDIT: ignore the first time measurement, see below for how this happened):

real	0m20.149s
user	3m30.696s
sys	0m15.814s
real	0m18.522s
user	2m42.483s
sys	0m10.206s

Because those two builds with the parallel rustc gave such wildly different results, I tried a few more builds with the parallel rustc:

real	0m18.837s
user	2m43.548s
sys	0m10.142s
real	0m18.509s
user	2m43.081s
sys	0m10.904s
real	0m18.613s
user	2m43.926s
sys	0m10.744s

Those three runs seem far more consistent, both in user time and in wall-clock time.

EDIT: oh, I just realized the problem, and it has nothing to do with rustc. The build of git2 (and potentially other libraries) compiled C code, and that compilation used ccache, missing the cache the first time and hitting the cache the remaining times. Nevermind, please ignore that first build. Leaving this up to document a potential pitfall for others doing build benchmarking.

Looks like smaller crates get a decent improvement in build time as well, though not quite as substantial.

1 Like

It might be useful to also test builds where Cargo can already saturate the available cores (or come close) with separate rustc processes, as a worst case scenario.

1 Like

So I have an 4 core, 8 logical processes, win10 laptop. So here are my numbers for compiling cargo:

cargo> cargo clean

cargo>cargo +nightly-2019-12-17 check -Ztimings => 1m 47s
cargo>cargo +nightly-2019-12-18 check -Ztimings => 1m 44s 

cargo>cargo +nightly-2019-12-17 build -Ztimings => 1m 48s
cargo>cargo +nightly-2019-12-18 build -Ztimings => 1m 43s

cargo>cargo +nightly-2019-12-17 build --release -Ztimings => 6m 44s
cargo>cargo +nightly-2019-12-18 build --release -Ztimings => 6m 44s

git of timing files

Thanks for the initial measurements! If it's ok though I'd ask to hold off on posting more results here, I'll make a dedicated thread which has a lot more information about data we'd like to gather. Stay tuned!

1 Like

Ok I've posted a dedicated thread with more information about getting measurements, and I'll copy over some of the results posted here so far, thanks all!

3 Likes