Exploring Crate Graph Build Times with `cargo build -Ztimings`

Just landed on tonight's nightly @ehuss has done some amazing work to add a new unstable flag to Cargo, -Z timings, which can be used to help visualize how crates are getting compiled for you and where time is being spent. I wanted to post an announcement for this in case anyone's interested in giving it a spin! While we don't have a ton of plans for stabilization yet, we'd love to see where this feature can go.

To test this out you'll first want to update to the latest nightly:

$ rustup update nightly
info: syncing channel updates for 'nightly-x86_64-unknown-linux-gnu'

  nightly-x86_64-unknown-linux-gnu unchanged - rustc 1.39.0-nightly (9b9d2aff8 2019-09-19)

info: checking for self-updates

Next you'll want to build locally with:

$ cargo +nightly build -Z timings

The timings option takes a list of comma-separated values, and defaults to printing information on the terminal about crate timings and generating a *.html report with everything in a human-readable form. You can read a bit more about the option online.

Some example graphs look like:

The graphs should be able to graphically show you what crates are taking a long time to compile, where crates are being compiled, etc. You can also see the effect that pipelined compilation has on your compile graph to see when dependencies get unlocked. The documentation for -Z timings has more information about how to interpret the graphs.

If you're curious give this flag a spin, and if you have feedback please open an issue!

49 Likes

As you flagged in serde-rs/serde#1146, procedural macros currently never get pipelined with their dependencies. Would it be reasonable for Cargo to automatically apply the transformation of turning all dependency-having proc macro crates into small shims around an underlying non proc macro impl crate to get better pipelining?

#[proc_macro_derive(Deserialize, attributes(serde))]
pub fn derive_deserialize(input: TokenStream) -> TokenStream {
    serde_derive_impl::derive_deserialize(input)
}

Before:

After:

1 Like

I don't personally know of (and I don't think there is) a way for Cargo to automatically perform that sort of transformation today, but that's the topic of this rustc issue, allowing more pipelining to happen with not only procedural macros but also binaries.

Also 0.56 seconds is actually crazy long for a serde_derive that has a tiny shim function like you mentioned, that may be worth investigating where that 500ms is going.

2 Likes

Here is the split commit that I was timing on: dtolnay/serde@aa70f0f.

Also if anyone's looking at graphs and feels that something should be compiled in parallel but isn't, turns out Cargo isn't impervious to bugs, and this is an excellent resource to discover those kinds of bugs!

4 Likes

Great work, this is an excellent tool and have already been super useful for us!

Spent the last few hours looking at our builds and fixing some bottlenecks and this was great to see just that and what crates to focus on to maximize parallelism and get things started as early as possible.

One particular dependency we ran into was the old winapi 0.2 crate which on Windows release build on a my 16-core Threadripper takes a whopping 65 seconds (!) to compile (out of a total 3.5 minute build). And as tons of smaller crates such as iovec and bytes depend on it, very big crates in the ecosystem gets pushed out 40-50 sec in the build reducing some major parallelism opportunities in our specific builds.

Filed a PR specifically on iovec to get rid of that old winapi dependency, and are tracking some more to likely make a major improvement to Windows build times:

Thanks!

4 Likes

Thanks for the awesome work on this!

I noticed on one of my side projects that the final link steps was taking almost 20 seconds, with only 4 seconds spent on codegen. Do you have any idea why that slowdown could be there? It makes it very frustrating to write integration tests and small incremental changes.

Here's a screenshot of the build times:

Project on github: https://github.com/jyn514/rcc

You're not the only one running into this actually, Cargo itself was building winapi 0.2! We had the same experience you did, very quickly realizing that it was a big chunk of the build and was also relatively easy to remove for us! (certainly not the case for everyone)

1 Like

I'm not entirely sure what you mean by link steps, but it sounds like you're talking about the "rcc" library crate and then the "rcc" binary? A rustc feature like this could help Cargo pipeline the binary with the library a bit more, but this chart won't actually help you learn much about why rustc is slow, only about when it was scheduled.

I see, I'm probably asking in the wrong place then.

When I say 'link steps', I mean running ld:

$ cargo +nightly rustc --bin rcc -- -Z time-passes
   Compiling rcc v0.1.0 (/home/joshua/Documents/Programming/rust/rcc)
... snip ...
    time: 3.862; rss: 201MB     running linker
  time: 3.865; rss: 201MB       linking
time: 4.292; rss: 139MB         total

Wow! Is there a plan to also print this out as JSON info for back-end analysis later, finding regressions and so on?

Ah yes, for actually debugging why the linker takes so long to execute you'd have to dig in more beyond just the data coming out of Cargo. That does looks suspicious though!

This is actually supported today! You can pass -Z timings=json. You can find more information for this in the online documentation

1 Like

This is a fantastic tool! Great work, @ehuss. You can really see the benefits of pipelining.

One small suggestion: it would be nice if the Gantt chart showed the front-end/codegen split for each crate as well as the total time. Currently, to determine that split you have to eyeball it or scroll down to the table.

4 Likes

Oh even better, thx!

This is slightly offtopic but if you have linking troubles, you could try applying the suggestions from this comment of mine: https://github.com/rust-gamedev/wg/issues/50#issuecomment-527160611

1 Like

Using lld and no debug info really helped, thank you so much! I went from ~5.5 seconds with a hot cache to ~4 seconds with a cold cache and ~.5 seconds with a hot cache, that makes small changes much more pleasant. It didn't help as much with tests, maybe because they don't usually have a hot cache? Still very helpful, thank you.

2 Likes

I am interested in trying this on rustc itself, but rustc uses the x.py script for the top level build commands, rather than cargo itself. So how should I invoke this tool for rustc? Thanks!

@nnethercote

You could place a cargo.arg("-Ztimings"); after

That will pass -Ztimings to every cargo invocation performed by ./x.py.

It seems no cargo-timing.html file is generated when the build fails.

1 Like

I have reduced clean compilation time of cg_clif with more than 2min with help of -Ztimings:

9 Likes