Measuring compiler performance


So, sadly, the website has been broken for quite some time (basically since rustbuild landed). @Mark_Simulacrum has done some awesome work (which they can describe) building up a new system that uses the pre-built binaries produced by travis. This offers the promise of very precise info about which PR triggered a regression.

I’d like to spark a discussion on two topics:

  1. What are the minimum steps we can take to get some kind of results available again?
    • I really dislike having no measurements
    • Is it just a matter of needing to get the old server up and going again, or what?
  2. How should we structure our test suite?

@Mark_Simulacrum and I have had quite a few conversations on the second point and it seems like a good idea to get broader input.

My current take is that there are roughly four kinds of measurements I would like:

  • Compilation times that target very specific parts of the compiler and workflows
    • this includes incremental flows, i.e., build from scratch, apply diff, etc
  • Regression tests for known performance issues (kind of the same)
  • “Real-world” tests that correspond to frozen versions of actual crates
    • this includes incremental flows, i.e., build from scratch, apply diff, etc
  • Performance of generated code
    • not currently measured at all; obviously somewhat different from compilation time, but perhaps can share infrastructure

Our current set of compilation-time benchmarks has grown somewhat organically and includes a smattering of the above categories. Perhaps we should carefully review them?

I made a brief stab at set of runtime benchmarks as well but that never quite got off the ground. Some more suggestions for entries there would be helpful.



Those benchmarks also need to include the maximum RAM memory used during the compilation, and the size of the resulting binaries.

80-90% of the times I update the Nightly compiler (64 bit Windows, and I update it daily when possible) I see the binaries grow a little (some months ago the binaries shrunk hugely in one update, and we’re very far from regaining that lost binary weight, so it’s not a big problem).

  • I think there is value in measuring all four of those categories.
  • There should be some pruning of current test cases, we have two versions of regex in there, for example.
  • The visual presentation of performance numbers could be a lot better. More clearly separating the above categories would probably go a long way towards addressing concerns about mixing them.

One more thing to keep in mind is that Travis nightlies have debug assertions and LLVM assertions turned on, I think. This skews measurements a bit.


I’m still holding on to the hope that winapi gets added to the set of crates being measured.