Measuring compiler performance


#1

So, sadly, the perf.rust-lang.org website has been broken for quite some time (basically since rustbuild landed). @Mark_Simulacrum has done some awesome work (which they can describe) building up a new system that uses the pre-built binaries produced by travis. This offers the promise of very precise info about which PR triggered a regression.

I’d like to spark a discussion on two topics:

  1. What are the minimum steps we can take to get some kind of results available again?
    • I really dislike having no measurements
    • Is it just a matter of needing to get the old server up and going again, or what?
  2. How should we structure our test suite?

@Mark_Simulacrum and I have had quite a few conversations on the second point and it seems like a good idea to get broader input.

My current take is that there are roughly four kinds of measurements I would like:

  • Compilation times that target very specific parts of the compiler and workflows
    • this includes incremental flows, i.e., build from scratch, apply diff, etc
  • Regression tests for known performance issues (kind of the same)
  • “Real-world” tests that correspond to frozen versions of actual crates
    • this includes incremental flows, i.e., build from scratch, apply diff, etc
  • Performance of generated code
    • not currently measured at all; obviously somewhat different from compilation time, but perhaps can share infrastructure

Our current set of compilation-time benchmarks has grown somewhat organically and includes a smattering of the above categories. Perhaps we should carefully review them?

I made a brief stab at set of runtime benchmarks as well but that never quite got off the ground. Some more suggestions for entries there would be helpful.

Thoughts?


#2

Those benchmarks also need to include the maximum RAM memory used during the compilation, and the size of the resulting binaries.

80-90% of the times I update the Nightly compiler (64 bit Windows, and I update it daily when possible) I see the binaries grow a little (some months ago the binaries shrunk hugely in one update, and we’re very far from regaining that lost binary weight, so it’s not a big problem).


#3
  • I think there is value in measuring all four of those categories.
  • There should be some pruning of current test cases, we have two versions of regex in there, for example.
  • The visual presentation of performance numbers could be a lot better. More clearly separating the above categories would probably go a long way towards addressing concerns about mixing them.

One more thing to keep in mind is that Travis nightlies have debug assertions and LLVM assertions turned on, I think. This skews measurements a bit.


#4

I’m still holding on to the hope that winapi gets added to the set of crates being measured.