Help Needed: corpus for measuring runtime performance of generated code

Update from my end:

2018-02-02 is the first nightly date since 2018-01-01 where all of the currently assembled benchmarks compile successfully, and I’m currently running a script to backfill data from there for the last couple of months using a spare machine I have. Once more of these have run, I’m planning to start exploring a few strategies for how to present the data. At a minimum I think we need to identify some statistics which can allow us to sort graphs for the benchmarks by some sort of “interestingness” metric, surfacing the most interesting graphs to look at. Right now I am assuming we want to know about a) runtime performance regressions and b) runtime performance improvements, and to focus on recent (6 weeks old? 4?) changes like those.

I don’t have much of experience dealing with this kind of data, so I’ve begun a bit of research on what kinds of analysis might be appropriate for finding interesting benchmarks, keeping a few notes at https://github.com/anp/lolbench/issues/7. So far, I’m pretty sure that:

  • we want to be very confident that something is a regression/improvement, not just noise
  • we don’t need to take any automated action based on the metric, other than making it easy for humans to know which benchmarks to look at
  • we don’t want to have to define lots of parameters up front for different benchmarks (there are too many individual benchmarks)
  • a solution should be as simple as possible so it doesn’t become a weird black box

I’ve collected a few ideas on that issue, would be great to hear from anyone with more experience with statistics.

While backfill benchmarks are running I’m going to try to tackle:

  • polish perf_events and publish to crates.io
  • upstream PMU measurements to criterion
  • make it really easy to contribute new benchmarks
4 Likes