OK! I spent some effort on putting some benchmarks together. The repository currently has benchmarks from 12 projects in it, and I’m tracking some more I’d like to add. Some of the benchmarks are commented out to keep from eating up all of my battery, but if you uncomment them you can run cargo bench in the crate root and get some nice output.
I have a few things I’d like to do still:
- integrate
perf record into criterion runners (this may take a bit of effort) so that perf.rlo can display more granular metrics than just runtime
- sort out a stable JSON format (I don’t know if the criterion authors expect to keep the current files they write as a stable format)
- make criterion able to write data files and reports to a configurable directory
- add more benchmarks (see tracking issue)
So far my goal has been to avoid enabling any nightly features. While I would expect these to always be run with a nightly compiler, keeping everything compiling on the latest nightly seems like a worthy goal. In a few places that has definitely required more than copy/paste migration of benchmark code, but I’m not worried about the maintenance cost of that as I would hope that the benchmark suite will be append-only as much as possible.
Once the data collection is in a good place, I think one open question is how to handle the volume of data. Over time, I hope we’ll be able to prune the number of benchmark functions to a more actionable set, but in the meantime there are currently almost 400 benchmark functions, and that will grow quite a bit if we cover as many types of crates as proposed. In the past I’ve tried using a geometric mean of each crate’s benchmark times to detect regressions for an entire crate, but that seems a bit coarse-grained for a tool that’s supposed to help detect codegen regressions.
One last question regarding what benchmarks to add: does anyone have any idea of what constitutes “coverage” for this kind of benchmark suite? Given the complexity of optimization passes, I’m not sure how one would be confident that a benchmark suite covers all of the language/compiler/std/toolchain features we’d care about, but it would be good to at least try. If anyone has more informed opinions about coverage targets we could set, I’d be happy to look into adding more benchmarks in that direction.
EDIT: Maybe there’s inspiration to be drawn from SPEC or similar projects?
EDIT2: At the very least maybe we should include things from the benchmarks game so we don’t regress on the PR front 