Hey, Criterion.rs maintainer here. I think it's great that you're using Criterion.rs for this. I'm taking a bit of a break from active development for the moment but I'd be happy to assist with this work from the Criterion.rs side. Let me know if there's anything I can help with.
integrate perf record into criterion runners (this may take a bit of effort) so that perf.rlo can display more granular metrics than just runtime
This seems like it would be useful for others as well. I don't know much about perf specifically; can that be done in-process, or would it require running the benchmark in a sub-process?
sort out a stable JSON format (I donāt know if the criterion authors expect to keep the current files they write as a stable format)
The JSON formats are deliberately unspecified and open to change, unfortunately. They've already changed at least once since 0.1.0. It might be possible to support this use-case more cleanly by allowing the user to provide a custom report that would receive the data directly. There's already something like that internally, but the API isn't really ready for public use.
make criterion able to write data files and reports to a configurable directory
This one should be pretty easy; I think all of the necessary code is there for it, I just didn't want to stabilize an API for it until I needed to.
ping - @dikaiosune, Iām quite lost as to the status. Should we make a kind of tracking issue here? Weāre hitting āYet Anotherā case where itād be really nice to have some insight into the effects of various changes on runtime performance (in this case, it is @michaelwoeristerās clever PR to reuse generics from dependencies, which offers a tidy win on compilation time, but comes at the cost of being able to do less inlining ā for now, we are limiting to debug builds, but ā¦).
Sorry for the radio silence! Workās been a bit intense.
Thereās a bit more work to do before I think it would be a good idea to start collecting metrics on perf.rlo (and I still donāt have a clear idea of how to make the data more navigable). In the meantime, it would be pretty straightforward to clone the repository onto a benchmark machine (bare metal preferred) and run it. I would recommend the following:
set a rustup override for the benchmark directory for the ābaseā toolchain (before the changes)
run cargo bench, save the results
set a rustup override for the benchmark directory with a custom toolchain from the PR
run cargo bench again, and criterion should tell you on the command line if thereās a measurable difference compared to the base run
I havenāt run the entire suite yet myself so I donāt know how long itāll take, but Iād estimate a couple of hours. Happy to help here or on IRC if someone wants to help set this up.
I donāt think we can afford a couple hours of runtime on the current perf collector, so weād need to consider finding another dedicated server or narrowing down the quantity of benchmarks we run before we take that step.
I have a physical box we can use for lolbench for the time being. Iāve done a bunch of configuration to try to make microbenchmark performance more predictable.
I have a fork (that I need to rebase after some changes from @bheisler) of criterion.rs that records hardware PMUs from linux for each microbenchmark, so we should be able to describe benchmark performance much more granularly than just ns/iter.
I have a small patch in my criterion fork that writes results to CARGO_TARGET_DIR which was my main blocker to automating collection ā definitely want to preserve the benchmark results alongside the built artifacts.
There are a few tasks left:
set up a task to run this on every nightly and push the JSON files somewhere
figure out how/whether to reformat the JSON files that come directly from criterion
maybe add more benchmarks
collect enough data that I can start figuring out how to display outliers on the various metrics
That sounds great. We should definitely talk about bringing some of those patches back into criterion.rs. Iām particularly curious about your changes to track thr PMUās. Do you have a link to where I can take a look at your fork?
For non-lolbench further work, I need to also make criterion display PMU data when comparisons suggest that they regressed/improved. Right now there is no inter-benchmark-run comparison for them as there is for benchmark times.
2018-02-02 is the first nightly date since 2018-01-01 where all of the currently assembled benchmarks compile successfully, and Iām currently running a script to backfill data from there for the last couple of months using a spare machine I have. Once more of these have run, Iām planning to start exploring a few strategies for how to present the data. At a minimum I think we need to identify some statistics which can allow us to sort graphs for the benchmarks by some sort of āinterestingnessā metric, surfacing the most interesting graphs to look at. Right now I am assuming we want to know about a) runtime performance regressions and b) runtime performance improvements, and to focus on recent (6 weeks old? 4?) changes like those.
I donāt have much of experience dealing with this kind of data, so Iāve begun a bit of research on what kinds of analysis might be appropriate for finding interesting benchmarks, keeping a few notes at https://github.com/anp/lolbench/issues/7. So far, Iām pretty sure that:
we want to be very confident that something is a regression/improvement, not just noise
we donāt need to take any automated action based on the metric, other than making it easy for humans to know which benchmarks to look at
we donāt want to have to define lots of parameters up front for different benchmarks (there are too many individual benchmarks)
a solution should be as simple as possible so it doesnāt become a weird black box
Iāve collected a few ideas on that issue, would be great to hear from anyone with more experience with statistics.
While backfill benchmarks are running Iām going to try to tackle:
Another, slightly more exciting update: I have run enough of the benchmarks on some of my hardware to have a little bit of data from nightlies cut during February and March.
I havenāt done more than a cursory scan of a few of the benchmarksā results, but I already found one fun performance improvement:
Talked to @eddyb briefly on IRC and it seems somewhat likely that something in 3bcda48ā¦45fba43b caused the improvement here. That commit range includes the LLVM 6 upgrade, so thatās a pretty likely candidate.
Hopefully we can have more automated detection for these changes in the near future!
I finally have the automated collection working reliably and running on a couple of cheap dedicated servers. The data is currently summarized at https://blog.anp.lol/lolbench-data/ if anyone wants to check it out!
EDIT: I am in the process of writing a blog post with more detail, should post that in a day or two.