Help Needed: corpus for measuring runtime performance of generated code


#21

Hey, Criterion.rs maintainer here. I think it’s great that you’re using Criterion.rs for this. I’m taking a bit of a break from active development for the moment but I’d be happy to assist with this work from the Criterion.rs side. Let me know if there’s anything I can help with.

integrate perf record into criterion runners (this may take a bit of effort) so that perf.rlo can display more granular metrics than just runtime

This seems like it would be useful for others as well. I don’t know much about perf specifically; can that be done in-process, or would it require running the benchmark in a sub-process?

sort out a stable JSON format (I don’t know if the criterion authors expect to keep the current files they write as a stable format)

The JSON formats are deliberately unspecified and open to change, unfortunately. They’ve already changed at least once since 0.1.0. It might be possible to support this use-case more cleanly by allowing the user to provide a custom report that would receive the data directly. There’s already something like that internally, but the API isn’t really ready for public use.

make criterion able to write data files and reports to a configurable directory

This one should be pretty easy; I think all of the necessary code is there for it, I just didn’t want to stabilize an API for it until I needed to.


#22

ping - @dikaiosune, I’m quite lost as to the status. Should we make a kind of tracking issue here? We’re hitting “Yet Another” case where it’d be really nice to have some insight into the effects of various changes on runtime performance (in this case, it is @michaelwoerister’s clever PR to reuse generics from dependencies, which offers a tidy win on compilation time, but comes at the cost of being able to do less inlining – for now, we are limiting to debug builds, but …).


#23

Sorry for the radio silence! Work’s been a bit intense.

There’s a bit more work to do before I think it would be a good idea to start collecting metrics on perf.rlo (and I still don’t have a clear idea of how to make the data more navigable). In the meantime, it would be pretty straightforward to clone the repository onto a benchmark machine (bare metal preferred) and run it. I would recommend the following:

  1. set a rustup override for the benchmark directory for the “base” toolchain (before the changes)
  2. run cargo bench, save the results
  3. set a rustup override for the benchmark directory with a custom toolchain from the PR
  4. run cargo bench again, and criterion should tell you on the command line if there’s a measurable difference compared to the base run

I haven’t run the entire suite yet myself so I don’t know how long it’ll take, but I’d estimate a couple of hours. Happy to help here or on IRC if someone wants to help set this up.


#24

I don’t think we can afford a couple hours of runtime on the current perf collector, so we’d need to consider finding another dedicated server or narrowing down the quantity of benchmarks we run before we take that step.


#25

I’ll try to make progress on some PRs to criterion in the near future.

@Mark_Simulacrum I’ll try setting up a physical box, I have a spare with pretty solid performance.


#26

Wow, time flies! Small update on this from me:

  • I have a physical box we can use for lolbench for the time being. I’ve done a bunch of configuration to try to make microbenchmark performance more predictable.
  • I have a fork (that I need to rebase after some changes from @bheisler) of criterion.rs that records hardware PMUs from linux for each microbenchmark, so we should be able to describe benchmark performance much more granularly than just ns/iter.
  • I have a small patch in my criterion fork that writes results to CARGO_TARGET_DIR which was my main blocker to automating collection – definitely want to preserve the benchmark results alongside the built artifacts.

There are a few tasks left:

  • set up a task to run this on every nightly and push the JSON files somewhere
  • figure out how/whether to reformat the JSON files that come directly from criterion
  • maybe add more benchmarks
  • collect enough data that I can start figuring out how to display outliers on the various metrics
  • figure out how to detect/display outliers :stuck_out_tongue:

#27

That sounds great. We should definitely talk about bringing some of those patches back into criterion.rs. I’m particularly curious about your changes to track thr PMU’s. Do you have a link to where I can take a look at your fork?


#28

Yep, the commits in my fork are at https://github.com/anp/criterion.rs/commits/master (they should be possible to upstream without too much work). They rely on a crate I wrote at https://github.com/anp/perf_events to track PMUs. It only works on Linux at the moment but I’m considering trying to expand that.

For non-lolbench further work, I need to also make criterion display PMU data when comparisons suggest that they regressed/improved. Right now there is no inter-benchmark-run comparison for them as there is for benchmark times.


#29

Update from my end:

2018-02-02 is the first nightly date since 2018-01-01 where all of the currently assembled benchmarks compile successfully, and I’m currently running a script to backfill data from there for the last couple of months using a spare machine I have. Once more of these have run, I’m planning to start exploring a few strategies for how to present the data. At a minimum I think we need to identify some statistics which can allow us to sort graphs for the benchmarks by some sort of “interestingness” metric, surfacing the most interesting graphs to look at. Right now I am assuming we want to know about a) runtime performance regressions and b) runtime performance improvements, and to focus on recent (6 weeks old? 4?) changes like those.

I don’t have much of experience dealing with this kind of data, so I’ve begun a bit of research on what kinds of analysis might be appropriate for finding interesting benchmarks, keeping a few notes at https://github.com/anp/lolbench/issues/7. So far, I’m pretty sure that:

  • we want to be very confident that something is a regression/improvement, not just noise
  • we don’t need to take any automated action based on the metric, other than making it easy for humans to know which benchmarks to look at
  • we don’t want to have to define lots of parameters up front for different benchmarks (there are too many individual benchmarks)
  • a solution should be as simple as possible so it doesn’t become a weird black box

I’ve collected a few ideas on that issue, would be great to hear from anyone with more experience with statistics.

While backfill benchmarks are running I’m going to try to tackle:

  • polish perf_events and publish to crates.io
  • upstream PMU measurements to criterion
  • make it really easy to contribute new benchmarks

#30

Another, slightly more exciting update: I have run enough of the benchmarks on some of my hardware to have a little bit of data from nightlies cut during February and March.

I haven’t done more than a cursory scan of a few of the benchmarks’ results, but I already found one fun performance improvement:

Talked to @eddyb briefly on IRC and it seems somewhat likely that something in 3bcda48…45fba43b caused the improvement here. That commit range includes the LLVM 6 upgrade, so that’s a pretty likely candidate.

Hopefully we can have more automated detection for these changes in the near future!

EDIT: I pushed up the notebook I’m using and the data for it to a repo: https://github.com/anp/lolbench-analysis


#31

Last week I gave a talk about the work I’ve been doing on lolbench so far: https://youtu.be/gSFTbJKScU0