Help Needed: corpus for measuring runtime performance of generated code

nikomatsakis · February 20, 2018, 8:24pm

At present, we measure compilation times on the perf.rust-lang.org website, but we do not measure the performance of generated code. This is problematic! We need to create a decent corpus of programs. This is particularly true as we start to move forward on MIR-level optimizations.

I made a start at this some time ago here:

and the issues on that repository encoded a few more.

However, that benchmark suite is rather dated by now, and anyway it never got “rounded out”. Is there anyone out there who would like to take charge of finishing that work? It would also be great to hear suggestions for nice Rust projects to include.

@Mark_Simulacrum has promised me (read: agreed in an off-hand comment on IRC) that if we get a nice suite of benchmarks (each run with cargo bench, presumably), they will take charge of getting the results onto perf. =)

cc @michaelwoerister @eddyb

repax · February 20, 2018, 10:13pm

I’m wondering if there is a way to make sure that the programs that are compiled and measured are diverse enough to tax/utilise a broad set of compiler/language features.

Of course, one can just try to measure a range of programs and hope that they will be representative. The idea I had was to use some sort of compiler introspection akin to code coverage analysis to see what parts of the compiler are used.

With that you could say which part of the compilation process was affected by a change (when measuring the performance of the compiler), and what language constructs were used by the sample programs (when measuring the performance of the compiler output).

nikomatsakis · February 21, 2018, 4:42pm

This sounds like an interesting thing to do...after we've assembled our "first draft". =) I would think it'd be a great start to just add in benchmarks from various common packages, like regex, nom, or rayon.

mark-i-m · February 21, 2018, 11:44pm

I’m wondering if there is some automated way to try the top N crates on crates.io ? Maybe we can selectively ask those crates to add benchmarks and just automatedly run cargo bench ?

nikomatsakis · February 22, 2018, 12:00am

It will be really hard to infer much from that measurement. When new crates become popular, or benchmarks change, the measurement will change. I think it’d be better to have fixed benchmarks with lockfiles, so that the only variable is rustc itself.

anp · February 22, 2018, 1:00am

A while ago I took a crack at this: https://github.com/anp/rust-runtime-benchmarks. At the time, finding a range of popular crates with maintained benchmarks that would also work across many nightly versions was a bit challenging. I ended up doing a bunch of other things to try to get consistent results across compiler versions, but sort of ran out of steam.

Is anyone going to pick up the call on this? I still think it’d be an interesting problem and I think the crates ecosystem has stabilized quite a bit and I’ve learned a lot more since I was trying that a little while ago :).

nikomatsakis · February 22, 2018, 1:28am

I haven't seen anybody responding positively yet. I think the obvious next step is to create a kind of "quest issue". This basically means:

Open up an issue on the rust repo (or perhaps rust-runtime-benchmarks) describing the problem
Write up about a checklist of some popular crates that we think would be good candidates for this benchmark. Here are some ideas of mine:
- a nom-based parser
- one of the rayon demo benchmarks
- regex (actually my rust-runtime-benchmarks repo has something that @burntsushi contributed)
- serde-json benchmark
- a raytracer (have one of those already, but maybe there's a better choice?)
- rust-brotli
- inflate algorithm
Write up clear instructions for how to create the benchmark PR.
- For example, it should run with cargo bench
Doing one yourself to serve as a template is probably a good idea.
Advertise the heck out of this issue and see if people come to pick up the gauntlet!

If anyone is up for doing this, I will happily take the first step and move rust-runtime-benchmarks into the rust-lang-nursery, and make that person a collaborator. Otherwise, I'll try to get to it, but I'm pretty slammed right now.

@dikaiosune, interested?

nikomatsakis · February 22, 2018, 1:29am

(I also like that existing doom benchmark. If for no other reason then the fond memories I have of playing doom until the wee hours of the night.)

burntsushi · February 22, 2018, 12:19pm

I have a few other crates (not all as widely used as regex) with solid benchmark suites as well:

snap — This uses the same benchmark suite as the reference Snappy implementation.
csv — Benchmarks of different ways to read CSV data across four distinct corpora.
aho-corasick — A specific component of the regex crate responsible for multiple pattern search.
byteorder — Lots of benchmarks on reading/writing numbers in various byte orders.

Of the above, I would think byteorder would be the most important to keep an eye on, because a lot of people use it and depend on it to be as fast possible.

anp · February 22, 2018, 7:11pm

I might have some time to spend on this!

@Mark_Simulacrum mentioned in the linked IRC log that they’d like to use additional features for cargo bench to control the number of bench runs so that perf.rlo can control the results in a more fine-grained way, like running only one iteration of a benchmark at a time. One reason I’m slightly skeptical of this approach is that it may be fairly common (I’ve certainly done it) to include expensive up front computation in a bench function, but before the bencher.iter(|| ...) call. It also seems like there’s not a lot of motivation to iterate on the bench harness’ interface right now, and getting any changes merged to support perf.rlo output is probably going to take a little while. Which I assume would make it harder to add things like structured (JSON/XML/etc) output in addition to the CLI flags mentioned on IRC.

I think it might be worth considering making an umbrella project which uses criterion.rs so that we can collect some more interesting stats than just ns/iter. I imagine it would certainly be easier to add structured output in a format that perf.rlo can consume easily. There are two big issues I see with this approach:

It would require effort to wrap each bench suite in a criterion layer, although I assume at least some of the projects would be interested in having more robust statistics available and may be willing to directly expose criterion bench functions behind a cfg.
I don’t entirely understand why perf.rlo’s stats collector would be preferable over any other, but if there’s something really useful or advantageous that perf.rlo does that criterion doesn’t, I could see that being a downside.

If we went this route, I think we’d have a project structure like so:

top level binary with a criterion_main!(foo)
- facade crate with exported benchmarking functions, e.g. regex_0_11_0
  - exact-version-pinned dependency on regex@0.11.0 (version made up here, no idea what regex is on right now)

And repeating this pattern for each crate and crate version included. Since updating crate versions could drastically alter what benchmarks are available and the numbers for existing benchmarks, I think we’d want to make the benchmark suite append-only to start.

Happy to discuss the above, mostly just thinking out loud.

Some other ideas of potentially suitable benchmarks (haven’t checked if these have benchmarks):

diesel – I don’t know if it has benchmarks today, but given that it stresses many areas of the type system I would expect it to be interesting fodder for optimization regressions
futures (or some consumer of futures) – another example of a heavily abstracted API that is likely to be performance critical for the ecosystem, albeit probably difficult to measure meaningfully in microbenchmarks
clap
some serde serializers and deserializers
an arena allocator, perhaps something like petgraph on top of one?
ndarray or some other pure-rust numerical compute library?
frequently used data structures like smallvec or vecmap
crossbeam or one of its offspring?
backtrace?

Some language/compiler features I can imagine not being caught by popular crates:

dynamic dispatch – generally considered a last resort for many Rust programmers
C ABI calls – many/most libraries avoid taking on C deps but many applications use them
panicking – not sure if the performance of this matters much, but it’s definitely not present in very many popular libraries

nikomatsakis · February 22, 2018, 7:58pm

cc @sgrif

nikomatsakis · February 22, 2018, 8:00pm

This all sounds awesome to me. I don't have a very strong opinion about the format of the benchmarks. I personally have found cargo bench-style benchmarks actually kind of hard to write -- it often seem like they have to be so short that they come up quite unreliable, but maybe I'm doing it wrong in other ways. But I don't think it represents "best practices" in any particular way.

I do agree though that having some way to hide "setup" costs and exclude them from the measurement would be useful.

Mark_Simulacrum · February 22, 2018, 9:24pm

Yes, I’m generally happy to take benchmarks in any format. perf.rlo today expects a perf style output, which is why I commented that it might be convenient to use its builtin runner (which runs perf stat with the right flags). But if we want to use criterion or some other runner and collect stats from it, that seems fine as well. I don’t feel too strongly either way. It does seem reasonable that microbenchmarks like the kind found with cargo bench don’t work well with perf stat style runs since the overhead of the process will probably be fairly high.

In general, I’m happy to see PRs modifying the collector if we think there’s a better approach than what we do today or need some new functionality.

sgrif · February 22, 2018, 9:55pm

Yup, we do have benchmarks today. bin/bench in the Diesel repo will run them.

abusch · February 23, 2018, 4:26am

It depends what you're after, but there are a few floating around these days:

tray_rust
my port of PBRT: rustracer
another port of PBRT: rs_pbrt
psychopath

I'm sure there are more that I don't know about

anp · February 26, 2018, 9:52pm

OK! I spent some effort on putting some benchmarks together. The repository currently has benchmarks from 12 projects in it, and I’m tracking some more I’d like to add. Some of the benchmarks are commented out to keep from eating up all of my battery, but if you uncomment them you can run cargo bench in the crate root and get some nice output.

I have a few things I’d like to do still:

integrate perf record into criterion runners (this may take a bit of effort) so that perf.rlo can display more granular metrics than just runtime
sort out a stable JSON format (I don’t know if the criterion authors expect to keep the current files they write as a stable format)
make criterion able to write data files and reports to a configurable directory
add more benchmarks (see tracking issue)

So far my goal has been to avoid enabling any nightly features. While I would expect these to always be run with a nightly compiler, keeping everything compiling on the latest nightly seems like a worthy goal. In a few places that has definitely required more than copy/paste migration of benchmark code, but I’m not worried about the maintenance cost of that as I would hope that the benchmark suite will be append-only as much as possible.

Once the data collection is in a good place, I think one open question is how to handle the volume of data. Over time, I hope we’ll be able to prune the number of benchmark functions to a more actionable set, but in the meantime there are currently almost 400 benchmark functions, and that will grow quite a bit if we cover as many types of crates as proposed. In the past I’ve tried using a geometric mean of each crate’s benchmark times to detect regressions for an entire crate, but that seems a bit coarse-grained for a tool that’s supposed to help detect codegen regressions.

One last question regarding what benchmarks to add: does anyone have any idea of what constitutes “coverage” for this kind of benchmark suite? Given the complexity of optimization passes, I’m not sure how one would be confident that a benchmark suite covers all of the language/compiler/std/toolchain features we’d care about, but it would be good to at least try. If anyone has more informed opinions about coverage targets we could set, I’d be happy to look into adding more benchmarks in that direction.

EDIT: Maybe there’s inspiration to be drawn from SPEC or similar projects?

EDIT2: At the very least maybe we should include things from the benchmarks game so we don’t regress on the PR front

Storyyeller · February 27, 2018, 3:07am

Enjarify (https://github.com/Storyyeller/enjarify/tree/go/enjarify-rs) has benchmarks. They’re not set up to run with cargo bench, but it shouldn’t be hard to change that. Right now it is set up so that you run the benchmark by just passing the argument hashtests to the main binary.

Enjarify is also somewhat unusual in that it deliberately panics and recovers, although this will only happen a couple dozen times per run, so it’s far from a bottleneck.

asomers · February 27, 2018, 5:06am

What about the programs from https://benchmarksgame.alioth.debian.org/u64q/rust.html ? It’s already a well-optimized, diverse set of small programs that do CPU-intensive stuff.

ruuda · February 27, 2018, 7:45pm

Claxon (a flac decoder) has benchmarks. I am not sure whether it would be a good target to measure; it has a few very hot loops and I don’t think there is much to gain through mir optimisations. It might be useful to detect regressions still.

nikomatsakis · February 28, 2018, 6:45pm

This is great @dikaiosune! To me, the critical next question is how to get this data onto perf. I do feel like 400 data points is a bit much though, it definitely wouldn’t scale with the current perf setup.

Topic		Replies	Views
Pre-RFC: Stabilize `#[bench]`, `Bencher` and `black_box` language design	50	11836	March 25, 2019
Help us benchmark incremental compilation!	48	12340	March 25, 2019
Compiler Profiling Survey compiler	25	2753	January 27, 2020
Measuring compiler performance tools and infrastructure	4	1588	March 25, 2019
Let's talk about parallel codegen compiler	49	10012	March 25, 2019

Help Needed: corpus for measuring runtime performance of generated code

Related topics