Help Needed: corpus for measuring runtime performance of generated code


At present, we measure compilation times on the website, but we do not measure the performance of generated code. This is problematic! We need to create a decent corpus of programs. This is particularly true as we start to move forward on MIR-level optimizations.

I made a start at this some time ago here:

and the issues on that repository encoded a few more.

However, that benchmark suite is rather dated by now, and anyway it never got “rounded out”. Is there anyone out there who would like to take charge of finishing that work? It would also be great to hear suggestions for nice Rust projects to include.

@Mark_Simulacrum has promised me (read: agreed in an off-hand comment on IRC) that if we get a nice suite of benchmarks (each run with cargo bench, presumably), they will take charge of getting the results onto perf. =)

cc @michaelwoerister @eddyb


I’m wondering if there is a way to make sure that the programs that are compiled and measured are diverse enough to tax/utilise a broad set of compiler/language features.

Of course, one can just try to measure a range of programs and hope that they will be representative. The idea I had was to use some sort of compiler introspection akin to code coverage analysis to see what parts of the compiler are used.

With that you could say which part of the compilation process was affected by a change (when measuring the performance of the compiler), and what language constructs were used by the sample programs (when measuring the performance of the compiler output).


This sounds like an interesting thing to do…after we’ve assembled our “first draft”. =) I would think it’d be a great start to just add in benchmarks from various common packages, like regex, nom, or rayon.


I’m wondering if there is some automated way to try the top N crates on ? Maybe we can selectively ask those crates to add benchmarks and just automatedly run cargo bench ?


It will be really hard to infer much from that measurement. When new crates become popular, or benchmarks change, the measurement will change. I think it’d be better to have fixed benchmarks with lockfiles, so that the only variable is rustc itself.


A while ago I took a crack at this: At the time, finding a range of popular crates with maintained benchmarks that would also work across many nightly versions was a bit challenging. I ended up doing a bunch of other things to try to get consistent results across compiler versions, but sort of ran out of steam.

Is anyone going to pick up the call on this? I still think it’d be an interesting problem and I think the crates ecosystem has stabilized quite a bit and I’ve learned a lot more since I was trying that a little while ago :).


I haven’t seen anybody responding positively yet. I think the obvious next step is to create a kind of “quest issue”. This basically means:

  • Open up an issue on the rust repo (or perhaps rust-runtime-benchmarks) describing the problem
  • Write up about a checklist of some popular crates that we think would be good candidates for this benchmark. Here are some ideas of mine:
  • Write up clear instructions for how to create the benchmark PR.
    • For example, it should run with cargo bench
  • Doing one yourself to serve as a template is probably a good idea.
  • Advertise the heck out of this issue and see if people come to pick up the gauntlet!

If anyone is up for doing this, I will happily take the first step and move rust-runtime-benchmarks into the rust-lang-nursery, and make that person a collaborator. Otherwise, I’ll try to get to it, but I’m pretty slammed right now.

@dikaiosune, interested?


(I also like that existing doom benchmark. If for no other reason then the fond memories I have of playing doom until the wee hours of the night.)


I have a few other crates (not all as widely used as regex) with solid benchmark suites as well:

  • snap — This uses the same benchmark suite as the reference Snappy implementation.
  • csv — Benchmarks of different ways to read CSV data across four distinct corpora.
  • aho-corasick — A specific component of the regex crate responsible for multiple pattern search.
  • byteorder — Lots of benchmarks on reading/writing numbers in various byte orders.

Of the above, I would think byteorder would be the most important to keep an eye on, because a lot of people use it and depend on it to be as fast possible.


I might have some time to spend on this!

@Mark_Simulacrum mentioned in the linked IRC log that they’d like to use additional features for cargo bench to control the number of bench runs so that perf.rlo can control the results in a more fine-grained way, like running only one iteration of a benchmark at a time. One reason I’m slightly skeptical of this approach is that it may be fairly common (I’ve certainly done it) to include expensive up front computation in a bench function, but before the bencher.iter(|| ...) call. It also seems like there’s not a lot of motivation to iterate on the bench harness’ interface right now, and getting any changes merged to support perf.rlo output is probably going to take a little while. Which I assume would make it harder to add things like structured (JSON/XML/etc) output in addition to the CLI flags mentioned on IRC.

I think it might be worth considering making an umbrella project which uses so that we can collect some more interesting stats than just ns/iter. I imagine it would certainly be easier to add structured output in a format that perf.rlo can consume easily. There are two big issues I see with this approach:

  1. It would require effort to wrap each bench suite in a criterion layer, although I assume at least some of the projects would be interested in having more robust statistics available and may be willing to directly expose criterion bench functions behind a cfg.
  2. I don’t entirely understand why perf.rlo’s stats collector would be preferable over any other, but if there’s something really useful or advantageous that perf.rlo does that criterion doesn’t, I could see that being a downside.

If we went this route, I think we’d have a project structure like so:

  • top level binary with a criterion_main!(foo)
    • facade crate with exported benchmarking functions, e.g. regex_0_11_0
      • exact-version-pinned dependency on regex@0.11.0 (version made up here, no idea what regex is on right now)

And repeating this pattern for each crate and crate version included. Since updating crate versions could drastically alter what benchmarks are available and the numbers for existing benchmarks, I think we’d want to make the benchmark suite append-only to start.

Happy to discuss the above, mostly just thinking out loud.

Some other ideas of potentially suitable benchmarks (haven’t checked if these have benchmarks):

  • diesel – I don’t know if it has benchmarks today, but given that it stresses many areas of the type system I would expect it to be interesting fodder for optimization regressions
  • futures (or some consumer of futures) – another example of a heavily abstracted API that is likely to be performance critical for the ecosystem, albeit probably difficult to measure meaningfully in microbenchmarks
  • clap
  • some serde serializers and deserializers
  • an arena allocator, perhaps something like petgraph on top of one?
  • ndarray or some other pure-rust numerical compute library?
  • frequently used data structures like smallvec or vecmap
  • crossbeam or one of its offspring?
  • backtrace?

Some language/compiler features I can imagine not being caught by popular crates:

  • dynamic dispatch – generally considered a last resort for many Rust programmers
  • C ABI calls – many/most libraries avoid taking on C deps but many applications use them
  • panicking – not sure if the performance of this matters much, but it’s definitely not present in very many popular libraries


cc @sgrif


This all sounds awesome to me. I don’t have a very strong opinion about the format of the benchmarks. I personally have found cargo bench-style benchmarks actually kind of hard to write – it often seem like they have to be so short that they come up quite unreliable, but maybe I’m doing it wrong in other ways. But I don’t think it represents “best practices” in any particular way.

I do agree though that having some way to hide “setup” costs and exclude them from the measurement would be useful.


Yes, I’m generally happy to take benchmarks in any format. perf.rlo today expects a perf style output, which is why I commented that it might be convenient to use its builtin runner (which runs perf stat with the right flags). But if we want to use criterion or some other runner and collect stats from it, that seems fine as well. I don’t feel too strongly either way. It does seem reasonable that microbenchmarks like the kind found with cargo bench don’t work well with perf stat style runs since the overhead of the process will probably be fairly high.

In general, I’m happy to see PRs modifying the collector if we think there’s a better approach than what we do today or need some new functionality.


Yup, we do have benchmarks today. bin/bench in the Diesel repo will run them.


It depends what you’re after, but there are a few floating around these days:

I’m sure there are more that I don’t know about :slight_smile:


OK! I spent some effort on putting some benchmarks together. The repository currently has benchmarks from 12 projects in it, and I’m tracking some more I’d like to add. Some of the benchmarks are commented out to keep from eating up all of my battery, but if you uncomment them you can run cargo bench in the crate root and get some nice output.

I have a few things I’d like to do still:

  1. integrate perf record into criterion runners (this may take a bit of effort) so that perf.rlo can display more granular metrics than just runtime
  2. sort out a stable JSON format (I don’t know if the criterion authors expect to keep the current files they write as a stable format)
  3. make criterion able to write data files and reports to a configurable directory
  4. add more benchmarks (see tracking issue)

So far my goal has been to avoid enabling any nightly features. While I would expect these to always be run with a nightly compiler, keeping everything compiling on the latest nightly seems like a worthy goal. In a few places that has definitely required more than copy/paste migration of benchmark code, but I’m not worried about the maintenance cost of that as I would hope that the benchmark suite will be append-only as much as possible.

Once the data collection is in a good place, I think one open question is how to handle the volume of data. Over time, I hope we’ll be able to prune the number of benchmark functions to a more actionable set, but in the meantime there are currently almost 400 benchmark functions, and that will grow quite a bit if we cover as many types of crates as proposed. In the past I’ve tried using a geometric mean of each crate’s benchmark times to detect regressions for an entire crate, but that seems a bit coarse-grained for a tool that’s supposed to help detect codegen regressions.

One last question regarding what benchmarks to add: does anyone have any idea of what constitutes “coverage” for this kind of benchmark suite? Given the complexity of optimization passes, I’m not sure how one would be confident that a benchmark suite covers all of the language/compiler/std/toolchain features we’d care about, but it would be good to at least try. If anyone has more informed opinions about coverage targets we could set, I’d be happy to look into adding more benchmarks in that direction.

EDIT: Maybe there’s inspiration to be drawn from SPEC or similar projects?

EDIT2: At the very least maybe we should include things from the benchmarks game so we don’t regress on the PR front :stuck_out_tongue:


Enjarify ( has benchmarks. They’re not set up to run with cargo bench, but it shouldn’t be hard to change that. Right now it is set up so that you run the benchmark by just passing the argument hashtests to the main binary.

Enjarify is also somewhat unusual in that it deliberately panics and recovers, although this will only happen a couple dozen times per run, so it’s far from a bottleneck.


What about the programs from ? It’s already a well-optimized, diverse set of small programs that do CPU-intensive stuff.


Claxon (a flac decoder) has benchmarks. I am not sure whether it would be a good target to measure; it has a few very hot loops and I don’t think there is much to gain through mir optimisations. It might be useful to detect regressions still.


This is great @dikaiosune! To me, the critical next question is how to get this data onto perf. I do feel like 400 data points is a bit much though, it definitely wouldn’t scale with the current perf setup.