Pre-RFC: Stabilize `#[bench]`, `Bencher` and `black_box`

I’d welcome this!

I had thought that the sticky issue here was the stabilization of black_box, but I can’t remember the details.

Counter-proposal: can we macros 1.1 this? That is, maybe full on custom test frameworks need something huge, but is there something small we can do for now?

Also, will get you on stable for this stuff today.

I am also feelin’ that pressure of “let’s just stabilize some stuff”, but given how little attention this has gotten for years… i dunno.


Stabilizing basic testing and benching is an excellent usability idea and I’m glad it’s happening.

If building a #![no_std] binary, can’t you just test and bench it separately as an external crate and pull in the standard libs separately? I think conventional testing is already “different” enough on embedded devices that this neither helps nor hurts people who are already accustomed to it.

I make use of the test feature for benchmarks in several crates that I maintain, but I don’t feel any pressing need to stabilize these. It’s incredibly easy to make travis-ci run the benchmarks in the nightly version, so these being stable doesn’t really get me anything. If the feature genuinely isn’t ready, I’d rather it not be stable.

At the same time, I’ve been using it as-is for a couple years now, and it’s worked perfectly fine. (I do miss the ability to ratchet metrics, actually.) If it works and hasn’t changed, perhaps by definition it is stable?


Thank you for pushing on this front Brian. In the external test harnesses thread I proposed an alternative:

I’d like to (first introduce as unstable and then) stabilize building blocks for external test harnesses:

  • A simplified version of TestDescAndFn, with just enough to describe what rustc --test generates. (So probably without dynamic tests.)
  • Some way to override which crate/function is used instead of test::test_main_static. Maybe #[test_harness] extern crate fancy_test;? (If benchmark are supported in this mechanism, TestDescAndFn would probably have to be generic over Bencher.)

Just to clarify:

There are only a few crates I see using test::TestDesc (and thus constructing their own test suites dynamically): the out-of-tree compiletest and syntax, and (interestingly) url.

The url, idna, and html5ever crates each use which is a copy of Rust’s src/libtest modified just enough to run on stable Rust. (The package name for Cargo and is rustc-test, but the library name for rustc and extern crate is test.) They do so to dynamically generate many tests that share the same code but take different input data. But rustc-tests can also be used with #[test] and #[bench] to run benchmarks on stable Rust today.

Woah #[bench] is stable? And rustc can be convinced to use an alternate test crate? How do you do that? That could change my opinions about the solution space considerably.

My concern with the alternate proposals so far is that they take us back to the drawing board. We are going back to the drawing board to design custom test frameworks, and relatively soon, but anything involving design must take considerably longer than essentially rubber stamping a de-facto solution.

1 Like

Yes, that works for many cases, but there is e.g. embedded code that cannot link to std, and that might e.g. need tooling help to marshal itself over to an emulator before running.

In its own Cargo.toml, uses:

name = "rustc-test"

name = "test"

So having this in your Cargo.toml:

rustc-test = "0.1"

… makes Cargo pass this to rustc: --extern test=…/target/…/libtest-….rlib so that extern crate test; uses that crate instead of the standard library one. Since that crate doesn’t use #[unstable] (I removed them in this copy), this works fine on stable.

And yes, it looks like the #[bench] attribute itself is not feature-gated. Only (the standard library’s copy of) the test crate and test::Bencher are. But honestly this looks like an accident and I think nobody else has noticed so far. I typed my previous message from memory of almost a year ago without checking, and when reading your reaction I thought I was wrong until I tried it.

I’ve just checked, none of rustc-test’s reverse dependencies use #[bench]. I think it would be fine to feature-gate it today until we figure out a plan.

My concern with the alternate proposals so far is that they take us back to the drawing board.

For what it’s worth, the proposal I made in the other thread and quoted above is made to be minimal, just enough to support rustc --test, leaving everything else to external code.

1 Like

I am strongly in favor of having some solution for #[bench] available on stable. I could easily get behind @brson’s original plan, but I have to think about @SimonSapin’s plan a bit. I think the high-order bit for me is that I want benchmarking to feel “built-in” in the same way that unit testing does. @brson’s plan seems to achieve that. @SimonSapin’s plan could well achieve that too, though I think it has to be paired with a rust-lang-nursery crate that is on the fast to ubiquity. I also think we should standard a way to write tests and report results (more on this later) so that people can write benchmarks once and then experiment with different runners.

As far as writing tests, I am not that worried about forward compatibility, in part because I feel like the current benchmarking inferface is about as simple as it gets, and hence we will always want a mode that operates something like it, even if we eventually grow more knobs and whistles. (The need to sometimes use “black box” is a drag, admittedly, but I’m not sure if that is easily avoided?) I’m curious if anyone has any concrete thoughts about alternatives.

It is true of course that it is relatively easy to use nightly to run benchmarking and with travis, but it also means that even if your library works on stable, you often need nightly compilers to run benchmarks. This means you can’t benchmark the compiler your users are using and it is also kind of a pain. It gives off a “Rust is not ready for prime-time” aura to me.

Another aspect that hasn’t been much discussed here is that I think we should stabilize a standard output format for writing the results. It ought to be JSON, not text. It ought to be setup to give more data in the future (e.g., it’d be great to be able to get individual benchmarking results). We’ve already got tools building on the existing text format, but this is a shifty foundation. (For example, the cargo-chrono tool can do things like work various commits, run benchmarks, aggregate multiple runs, compute medians and normalize, and generate plots like this one or this one that summarize the results. Meanwhile @burntsushi has the cargo-benchcmp tool.)

Finally, I feel like having a stable way to write benchmarks might encourage more people to investigate how to improve it! I know that for Rayon I’ve found the numbers to be a bit unreliable (which was part of my motivation in writing cargo-chrono). I think part of this is that the current runner basically forces you to have closures that execute very quickly, which means I can’t write benchmarks that process a large amount of data. This all seems eminently fixable by building on the existing APIs. (I could also be totally wrong about what the problem is; but benchmarks run by hand seem to yield more stable numbers in some cases (not all).)


One thing I’d love to see with both tests and benches is easier parameterization. Macros are a solution, but are limited in expressiveness.

1 Like

I have seen related issues benchmarking hashmap changes and I think it could be related to this

1 Like

Another unrelated point: The current benchmark runner is underequipped for micro benchmarks.

We should be looking at JMH which is probably the gold standard in microbenchmark harnesses.

While I am in favor of aiming to improve the benchmarking harness, I think we focus on getting something stable and usable first, as long as we feel there is room to expand it. Taking a quick glance at the JMH examples, it seems like there would be plenty of room to do so – either by further configuring #[bench], offering new methods on the Bencher callback, or extending the binary to support alternate execution modes.

Stabilizing this is likely a measurable step towards reducing reliance on nightly (which IS a good thing judging by the 2017 roadmap). The current #[bench] and friends are simple enough that they could easily be customizable/expanded in the future.

Since procedural macros are finally slowly getting stable, alternative test harnesses or benchmarkers should probably simply be procedural macros.

In other words if you want to use the default bencher you use #[bench], and if you want to use a third party library you add a dependency to awesome_library and you use #[awesome_library_bench].

1 Like

Last time Alex and I worked through a custom test framework design, we landed on punting test definitions entirely to procedural macros, likely compiling down to today’s #[test] fns. For benchmarks there would possibly need to be some extensions to the test crate’s APIs since it itself is responsible for running today’s benchmarks and isn’t extensible to other approaches.

Or perhaps

use awesome_library::bench; // shadow the default bench macro 

fn foo() { ... }

though I’ve not thought that through at all =)

1 Like

I wonder if we could future-proof the Bencher type to be more abstract and use the already-working trick of overriding the test crate to provide alternate benching strategies. At least if we do start thinking about stabilizing Bencher we should see how much of its interface we can get away with locking down. The current interface seems quite specific, and I don’t have much conception myself of requirements for generic benchmarking.

I’m not keen on stabilizing more hard-coded annotations in rustc. Sure they make things easier now, but they will amount to extra unneeded complexity that never goes away once we have a beautiful custom test harness solution.

It’s already awkward that #[test] and #[bench] are always defined regardless of what crates one links and how one is compiling—that means we can’t backwards compatibility solely define them in some library. For, it would most elegant to make all testing stuff (runtime and macros) come from single multi-phase test crate, or test and test-macros pair of crates.

The plugins-based approach has the advantages that test harness are not treated in a special in way by the compiler, and that you can use multiple test harnesses per crate.