Pre-RFC: Stabilize `#[bench]`, `Bencher` and `black_box`


#1

This is just a quick strawman proposal about the unstable test feature. With the recent uptick in concern about the nightly channel I’m feeling more urgency to knock down the unstable feature list.

When I look at that link the thing that stands out to me is the "test" feature. This feature tends to be used to get access to either black_box or #[bench] and Bencher. We’ve been reluctant to stabilize these because they aren’t as good as they could be, prefering to hold off for a more extensive custom test frameworks

In a crude grep through crates.io I see:

  • 645 crates using block_box or Bencher
  • 752 using extern crate test

The discrepancy between the two is curious, possibly because I’m missing e.g. glob-imported black boxes, possibly because of other minor features people are using in the test crate.

There are only a few crates I see using test::TestDesc (and thus constructing their own test suites dynamically): the out-of-tree compiletest and syntax, and (interestingly) url. I see a few using test::stats (I don’t even know what this is…), but that seems like it can trivially be moved out of tree.

I suggest we stabilize #[bench], Bencher, and black_box as-is or nearly as-is.

I think this reasonable because:

  • Even in unstable form, these features are low maintenance and likely to stay around forever as-is, they are so widely used. Stabilizing them, while distasteful, has relatively little maintenance impact.
  • A stable, working implementation of custom test frameworks that can support benchmarking is far away. We don’t even have a design.

The primary reason for doing so is to quickly eliminate a major source of nightly dependence, not because we love the feature.

So here’s specifically what I suggest:

  • Move black_box and Bencher to std::test, leaving unstable reexports.
  • Actively migrate all existing crates from test::black_box to std::test::black_box.
  • Mark std::test and #[bench] stable.

That’s it. We can do it real fast. And I think it’s important that we actively do the ecosystem migration, not just wait for authors to get around to it. Let’s move the numbers.

Someday we can deprecate these if the tides change.

Why put these in std::test instead of stabilizing extern crate test? This is a tricky question. Today the test crate is an implementation detail. No part of the hypothetical Rust spec requires it. So it makes sense to hold that line and just move the few tiny features people need into std. That said, I am not confident the test crate will always be an implementation detail, and I suspect in the fullness of time it will go from de-facto stable to specced. In particular, last I recall, the std-aware cargo RFC requires cargo to know the test crate exists, though it does not require the test crate to be importable by user crates.

The wrinkles to consider here are the impact on custom test frameworks and on no-std testing (which so far has recieved little thought). Neither of these implications I have thought through much, but I suspect stabilizing a few benchmarking types does not cause significantly worse problems for either effort.

Let me know what ya think.


Past, present, and future for Rust testing
#2

Oh, one other thing to think about before doing this is the implementation of #[bench]. The validation of their signatures is a bit funky. The test pass is basically ensuring that a benchmark fn returns nil and takes some argument… The error behavior may need to be considered a little more carefully.


#3

That’s not enough crates to justify stabilizing something that shouldn’t. Also, even if half of all crates used these half-baked things, I would prefer we wait for something better to come along, and if it’s so important, designing good replacements should simply be prioritized.


#4

I’d welcome this!

I had thought that the sticky issue here was the stabilization of black_box, but I can’t remember the details.


#5

Counter-proposal: can we macros 1.1 this? That is, maybe full on custom test frameworks need something huge, but is there something small we can do for now?

Also, https://crates.io/crates/bencher will get you on stable for this stuff today.

I am also feelin’ that pressure of “let’s just stabilize some stuff”, but given how little attention this has gotten for years… i dunno.


#6

Stabilizing basic testing and benching is an excellent usability idea and I’m glad it’s happening.

If building a #![no_std] binary, can’t you just test and bench it separately as an external crate and pull in the standard libs separately? I think conventional testing is already “different” enough on embedded devices that this neither helps nor hurts people who are already accustomed to it.


#7

I make use of the test feature for benchmarks in several crates that I maintain, but I don’t feel any pressing need to stabilize these. It’s incredibly easy to make travis-ci run the benchmarks in the nightly version, so these being stable doesn’t really get me anything. If the feature genuinely isn’t ready, I’d rather it not be stable.

At the same time, I’ve been using it as-is for a couple years now, and it’s worked perfectly fine. (I do miss the ability to ratchet metrics, actually.) If it works and hasn’t changed, perhaps by definition it is stable?


#8

Thank you for pushing on this front Brian. In the external test harnesses thread I proposed an alternative:

I’d like to (first introduce as unstable and then) stabilize building blocks for external test harnesses:

  • A simplified version of TestDescAndFn, with just enough to describe what rustc --test generates. (So probably without dynamic tests.)
  • Some way to override which crate/function is used instead of test::test_main_static. Maybe #[test_harness] extern crate fancy_test;? (If benchmark are supported in this mechanism, TestDescAndFn would probably have to be generic over Bencher.)

Just to clarify:

There are only a few crates I see using test::TestDesc (and thus constructing their own test suites dynamically): the out-of-tree compiletest and syntax, and (interestingly) url.

The url, idna, and html5ever crates each use https://crates.io/crates/rustc-test which is a copy of Rust’s src/libtest modified just enough to run on stable Rust. (The package name for Cargo and crates.io is rustc-test, but the library name for rustc and extern crate is test.) They do so to dynamically generate many tests that share the same code but take different input data. But rustc-tests can also be used with #[test] and #[bench] to run benchmarks on stable Rust today.


#9

Woah #[bench] is stable? And rustc can be convinced to use an alternate test crate? How do you do that? That could change my opinions about the solution space considerably.

My concern with the alternate proposals so far is that they take us back to the drawing board. We are going back to the drawing board to design custom test frameworks, and relatively soon, but anything involving design must take considerably longer than essentially rubber stamping a de-facto solution.


#10

Yes, that works for many cases, but there is e.g. embedded code that cannot link to std, and that might e.g. need tooling help to marshal itself over to an emulator before running.


#11

In its own Cargo.toml, https://crates.io/crates/rustc-test uses:

[package]
name = "rustc-test"

[lib]
name = "test"

So having this in your Cargo.toml:

[dev-dependencies]
rustc-test = "0.1"

… makes Cargo pass this to rustc: --extern test=…/target/…/libtest-….rlib so that extern crate test; uses that crate instead of the standard library one. Since that crate doesn’t use #[unstable] (I removed them in this copy), this works fine on stable.

And yes, it looks like the #[bench] attribute itself is not feature-gated. Only (the standard library’s copy of) the test crate and test::Bencher are. But honestly this looks like an accident and I think nobody else has noticed so far. I typed my previous message from memory of almost a year ago without checking, and when reading your reaction I thought I was wrong until I tried it.

I’ve just checked, none of rustc-test’s reverse dependencies use #[bench]. I think it would be fine to feature-gate it today until we figure out a plan.

My concern with the alternate proposals so far is that they take us back to the drawing board.

For what it’s worth, the proposal I made in the other thread and quoted above is made to be minimal, just enough to support rustc --test, leaving everything else to external code.


#12

I am strongly in favor of having some solution for #[bench] available on stable. I could easily get behind @brson’s original plan, but I have to think about @SimonSapin’s plan a bit. I think the high-order bit for me is that I want benchmarking to feel “built-in” in the same way that unit testing does. @brson’s plan seems to achieve that. @SimonSapin’s plan could well achieve that too, though I think it has to be paired with a rust-lang-nursery crate that is on the fast to ubiquity. I also think we should standard a way to write tests and report results (more on this later) so that people can write benchmarks once and then experiment with different runners.

As far as writing tests, I am not that worried about forward compatibility, in part because I feel like the current benchmarking inferface is about as simple as it gets, and hence we will always want a mode that operates something like it, even if we eventually grow more knobs and whistles. (The need to sometimes use “black box” is a drag, admittedly, but I’m not sure if that is easily avoided?) I’m curious if anyone has any concrete thoughts about alternatives.

It is true of course that it is relatively easy to use nightly to run benchmarking and with travis, but it also means that even if your library works on stable, you often need nightly compilers to run benchmarks. This means you can’t benchmark the compiler your users are using and it is also kind of a pain. It gives off a “Rust is not ready for prime-time” aura to me.

Another aspect that hasn’t been much discussed here is that I think we should stabilize a standard output format for writing the results. It ought to be JSON, not text. It ought to be setup to give more data in the future (e.g., it’d be great to be able to get individual benchmarking results). We’ve already got tools building on the existing text format, but this is a shifty foundation. (For example, the cargo-chrono tool can do things like work various commits, run benchmarks, aggregate multiple runs, compute medians and normalize, and generate plots like this one or this one that summarize the results. Meanwhile @burntsushi has the cargo-benchcmp tool.)

Finally, I feel like having a stable way to write benchmarks might encourage more people to investigate how to improve it! I know that for Rayon I’ve found the numbers to be a bit unreliable (which was part of my motivation in writing cargo-chrono). I think part of this is that the current runner basically forces you to have closures that execute very quickly, which means I can’t write benchmarks that process a large amount of data. This all seems eminently fixable by building on the existing APIs. (I could also be totally wrong about what the problem is; but benchmarks run by hand seem to yield more stable numbers in some cases (not all).)


Past, present, and future for Rust testing
#13

One thing I’d love to see with both tests and benches is easier parameterization. Macros are a solution, but are limited in expressiveness.


Past, present, and future for Rust testing
#14

I have seen related issues benchmarking hashmap changes and I think it could be related to this https://github.com/rust-lang/rust/pull/38779


#15

Another unrelated point: The current benchmark runner is underequipped for micro benchmarks.

We should be looking at JMH which is probably the gold standard in microbenchmark harnesses.


#16

While I am in favor of aiming to improve the benchmarking harness, I think we focus on getting something stable and usable first, as long as we feel there is room to expand it. Taking a quick glance at the JMH examples, it seems like there would be plenty of room to do so – either by further configuring #[bench], offering new methods on the Bencher callback, or extending the binary to support alternate execution modes.


#17

Stabilizing this is likely a measurable step towards reducing reliance on nightly (which IS a good thing judging by the 2017 roadmap). The current #[bench] and friends are simple enough that they could easily be customizable/expanded in the future.


#18

Since procedural macros are finally slowly getting stable, alternative test harnesses or benchmarkers should probably simply be procedural macros.

In other words if you want to use the default bencher you use #[bench], and if you want to use a third party library you add a dependency to awesome_library and you use #[awesome_library_bench].


Past, present, and future for Rust testing
#19

Last time Alex and I worked through a custom test framework design, we landed on punting test definitions entirely to procedural macros, likely compiling down to today’s #[test] fns. For benchmarks there would possibly need to be some extensions to the test crate’s APIs since it itself is responsible for running today’s benchmarks and isn’t extensible to other approaches.


#20

Or perhaps

use awesome_library::bench; // shadow the default bench macro 

#[bench]
fn foo() { ... }

though I’ve not thought that through at all =)