I am strongly in favor of having some solution for #[bench]
available on stable. I could easily get behind @brson’s original plan, but I have to think about @SimonSapin’s plan a bit. I think the high-order bit for me is that I want benchmarking to feel “built-in” in the same way that unit testing does. @brson’s plan seems to achieve that. @SimonSapin’s plan could well achieve that too, though I think it has to be paired with a rust-lang-nursery crate that is on the fast to ubiquity. I also think we should standard a way to write tests and report results (more on this later) so that people can write benchmarks once and then experiment with different runners.
As far as writing tests, I am not that worried about forward compatibility, in part because I feel like the current benchmarking inferface is about as simple as it gets, and hence we will always want a mode that operates something like it, even if we eventually grow more knobs and whistles. (The need to sometimes use “black box” is a drag, admittedly, but I’m not sure if that is easily avoided?) I’m curious if anyone has any concrete thoughts about alternatives.
It is true of course that it is relatively easy to use nightly to run benchmarking and with travis, but it also means that even if your library works on stable, you often need nightly compilers to run benchmarks. This means you can’t benchmark the compiler your users are using and it is also kind of a pain. It gives off a “Rust is not ready for prime-time” aura to me.
Another aspect that hasn’t been much discussed here is that I think we should stabilize a standard output format for writing the results. It ought to be JSON, not text. It ought to be setup to give more data in the future (e.g., it’d be great to be able to get individual benchmarking results). We’ve already got tools building on the existing text format, but this is a shifty foundation. (For example, the cargo-chrono tool can do things like work various commits, run benchmarks, aggregate multiple runs, compute medians and normalize, and generate plots like this one or this one that summarize the results. Meanwhile @burntsushi has the cargo-benchcmp tool.)
Finally, I feel like having a stable way to write benchmarks might encourage more people to investigate how to improve it! I know that for Rayon I’ve found the numbers to be a bit unreliable (which was part of my motivation in writing cargo-chrono). I think part of this is that the current runner basically forces you to have closures that execute very quickly, which means I can’t write benchmarks that process a large amount of data. This all seems eminently fixable by building on the existing APIs. (I could also be totally wrong about what the problem is; but benchmarks run by hand seem to yield more stable numbers in some cases (not all).)