Moved from https://users.rust-lang.org/t/past-present-and-future-for-rust-testing/14293.
Support for a custom testing harness has been a long-standing item on the Rust wishlist. Many a bug and PR has been closed or punted on with the reasoning that "this will be fixed by custom test harness support". This post tries to summarize the areas in which a custom test harness would be useful, the discussion so far, what features it would need to support, open questions, and some thoughts on how it may be implemented. It does so in the hope that this might help lay a foundation for further discussion, and what things will need to be resolved before continuing.
What is this, and why now?
See the IRC discussion from #rust-libs
starting here (it's not very long). Long story short: hopefully this "summary" of the current state of affairs and proposals is useful for a meeting that may happen at the work week next week. And if not, it might still be useful as a starting point for improving libtest
.
Why a custom test harness?
- Configurable test output formats. This is probably the largest single category of complaints about the current standard test harness. Everything from how to format the output of
assert!
to supporting standardized machine-readable output formats like TAP, mozlog, and xUnit XML. Several users have also requested support for JSON output (#2234, #45923, #46450) for IDE integration. - Stabilizing benchmarking. The second most sought-after aspect of this is the eventual stabilization of built-in benchmarking (i.e.,
#[bench]
). This has seen a lot of discussion (summarized below), but overall the wish is that benchmark "tests" in one form or another should be supported on stable. - Grouping tests. The test harness that ships with Rust by default considers all tests equal. Every function annotated with
#[test]
will be run as a test, no matter where it is, and independently of any surrounding code or tests. There is no support for setup/teardown code for tests (#18043), nor a notion of test "suites" that are related (and may have shared setup/teardown). - Finer control over test execution. Tests can currently be ignored (
#[ignore]
), allowed to fail (#[allow_fail]
), or they can be skipped entirely by passing a test name filter tocargo test
. While this suffices for simple use-cases, there are plenty of situations where better control over which tests run when, and how, would be desireable (e.g., #45765, #42684, #43155, #46408, and #46417) - Test generation. The
#[test]
annotation only allows a static set of tests that are defined in code. However, there are several cases where dynamic generation of tests would be welcome (e.g., for parameterized tests(1 2), tests with multiple fixtures, or other kinds of dynamically generated tests).
There has been significant previous discussion on this topic, both in terms of desired features, goals, obstacles, and implementation approach. This post represents a very condensed summary of those discussions. I encourage interested readers to go read the linked threads. At a high level, I'll refer to @alexcrichton's comment on cargo#2790:
Most of what we've been thinking has been along the lines of separating the concern of test frameworks and test harnesses. That is, a test framework is the syntax that you actually define a test. I believe the descriptor crate would fall in this category, but not the expector crate. Our thinking is that we don't actually deal with test frameworks at all for now and instead punt them to the eventual stability of plugins. The one interface for
--test
to the compiler would be the#[test]
attribute. In other words, any test framework would in the end "compile down" to a bunch of no-argument functions with a#[test]
attribute (like we have today).On the other hand, though, the test harness is something we consider is responsible for running these functions. The test harness today for example runs them in parallel, captures output, etc. It should be the case that any test harness is capable of being plugged into any test framework, so we'd just need a standard interface between the two. We're also thinking that test harnesses encompass use cases like running in IDEs, producing XML/JUnit output, etc.
Relationship to benchmarks
Let me first point you to 1, 2, 3, 4, 5, and 6.
These threads are mostly about stabilizing #[bench]
by pulling things out to separate crates. These have mostly been closed or punted on, but there seems to be general consensus that benchmarking will see progress with custom testing frameworks.
It is unclear what the full story is here --- do we want to treat test and bench as separate, or as part of the same problem (and thus should share a solution)? Personally, I think benchmarking and testing are sufficiently different that they shouldn't be handled through the same mechanism (i.e., #[test]
and #[bench]
should be handled by different runners). That said, as observed below, it may very well be that we want their output to be unified. This suggests that we may want two interfaces for tests/benchmarks: one for running, and one for formatting.
Amidst a discussion on stabilizing #[bench]
, @nikomatsakis observes:
Another aspect that hasn't been much discussed here is that I think we should stabilize a standard output format for writing the results. It ought to be JSON, not text. It ought to be setup to give more data in the future (e.g., it'd be great to be able to get individual benchmarking results). We've already got tools building on the existing text format, but this is a shifty foundation.
He continues by pointing out that the output of benchmarking is already used by other tools (like @BurntSushi's cargo-benchcmp), and it'd be neat to have a single format all of these tools could consume. This suggests that there should be at least some connection between testing and benchmarking (namely their output format).
On the topic of benchmarking, he also observes that:
I think part of this is that the current runner basically forces you to have closures that execute very quickly, which means I can't write benchmarks that process a large amount of data. This all seems eminently fixable by building on the existing APIs.
That same thread has pointers to JMH as the gold standard for microbenchmarks. We may want to draw some inspiration from that in designing a benchmarking interface.
@ tomaka suggests that different benchmark types should just be different annotations:
Since procedural macros are finally slowly getting stable, alternative test harnesses or benchmarkers should probably simply be procedural macros.
In other words if you want to use the default bencher you use
#[bench]
, and if you want to use a third party library you add a dependency toawesome_library
and you use#[awesome_library_bench]
.
@ alkis lists a number of further requests for benchmarking here.
Implementation thoughts
Separating runners and formatters
PR#45923 and PR#46450 suggest providing JSON test output as a stream of test events. Specifically:
{ "type": "suite", "event": "started", "test_count": "1" }
{ "type": "test", "event": "started", "name": "f" }
{ "type": "test", "event": "failed", "name": "f" }
{ "type": "suite", "event": "failed", "passed": 0, "failed": 1, "allowed_fail": 0, "ignored": 0, "measured": 0, "filtered_out": "0" }
{ "type": "test_output", "name": "f", "output": "thread 'f' panicked at 'assertion failed: `(left == right)`
left: `3`,
right: `4`', f.rs:3:1
note: Run with `RUST_BACKTRACE=1` for a backtrace.
" }
Leaving the fact that it's JSON aside for a second, this kind of event streaming seems like a solid foundation on which to build both test runners (emit events as they occur), and output formatters (decide how you want to show each event/set of events). This could also be extended to benchmarks:
{ "type": "benchmark", "event": "started", "name": "b" }
{ "type": "benchmark", "event": "finished", "name": "b", "runtime": 300, "iterations": 100 }
By introducing this abstraction, test runners and output formatters can be nicely decoupled. This also addresses a concern that was raised in RFC1284: a user may want to change testing output at runtime with flags (for example based on whether consumer is a human or machine). This kind of design would let a crate pick its test runner (perhaps because it needs a particular feature), while the user can independently pick the formatter. @ tomjakubowski suggests something like --format=tap
. The thread also has more requests for pluggable output formats further down. This design also allows features that might otherwise be tricky such as exposing test results in a streaming fashion.
Choosing a test runner
Crates may want to pick a specific test runner because they provide a particular feature (like test setup/teardown), or perhaps because they integrate deeply with some other external system. This process should be relatively painless. To quote @alexcrichton:
my hope is that you just drop a
[dev-dependency]
in yourCargo.toml
and that's almost all you need to do
This has much the same flavor as changing the global, default allocator (RFC1974, #27389), and we may be able to borrow the solution for there. For example, one could imagine an interface like:
#[test_runner]
static RUNNER: MyTestRunner = MyTestRunner;
use std::test::TestRunner;
struct MyTestRunner;
impl TestRunner for MyTestRunner {
// ...
}
API for users
From https://internals.rust-lang.org/t/pre-rfc-stabilize-bench-bencher-and-black-box/4565/19:
Last time Alex and I worked through a custom test framework design, we landed on punting test definitions entirely to procedural macros, likely compiling down to today's
#[test]
fns. For benchmarks there would possibly need to be some extensions to the test crate's APIs since it itself is responsible for running today's benchmarks and isn't extensible to other approaches.
This seems pretty reasonable to me. In fact, using just #[test]
as it exists today is probably sufficient. Additional ways of specifying tests (e.g., stainless) can use macros to produce #[test]
annotated functions, but this does not require special support in libtest.
One addition that might be nice for supporting fixtures and parameterized tests is to introduce an annotation like #[test_generator]
, which is a function that dynamically generates tests to run. Something like:
#[test_generator]
fn test_many() -> impl Iterator<Item = (String, Box<FnOnce() -> ()>)> {
(0..10).map(|i| Box::new(move || test_one(i)))
}
Currently, the calls to test_one
would need to be made directly from test_many
, but this both forces those tests to be serialized, and causes errors in one generated test to also fail all the others. With a test generator-like construct, these could be run independently by the runner.
Another feature that is often requested is support for test suites. Here, I'd like to echo @brson from this comment:
As to some of the requirements you mention, nesting of tests we see as being done through the existing module system. Rust tests already form a hierarchy through their module structure and naming, they just aren't represented or presented that way in the test runner. They way we've been envisioning this nesting working is as a feature of the test runner (or harness): they would take the flat list of test names and turn it into a tree based on their names, which already reflect the module hierarchy.
To support arbitrary test names I might suggest further test attributes like:
#[test_name = "some spec"] mod { #[test] #[test_name = "some spec"] fn mytest() { } }
If we end up accumulating a lot of new requirements for features that require communication between the frontend test frameworks and the backend test runner, and therefore a lot of new annotations to support them, then this plan could start looking pretty messy. But I'm still hopeful that most test framework features can be expressed either as syntactic sugar over a simple test definition that we mostly already have; or as features of the program actually executing the tests; without a whole lot of new information necessary to communicate between the two.
API for the runner
This is where a lot of the trickiness (and bikeshedding) will arise. Some questions:
- How does the runner learn about command-line arguments?
- How are tests and suites communicated to the runner? An iterator? Consecutive method invocations?
- How does the runner asynchronously report test results? Futures? Channels? Some abstracted form of Writer?
API for the formatter
The interface here is likely much more straightforward. Probably all that is needed is a number of methods that are called when an event occurs (e.g., suite starts, test starts, test finishes, etc.), and it's up to the formatter to eventually write its formatted results to some designated output. Some questions:
- Does the formatter always write to STDOUT? If not, how is it informed about the correct output? A
Write
perhaps? - How does the test runner actually use a formatter? If the user puts
tap-format
in their[dev-dependencies]
and gives--format=tap
, how does the test binary end up feeding events from the test runner to the formatter? Does it recompile to link againsttap-format
specifically, or does it always link against all formatters specified in[dev-dependencies]
and dynamically switch between them depending on the provided--format
flag?
Open questions:
-
@brson points out that we'd like to support other types of tests from cargo in the same framework:
-
We'll probably want support for setup/teardown for doctests.
-
How do we distinguish between tests with the same name (potentially in different suites that run in parallel). Including filename+line would help with this, but we probably also don't want to include that with every event.
-
If the test runner dictates what command-line arguments are available, and how they're interpreted (e.g., regex test name filtering), users may be surprised when they move between crates and
cargo test
behaves differently. Do we need a third layer that chooses which tests to run? Should certain filters/flags be supported across all runners? -
Do we want additional annotations, such as whether some tests must run serially1 2, to be a feature of the test runner or a feature of libtest? If the former, how are annotations communicated to the runner? If the latter, how is this enforced when we don't have control over how tests are run? From the linked threads:
Discussed at the dev-tools meeting today, this problem can be solved today using a mutex (I filed #43155 to document that). Given that this solution feels somewhat ad hoc and there is workaround, we would prefer to punt on adding this in favour of custom test runners or some other long-term solution.
-
How are command-line arguments to libtest/the test runner/the formatter handled and distributed? This comment, and the following discussion, brings up a good point about flags and options to the test runner through
cargo
. What should, e.g.,cargo test --verbose
do with a custom testing framework? It's likely that we'd wantcargo
to forward nearly all options and flags (especially unknown ones) to the test runner (currently realized by including them after--
). -
@ekiwi brings up some interesting points about testing on embedded platforms:
- the test harness should be able to be made up of at least two different binaries: one that runs on the host pc and another one that runs on the target
- there needs to be a mechanism to be able to compile only a certain number of tests, so that we can make sure that they still fit on the target
It's unclear that we actually want to solve this with this new custom test framework design, but the resulting utest-rs crate may be worth checking out.
-
@ jsgf asks for the ability to "generate a list of all the tests/benchmarks without actually running them (machine parseable)". How might we do this? Perhaps by having a "dummy" runner that just emits each test it is given without running it?
Other threads:
-
Issues Ā· rust-lang/cargo Ā· GitHub Add option for printing test output summary.
-
PR#44813 and issue #43381:
The testing framework should have a logging/reporting interface and the teamcity service messages could be one such listener (the console and crate cargo-test-junit being other implementations of test listeners).
@ gilescope
-
crates.io: Rust Package Registry Pulls out all of the rust testing, including benchmarking.
Passes
--extern test=ā¦/target/ā¦/libtest-ā¦.rlib
torustc
so thatextern crate test
uses that crate instead of the standard library one. Since that crate doesnāt use#[unstable]
, this works fine on stable.It's not clear that this solves anything, it just makes the same testing code that currently exists in
std
externally accessible. But it is cool that it works. -
PR#46417 and issue #46408: Test name filters do not allow for disambiguating tests with overlapping names. More generally, the current test name filtering is extremely limited, and would be something that a custom test runner would need to be able to augment.
-
meeting-minutes/weekly-meetings/2015-03-24.md at 64c85df3b86d2c19183257892294d4ad80cdd0a8 Ā· rust-lang/meeting-minutes Ā· GitHub Some short internal discussion about
libtest
and benchmarks: -
Issues Ā· rust-lang/rust Ā· GitHub Request for support of "run test near this line" for editors.
/cc @dtolnay @alexcrichton @nrc @nikomatsakis @llogiq @QuietMisdreavus @steveklabnik