Past, present, and future for Rust testing

This seems like a great plan for tests, but I don’t think it will work for benchmarks. I suspect it won’t be great for fuzz tests either, but I’ll let Manishearth comment on that.

It makes sense to abstract over test runners because all unit tests produce pretty much the same output - name, pass/fail, optional error. This holds for quickcheck tests and cucumber tests and JUnit tests.

There’s no such common denominator for benchmarks. We can already see that - Criterion produces a number of statistics that libtest benchmarks don’t. Will you add a TestEvent variant that includes distributions for all of the various statistics? Now, you could add one that reports a series of iteration counts and times, sure. Will you then add another for measuring and reporting the amount of memory allocated per iteration (as has already been requested)? And another for reporting information from CPU performance counters (also requested)?

The output formatting of a benchmark harness has to know about the measurements that harness makes. It doesn’t make sense to have interchangeable formatters and runners for benchmarks like it does for tests.

It seems to me that everything you propose could be done as a crate on top of Manishearth’s proposal except for register_test_component! - and the goals for that could probably be accomplished in another way. It would be a great crate - by all means, please build that crate. For benchmarks, though, I’d really rather just generate my own main.

But register_test_component! is specifically the thing that's needed to support custom test frameworks. Without that, there is no way to discover functions/whatevers defined at compile time, which means you couldn't write your own test runner! In particular, it would mean that every testing framework is limited to functions annotated with the #[test] attribute, which is insufficient to express things like BDD, Catch C++ style declarative tests, pytest dependency injection (probably), etc.

If we wanted benchmarks to be something that we want to be integrated nicely with formatters (or rather, if we formatters to generally also be able to show at least some limited benchmark output):

#[non_exhaustive]
enum Measurement {
    Time(usize),
    Ops(usize),
    Rss(usize),
    Other {
        name: String,
        value: String,
    }
}

#[non_exhaustive]
enum TestEvent {
    // ...
    BenchmarkResult {
        name: String,
        measurements: Vec<Measurement>,
    }
}

If we don't, or if you have some very complex property, then it sounds like you'd want to simply bypass the formatter entirely (after all, if the formatter is, say, TAP, how would you even report benchmark results?). In that case, your benchmark output is very tightly connected to your test runner, and the test runner should simply produce its own output (e.g., by calling println! directly).

But that's effectively what my proposal lets you do? You can put whatever you want in your runner's run method, and that's called directly from the resulting main. And there is very little extra around that. And the register_test_component! stuff would let you easily find benchmarks declared with #[bench], or potentially even come up with a better/more elaborate syntax for them using other procedural macros that eventually call register_test_component!.

One place where things will get tricky is if a user wants to use two different runners — one for tests and one for benchmarks — within the same crate (keep in mind that files outside of src/ are in separate crates). Not sure how to solve that one yet… Though we may also want to punt that for the time being, and simply require that in src/ the user can only use whatever benchmarking capabilities provided by their src/ runner, and any benchmarks where they want to use a more specialized runner would have to go in, say, bench/whatever.rs.

Some info that property based testing frameworks may want to provide:

{ "type": "test", "runner": "proptest", "runner_type": "property_based_testing", "event": "shrinking", "name": "complicated" }
{ "type": "test", "runner": "proptest", "runner_type": "property_based_testing", "event": "shrinking", "name": "simplified" }
{ "type": "test", "runner": "proptest", "runner_type": "property_based_testing", "event": "shrinking", "name": "found_minimal" }
"details": {
  // The minimal failing input (or none if all tests passed)
  "minimal_input": "Point { x: 1, y: 0 }",
  // Denotes how many actual randomly generated test cases that passed.
  "passed": 78, 
  // Denotes how many randomly generated test cases that failed.
  "failed": 3,
  // Dito for the number that was filtered out by the test runner because of some  .filter(|x| predicate(x)) logic
  // See: https://docs.rs/proptest/0.3.2/proptest/strategy/trait.Strategy.html#method.prop_filter    
  "ignored": 10,
  // The number of generated cases that was required to pass the test.
  "required": 100,
  // The time it took in microseconds (or nanoseconds) to generate a test case on average:
  "avg_gen": 10,
  // The time it took to execute the function under test on average:
  "avg_exec": 55,
}

See: https://github.com/rust-lang/rfcs/pull/2234#issuecomment-350955070

PS: I think testing and benchmarking are so fundamentally different activities that it might be best to discuss them in two different threads.

I don’t think that designing a good test runner and designing the integration of testing with rust should be part of the same conversation.

I strongly agree with some of the thoughts in this thread that improving testing in rust should be primarily about making it possible to provide a library solution. Everything that needs to be baked into rustc is something that other kinds of (non-test) scenarios won’t be able to take advantage of.

In particular, I would be really happy if testing looked like:

#[cfg(test)]
extern crate my_test_runner;
// proc-macro-2, right?
#[cfg(test)]
my_test_runner::main!();

#[cfg(test)]
mod test {
    use my_test_runner::mtest;

    #[mtest]
    fn do_test() {};
}

And then cargo test works correctly.

The primary things that I see being required for this are:

  • being able to register the results of one macro for use with another (I disagree that this is less general than being able to specify a test runner globally like a global allocator)
  • being able to tell cargo that there are specific mains that should be generated for specific configs

This seems like you would want something in Cargo.toml like (very strawman):

[alternate_modes]
test = { profile = "dev", cfg = "test", is_executable = true }

But other than a registry and telling cargo to pass cfg flags what do you need?

To my mind this solves lots of problems.

  • You don’t need a single test package for the whole project! Libtest could expose the equivalent of my_test_runner::main!() above and you could do:

    #[cfg(test)]
    fn main() {
        rustc_test::main!();
        my_test_runner::main!();
    }
    
    #[cfg(test)]
    mod test {
        #[test]
        fn old_test() {}
    
        #[my_test_runner::mtest]
        fn new_test() {}
    }
    
  • Everything that is necessary for this to work would make rocket happier

  • Test framework versioning

  • Providing a way to hook in to cargo from procedural macros (opt-in, via Cargo.toml?) would maybe make some interesting things possible? I admit that testing is the only use case I can think of for this, but being able to say "install this library and add a flag to Cargo.toml and cargo fuzz will just work" seems kind of magical

  • rusttest 2.0 could be developed out of tree and versioned independently

1 Like

There’s also the matter of:

  • filtering tests (can still be done via function path substring matching)
  • passing parameters to the test runner from cargo in a way that is ergonomic

Getting reporting right is also non-trivial - different runners may want to report different amount of things as I’ve shown in the snippet above with hypothetical richer info from proptest. There’s also richer information that can be provided on success.

What you describe looks mostly good tho.

While I agree with @quodlibetor that it'd be neat to be able to specify other mains beyond those of test, I think trying to tackle that in the same breath as custom test frameworks will cause us to make no progress. As far as I can tell, my proposal above could be extended to what @quodlibetor proposes, but I think we should deal with this problem in two phases. First get custom test frameworks to a good place, and get some experience from that process, and then see if we want to expand that to also include arbitrary compile "modes" where libraries can provide their own main()s.

I think this is a very important point, and probably something we all agree on. I believe my proposal also provides this.

I think filtering should be completely handled by the test runner. This does mean that crates with different runners will have different arguments to cargo test, which is a little unfortunate, but I think otherwise we'd be constraining them too much. I don't have any concrete examples, but I suspect that forcing all test frameworks to support the same flags would bring some pain. We might want to think about requiring that the first positional argument is a filter for which tests to run, but even then we have questions like "are regexes supported?".

This is related, and also very tricky. I think we'll simply want to stipulate that cargo test passes all of its flags on to the test runner, and the test runner decides what to do with them. Maybe it makes sense to specify some flags that must be supported to promote consistency, but I don't have a good sense of what those would be yet.

Absolutely. Though I am of two minds here about how we actually go about that. My thinking with TestEvent was that there should be some common set of events that we expect basically every test runner to want to report. And "standardized" formatters would deal only with those events. If a runner is very special/different, it probably doesn't make sense to try to fit it with any formatter the user might think of using. Instead, those test runners will likely bypass the output formatter, and instead produce their own output directly. However, I expect that to be relatively rare -- I think most testing frameworks can get away with a very narrow set of events.

I've been wrestling with this as well, and don't have a good, clean answer. I think it makes sense that the mechanism we use for introducing custom test frameworks into Rust should be general enough to also support benchmarking (e.g., both mine and @quodlibetor's suggestion do that I believe), even if the form those runners will take are very different. It's true that it'll probably be tricky to have test output formatters integrate with benchmarks, but that's fine I think.

I think at the very least, the formatter should be able to handle events from PBT-frameworks such as "complicated/simplified/found_minimal" because PBT is simply better in a lot of cases than unit testing and so more important. I think that the runner should be able to identify itself in the output and also specify what category of runner it belongs to (pbt/unit-testing/other...).

IMHO, Runners should also be able to specify extra data in some free-form json-like object notation - that way, you can both support both the minimal and the maximal. Depending on what category the runner belongs to, formatters can then format the the extra info in some specific way.

Bypassing the formatter comes at a high price - IDEs can no longer work with your runner.

@jonhoo I agree that your proposal works for developing out of tree, and I also agree that mine seems like a much lift than just hooking up a couple more things.

My uneducated guess about how the current test framework works is that it’s somewhat similar to proc macros, but it has more vision into the internals of the compiler. I would maybe hope that, instead of blocking rusttest2 on as-yet unproposed enhancements to proc-macro2, the experience building rusttest2 could be used for a proposal for proc-macro3?

I don’t want to argue to slow down moving test out of tree, but it just seems clear that everything that is custom in-tree for test is something that libs can’t express, so we should be working to clear that out.

Just for the sake of posteriority, there is also a decently long chain of discussion from IRC starting at https://mozilla.logbot.info/rust/20180116#c14136417 between @Manishearth, @Centril, and myself.

1 Like

I’ve drafted an eRFC that tries to incorporate all the concerns brought up so far.

Please let me know your thoughts. I’ll probably be posting this on the RFCs repo in a week or two depending on what the dev tools team thinks.

1 Like

I’ll try to sum up my thoughts from our IRC discussion + the posted eRFC.

  • I think there is value in trying to not aim for a fully general solution in the first instance, and instead come up with a targeted solution for testing and benchmarking which we believe can later be extended to a general mechanism.
  • I think the proposed eRFC focuses too much on the cargo integration (given my first point above), and too little on the needs of test runner authors. Specifically, I think it is insufficient to only support iterating over just functions with a given attribute. I think it would be simpler to have a mechanism by which a procedural macro (which can already define a new attribute) can generate “points of interest” that the test runner can later identify. This is not incompatible with the proposed “test frameworks are full-crate macros” approach, but rather a different proposal for how one would write that macro.
  • There is some appeal to having the test runner fully generate its own main, rather than the somewhat “magical” autogenerated main I proposed further up in this thread.
  • I think we need to seriously think about how we are going to handle namespacing of attributes (or “points of interest” as suggested above) here. What if two testing frameworks both want to use the annotation #[test]? Or #[test::setup]? Should we disallow that? That seems unfortunate. Some kind of namespacing would be nice. Or maybe we require that they are under different #[cfg(..)] blocks?
  • The notion of a “helper crate” for writing testing frameworks strikes me as the wrong approach, and is what suggests to me that the proposed solution is too general/broad. I think we should only expose a restricted/specialized helper interface, without also committing to the general design, so that we can get some experience with how people use that, and what they use it for.
  • I worry a little about adding too many special Cargo.toml keywords and sections, but perhaps this is just me being overly concerned. I would rather see crates declaring in-source that they want to use a particular framework, as that means you can generally understand what’s going on from looking at just the given .rs file, instead of also needing to look up in Cargo.toml. I could be convinced otherwise though.

I mentioned this in IRC, but I find that exposing a general interface with a specific one as a library is actually better, since the general API is simpler (it’s not introducing anything new aside from one attribute). Additionally, folks can improve the helper API to address their use cases themselves once it’s possible to experiment with this. Stabilizing the helper functionality as core is stabilizing more things and may even involve stabilizing an AST repr which we don’t want to do.

This is the exact same model used for proc macros; the proc macro interface is pretty minimal and everything is farmed out to syn. The nice thing is that folks contribute back to syn to improve it for their use cases.

This also means we don’t have to worry about addressing everything even when stabilizing. If we miss something, it can always be added later to the helper crate even if it’s a breaking change. Not so for a rustc feature.

I’ve focused on the cargo integration because that’s the trickiest part IMO; you can already write custom test runners using tricks like those used by compiletest or cargo-fuzz, the problem is that these have to emulate what cargo does and it’s always terrible and hard to get right.

The cargo stuff is not something I expect to benefit from iteration in the experimental period – folks understand and contribute to libraries they use, but this is less true for the tools they use. But I expect the helper to get a lot of improvements. However discussing possible APIs now and adding them to the eRFC is fine by me too; I was attempting to come up with what I thought would be the minimum viable eRFC. “Collect points of interest” seems like a good kind of API to have. We can tweak the current one to work on all items.

Regarding clashes, you can namespace attrs already, and we should encourage that for custom ones. But it’s also fine if two frameworks wish to use the test attr provided they don’t expect anything different from them.

I think testing frameworks are largely a cargo concern, it is already cargo which orchestrates the running of tests in a folder, etc.

I like the idea of having points of interests.

I think this is bound to lead to problems... Two frameworks might start similar, but diverge eventually. I think it is just simpler to only allow #[test] and #[bench] for the default runners we have today. I think people should use #[mytest::setup] instead of #[test::setup].

I'm a bit unsure as to how the eRFC as currently written allows you to use different runners such as proptest and the normal test in src/ inside the modules. Can you clarify that in the eRFC?

PS: no love for proptest in the eRFC, only for quickcheck? :wink:

1 Like

Yes, the eERFC allows all runners to operate on pib, but gives them the ability to switch that off as well.

We could totally ask for scoping of attrs. I’m allowing for clashes in case someone wishes to define a “better version of the rustc framework”; folks shouldn’t have to sed the file to get something else. It’s unclear if this is a necessary goal of the RFC. (I need to clarify the kinda of cases I’m attempting to cover) But either way I don’t think it a big deal for runners to clash attributes.

FWIW as a proc macro we can’t disallow anything; we can only politely request it to not use the same attr. We can get it to not use the same folder but this makes custom frameworks slightly second class.

As for points of interest, this can easily be done as a minor extension of the TestCollector API where we give it an attribute scope.

@nrc has now posted the meeting minutes for the dev-tools meeting on this a few days ago.

Seems like the main difference between that discussion and the eRFC is that it’s considering stabilizing an output format as well; which I feel we may wish to do but probably informally, and probably throughout the course of the experimentation.

I agree with that – the formatter can come separately (and later). The discussion also happened ~5 days ago, so much of the discussion here/on IRC had not yet happened.