Pre-RFC: machine readable test output


#1

Summary

Add --format flag to rustc --test to select testing output.

Motivation

Currently when testing Rust provide only human-readable output which is nice in most cases as we are humans or other humanoids. But from time to time we need to feed machine with results of our tests (i.e. some kind of CI that will report which tests failed or something like that) and there is no way to do that in “civilised” way. We need to parse non-machine-readable output which isn’t nice. But wait: there are some nice, standardized and well described formats for providing test output:

Use them for greater good!™

Detailed design

As described above just add flag to select desirable test output format which will take one of the possible values:

  • default - current, human-readable format
  • human - alias for default
  • tap - Test Anything Protocol which is nice format between human-readable and machine-readable
  • tap-y - next-gen TAP variation that use YAML stream with tests results, it allows to provide more data for sake of developer toolings
  • tap-j - like above but uses JSON instead of YAML

Drawbacks

Add some complexity in test results displaying but no more known drawback at this time.

Alternatives

Left as is, otherwise none.

Unresolved questions

Available output formats. For now it has been chosen based on my knowledge of well known and popular test output formats but I can miss something.


#2

This should probably also work for benchmarks.


#3

I’ve assumed that benchmarks are also tests (as they are in cfg(test) flag).


#4

I just thought I’d mention it, because the output is not quite the same. Also for machine-readable output, we may want to output the timings with all precision we have.

Also note that neither tap nor tap-y/j have a provision for incorporating benchmark results (though the extra section is an obvious candidate), so we need to define our own convention/format here.


#5

Great idea—this could be a nice way to reduce the complexity of the test framework.

Outputting TAP seems to allow easy conversion to other formats (either by us or 3rd party tools). We could also deprecate the current ‘human’ format in favor of piping the TAP output into smaller tools like faucet (rewriting this in Rust might be a nice beginner project; cargo could bundle such output filters).

I haven’t read much about TAP-Y/J. They seem to offer the developer some advantages (generating JSON is easy, and valid JSON is valid YAML, so you get both for free) and might contain more information. From what I can tell, converting between TAP-J and TAP looks easy as well (given a TAP and a JSON parser).

Let me tell you about one more radical idea: Assuming that the internal representation of test results is already similar to the schema of TAP-J, offering only TAP-J could also be an option to further reduce complexity. External tools (that could be bundled with cargo, as above) could be used to convert it that anything. (The default test output would just be JSON.)

(All this would probably be a breaking change for users, though.)


#6

There could be “built-in” parser for TAP-J added for backwards compatibility. I don’t see how to do it otherwise without breaking change.


#7

I just recategorized this to “tools and infrastructure” – but I think it’s a great idea! Code-wise it’d be nice to rewrite the current “front end” as a TAP consumer.


#8

I would think about using something like TAP-J with public specification. I could try to write one when I find some time.


#9

I agree that this sounds like a pretty awesome idea! Semantically this probably wouldn’t be a flag to the compiler but rather to the test binary itself, but beyond that this is certainly one in a long list of items it’d be cool if our test infrastructure did by default.

I’ve long wanted the ability to plug in your own test harness instead of always having to use libtest as that would allow developing this kind of functionality on crates.io before moving it into the main distribution. Plus it’d even allow for faster iteration! This is pretty far out though, so just mentioning it as a passing thought.

I do like the idea of being as composable as possible, so we may wish to cut back the scope here to only have human and machine readable output so long as the latter can be convertible to basically anything else.


#10

What you think about:

Detailed design

Replace current, human-readable, output with JSON-based output based on TAP-J. For compatibility with existing work flow there should be added built-in parser for that protocol which will output in current format.

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Protocol description

Output of test suite should be stream of JSON objects that can be parsed by additional tooling.

Structure

Output is stream lines containing JSON objects separated by new line. Each object MUST contain type field which designates suite, test, bench or finally. Any document MAY have extra field which contains an open mapping for additional information.

Suite

{
  "type": "suite",
  "name": "Doc-Test",
  "build": "2015-08-21T10:03:20+0200",
  "count": 13,
  "rustc": "2a89bb6ba033b236c79a90486e2e3ee04d0e66f9"
}

Describe test suite. It MUST appear only once at the beginning of stream.

Fields:

  • type MUST be suite.
  • build MUST be ISO8601 timestamp of build date. It will prevent accidentally running old test suites at zero cost.
  • name SHOULD contain current suite type (“Test”, “Doc-Test”, “Benchmark”).
  • count MUST be count of all tests (including ignored in runtime)
  • rustc MUST be version of Rust compiler used for building test suite.

Test

{
  "type": "test",
  "subtype": "should_panic",
  "status": "ok",
  "label": "octavo::digest::md5::tests::test_md5",
  "file": "src/digest/md5.rs",
  "line": 684,
  "stdout": "",
  "stderr": "",
  "duration": 100
}

Each test MUST have produce one and only one test struct.

Test unit MUST have status field which MUST be one of the values: ok, fail, ignore.

It is RECOMMENDED to add subtype field which contain either test, bench or should_panic.

Unit MUST also contain label field which describe name of the test.

A test SHOULD contain file and line fields for sake of debugging.

A test MAY contain stdout and stderr that are outputs on given streams.

It is RECOMMENDED to include duration field that contain test run time in nanosecond .

Benchmark

{
  "type": "bench",
  "status": "ok",
  "label": "octavo::digest::md5::tests::bench_md5",
  "file": "src/digest/md5.rs",
  "line": 698,
  "iterations": 382,
  "duration": 300
}

Fields description is the same as in test with 2 conditions:

  • duration field MUST be present
  • additional iterations field MUST be present that presents iterations measured by benchmark

Finally

{
  "type": "final",
  "results": {
    "ok": 10,
    "fail": 0,
    "ignore": 2
  }
}

This MUST finish the stream and parser MUST reject all input that will be after this structure.

results struct MUST include all fields named ok, fail and ignore which describe how many tests passed, failed and was ignored respectfully.


#11

Is this supposed to say duration? And is the benchmark time also in nanoseconds (0.01 ns in the example JSON seems odd)?


#12

Wow, thanks for putting so much effort into this, @hauleth!

In most specs I’ve read so far, this was always describes as ‘stream of JSON documents’, where the schema of the JSON documents was specified later as objects. Shouldn’t matter, though; just something I noticed.

To make your description of the test schema a bit more concise, I translated it to this struct:

enum TestSubType { Test, Bench, ShouldPanic }
enum TestStatus { Ok, Fail, Ignore }

struct Test {
    subtype: Option<TestSubType>,
    status: TestStatus,
    label: String,
    file: Option<PathBuf>,
    line: Option<u64>,
    stdout: Option<String>,
    stderr: Option<String>,
    duration: Option<Duration>,
    iterations: Option<u64>,
}

Concerning benchmarks:

I think you meant ‘duration’ instead of ‘time’.

Also, what about measuring ‘throughput’? The current benchmarks are able to do that, so we might want to model that.

Oh, and I noticed the suite structure contains a time stamp. Is there something in std (or libc for that matter) that can output this?


#13

Yeah, that was typo due my live work on specs. Sorry, I’ll fix that.

To be honest I assumed the bench type will be changing as test crate isn’t stable yet. I would like, i.e. to see array nspi instead of duration number. It would allow to check against time differences in call (I’m working on crypto crate named Octavo and it would be really helpful) or to check against cache re-usage. It would also provide a way to provide more subtle statistical analysis.

Now I see that I’ve also missed description of failed test that could provide also stack trace.


#14

I’ve finally commited RFC https://github.com/rust-lang/rfcs/pull/1284