Cargo idea: CLI option to only rerun tests in crates with changes

Some build systems like Bazel and Microsoft's Cloudbuild only run tests for projects that have a change in their transitive closure. Would it be simple to add an --incremental CLI flag to cargo test? This flag would only rerun test executables for crates with a change in their transitive dependencies or that previously failed.

This would be nice to have in reducing average CI times for projects in a workspace.

2 Likes

A lot of this depends on what you mean by if a dependency changed. We could track the fingerprint for a test binary since last run and re-run if the fingerprint is now different.

If you want to get more granular than that, you have two problems

  • Cargo has no knowledge of tests
  • We'd likely need first-class coverage information and maybe VCS information

Currently, cargo test has no knowledge of tests and libtest is too low level for this. As we work to stabilize the json output from libtest (my project goal for 2025h1), Cargo will gain knowledge about tests and be able to have higher level functionality like this.

4 Likes

The OP does indicate a granularity of the executable and crate dependency graph. This is very granular to be clear, but it would still be quite useful for larger workspaces with many crates in them to be able to automatically only rerun tests in the crates being edited (“precise”) or dependent on them (“accurate”).

A slightly more interesting question is when Cargo considers a test('s dependencies) to have changed / what the baseline is. Probably the simplest and most useful answer is roughly “if the last test run was fully successful and the fingerprint hasn't changed, cache the successful result (don't rerun).”

To note separately, as a feature, this could alternatively be handled via an external runner (e.g. nextest) rather than in Cargo itself (although Cargo having the functionality would be more convenient).

If we did the binary or package level re-running, it would be interesting to see prior art in terms of

  • What is the flag called
  • What level of granularity does it operate on
  • What invalidates the result or creates a unique result (e.g. cargo test --test foo or cargo test -- foo run a subset of tests and should only count for that subset?)
  • What considerations for poisining of the result are considered (e.g. untracked files or env variables changing test results)
  • Are only successful runs skipped or also unsuccessful?

This granularity is also likely to be useful because there are multiple existing reasons for expensive or often-irrelevant tests to be broken out into separate test targets or separate packages:

  • to configure them as skipped by default with <target>.test = false
  • to suggest to the user that they are a meaningful unit
  • to avoid building them, not just running them
  • to avoid building this-test-only dependencies (by splitting a package and not just a test)
  • to isolate process-wide state that shouldn't be allowed to affect other tests
  • for a custom test harness that can't run together with #[test]s
1 Like

Prior art from Bazel:

  • What is the flag called:

--[no]cache_test_results, and has a shorthand flag -t (enable) / -t- (disable). --cache_test_results is the default.

  • What level of granularity does it operate on.

The bytes of the test binary itself, plus any environment variables and flags that will be exposed to the test binary. Bazel also lets you declare files which will be read at runtime, so those are included in the cache key.

Depending on your sandboxing strategy, attempting to read undeclared files may succeed or fail, but either way they won't be included in the cache key.

  • What invalidates the result or creates a unique result (e.g. cargo test --test foo or cargo test -- foo run a subset of tests and should only count for that subset?)

cargo test --test foo is treated as completely independent from cargo test, there's no caching of subsets of results. For that matter, cargo test --test foo and cargo test -- foo would be cached independently, because the granularity is literal flags.

  • What considerations for poisining of the result are considered (e.g. untracked files or env variables changing test results)

Files which don't appear in the dependency graph are ignored.
Files which are present in the dependency graph, but untracked by version control are considered.
Env vars are tracked (by default bazel doesn't forward all host env vars to tests, but they can be forwarded on a per-env basis).

  • Are only successful runs skipped or also unsuccessful?

Only successful runs are cached.

1 Like

I guess one challenge in building such a thing is how to handle non-code dependencies. In the general case, tests can read files, access the internet, and so on. Bazel just outright disallows internet access and sandboxes tests such that they can only access declared files. It's not immediately obvious how to handle test file dependencies. For files that exist statically, I guess you can declare them in Cargo.toml somehow, but if they're dynamically generated things get tricky.

Does Rust have a testing philosophy that could narrow the scope of the problem to make things simpler? The docs state that tests run in parallel by default and that you can test internal functions if you want, but I've not seen anything guidance for writing good tests (I'm not even sure it should be something Rust gives guidance on).

Bazel is probably a good place to start, but Bazel (and its bretheren) also tend to say "good riddance" to unsupported cases; I don't think Rust should limit itself like these tools do. I mean, it's all fine for what it does, but I have no idea how to teach Bazel "I am sensitive to PAGE_SIZE of the running kernel"[1] "I am sensitive to the runtime quota limits of the kernel for resource X".

This will miss effects of things like inventory-registered data. I know Google is super-allergic to static global data, but it is something to consider here. There may also be code sensitive to impl SomeTrait for Type that isn't in the module hierarchy of Type (strange…but not impossible, but probably low-risk without specialization).

How does that work with tests that do things like database work or (in some of my crates) extracting the test data from .git (test data are stored in-history and merged via -s ours…the data are actual Git history topology, not just data files).


My interest is in getting line-level (or at least function-level) coverage metrics per test. This would allow a tool to take a (commit_base, historical_results, patch) and spit out a set of tests to run for the patch. While this is probably not of wide-spread interest as test suites tend to be fast enough to not care, it adds up when one wants to run cargo-mutants as part of CI. If I could get that to run limited tests (with the intent of fast-failing on handled mutants given the mutated code diff; successful passes would follow up with the remaining tests to prevent false negatives), I think I could have it run as part of PR pipelines rather than nightly or weekly (depending on project size).


  1. actually, just depend on a file that embeds it somehow ↩︎

The standard mechanism here is "depfiles" (what gcc -MF will write). Basically, you intercept any file read and report it at the end as "these files matter". Note that the Makefile format is limited in that it can only express dependencies on files that exist, not on "negative dependencies". As an example, say you have some file search procedure and you search for file F in directories A and B and it lives in B. If you first search for A/F, if this path appears, you need to rerun your test. Generally, build systems just ignore this problem, but I don't know that test suites can.

As for non-file resources, you need some way to "snapshot" the relevant state to report as a "dependency". This might be the state of the database (but maybe not how you connect to it), the set of running processes (for something like a pgrep tool), X server metrics (DPI, resolution, etc.) for UI testing, etc. If there's a mechanism to do that (as well as say "I depend on thing X which is inherently not fingerprintable, so just always run me"), I think there's a decent base to build upon.

Tangential, but I have personal experience with a project (built using traditional make) where this was a serious headache to the point where most of the devs completely gave up on incremental builds.

The core idea is: all of those things should be represented as a file you depend on. So if you depend on a .git directory, that entire directory should be an input to your test (or a filtered version that contains the commits you care about). Or if you depend on database state, probably you depend on a dump of the entire database (and potentially create the database as part of your test set-up based on the dump). Or as you described in the next post, e.g. a hash of the current state of a server or something.

(To be clear, I'm not suggesting these as things Rust should pick up, just giving the prior art of another system).

One way to report non-cargo dependencies is through test harness json output, acting like build script rerun-if-changed directives. If we have pytest-like fixtures, some of these could be reported automatically.

Of course there are

  • bugs with the external dependency declarationsd
  • that this doesn't exist yet but neither does the skipping logic. If its opt-in somehow then that should be fine.

I was mulling the rerun-if-changed approach a bit and think it has a few problems. Namely, is adding the files the user tests accessed to the JSON a manual or automatic process? Automatically tracking this requires hooking into the file APIs for the transitive closure of a test's subprocesses (Cloudbuild does this using detours). Bazel's sandboxing approach seems like it would have a chicken and egg problem in that tests don't report their accessed files until after they've run (let alone, been declared).

Manual tracking requires the user to call some hook to report file accesses to the test harness. One would typically wrap this in a convenience method that opens a file and reports it in a single call. Unfortunately, it's not clear how to inject this behavior into subprocesses. I'd expect most people writing tests aren't spawning child processes, so it's not clear how big of an issue this is.

It is at least specifically supported practice to test binaries by spawning them.

For the idea I had, it would be manual but test support libraries would help.

For example,

  • A snapshottting library could register the snapshot as a dependency
  • I use a test library that reads files to set up a scratch pad directory for my tests to run in. That library could register those files that are read

As for spawning processes, I do that for

  • End-to-end testing of a binary
  • End-to-end testing of code related to binaries

In both cases, I know what files the binaries should be calling and can register them. I don't see much of a need for a spawned process to register files.