[Idea] Cargo Global Binary Cache

If it’s local to your machine (or even your physical network), you can (hopefully) trust anyone who would be putting a build in the cache, as they either have access to your machine or your physical network. Both are basically game over security-wise even without a binary cache for cargo.

If it’s on the internet, it’s a similar process to trusting the source code you’re downloading. Check a hash to make sure you have the correct thing, and check to see if people you trust trust that binary (or source code) using something like cargo crev.

If we assume cargo has reproducible builds, you could have a build bot that builds the crate locally and checks against the prebuilt, and if they’re the same, trusts it to be the binary resulting from building that source code.

It’s all a trust problem. For this though, I hope you can trust yourself :sweat_smile:

1 Like

I just published the PR https://github.com/rust-lang/cargo/pull/6437

3 Likes

I would love to see cargo sharing binary artifacts between builds by the same user on the same system by default.

9 Likes

Agreed, it would be a huge step forward and improve the default and overall experience for everyone!

Both for developers that may build lots of small Rust apps and where we now rebuild common dependencies and waste diskspace for the duplication, and also for developers that work on just a few but much bigger projects where even 2-3x redundant compilation across the projects is a big hit.

1 Like

A quick update on the progress. I have two PRs to improve cargo sweep. One that uses the data stored by Cargo to clean files that are for a no longer installed version of rust. The other uses the hash files checked by Cargo to accurately determine that atime of the last Cargo invocation using each file. (still relies on the file system maintaining atime.) I also have a PR to Cargo to have Cargo maintain a last used timestamp file in the folder that (my PR to) cargo-sweep is already checking. (so we can get away from atime.)

@haohou PR to Cargo requires some work on a technical problem that is related to a design problem. The technical problem is how can we be sure that artifacts will only be reused if exactly equivalent. The design problem is what exactly should go in the Cached folder. The PR as it stands choose to put only the, non primary, file dep artifacts in the Cached folder, and was working on heuristics for making sure reuse was correct. Unfortunately, Alex was unconvinced that we would find and maintain a sufficient set of heuristics without somehow reusing the existing fingerprint module.

6 Likes

So let me try to articulate an alternative answer to that design problem. Edit: to be clear I am not speaking for the Cargo Team.

Proposal

There are two folders Cargo uses for keeping track of and building your project and its dependencies.

  1. There is one set by the environment variable CARGO_TARGET_DIR and the target-dir field in .cargo/config and defaults to /target. The exact content and structure of this folder is an implementation detail of Cargo, and can be changed in the future, but this proposal does not suggest changing it. This folder is used for artifacts that Cargo thinks will only be useful to this project. The exact rules for what goes in this folder are implementation details of Cargo, and can be changed in the future, but it will include the bin and lib and test artifacts for the project being compiled.
  2. The other one is set by the environment variable CARGO_CACHE_DIR and the cache-dir field in .cargo/config and defaults to being the same as target-dir. The exact content and structure of this folder is an implementation detail of Cargo, and can be changed in the future, but this proposal suggests starting it the same as the target-dir. Cargo will always support using the same folder for both even if the default changes in the future. This folder is used for artifacts that Cargo thinks may be useful to other projects.

Why keep the same structure between the two folders?

Cargo does a good job of rebuilding/reusing only when needed. That “fingerprint” system is coupled to the folder structure. By keeping the structure we can keep the working system. Also tools built for cleaning one will work on the other.

Why does cache-dir default to target-dir?

This means that accepting this proposal will not break any existing users. Then when this is working well it is not hard to propose changing this default to /target/cacheable/ for eazy CI cashing, or to ~/.cargo/build_cache/ for good multi project cashing. The bikeshed of what is a good default and whether it is worth braking users, is left for a follow up. Edit: This also means that if we are not caching in a way best optimized for this new use, then we are not inflicting it on the community. Unless a user opts into it, or until that follow up. User opt in gives us time to debug this before we seriously suggest that follow up. See the conversation below.

What exactly goes in the `cache-dir?

I don’t know. That is why I want it to be an implementation detail we can change. I think making it “things from crates.io” as @matklad demonstrated, or “not primary” as @haohou was working on, would be a good place to start.

1 Like

By keeping the structure we can keep the working system.

I am no longer sure that this’ll work out of the box: not all of the inputs are reflected in the file name hash. For example, changing RUST_FLAGS changes the way dependencies are compiled, but the file name hash stays the same. So, just moving stuff to a separate might not work if, for example, you have different profile flags in different projects? In other words, target/debug target/release separation does not make sense for shared cache, because there may be several debugs and releases.

That depends on what we mean by "work".

If "work" means "will build the correct final artifact" then I think it will just work. If we relied only on the file name hash is and did not account for rust RUST_FLAGS we could, for example, end up linking code for the wrong CPU. So getting the correct output is not a small achievement.

If "work" means "setting the cache-dir to the same folder for more than one project will always save you time" then, at this time, it will definitely not work. Your example is excellent, one project "A-bin" has a config that sets RUST_FLAGS="a", another project "B-bin" has a config that sets RUST_FLAGS="b" both of which use the same cache-dir. On building "A-bin" the cache will fill with deps that are built for RUST_FLAGS="a". Then on building "B-bin" the cache will be overwritten with deps that are built for RUST_FLAGS="b". Then "A-bin" will have to build them to put them back. We have a kind of false sharing.

These kinds of issues can be reduced by storing more info in the folder structure, but it will take time to find them all and design the best solution for each. A more ambitious proposal could suggest a plan for how to encode all sutch info, and thus suggest using a shared cache-dir by default. I am very welcome to hearing other proposals. In the meantime, I think my proposal, to add the infrastructure but have it off by default, allows us to start experimenting to find what needs to be improved without needing to get it all wright before we can merge.

Profile flags are tracked in the filename hash. RUSTFLAGS are not. It's trivial to change what's in the hash. Here's a summary:

Value Fingerprint Metadata
rustc
Profile
cargo rustc extra args
CompileMode
Target Name
Target Kind (bin/lib/etc.)
Enabled Features
Immediate dependency's hashes ✓ (except build.rs and bins)
Target or Host mode
__CARGO_DEFAULT_LIB_METADATA (release channel for libstd)
manifest authors, description, homepage (exposed by CARGO_PKG_*)
package_id
Target src path
Target path relative to ws
Target flags (test/bench/for_host/edition)
Edition
-C incremental=... flag
mtime of sources
RUSTFLAGS/RUSTDOCFLAGS

Cargo does not fingerprint everything, so it is easy for parts of the environment to leak into the artifact without being tracked. I have an experimental branch of rustc that would track environment variables, but I'm not sure if I'll finish it. There's always going to be something that leaks without strict sandboxing.

I like the idea of having multiple target directories. And hopefully the implementation could lead naturally to network-based caching in the future. I think with decent gc support, it could be relatively aggressive about placing artifacts in the shared cache. It would be interesting to maybe think about how the user can control what's cached.

Thank you for that detailed summary. I wonder if that should be added to internal documentation some ware?

I am a little unclear how you would recommend we change the proposal given your deeper understanding of the situation.

  • Should we be using this list to encode more info in the filename before proposing a split target/cache?
  • Does the fact that we can’t encode everything mean that we should not have a cache folder? ( I definitely think it means we need the ability to turn it off, but that is part of this proposal.)
  • I don’t see why we can’t teach sccache about the cache folder allowing it to get files from the internet for all projects using that folder. But I do not know the sccache internals.
  • I am working on making the GC ecosystem better. I totally agree that changing the default amount of caching is mostly a function of the GC quality. Should this proposal be on hold for GC improvements?
  • One day allowing control of what is cached sounds reasonable. Do you have a suggestion for what it could look like? Should we put this proposal on hold until we have a plan for that?

I don’t have too many concrete suggestions other than being cautious about the limitations of a shared cache (especially without sandboxing and environment tracking). Some ideas:

  • Consider adding RUSTFLAGS to the metadata hash.
  • I think it would be fine to start with an unstable experiment, but I think gc support might be a prerequisite for stabilization.
  • Consider the UI for how cargo clean would work. How to clean my local target vs the shared one? How will cargo clean evolve as it undertakes more tasks (like managing the registry caches)? Should there be separate commands, or could it take subcommands, or various flags?
  • Stretch goal: Figure out some way to make this work with rustdoc.

I don’t think user-level control is a prerequisite to get started. Just consider making the logic on what to share flexible. As an example, Bazel can control what’s cached per-target. I think for a local shared cache, the default of including non-path dependencies might be a good start?

1 Like

LGMT! And this will immediately unlock the use-case I am mostly interested in: making it easy to cache target directory without manually specifying before_cache: remove this and that and that.

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.