[Idea] Cargo Global Binary Cache


#1

Motivation

I am currently use Rust for almost every thing, including developing software and data analysis. And Rust works pretty well. But Cargo is annoying for building dependencies separately. This is very annoying because:

  1. It slow down the build. Even a small project may spend a long time to build even with full parallelization.
  2. For my use case, I write a lot of data analyzing script/Adhoc projects with Rust. It blows up my disk space really quickly.
  3. When I need to clean the code or build release build, Cargo starts from scratch again and again.

So I think it’s at least really useful to make a system that can shares the identical crate builds and speed up the build system.

ccache is an good example for similar system for other programming language, which checks if two builds are identical by examining the source code hash and the compiler parameter hash.

Supporting Data

The basic assumption of this idea is software developers use a set of crate with similar configuration, features more often than others. So even though it’s true that Rust build system allows the root crate pass configurations and feature settings to the dependencies, which may lead the output of the dependency bindaries different. But it’s still true once we have a infrastructure that can cache the most commonly used <crate, version, cfg, feature…> and speed up the build.

I did a crude stat on crates in crate.io. It seems 1) there are some most commonly used crates, and 2) there are commonly used feature set for each commonly used crates. For example:

For libc

    472 "*","features" []
    480 "^0.2.21","features" []
    506 "^0.1","features" [""]
    711 "^0.1","features" []
   1004 "*","features" [""]
   8460 "^0.2","features" []

For serde

    309 ">= 0.3.0","features" [""]
    356 "^1.0","features" ["derive"]
    368 "^1.0.0","features" []
   1039 "^1.0.2","features" []
   1538 "^0.9","features" []
   1846 "^1","features" []
   1920 "^0.8","features" []
   8952 "^1.0","features" []

The both library can use the cache efficiently.

Details

Distinguishing different binaries

In fact cargo is able to hash different binaries. The metadata for each unit seems perfectly fits the need for the binary cache. It is mixed with the compilation parameters, all the dependent hashes. It seems with current cargo infrastructure, it’s not too much work to add a cache layer between compiler and cargo.

Although some crate has build.rs script, but the system will not cache any build script and it seems metadata also reflects the changes in build script output.

Cache Management

On challenging issue is we need to make the cache use limited amount of disk space. And it’s generally bad if we put every in the cache and most of them won’t be used in the future. I suppose for this compiler cache, a LRU cache with limited size will fit the need perfectly (because the most commonly used crate + cfg will be always in fresh state)

Current State

I currently implemented a prototype without any cache management features, and it seems works pretty well for me. (And it speed up the build a lot after I compiled a few different crates)

I am not sure if there’s any further issue with the change. And I basically want to write an pre-RFC, but I don’t know if there’s any guidelines for writing RFCs.

Any thoughts about this idea?


[update] I’ve open this pr to Cargo.

It seems not I am not the only one really want this. In both PR and this thread has mentioned sccache and RUSTC_WRAPPER. My understand is it is able to avoid recompile the same artifacts, but it still need to copy the artifacts to target dir because this is how cargo works today. (Tell me if I was wrong)

According to the discussion below, I think it’s reasonable to have the first version only sharing the dependency binaries controlled by environment var CARGO_SHARED_TARGET_DIR. If this variable is passed to Cargo, Cargo will try to use (if binary presents already) or produce (if no binary available) shared dependencies.

Cache management/GC is wanted for some people as well, but as @Eh2406 mentioned, it’s hard to design this part working for most people.

If this works, I probably also want make Cargo tracking some usage information for each artifacts which may be useful for separate GC/cache management program/cargo plugin, etc…


Rust 2019: Think Bigger
Cargo seems to download the same crate at least once per package
#2

Isn’t this just sccache?

If you have a number of related crates, you can also put them into a cargo workspace to reduce target directory duplication.


#3

Since I’ve looked at relevant Cargo’s code recently, let me give a few hints. (On mobile, hectic formatting )

Currently, cargo stores all binary artifacts in the target dir. its layout is complicated, but mostly private, so we can change that. The idea is to store all crates io packages elsewhere (and share that folder across projects). See the link on this comment for how this could be achieved: https://github.com/rust-lang/cargo/issues/5885#issuecomment-445015842


#4

I did some hack on this part, it seems work. (Tell me if anything is wrong).

What I have changed is the function in CompilationFiles for deps dir and output dir. Finally I also pass an additional parameter to rustc: -L dependency= It seems work on my side, all the binaries is produced in a different location and the build is just working.

But I am not sure if there’s anything wrong with the change?


#5

I think one additional required bit of logic here would be to add a file lock for this new shared directory.


#6

Yes, that’s true.

I am planning to have a flock when updating the cache index. In addition, it should have some mechanism that prevent a crate from being removed when cargo decide use the cached binary. The cache mechanism seems a little bit complicated then.


#7

Sort of similar, but it’s really nice to have a builtin cache with Cargo. Plus Cargo already have the perfect hash for this.

I think later we can even have the feature so that we can pin some binary in the cache, so that it just like install a lib.

The centralized binary cache will reduce the target directory size as well.

For Ad-hoc projects, I would like to keep each Ad-hoc project separate, they are really different things, and just want to build them separately.


#8

Shared, machine-wide guts of target subdirectories would be great! In addition to saving lots of disk space and time, it would also fix my two main pet peeves about Cargo:

  • macOS wants caches in ~/Library/Caches. Large target subdirectories scattered all over the disk cause unwanted churn in backups (Time Machine) and search (Spotlight), and tools to free up disk space don’t know how to deal with them.

  • In Vagrant you want to place Rust source code on the network drive to share it with the host. You also never want to have cache on that slow network drive with unreliable locking and broken hardlinks.


#9

BTW, is there a room to improve sccache integration? Currently sccache hooks up as RUSTC_WRAPPER, but it doesn’t really know what Cargo is doing, so it has to guess some details. For example, could Cargo give sccache some uniquely-identifying hash of the operation instead of sccache guessing it?


#10

@haohou if you want to push this to the real implementation, I think it makes sense to open a PR to Cargo with rough design, to get the feedback from the team early.

Note that this is a pretty significant change to Cargo’s “API”, so, in addition to implementation work, some design/discussion/stabilization efforts will also be required.


#11

Yes, I am happy to do that.

Just want to make sure you mean make a formal RFC documentation and let people discuss it, rather than the implementation PR to cargo.


#12

I don’t think a full RFC is needed here (though, Cargo teem might disagree), as this is not a terribly large feature. However, when preparing PR, it might be a good idea to write some-kind of documentation for this feature, which explains what it does exactly, and explicitly mention points where design can go either one way or another.

For example, how would the user configure this shared dir? Will it be a flag like --target-dir, and env var, a config option? Will cargo use a shared dir by default, or you need to opt-into this feature?

The implementation side of things is “easy” in a sense that we can always adjust it later. The “interface” part is much harder, because you can’t change them later due to backwards compat.


#13

I love to see more work in this space. I also appreciate having more people getting involved. So thank you!

In addition to the excellent questions @matklad layed out, we would also need to describe how this cache would be different then setting a single shared target directly with the CARGO_TARGET_DIR environment variable or .config file?


#14

I think it would be better to have a cache, instead of a shared target dir.

Currently I have the code change for sharing the .rlib and .d. But I think it’s another problem is, the shared target folder actually consumes huge disk space. I think it should have a cache management mechanism.

For example, if the dependency is compiled only once, cargo make the output to the package’s target. And if cargo finally realized this binary is used several times, then it will cache the compilation result. Also, we can put a limit of amount of disk space the cache can use. If the binary haven’t been use for a while, we probably can just remove it without hurting the performance and just make it a LRU cache.

So it turns out probably a file like ~/.cargo/bin-cache.conf, which defines the cache configurations.

At the same time, we probably need a manifest file tracking the status of the cached binaries. For example: the timestamp of last use, the total size of the binary, and probably flag prevents the binary from being deleted for a while (I would like call it a lease. Just think about there are multiple cargo instance is running, once one cargo decide to use a cached binary, it should have a way to pin the binary in the cache a little while. And the lease has a expire timestamp just in case cargo crashes during the build).

Any ideas on this ?


#15

Any ideas on this ?

Meny! I have shared some of them in previous discussions including this, this, this, or the aforementioned this. The TLDR is that the CARGO_TARGET_DIR = ~/.cargo/buid_cache/ is most of the way to a good solution to the technical problems, except for the problem of cleaning it out. (There are UI problems like only sharing things that are likely to be reused, but I don’t see them as big blockers.) There are many good ideas for how a GC should work LRU by size, LRU by number, last used cutoff, if the compiler is still installed, if the toolchain is still installed… and many many more. Each one is perfect for someones use case, and unacceptable for someone else’s. It is going to take a lot of design work to find something good enough to become part of Cargo itself. Luckily most of these ideas can be experimented with as out of tree cargo-subcommands like cargo-sweep, or as cargo wrappers like sccache.

Go bild a thing, and please tell us (the Cargo team) what we can do to make things easier. The Cargo team is open to making changes to make exploring this design space possible! We are working on improving the docs of the files that cargo generates, because several of these experiments have reported back that they don’t know what the impact of cleaning each file is likely to be. We are definitely willing to consider PRs that add functionality to cargo to unlock more exploration, for example a timestamp file recording the last time cargo used that artifact.


#16

Small problem with changing actual cargo target dir is that the final build results (binaries, dylibs, etc.) are also placed there. It’s convenient to have ./target/release/exe in project’s own directory, not in some shared cache dir hidden somewhere.

Basically, Cargo target dir conflates concepts of build/temp dir and products dir. These need to be separated out.

BTW, Xcode nicely separates temp, intermediaries and products as separate dirs, and I had to work around Cargo to use it from Xcode.


#17

I see. I think it’s what I current have:

Put all non-primary outputs to the dir specified by CARGO_SHARED_TARGET_DIR, at the same time, if there’s required binary in that directory, just do not recompile the code.

Is this sounds a reasonable change ? If yes I can do some cleanup and publish a PR.


#18

What I am doing is only change the output of non-primary units.

Yes, maybe this is another thing should the change handles.


#19

Yes! This is the kind of UX problems the current implementation has. We would definitely need to make that better before we can change cargos default behavior.

BTW, does the unstable --output-path help with this in the meantime?

I do not have the authority to unilaterally merge such a change, but I would be happy to look at and comment on such a PR! I think it may spark a useful conversation on how to make incremental progress.


#20

A fair way down this thread, and no-one has mentioned security?

Rust’s safety and security depends on the compiler. Compiler output is assumed to have gone through all the borrow-checker (etc) validation and be trustworthy. We can read and validate the source code input to this process (at least in theory).

How would this scheme address the concern of untrustworthy code being uploaded to the cache?

[edit] I seem to have inferred too much from the word ‘global’ here; it seems the proposal is more per-user between projects. That’s a good idea and something I’ve wished for, too. My confusion happened because there was another similar conversation about downloading build products from crates.io recently, and I conflated the two somewhere.

I’m leaving this concern here, because it’s important to keep in mind if ever going beyond a single trust boundary.