Motivation
I am currently use Rust for almost every thing, including developing software and data analysis. And Rust works pretty well. But Cargo is annoying for building dependencies separately. This is very annoying because:
- It slow down the build. Even a small project may spend a long time to build even with full parallelization.
- For my use case, I write a lot of data analyzing script/Adhoc projects with Rust. It blows up my disk space really quickly.
- When I need to clean the code or build release build,
Cargo
starts from scratch again and again.
So I think itās at least really useful to make a system that can shares the identical crate builds and speed up the build system.
ccache
is an good example for similar system for other programming language, which checks if two builds are identical by examining the source code hash and the compiler parameter hash.
Supporting Data
The basic assumption of this idea is software developers use a set of crate with similar configuration, features more often than others. So even though itās true that Rust build system allows the root crate pass configurations and feature settings to the dependencies, which may lead the output of the dependency bindaries different. But itās still true once we have a infrastructure that can cache the most commonly used <crate, version, cfg, featureā¦> and speed up the build.
I did a crude stat on crates in crate.io. It seems 1) there are some most commonly used crates, and 2) there are commonly used feature set for each commonly used crates. For example:
For libc
472 "*","features" []
480 "^0.2.21","features" []
506 "^0.1","features" [""]
711 "^0.1","features" []
1004 "*","features" [""]
8460 "^0.2","features" []
For serde
309 ">= 0.3.0","features" [""]
356 "^1.0","features" ["derive"]
368 "^1.0.0","features" []
1039 "^1.0.2","features" []
1538 "^0.9","features" []
1846 "^1","features" []
1920 "^0.8","features" []
8952 "^1.0","features" []
The both library can use the cache efficiently.
Details
Distinguishing different binaries
In fact cargo
is able to hash different binaries. The metadata for each unit seems perfectly fits the need for the binary cache. It is mixed with the compilation parameters, all the dependent hashes. It seems with current cargo
infrastructure, itās not too much work to add a cache layer between compiler and cargo.
Although some crate has build.rs
script, but the system will not cache any build script and it seems metadata
also reflects the changes in build script output.
Cache Management
On challenging issue is we need to make the cache use limited amount of disk space. And itās generally bad if we put every in the cache and most of them wonāt be used in the future. I suppose for this compiler cache, a LRU cache with limited size will fit the need perfectly (because the most commonly used crate + cfg will be always in fresh state)
Current State
I currently implemented a prototype without any cache management features, and it seems works pretty well for me. (And it speed up the build a lot after I compiled a few different crates)
I am not sure if thereās any further issue with the change. And I basically want to write an pre-RFC, but I donāt know if thereās any guidelines for writing RFCs.
Any thoughts about this idea?
[update] Iāve open this pr to Cargo.
It seems not I am not the only one really want this. In both PR and this thread has mentioned sccache
and RUSTC_WRAPPER
. My understand is it is able to avoid recompile the same artifacts, but it still need to copy the artifacts to target dir because this is how cargo works today. (Tell me if I was wrong)
According to the discussion below, I think itās reasonable to have the first version only sharing the dependency binaries controlled by environment var CARGO_SHARED_TARGET_DIR
. If this variable is passed to Cargo, Cargo will try to use (if binary presents already) or produce (if no binary available) shared dependencies.
Cache management/GC is wanted for some people as well, but as @Eh2406 mentioned, itās hard to design this part working for most people.
If this works, I probably also want make Cargo tracking some usage information for each artifacts which may be useful for separate GC/cache management program/cargo plugin, etcā¦