Good morning everyone! I’d like to both ask for feedback and provide a heads-up about a change to Cargo’s crate index. If you’re a normal user of Cargo, everything will continue working and this can be ignored.
As a bit of background, Cargo’s index for crates.io crates lives at a git
repository: https://github.com/rust-lang/crates.io-index. This git repository
receives a new commit for all new crates published on crates.io. Each git commit
adds a line to a file corresponding to the crate that was published. Cargo then
leverages git
to provide incremental updates to its copy of the index stored
locally, and in turn the index stored locally is used to make crate graph
resolution much quicker by avoiding lots of network requests.
The crates.io index started a long time
ago
with an empty repository, and we’ve been incrementally growing it ever since
then (one commit at a time). At the time of this writing there’s just over 100k
commits in the history of the index. This does add up over time though! Although
git
makes an incremental update of the index a cheap operation, cloning the
index from scratch downloads this entire 100k-commit-long history which can be
quite sizable!
Let’s take a look at some numbers. I’ve prepared two repositories:
- https://github.com/alexcrichton/crates.io-index-2018-09-20-snapshot - a snapshot of what the index looked like when I ran this test, simply a fork/clone of the crates.io-index repository, history and all.
-
https://github.com/alexcrichton/crates.io-index-2018-09-20-squashed - similar
to the snapshot, except the entire history now only lives on the
snapshot
branch. Themaster
branch has squashed all 100k commits down to just one.
Let’s see how big these repositories are:
$ git init foo
$ cd foo
$ time git fetch https://github.com/alexcrichton/crates.io-index-2018-09-20-snapshot master
remote: Enumerating objects: 552033, done.
remote: Total 552033 (delta 0), reused 0 (delta 0), pack-reused 552033
Receiving objects: 100% (552033/552033), 86.90 MiB | 49.47 MiB/s, done.
Resolving deltas: 100% (363407/363407), done.
From https://github.com/alexcrichton/crates.io-index-2018-09-20-snapshot
* branch master -> FETCH_HEAD
git fetch https://github.com/alexcrichton/crates.io-index-2018-09-20-snapshot 13.27s user 0.95s system 177% cpu 8.037 total
$ du -sh .
102M .
Ok not awful! It takes (on my very fast machine and network) just over 8 seconds to clone the index’s entire history and it takes up 102MB of space on my machine. Note, however, that nearly 87MB of data was transferred over the network.
Next let’s take a look at the squashed branch:
$ git init foo
$ cd foo
$ time git fetch https://github.com/alexcrichton/crates.io-index-2018-09-20-squashed master
remote: Counting objects: 25812, done.
remote: Compressing objects: 100% (12468/12468), done.
remote: Total 25812 (delta 10958), reused 1814 (delta 1814), pack-reused 11530
Receiving objects: 100% (25812/25812), 9.56 MiB | 6.73 MiB/s, done.
Resolving deltas: 100% (11688/11688), done.
From https://github.com/alexcrichton/crates.io-index-2018-09-20-squashed
* branch master -> FETCH_HEAD
git fetch https://github.com/alexcrichton/crates.io-index-2018-09-20-squashed 1.20s user 0.22s system 35% cpu 3.990 total
$ du -sh .
11M .
That’s a huge improvement! Not only are we downloading nearly 10x less data it was twice as fast (on my very fast machine with a very fast network) and it also takes up 10x less space on disk. Clearly that history is costing us!
Thankfully, Cargo was designed from the get-go with this problem in mind. We always knew that the index was going to get large and larger and so we always wanted the option to rewrite the history into one commit in our back pocket. To that end all versions of Cargo have been ready for this change, let’s take a look:
# First, let's see the real index
$ rm -rf $HOME/.cargo/registry
$ time cargo update
Updating registry `https://github.com/rust-lang/crates.io-index`
cargo update 13.65s user 0.86s system 97% cpu 14.869 total
# Next, let's see our snapshot index
$ export CARGO_REGISTRY_INDEX=https://github.com/alexcrichton/crates.io-index-2018-09-20-snapshot
$ time cargo update
warning: custom registry support via the `registry.index` configuration is being removed, this functionality will not work in the future
Updating registry `https://github.com/rust-lang/crates.io-index`
cargo update 15.33s user 1.11s system 99% cpu 16.610 total
# And finally, let's see the squashed index
$ export CARGO_REGISTRY_INDEX=https://github.com/alexcrichton/crates.io-index-2018-09-20-squashed
$ time cargo update
warning: custom registry support via the `registry.index` configuration is being removed, this functionality will not work in the future
Updating registry `https://github.com/rust-lang/crates.io-index`
cargo update 1.27s user 0.16s system 32% cpu 4.398 total
Here we’re seeing again some huge wins from using our squashed index in Cargo.
Note that we didn’t rm -rf
the index between each step, so Cargo’s naturally
re-updating the index when the histories diverge. In other words, Cargo easily
handles disjoint histories (such as when we roll the index into one commit).
Ok so with all that background, I’d like to propose that soon (around next week) we roll the index flat into one commit. More precisely I will execute the following:
$ git fetch --all
$ git reset --hard origin/master
$ git rev-parse HEAD
# make note of this commit
$ git push git@github.com:rust-lang/crates.io-index HEAD:snapshot-$date
$ git reset $(git commit-tree HEAD^{tree} -m "Roll index into one commit")
$ git push git@github.com:rust-lang/crates.io-index \
HEAD:master \
--force-with-lease=refs/heads/master:$the_earlier_commit
This should push the entire state of the current index into a branch on the same git repository (for archival purposes). Afterwards it’ll convert everything into one commit and then push that to the current master branch (using a compare-and-swap operation to make sure we don’t lose any published crates).
After this all new checkouts of the index should be much faster. Existing checkouts will pay the same one-time-cost as a fresh checkout the first time they’re updated (to download the new commit), but after that everyone will enjoy incremental updates again.
And… that’s it! Do others have thoughts on this? Concerns? Ideas? Happy to hear them!
As a side note, some may read this and ask “why not just use shallow checkouts?” This is a good question! A shallow checkout doesn’t check out the full history and wouldn’t suffer from this slows-down-over-time problem. There are two primary problems with shallow checkouts, however:
- Performing an incremental update of a shallow checkout is very expensive as a server operation. The CocoaPods project on GitHub has historically run into problems with this strategy. Effectively this isn’t a scalable solution on large scale.
- Furthermore,
libgit2
, the library that Cargo uses for git operations, doesn’t implement shallow clones.
Another question others may have is “why use git at all?” I won’t go too much into that here as it’s not really on topic for this discussion specifically. In short though it gives us incremental updates, is cross platform, and easy to integrate with.