Cargo's crate index: upcoming squash into one commit

alexcrichton · September 20, 2018, 7:03pm

Good morning everyone! I’d like to both ask for feedback and provide a heads-up about a change to Cargo’s crate index. If you’re a normal user of Cargo, everything will continue working and this can be ignored.

As a bit of background, Cargo’s index for crates.io crates lives at a git repository: https://github.com/rust-lang/crates.io-index. This git repository receives a new commit for all new crates published on crates.io. Each git commit adds a line to a file corresponding to the crate that was published. Cargo then leverages git to provide incremental updates to its copy of the index stored locally, and in turn the index stored locally is used to make crate graph resolution much quicker by avoiding lots of network requests.

The crates.io index started a long time ago with an empty repository, and we’ve been incrementally growing it ever since then (one commit at a time). At the time of this writing there’s just over 100k commits in the history of the index. This does add up over time though! Although git makes an incremental update of the index a cheap operation, cloning the index from scratch downloads this entire 100k-commit-long history which can be quite sizable!

Let’s take a look at some numbers. I’ve prepared two repositories:

https://github.com/alexcrichton/crates.io-index-2018-09-20-snapshot - a snapshot of what the index looked like when I ran this test, simply a fork/clone of the crates.io-index repository, history and all.
https://github.com/alexcrichton/crates.io-index-2018-09-20-squashed - similar to the snapshot, except the entire history now only lives on the snapshot branch. The master branch has squashed all 100k commits down to just one.

Let’s see how big these repositories are:

$ git init foo
$ cd foo
$ time git fetch https://github.com/alexcrichton/crates.io-index-2018-09-20-snapshot master
remote: Enumerating objects: 552033, done.
remote: Total 552033 (delta 0), reused 0 (delta 0), pack-reused 552033
Receiving objects: 100% (552033/552033), 86.90 MiB | 49.47 MiB/s, done.
Resolving deltas: 100% (363407/363407), done.
From https://github.com/alexcrichton/crates.io-index-2018-09-20-snapshot
 * branch                  master     -> FETCH_HEAD
git fetch https://github.com/alexcrichton/crates.io-index-2018-09-20-snapshot  13.27s user 0.95s system 177% cpu 8.037 total
$ du -sh .
102M    .

Ok not awful! It takes (on my very fast machine and network) just over 8 seconds to clone the index’s entire history and it takes up 102MB of space on my machine. Note, however, that nearly 87MB of data was transferred over the network.

Next let’s take a look at the squashed branch:

$ git init foo
$ cd foo
$ time git fetch https://github.com/alexcrichton/crates.io-index-2018-09-20-squashed master
remote: Counting objects: 25812, done.
remote: Compressing objects: 100% (12468/12468), done.
remote: Total 25812 (delta 10958), reused 1814 (delta 1814), pack-reused 11530
Receiving objects: 100% (25812/25812), 9.56 MiB | 6.73 MiB/s, done.
Resolving deltas: 100% (11688/11688), done.
From https://github.com/alexcrichton/crates.io-index-2018-09-20-squashed
 * branch              master     -> FETCH_HEAD
git fetch https://github.com/alexcrichton/crates.io-index-2018-09-20-squashed  1.20s user 0.22s system 35% cpu 3.990 total
$ du -sh .
11M    .

That’s a huge improvement! Not only are we downloading nearly 10x less data it was twice as fast (on my very fast machine with a very fast network) and it also takes up 10x less space on disk. Clearly that history is costing us!

Thankfully, Cargo was designed from the get-go with this problem in mind. We always knew that the index was going to get large and larger and so we always wanted the option to rewrite the history into one commit in our back pocket. To that end all versions of Cargo have been ready for this change, let’s take a look:

# First, let's see the real index
$ rm -rf $HOME/.cargo/registry
$ time cargo update
    Updating registry `https://github.com/rust-lang/crates.io-index`
cargo update  13.65s user 0.86s system 97% cpu 14.869 total

# Next, let's see our snapshot index
$ export CARGO_REGISTRY_INDEX=https://github.com/alexcrichton/crates.io-index-2018-09-20-snapshot
$ time cargo update
warning: custom registry support via the `registry.index` configuration is being removed, this functionality will not work in the future
    Updating registry `https://github.com/rust-lang/crates.io-index`
cargo update  15.33s user 1.11s system 99% cpu 16.610 total

# And finally, let's see the squashed index
$ export CARGO_REGISTRY_INDEX=https://github.com/alexcrichton/crates.io-index-2018-09-20-squashed
$ time cargo update
warning: custom registry support via the `registry.index` configuration is being removed, this functionality will not work in the future
    Updating registry `https://github.com/rust-lang/crates.io-index`
cargo update  1.27s user 0.16s system 32% cpu 4.398 total

Here we’re seeing again some huge wins from using our squashed index in Cargo. Note that we didn’t rm -rf the index between each step, so Cargo’s naturally re-updating the index when the histories diverge. In other words, Cargo easily handles disjoint histories (such as when we roll the index into one commit).

Ok so with all that background, I’d like to propose that soon (around next week) we roll the index flat into one commit. More precisely I will execute the following:

$ git fetch --all
$ git reset --hard origin/master
$ git rev-parse HEAD
# make note of this commit
$ git push git@github.com:rust-lang/crates.io-index HEAD:snapshot-$date
$ git reset $(git commit-tree HEAD^{tree} -m "Roll index into one commit")
$ git push git@github.com:rust-lang/crates.io-index \
  HEAD:master \
  --force-with-lease=refs/heads/master:$the_earlier_commit

This should push the entire state of the current index into a branch on the same git repository (for archival purposes). Afterwards it’ll convert everything into one commit and then push that to the current master branch (using a compare-and-swap operation to make sure we don’t lose any published crates).

After this all new checkouts of the index should be much faster. Existing checkouts will pay the same one-time-cost as a fresh checkout the first time they’re updated (to download the new commit), but after that everyone will enjoy incremental updates again.

And… that’s it! Do others have thoughts on this? Concerns? Ideas? Happy to hear them!

As a side note, some may read this and ask “why not just use shallow checkouts?” This is a good question! A shallow checkout doesn’t check out the full history and wouldn’t suffer from this slows-down-over-time problem. There are two primary problems with shallow checkouts, however:

Performing an incremental update of a shallow checkout is very expensive as a server operation. The CocoaPods project on GitHub has historically run into problems with this strategy. Effectively this isn’t a scalable solution on large scale.
Furthermore, libgit2, the library that Cargo uses for git operations, doesn’t implement shallow clones.

Another question others may have is “why use git at all?” I won’t go too much into that here as it’s not really on topic for this discussion specifically. In short though it gives us incremental updates, is cross platform, and easy to integrate with.

sfackler · September 20, 2018, 7:10pm

Yes please!

jtgeibel · September 20, 2018, 7:18pm

In your sequence of 3 cargo updates above, do you end up with 1 or 3 hash directories under $HOME/.cargo/registry?

I’m not at a computer to test myself at the moment, but I’m assuming that environment variable results in 3 separate hashes.

I ask because I’m not certain that sequence demonstrates that cargo correctly handles history rewrites. Also, how far back, in terms of old cargo releases, do we want to test this to ensure old clients behave correctly?

cuviper · September 20, 2018, 7:36pm

I'm surprised by this, as I'd expect only that one commit object needs to be downloaded. Its tree object should be identical to the snapshot. But maybe git isn't as intelligent about this as I'd hope.

$ git cat-file -p FETCH_HEAD
tree 16376416fbe2cbf6576fbbe9e5bdb51f33c831f7
parent 1c6292da974cdd719eda68ad4cd5853f52033772
author bors <bors@rust-lang.org> 1537467333 +0000
committer bors <bors@rust-lang.org> 1537467333 +0000

Updating crate `bao#0.3.0`

vs

$ git cat-file -p FETCH_HEAD
tree 16376416fbe2cbf6576fbbe9e5bdb51f33c831f7
author Alex Crichton <alex@alexcrichton.com> 1537468080 -0700
committer Alex Crichton <alex@alexcrichton.com> 1537468080 -0700

Restart the index as of 2018-09-20

Eh2406 · September 20, 2018, 7:42pm

If we are going to do this, should we set a naming scheme for the archive branches and a tentative schedule for how often we will make them?

gnzlbg · September 20, 2018, 7:43pm

@alexcrichton did you consider using --depth 1 when cloning the index for the first time ? If so, how does it compare with the approaches you are suggesting ?

carols10cents · September 20, 2018, 7:44pm

Have you tested that crates.io handles this change okay? Also we should probably put crates.io in maintenance mode for the few minutes that you’re swapping the branches out; that would definitely prevent new crates from going missing.

cuviper · September 20, 2018, 7:56pm

That's a shallow clone, which he did address.

gnzlbg · September 20, 2018, 7:57pm

Duh, I did not know that that’s what --depth 1 is called. Thanks.

alexcrichton · September 20, 2018, 8:00pm

Ah yes good point, there's three urls. Rest assured though that if you go through the exercise of updating the branches live (to keep the same hash) it works out. I've tested this historically and simply failed to have a good set of instructions above!

Oh I should clarify that I haven't actually tested the download impact times, you may be right! I know it works from historical testing, however.

Perhaps! The naming scheme I think is fine to do something like snapshot-YYYY-MM-DD, and for schedule I think we'll stick with an as-needed basis for now until we've done it once or twice.

I haven't explicitly tested the registry but I have written all the code related to this in both Cargo and in crates.io. Additionally the registry already has to handle this use case where two different servers are competing to update the index. Switching to a different commit will look exactly like a different registry has pushed a commit, which has already been exercised quite a bit.

The --force-with-lease operation is intended to be a compare-and-swap so we don't need to take downtime on crates.io

kornel · September 21, 2018, 1:53pm

That’s OK by me. It won’t break crates.rs, but I’m slightly worried it increases dependence on the unofficial crates.io API.

The index content doesn’t directly contain information about when each version was published. This is possible to infer from commit dates. If you keep all the history in some branches or tags, it’ll still be possible, but even more cumbersome. If you just force push, then publication dates won’t be available in the index any more.

I’m worried about relying on crates.io API, since that’s borderline scraping of data, and it isn’t as easy to fully clone like a git repo.
The index wouldn’t be important and could be treated as a throw-away copy if there was another official source of truth for the complete crate data and its full history (and ownership BTW, which the crates index doesn’t have), e.g. data dumps from crates.io, or another git repo with whole history preserved.
Or if you plan on squashing the data, then it’d be good to add extra fields to the crates’ JSON (like publication date, and publisher’s github ID).

bill_myers · September 21, 2018, 3:10pm

Since the index is append-only, how about changing it to just put everything in a single file, appending new data at the end, and on the client requesting via HTTP only the tail part of the file since the last update position, and indexing that in a local SQLite database or any file-based key-value store?

The file could still be stored in a Git repository as long as the web interface supports HTTP byte range requests for raw files (which I guess GIthub does).

alexcrichton · September 21, 2018, 6:37pm

As a reminder the index is effectively an internal data structure of Cargo and will always remain so. It's critical to Cargo's performance so we will change it over time as we see fit to match Cargo's performance needs.

Note that for the immediate future, though, historical data will be preserved on a separate branch.

Perhaps! Like I mentioned in the OP though discussions about not using git I think are off-topic for this thread.

Eh2406 · September 21, 2018, 7:59pm

Can we add a readme to its repository that makes this clear?

alexcrichton · September 21, 2018, 8:51pm

Of course!

arielb1 · September 21, 2018, 11:04pm

How would this interact with the signing of the index introduced in RFC #2474 (https://github.com/rust-lang/rfcs/pull/2474)? The signature verification there wants to follow the entire principal line of commits to be able to follow the rotating key.

cuviper · September 21, 2018, 11:23pm

Maybe the squashed commit should mention the snapshot commit it came from, so you could still indirectly follow its history.

glandium · September 22, 2018, 1:38am

The git protocol only knows about commits. Without commits in common between the old and the new history, it's going to download the whole thing. One way to mitigate this somehow would be to keep the original root commit (a33de1c98898dc1baf541ee2c5162e7baea7c838). But if there were a lot of changes since then, it won't help much.

Edit: fetching the first commit of the original repo creates a .git of 136K, so it's not going to make a significant difference.

Edit 2: In fact, .git/objects is 36K.

Edit 3: Well, the first commit is just one config.json file, and nothing actually in the index, so...

cuviper · September 22, 2018, 6:58am

We could potentially stack the squashed commits each time we do this, so next time the squashed update can be just relative to the last.

But anyway, the full squashed commit is still small enough that it’s probably not worth much worry.

est31 · September 22, 2018, 10:39am

I’m a bit sad that this is done before data dumps of crates.io are being released. We got them promised already some time ago and now talk is about removing data, not adding it :/. I got a ton of useful information out of the git history, mostly about the origin of weird bugs in the index.

To continue my whining, right now you can just clone the index, change the master branch to an older commit, and point cargo at it. With this, you can fool cargo into believing that only those crates exist, thus altering its resolution behaviour to match the particular time of that commit. That’s an immenely powerful and useful feature. I proposed an automated way of doing this here. If your time stamp could be in different branches, or worse, different repos even, this could be a bit of an issue. And even worse if the historical data gets purged completely, and one needs to rely on third parties recording the history or construct it artificially from dumps (it’s a bit silly imo that you’d then have to create a fake git history while the real one was just tossed away).

So in summary, let me say that I’m not a fan of this. However, I agree that wanting to make the “mainstream” cargo usage faster is a big concern and purging the history seems to achieve great things for little work, so I guess doing this is reasonable.

Similar to @kornel’s statement, I think that cargo-local-serve won’t immediately be impacted by this change.

Topic		Replies	Views
Re-squash the crates.io repository libs	24	2843	January 16, 2020
Changes to how crates.io handles index updates announcements	4	3800	June 12, 2019
Cargo sparse protocol feedback thread cargo	21	4754	December 22, 2024
Call for testing: Cargo sparse-registry cargo	40	6508	March 17, 2023
Crates.io Index Snapshot Branches Moving \| Rust Blog announcements	1	907	May 15, 2022

Cargo's crate index: upcoming squash into one commit

Related topics