Re-squash the crates.io repository

In September 2018 we squashed the crates.io history into a single commit (Cargo's crate index: upcoming squash into one commit).

Are there any plans to re-squash the index? It's at nearly 80k commits now, and I'm realizing how slow a fresh cargo build is when you're in a place with poor internet. Currently at 25% after 20 minutes.

8 Likes

Also, is this a tenable implementation? In a hypothetical future where crates.io is as big as npm, would we need to squash every week?

2 Likes

I think the root cause is libgit2 doesn't support shallow clone. So... here we are~

1 Like

I'm surprised that rustup doesn't bundle the index with the installation. I presume Rust package could contain a well-compressed .git dir that would be smaller (lzma on top of git packs) and download faster (one file instead of git protocol).

1 Like

I've been meaning to test an alternative, it's possible to push and pull refs referring to git objects other than commits (e.g. in this repo I stored a blob object directly). So instead of pulling a ref referring to the entire history of the index cargo could pull a ref pointing directly at the tree of the current head, this would allow keeping the entire history without squashing in the repo, while only pulling the necessary objects to get the current index instead of the objects required to look at any point in that history.

EDIT: The main thing that needs testing is whether github supports partial updates when you only have refs pointing at trees. It's been a while since I looked at the protocol but I think it operated at the commit level, so it might not be capable of ignoring the objects known to both sides if you're just requesting a tree. A probably simpler alternative would be to keep a ref pointing at a separate always-squashed commit, which should be trivial for crates.io to generate and update on each change.

3 Likes

The intent is to automate the squashing process with a cron job, and we are just waiting for someone on the crates.io team to set it up. More details can be found at When should we next squash the index? · Issue #47 · rust-lang/crates-io-cargo-teams · GitHub.

From our current projections, it shouldn't be a serious issue for a long time.

AFAIK, this is not an option due to the way GitHub's servers work (though that is fairly old information).

3 Likes

GitHub server works fine, and official git implementation can deal with that. The libgit2 library still doesn't support it though. See:

1 Like

It is capable of doing shallow clones, but the concern is performance. CocoaPods had to stop using it for that reason as it was causing CPU problems on GitHub's end. That was 2016, though, so I'm not sure if that's still the case.

4 Likes

That seems like a promising approach. git commit-tree can make such squashed branch in a fraction of a second (it doesn't need to touch any content).

Is squashing on every commit wanted? Don't we lose the ability to download small incremental updates then?

In the long run, I think it would be best to move the index off of Git entirely.

Even with a single commit's worth of history, a Git checkout has to store all files twice (once in .git, once as the working copy), roughly doubling disk space requirements, for no benefit in this use case. @Nemo157's idea of pulling down objects in a nonstandard way might be able to avoid that while still staying on Git... but even a single copy of the index is 70MB and growing. And that‘s basically a waste of both disk space and bandwidth.

Instead, why not have a server with an API that takes a list of crates as input, and returns index data for only those crates and their dependencies? It’s not like there’s any speed benefit in caching this data offline, if cargo update has to sync the index every time you run it.

2 Likes

Is that fundamental? I thought you could do a bare checkout that doesn't have a working copy and still read blobs out.

2 Likes

Cargo has used a bare repo without a working copy for a while now:

7 Likes

It needs testing, but I think by having a long-running branch that each from-scratch commit is merged into github will still be able to run the algorithm to determine which objects are shared and only send a diff. (The issue may again be that this is only checked at the commit level, but fundamentally the protocol used for fetching is capable of this).

As the index grows it seems inevitable that it will get replaced eventually with some kind of query API, but the current solution has some nice properties:

  • Cargo finally supports --offline flag, and there's also -Z no-index-update, so it is possible to edit deps while offline.
  • The standard index format made it easier to have cargo-lts that can "rewind" the index to an older version and remove incompatible crates.
  • Having full history enables tools like cargo-tally.
  • The index is very helpful for https://lib.rs. The parts that have to use crates-io HTTP APIs are much slower, since crates-io wouldn't like getting 30000 requests per second :slight_smile:

I'm slightly worried that if Cargo switches to a query API, the unused index will become very low priority for the crates-io team, and will either get shut down outright, or limp unmaintained.

4 Likes

I suspect the crates.io repository could utililize "replacement" refs, where-in, when they create a squash of commit-ASDF -> root as QWER , they create a replacement ref that maps QWER/ASDF.

Then anything that elects to fetch the replacement refs (non-default) will get the full history, and git tools that opt-in to veiwing replacement refs will see the full history uninterrupted.

I think you would need to cache it so that it works with cargo build --offline, right?

Note that a query API isn't everything -- Nuget has one of those, but using Azure DevOps's built-in dotnet restore task still somehow ends up running a 1-minute cache update in most of my builds.

Hmm, I pulled up WinDirStat, and that 70MB is completely irrelevant compared to the overall size of my .cargo folder:

image

Am I in some horrible state that I should be re-installing, or is this normal?

2 Likes

By the way, we started producing daily database dumps of crates.io (only with public data of course), and every use of the API should be covered by it: https://crates.io/data-access

5 Likes

That's wonderful! Thanks for this.