Re-squash the crates.io repository

orf · October 15, 2019, 11:36pm

In September 2018 we squashed the crates.io history into a single commit (Cargo's crate index: upcoming squash into one commit).

Are there any plans to re-squash the index? It's at nearly 80k commits now, and I'm realizing how slow a fresh cargo build is when you're in a place with poor internet. Currently at 25% after 20 minutes.

mark-i-m · October 16, 2019, 5:30am

Also, is this a tenable implementation? In a hypothetical future where crates.io is as big as npm, would we need to squash every week?

crlf0710 · October 16, 2019, 11:40am

I think the root cause is libgit2 doesn't support shallow clone. So... here we are~

kornel · October 16, 2019, 11:59am

I'm surprised that rustup doesn't bundle the index with the installation. I presume Rust package could contain a well-compressed .git dir that would be smaller (lzma on top of git packs) and download faster (one file instead of git protocol).

Nemo157 · October 16, 2019, 12:30pm

I've been meaning to test an alternative, it's possible to push and pull refs referring to git objects other than commits (e.g. in this repo I stored a blob object directly). So instead of pulling a ref referring to the entire history of the index cargo could pull a ref pointing directly at the tree of the current head, this would allow keeping the entire history without squashing in the repo, while only pulling the necessary objects to get the current index instead of the objects required to look at any point in that history.

EDIT: The main thing that needs testing is whether github supports partial updates when you only have refs pointing at trees. It's been a while since I looked at the protocol but I think it operated at the commit level, so it might not be capable of ignoring the objects known to both sides if you're just requesting a tree. A probably simpler alternative would be to keep a ref pointing at a separate always-squashed commit, which should be trivial for crates.io to generate and update on each change.

ehuss · October 16, 2019, 2:02pm

The intent is to automate the squashing process with a cron job, and we are just waiting for someone on the crates.io team to set it up. More details can be found at When should we next squash the index? · Issue #47 · rust-lang/crates-io-cargo-teams · GitHub.

From our current projections, it shouldn't be a serious issue for a long time.

AFAIK, this is not an option due to the way GitHub's servers work (though that is fairly old information).

crlf0710 · October 16, 2019, 4:16pm

GitHub server works fine, and official git implementation can deal with that. The libgit2 library still doesn't support it though. See:

ehuss · October 16, 2019, 4:51pm

It is capable of doing shallow clones, but the concern is performance. CocoaPods had to stop using it for that reason as it was causing CPU problems on GitHub's end. That was 2016, though, so I'm not sure if that's still the case.

kornel · October 16, 2019, 5:23pm

That seems like a promising approach. git commit-tree can make such squashed branch in a fraction of a second (it doesn't need to touch any content).

bluss · October 16, 2019, 7:09pm

Is squashing on every commit wanted? Don't we lose the ability to download small incremental updates then?

comex · October 16, 2019, 7:45pm

In the long run, I think it would be best to move the index off of Git entirely.

Even with a single commit's worth of history, a Git checkout has to store all files twice (once in .git, once as the working copy), roughly doubling disk space requirements, for no benefit in this use case. @Nemo157's idea of pulling down objects in a nonstandard way might be able to avoid that while still staying on Git... but even a single copy of the index is 70MB and growing. And that‘s basically a waste of both disk space and bandwidth.

Instead, why not have a server with an API that takes a list of crates as input, and returns index data for only those crates and their dependencies? It’s not like there’s any speed benefit in caching this data offline, if cargo update has to sync the index every time you run it.

scottmcm · October 16, 2019, 7:51pm

Is that fundamental? I thought you could do a bare checkout that doesn't have a working copy and still read blobs out.

cuviper · October 16, 2019, 7:54pm

Cargo has used a bare repo without a working copy for a while now:

github.com/rust-lang/cargo

Don't check out the crates.io index locally

rust-lang:master ← alexcrichton:bare-registry

opened 10:06PM - 10 May 17 UTC

alexcrichton

+177 -91

This commit moves working with the crates.io index to operating on the git obje…ct layers rather than actually literally checking out the index. This is aimed at two different goals: * Improving the on-disk file size of the registry * Improving cloning times for the registry as the index doesn't need to be checked out The on disk size of my `registry` folder of a fresh check out of the index went form 124M to 48M, saving a good chunk of space! The entire operation took about 0.6s less on a Unix machine (out of 4.7s total for current Cargo). On Windows, however, the clone operation went from 11s to 6.7s, a much larger improvement! Closes #4015

Nemo157 · October 16, 2019, 8:01pm

It needs testing, but I think by having a long-running branch that each from-scratch commit is merged into github will still be able to run the algorithm to determine which objects are shared and only send a diff. (The issue may again be that this is only checked at the commit level, but fundamentally the protocol used for fetching is capable of this).

kornel · October 17, 2019, 12:03am

As the index grows it seems inevitable that it will get replaced eventually with some kind of query API, but the current solution has some nice properties:

Cargo finally supports --offline flag, and there's also -Z no-index-update, so it is possible to edit deps while offline.
The standard index format made it easier to have cargo-lts that can "rewind" the index to an older version and remove incompatible crates.
Having full history enables tools like cargo-tally.
The index is very helpful for https://lib.rs. The parts that have to use crates-io HTTP APIs are much slower, since crates-io wouldn't like getting 30000 requests per second

I'm slightly worried that if Cargo switches to a query API, the unused index will become very low priority for the crates-io team, and will either get shut down outright, or limp unmaintained.

kentnl · October 17, 2019, 6:47am

I suspect the crates.io repository could utililize "replacement" refs, where-in, when they create a squash of commit-ASDF -> root as QWER , they create a replacement ref that maps QWER/ASDF.

Then anything that elects to fetch the replacement refs (non-default) will get the full history, and git tools that opt-in to veiwing replacement refs will see the full history uninterrupted.

NickeZ · October 17, 2019, 7:07am

I think you would need to cache it so that it works with cargo build --offline, right?

scottmcm · October 17, 2019, 7:15am

Note that a query API isn't everything -- Nuget has one of those, but using Azure DevOps's built-in dotnet restore task still somehow ends up running a 1-minute cache update in most of my builds.

Hmm, I pulled up WinDirStat, and that 70MB is completely irrelevant compared to the overall size of my .cargo folder:

Am I in some horrible state that I should be re-installing, or is this normal?

pietroalbini · October 17, 2019, 7:31am

By the way, we started producing daily database dumps of crates.io (only with public data of course), and every use of the API should be covered by it: https://crates.io/data-access

kornel · October 17, 2019, 1:05pm

That's wonderful! Thanks for this.

Topic		Replies	Views
Cargo's crate index: upcoming squash into one commit libs	37	8302	March 25, 2019
[Solved] Run crater/cargobomb on GitHub projects not published on crates.io?	3	1515	March 25, 2019
Extending the sparse registry protocol for efficient caching/mirroring tools and infrastructure	2	346	May 3, 2025
Changes to how crates.io handles index updates announcements	4	3810	June 12, 2019
Cargo sparse protocol feedback thread cargo	21	4827	December 22, 2024

Re-squash the crates.io repository

Related topics