Are there any plans to re-squash the index? It's at nearly 80k commits now, and I'm realizing how slow a fresh cargo build is when you're in a place with poor internet. Currently at 25% after 20 minutes.
I'm surprised that rustup doesn't bundle the index with the installation. I presume Rust package could contain a well-compressed .git dir that would be smaller (lzma on top of git packs) and download faster (one file instead of git protocol).
I've been meaning to test an alternative, it's possible to push and pull refs referring to git objects other than commits (e.g. in this repo I stored a blob object directly). So instead of pulling a ref referring to the entire history of the index cargo could pull a ref pointing directly at the tree of the current head, this would allow keeping the entire history without squashing in the repo, while only pulling the necessary objects to get the current index instead of the objects required to look at any point in that history.
EDIT: The main thing that needs testing is whether github supports partial updates when you only have refs pointing at trees. It's been a while since I looked at the protocol but I think it operated at the commit level, so it might not be capable of ignoring the objects known to both sides if you're just requesting a tree. A probably simpler alternative would be to keep a ref pointing at a separate always-squashed commit, which should be trivial for crates.io to generate and update on each change.
It is capable of doing shallow clones, but the concern is performance. CocoaPods had to stop using it for that reason as it was causing CPU problems on GitHub's end. That was 2016, though, so I'm not sure if that's still the case.
In the long run, I think it would be best to move the index off of Git entirely.
Even with a single commit's worth of history, a Git checkout has to store all files twice (once in .git, once as the working copy), roughly doubling disk space requirements, for no benefit in this use case. @Nemo157's idea of pulling down objects in a nonstandard way might be able to avoid that while still staying on Git... but even a single copy of the index is 70MB and growing. And that‘s basically a waste of both disk space and bandwidth.
Instead, why not have a server with an API that takes a list of crates as input, and returns index data for only those crates and their dependencies? It’s not like there’s any speed benefit in caching this data offline, if cargo update has to sync the index every time you run it.
It needs testing, but I think by having a long-running branch that each from-scratch commit is merged into github will still be able to run the algorithm to determine which objects are shared and only send a diff. (The issue may again be that this is only checked at the commit level, but fundamentally the protocol used for fetching is capable of this).
As the index grows it seems inevitable that it will get replaced eventually with some kind of query API, but the current solution has some nice properties:
Cargo finally supports --offline flag, and there's also -Z no-index-update, so it is possible to edit deps while offline.
The standard index format made it easier to have cargo-lts that can "rewind" the index to an older version and remove incompatible crates.
Having full history enables tools like cargo-tally.
The index is very helpful for https://lib.rs. The parts that have to use crates-io HTTP APIs are much slower, since crates-io wouldn't like getting 30000 requests per second
I'm slightly worried that if Cargo switches to a query API, the unused index will become very low priority for the crates-io team, and will either get shut down outright, or limp unmaintained.
I suspect the crates.io repository could utililize "replacement" refs, where-in, when they create a squash of commit-ASDF -> root as QWER , they create a replacement ref that maps QWER/ASDF.
Then anything that elects to fetch the replacement refs (non-default) will get the full history, and git tools that opt-in to veiwing replacement refs will see the full history uninterrupted.
Note that a query API isn't everything -- Nuget has one of those, but using Azure DevOps's built-in dotnet restore task still somehow ends up running a 1-minute cache update in most of my builds.
Hmm, I pulled up WinDirStat, and that 70MB is completely irrelevant compared to the overall size of my .cargo folder:
Am I in some horrible state that I should be re-installing, or is this normal?
By the way, we started producing daily database dumps of crates.io (only with public data of course), and every use of the API should be covered by it: https://crates.io/data-access