The crates.io registry has grown to the point that a crate is published or updated every minute. The git index grows by thousands of new commits per day.
Individual Cargo clients have moved to the sparse registry protocol that can cope with the scale (99% of crates.io requests are from 1.70+ where it's the default ). However, other consumers like mirrors of the registry need to track all the updates. So far the
git
-based protocol could be used for tracking changes, but it's getting more and more painful to use.
The git
-based registry protocol is already close to a breaking point. I don't think it will survive in its current form more than a couple more years. The registry already has to regularly squash git history of the master
branch to keep its size reasonable. Shallow fetch
is becoming unusable — it churns through so many git objects, that the git
's automatic garbage collection is starting to get stuck due to "too many unreachable loose objects".
At the same time, ability to effectively mirror and cache crates.io is going to become important. The crates.io traffic keeps doubling every year.
Caching of the sparse index is a unique challenge: most index URLs don't change for weeks or months, but when they do change, users would like to see the changes within seconds. The best way to implement that is by having long-lived cache and actively purge updated URLs. That's what crates.io itself does. Other caches/mirrors need to be able to track registry changes in near real time to implement this too.
Currently crates.io has RSS feeds with the latest registry changes. This can work, but it's not ideal:
- The feeds don't guarantee having history beyond the last 1 hour. If a mirror is offline for more than 1 hour, it has no way to catch up, and needs to mark all of its cache as stale (e.g. this is a problem for CI runners that are woken up as needed).
- RSS is not the most compact format, and the feeds have extra information for human consumption, so making them much longer would start wasting significant amount of bandwidth.
- There's no way to enumerate all crate names and efficiently get additional metadata. Getting publication times for all versions requires lots of API requests (currently takes 2 days if respecting crates.io rate limit, 6-8 days if also getting owners/publishers).
- The feeds are not an official part of the registry protocol, so there's no solution for 3rd-party sparse registries.
I've proposed having an incremental changelog in the sparse registry protocol. It wasn't worth doing it at the time, but after 6 doublings of the registry's traffic, it may be time to revisit it, and prepare for the git index to be retired.