Extending the sparse registry protocol for efficient caching/mirroring

The crates.io registry has grown to the point that a crate is published or updated every minute. The git index grows by thousands of new commits per day.

Individual Cargo clients have moved to the sparse registry protocol that can cope with the scale (99% of crates.io requests are from 1.70+ where it's the default :partying_face:). However, other consumers like mirrors of the registry need to track all the updates. So far the git-based protocol could be used for tracking changes, but it's getting more and more painful to use.

The git-based registry protocol is already close to a breaking point. I don't think it will survive in its current form more than a couple more years. The registry already has to regularly squash git history of the master branch to keep its size reasonable. Shallow fetch is becoming unusable — it churns through so many git objects, that the git's automatic garbage collection is starting to get stuck due to "too many unreachable loose objects".

At the same time, ability to effectively mirror and cache crates.io is going to become important. The crates.io traffic keeps doubling every year.

Caching of the sparse index is a unique challenge: most index URLs don't change for weeks or months, but when they do change, users would like to see the changes within seconds. The best way to implement that is by having long-lived cache and actively purge updated URLs. That's what crates.io itself does. Other caches/mirrors need to be able to track registry changes in near real time to implement this too.

Currently crates.io has RSS feeds with the latest registry changes. This can work, but it's not ideal:

  • The feeds don't guarantee having history beyond the last 1 hour. If a mirror is offline for more than 1 hour, it has no way to catch up, and needs to mark all of its cache as stale (e.g. this is a problem for CI runners that are woken up as needed).
  • RSS is not the most compact format, and the feeds have extra information for human consumption, so making them much longer would start wasting significant amount of bandwidth.
  • There's no way to enumerate all crate names and efficiently get additional metadata. Getting publication times for all versions requires lots of API requests (currently takes 2 days if respecting crates.io rate limit, 6-8 days if also getting owners/publishers).
  • The feeds are not an official part of the registry protocol, so there's no solution for 3rd-party sparse registries.

I've proposed having an incremental changelog in the sparse registry protocol. It wasn't worth doing it at the time, but after 6 doublings of the registry's traffic, it may be time to revisit it, and prepare for the git index to be retired.

18 Likes

Thank you for the summary of the problem! I agree with your analysis. There has been some progress made since your original changelog proposal. Most of the discussion of this situation have been happening as part of the TUF/Registry-signatures work. Where it is mostly happening under the name "TUF-Snapshots". Public view of rust-lang | Zulip team chat However, that work has a lot of interconnected stakeholders. It would be reasonable for the Rust community to decouple the efforts, build something bespoke for the sparse registry protocol and figure out TUF integration after the fact.

I'm honestly not sure whether the "changelog" should include any security features or not, so I haven't mentioned that aspect in the post.

Combining it with a security layer would complicate the feature, and likely delay its rollout. OTOH I don't want to suggest pushing that problem for later indefinitely, nor add something half-baked that would get in the way of proper registry signing.