There's a plan to make crates-io purge CDN cache when a crate is published. Alternatively, Cargo could be taught to request freshly-published crates with a cache-buster that bypasses CDN cache.
Accurate cache invalidation on the server side is obviously ideal, since then a dumb client (e.g. curl+sh) works.
I think there's a surprising amount of room for client cleverness, though, for client-side cache invalidation (which can ideally be used both on a local cache and to cache bust the server.
A client could theoretically attempt resolution using the aggressive cache and only cache bust for resolution failures in the aggressive cache. This would lead to the use of potentially outdated dependencies higher in the tree, but cannotCitation Needed result in any overall resolution failures[1] since dependency edges are acyclic[2].
However, this also seems like effectively just a worse version of applying the existing proposed scheme for an incremental index index along with eager lookahead network requests assuming a no-(relevant)-change response. If network speed is low enough that getting a “no change” response for dependencies impacts the resolution time, you'll probably prefer using --locked
anyway. (Disclaimer: I am not a web developer.)
Conclusion: caching is interesting and as fractally complex as you want it to be.
Optional dependencies allow for loops in normal dep edges too, try this set of crates and features:
> cargo add clap@2 textwrap@0.11 num-traits@0.2.11 libm@0.2 rand@0.6 packed_simd@0.3 --features 'textwrap@0.11/hyphenation num-traits@0.2.11/libm libm@0.2/rand rand@0.6/packed_simd packed_simd@0.3/sleef-sys'
(I ran into so much fun around these sorts of issues while trying to publish crates through IPFS which requires a true DAG; I want to try using sparse-registry with this too, but my old script is using deprecated ipfs functionality so I need to spend some time to redo it).
Seems to be working fine in our CI environment!
Cut the build time by about a minute on average, seems like a great improvement and I haven't found any issues in our conditions.
With the benefit of more runs, it looks to be saving about a minute per run -- so consistent with what other folks are seeing. I suspect the win would be even greater on Windows, which has worse filesystem performance.
One thing I noticed is that the fetching need to go through several loops now, previously it just fetch until 100% then done, but now when it reached 100%, then it had to fetch another time again, and the total stuff needed to fetch which keeps increasing (1 -> 10 -> 50 -> 100 -> 130) resulting in the progress decreasing, feels weird.
I feel like the total should be -
until there is a definite answer to the total number that needs to fetch then only show it rather than keep increasing it.
A way to improve that in the registry would be to on publish, provide an optional expected_dep_forest_size
from the number of transitive dependencies used at publish time. This number could be used to provide a reasonable progress estimate quickly.
Well, it works !
I used the skyspell project for a somewhat realistic example.
cargo +nightly -Z sparse-registry update took 0.245s total
cargo +nightly update took 4.924s total
And my crates.io git index was pretty fresh - otherwise, the speed-up may even have been greater.
Nice
Thank you for this UX feedback! Would you open a ticket on the Cargo repo for us to track this?
Just tried this for the first time on a host where updating the traditional index was going really slow, e.g. (34137/77894) resolving deltas
Wow! What a difference! What was previously taking 10+ minutes was done in less than 10 seconds.
Great work!
Are there any implementations(programs) that I can use to use this option to host an offline copy of crates.io without git usage?
Any webserver will do. python -m http.server
can work for testing. In your clone of the index, you'll need to edit the config.json
file to point to the URL of your local copy of .crate
files.
If you're asking how to download all of the .crate
files for mirroring, there are various tools for that. (GitHub - ChrisMacNaughton/cargo-cacher: A caching server for crates + cargo GitHub - C4K3/crates-ectype: Easily create a mirror of crates.io (crate downloads only, not the website) GitHub - weiznich/crates-mirror GitHub - panamax-rs/panamax: Mirror rustup and crates.io repositories, for offline Rust and cargo usage. are some that I have in my notes, though they may not be appropriate for this use case or may be outdated). If you want to write your own tool, it is fairly trivial to walk the index and fetch every https://static.crates.io/crates/{crate}/{crate}-{version}.crate
. Just beware that it is currently over 41G.
Hello everyone, Databend is a cloud data warehouse written by rust.
We start a pull request ci: Enable cargo sparse-registry in response to this call for testing.
Here is our feedback:
- sparse-registry works (No build failure, no test failure, everything works)
- crate index update & download is much faster
- Before this PR: we need 75s (14:11:34 ~ 14:12:49) to update the index and download crates
- Within this PR: we need 21s (13:15:30 ~ 13:15:51) to update the index and download crates
-
cargo-audit
seems not to work well with sparse-registry
Detailed CI Logs
Is there an issue open on cargo-audit for this? If not, opening one with a reasonably small (number of packages) reproduction would be beneficial (but even a large one is still useful so others can minimize).
Submitted as `cargo audit` doesn't work well with `sparse-registry` · Issue #604 · rustsec/rustsec · GitHub
I wrote a small utility to download only crates needed by a project by parsing output from cargo vendor
.
I was hoping that beside my created crates
folder I could also have a crates.io-index
folder, ending with something like this:
$ ls offline-mirror/
crates/ crates.io-index/
I startup a simple python server, while within offline-mirror
:
$ sudo python3 -m http.server 80
and the run cargo +nightly build
from the same project with the following config:
$ cat .cargo/config
[unstable]
sparse-registry = true
[source.my-mirror]
registry = "http://192.168.42.64/crates.io-index"
[source.crates-io]
replace-with = "my-mirror"
However... it look as if it is requesting files not included in the git clone?
192.168.42.64 - - [26/Jul/2022 02:22:25] code 404, message File not found
192.168.42.64 - - [26/Jul/2022 02:22:25] "GET /crates.io-index/info/refs?service=git-upload-pack HTTP/1.1" 404 -
192.168.42.64 - - [26/Jul/2022 02:22:25] code 404, message File not found
192.168.42.64 - - [26/Jul/2022 02:22:25] "GET /crates.io-index/info/refs?service=git-upload-pack HTTP/1.1" 404 -
192.168.42.64 - - [26/Jul/2022 02:22:25] code 404, message File not found
192.168.42.64 - - [26/Jul/2022 02:22:25] "GET /crates.io-index/info/refs?service=git-upload-pack HTTP/1.1" 404 -2
Any tips, or a RFC describing the HTTP protocol would be great. thanks.
You're missing sparse+
in the registry url. That's what tells Cargo to use the new protocol. Otherwise it attempts to fetch it as git over HTTP.
registry = "sparse+http://192.168.42.64/crates.io-index"
The RFC is 2789 - but it's a bit light on details.
Trying this with Miri, where I often do ./miri install --offline
to avoid losing time to the database updates.
Results of hyperfine -w1 "./miri install"
with the default git index:
Benchmark 1: ./miri install
Time (mean ± σ): 718.3 ms ± 363.2 ms [User: 208.8 ms, System: 98.1 ms]
Range (min … max): 531.1 ms … 1657.5 ms 10 runs
With the sparse index:
Benchmark 1: ./miri install
Time (mean ± σ): 1.340 s ± 0.666 s [User: 0.258 s, System: 0.129 s]
Range (min … max): 0.646 s … 2.660 s 10 runs
So on average the sparse index is a lot slower, probably because I am currently on a fairly poor internet connection and it makes a lot more queries.