Crates.io bulk downloads

Would it be possible to provide dump files containing all crates on crates.io?

This would be very helpful for:

  • Scientists wanting to analyze the data
  • Future mars colonists. Their ping times will definitely suck!
  • Archivists
  • Digital preppers preparing for the next apocalypse where hacked thermostats DDOS level3 or something
  • Generally anyone who doesn’t want to rely on the internet for everything, e.g. who wants to be able to use Rust while being completely offline e.g. on planes/trains/etc.
  • People in remote places like behind the great firewall of china, or on the falkland islands (falklands only have satellite internet, can you imagine that?). They obviously can’t download a dump, but one can download it once and then use it to set up a mirror.

Other projects like Wikipedia offer similar dumps already.

I’ve recently used crates-mirror to get 7997 crates of the 8089 crates currently on crates.io. But this method is not perfect. You have to download each crate separately, leading to a non trivial overhead in HTTP/TLS negotiations. Also, obviously, it might get in trouble with anti abuse systems.

My download of crates.io and the directory with the .crate files right now takes 6 GB storage space. So its of definitely manageable size. In comparison: wikipedia dumps take tens of gigabytes. And this is before very easy optimisations are applied, like only storing the diffs between .crate files.

Implementation wise it could be e.g. a weekly job, that creates combined .tar.gz/zip files, and uploads them to static.rlo or somewhere else. As crates.io is addition only, its okay to delete old dumps.

10 Likes

FWIW downloading each one individually should be fine, we’re not gonna start blocking your IP or anything like that :slight_smile:

2 Likes

What about crates.io metadata that isn’t part of the crates.io-index repository? AFAIK there’s no dataset for owners and download stats. Is it OK to scrape that data?

We will be publishing an official crawler policy soon. We’re also looking to start producing dumps of our dataset at some point in the future.

6 Likes

It would be interesting research to see if crates.io + GitHub had enough data to train an idiomatic Rust API snippet creator like this one for Java: http://www.askbayou.com/

You can get all of the code already. There’s a crate for downloading all code from crates.io: https://github.com/weiznich/crates-mirror or https://github.com/C4K3/crates-ectype

For analysis I suggest modifying it to download only the latest version of each crate rather than all versions.

1 Like

Personally I'm interested in download numbers & ownership, as I want to play with crate ranking algorithms and analyze e.g. how much popularity and authors' experience is correlated with various crate quality metrics.


:arrow_forward: EDIT: it's here! https://crates.io/data-access :arrow_backward:

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.