Crates.io bulk downloads


#1

Would it be possible to provide dump files containing all crates on crates.io?

This would be very helpful for:

  • Scientists wanting to analyze the data
  • Future mars colonists. Their ping times will definitely suck!
  • Archivists
  • Digital preppers preparing for the next apocalypse where hacked thermostats DDOS level3 or something
  • Generally anyone who doesn’t want to rely on the internet for everything, e.g. who wants to be able to use Rust while being completely offline e.g. on planes/trains/etc.
  • People in remote places like behind the great firewall of china, or on the falkland islands (falklands only have satellite internet, can you imagine that?). They obviously can’t download a dump, but one can download it once and then use it to set up a mirror.

Other projects like Wikipedia offer similar dumps already.

I’ve recently used crates-mirror to get 7997 crates of the 8089 crates currently on crates.io. But this method is not perfect. You have to download each crate separately, leading to a non trivial overhead in HTTP/TLS negotiations. Also, obviously, it might get in trouble with anti abuse systems.

My download of crates.io and the directory with the .crate files right now takes 6 GB storage space. So its of definitely manageable size. In comparison: wikipedia dumps take tens of gigabytes. And this is before very easy optimisations are applied, like only storing the diffs between .crate files.

Implementation wise it could be e.g. a weekly job, that creates combined .tar.gz/zip files, and uploads them to static.rlo or somewhere else. As crates.io is addition only, its okay to delete old dumps.


#2

FWIW downloading each one individually should be fine, we’re not gonna start blocking your IP or anything like that :slight_smile:


#3

What about crates.io metadata that isn’t part of the crates.io-index repository? AFAIK there’s no dataset for owners and download stats. Is it OK to scrape that data?


#4

We will be publishing an official crawler policy soon. We’re also looking to start producing dumps of our dataset at some point in the future.


#5

It would be interesting research to see if crates.io + GitHub had enough data to train an idiomatic Rust API snippet creator like this one for Java: http://www.askbayou.com/


#6

You can get all of the code already. There’s a crate for downloading all code from crates.io: https://github.com/weiznich/crates-mirror or https://github.com/C4K3/crates-ectype

For analysis I suggest modifying it to download only the latest version of each crate rather than all versions.


#7

Personally I’m interested in download numbers & ownership, as I want to play with crate ranking algorithms and analyze e.g. how much popularity and authors’ experience is correlated with various crate quality metrics.