I analysed 5000 crates to find the most common standard library imports

Recently, there's been some discussion on whether to change the standard prelude for Rust in the 2021 edition. To find out which imports should be changed, I analysed 5000 crates from crates.io, and found the most common imports from the stdlib.

The Results

The results are on this spreadsheet. The main ones were std::fmt, HashMap, Arc, Path, std::io, std::io::Write, and Duration - all of these except std::io and std::io::Write (because they have async counterparts) could probably be added to the Prelude without much problem.

The Method

Each generated file is numbered according to its step. The commands were written and used on zsh on Linux, you may have to adapt them if you want to do it yourself or do something similar.

  1. Get crates index: git clone https://github.com/rust-lang/crates.io-index 1-index
  2. Get all crate JSON data: find 1-index -mindepth 2 -type f -not -path '*/\.*' -exec tail -n1 {} \; > 2-crate-json
  3. Generate all crate URLs: jq -r '"https://static.crates.io/crates/" + .name + "/" + .name + "-" .vers + ".crate"' 2-crate-json > 3-crate-urls
  4. Download $crates crates:
    • Set $crates to the number of crates you want to download
    • mkdir 4-crate-tars
    • cd 4-crate-tars
    • head -n$crates ../3-crate-urls | sed -e 's/^/-O\n/' | xargs curl
    • Display a progress bar because it can take a while:
until [ "$(ls | wc -l)" -eq $crates ]
do
	echo -en "\r"

	num="$(ls | wc -l)"
	width="$(($(stty size | awk '{ print $2 }') - 20))"

	echo -n "$num/$crates"

	num_s="$(printf '%.0f' $(($num. / $crates. * $width.)))"
	rem_s=$(($width - $num_s))

	echo -n '['
	for i in {0..$num_s}
	do
		echo -n '='
	done
	for i in {0..$rem_s}
	do
		echo -n '_'
	done
	echo -n ']'

	sleep 0.2
done
  1. Extract the crates
    • cd ..; mkdir 5-crate-srcs; cd 5-crate-srcs
    • for tar in ../4-crate-tars/*; do tar xzf "$tar" --wildcards '*.rs'; done
  2. Find uses:
    • Make a crate use-finder, with main.rs:
    use syn::{UseTree, Ident};
    
    fn main() -> Result<(), Box<dyn std::error::Error>> {
    	let file = syn::parse_file(&std::fs::read_to_string(
    		std::env::args().nth(1).unwrap()
    	)?)?;
    
    	struct Visitor<'ast>(Vec<&'ast Ident>);
    	impl<'ast> syn::visit::Visit<'ast> for Visitor<'ast> {
    		fn visit_use_tree(&mut self, tree: &'ast UseTree) {
    			match tree {
    				UseTree::Path(syn::UsePath { ident, tree, .. }) => {
    					self.0.push(ident);
    					self.visit_use_tree(tree);
    					self.0.pop();
    				}
    				UseTree::Name(syn::UseName { ident }) => {
    					for ident in &self.0 {
    						print!("{}::", ident);
    					}
    					println!("{}", ident);
    				}
    				// Ignore renames because they won't
    				// be affected by prelude changes
    				UseTree::Rename(_) => (),
    				UseTree::Glob(_) => {
    					for ident in &self.0 {
    						print!("{}::", ident);
    					}
    					println!("*");
    				}
    				UseTree::Group(syn::UseGroup { items, .. }) => {
    					for tree in items {
    						self.visit_use_tree(tree);
    					}
    				}
    			}
    		}
    	}
    
    	syn::visit::visit_file(&mut Visitor(Vec::new()), &file);
    
    	Ok(())
    }
    
    • Place syn = { version = "1.0.17", features = ["full", "visit"] } in your Cargo.toml.
    • Run cargo build --release.
    • Run find 5-crate-srcs -type f -exec use-finder/target/release/use-finder {} \; > 6-uses
  3. Select only stdlib uses: sed -ne 's/^\(core\|alloc\|std\)::/std::/p' 6-uses > 7-std-uses
  4. Find most common: sort 7-std-uses | uniq -c | sort -n
23 Likes

Cool! std::vec::Vec is the highest one I can see of items that already are in the prelude, with 198 counts.

1 Like

Very cool! One question; how did you choose which crates to analyze? By number of downloads, how well maintained they are, something else entirely? It would be interesting to see how the analysis changes based on which crates are chosen.

2 Likes

I just used the first 5000 crates alphabetically sorted from the index. Because the index doesn't contain download numbers, I would have to use database dumps or the crates.io API to get it, but I really don't imagine it would change much.

1 Like

Makes sense; I just downloaded and grepped through the database dump, but it doesn't have edition as a key, nor does it have any other interesting keys. Too bad; it would have been interesting to see how API usage changes over time.

Vec isn't in the core prelude though, of course. I think it's a common style to make a crate unconditionally #[no_std], and then import std-specific stuff manually behind a feature.

3 Likes

Very interesting, one thing I can see in your spreadsheet is that use module and use module::self are treated as separate entries. I think it would make sense consolidate these two since they are effectively the same.

2 Likes

Good point! Is there anything else that should be merged?

Come to think of it, this type of analysis would be useful for figuring out where the team needs to focus resources. Apple did something similar when they switched OS kernels to OS X, but focused on the actual APIs by asking everyone to run a tool that captured the Apple APIs that programmers used, and then shipped that to Apple. Apple then knew what they had to get right to keep developers happy.

You can use https://lib.rs/sitemap.xml to get the list of the highest ranked Rust crates.

4 Likes

@ckaran Statistically speaking, it's better to pick the crates at random. Only looking at the most popular crates would be biased.

1 Like

Popularity at least helps filter out some noise, like squatted crates.

It's weird to call it "bias". I'd say it would be more representative of relevant crates that people use.

The full set of all crates on crates-io includes spam, hello ferris crates, some pre-1.0 crates, garbage uploaded for testing whether publish works, and lots of abandoned projects. An unbiased sample of all these crates includes this garbage (Sturgeon's_law), and is biased against the kinds of crates that people currently maintain and want to use.

7 Likes

Yes, but it's not representative of code that people write. For example, commonly used crates include C bindings. Those are used often, but written rarely.

If we want to create a new prelude, it should be convenient to use for everybody, not just the authors of famous crates.

6 Likes

FYI all versions of all crates take at about 50Gb to download. The size of latest version of each, uncompressed, is 18Gb. I have that checked out locally, getting them is surprisingly easy because someone already wrote a tool to do so, complete with incremental sync. You can find the guide here.

5 Likes

I'm with @kornel on this; abandoned projects and other garbage shouldn't be a real factor in this. In fact, I prefer a bias in the support. I want to know that serde is as fast as possible, which means that if it happens to be using APIs supplied by the standard library, then those need to be optimized. Conversely, if we analyze a bunch of projects from 4 years ago that all used APIs that are now deprecated and discouraged, they should be weighted less.

The real question becomes 'how do we choose our biasing factors?' This addresses @Aloso's comment:

Here's a few factors we can use (others could be added at any time):

  • How well maintained is the project?
  • How widely used is the project (@Aloso's point)?
  • What would be the performance impact be to improve that portion of the code (Amdahl's Law)?

We can substitute how old the last release on crates.io is for the first point, use the number of dependent projects (including transitive dependencies) as pulled from the Cargo.toml files on all projects on crates.io for the second, and work on improving crater so that it is able to profile code for the third. Once we have a definition of how to measure each of the above, then it becomes pretty easy to define a metric that can be fully automated, which will let you know where you're going to get the most bang for your buck, which can feed into a priority queue of what tasks need to be done.

For what it's worth, my vote is to multiply all factors together; that is, each factor is mapped to the range [0.0, 1.0], and they are all multiplied together. This is the equivalent of calculating the volume of a hyperrectangle, and has the advantage of punishing efforts that improve one axis at the cost of a different one. Once you have the volume, divide it by the estimated effort to improve a given API. This forms the priority number (easy, high impact stuff first, and hard, low-impact stuff last). If there is enough processing power available, crater could be run on nightly for this, but I suspect that it would be used on the latest stable instead, running across a large swath of the ecosystem for several weeks to gather data.

That substitution only works if the crate continues to require maintenance. A heavily-used crate that is correct as it stands, thus not needing maintenance, would receive a low score that, given your proposal for multiplicative factors, would inherently trigger a low-priority rating.

2 Likes

Good point. What would you suggest?

For the record, I'm not tied to any particular set of factors, or even how those factors are mapped to the range [0.0, 1.0]. What I feel strongly about is:

  • Factors should always map to the range [0.0, 1.0].
  • The mapping must be bijective and monotonically increasing (that is, 1.0 is better than 0.0).
  • The mapping function must be computable by a program, so that the values can be generated automatically (this conflicts with my 'level of effort' statement earlier, but there may be ways to work around it too).

I think that all that is sufficient to prove that as the volume of the hyperrectangle increases, the API is getting 'better', but I haven't done a complete analysis and may have missed something. YMMV.

I expect that if this was turned into a real proposal and implemented, then there would be a learning curve as to which factors are the ones to concentrate on, as well as to decide what the precise shape of their mapping functions should be.

Maybe a good compromise would be to check the latest version of all crates that had a release in the last 12 months that contained more than 500 lines of Rust code, excluding documentation. That rules out very outdated projects, as well as most garbage/test/toy projects.

I still don't think that crates with many dependencies or downloads should be prioritized. The most popular crates simply aren't representative with respect to their usage of standard library imports.

3 Likes

No problem, can you think of an alternative metric to use that can be automated, and which can be mapped the way I mentioned earlier? We can toss it into the pot and see if the resulting priority makes sense to us.

To solve this, I look at other factors such as:

  • does this crate have out-of-date dependencies? (with also a smart definition of what actually is out of date). Bumping deps once in a while is the best sign of crates not being dead, even if their own code is done.
  • has frequency of releases slowed down gradually, and switched to patch releases? (crates that are abandoned tend to end releases abruptly, crates that become "done" tend to finish with slowing down rate of patch releases)
  • is the crate still gaining users, or losing users? Crates switching away signals deprecation sooner than download numbers, which have lots of momentum.

This is why I've suggested using lib.rs-ranked crates, because the ranking already includes lots of factors for crate's relevance and filters out the stuff that people may have written, but nobody wants it, including the authors.

2 Likes