Recently, there's been some discussion on whether to change the standard prelude for Rust in the 2021 edition. To find out which imports should be changed, I analysed 5000 crates from crates.io, and found the most common imports from the stdlib.
The Results
The results are on this spreadsheet. The main ones were std::fmt, HashMap, Arc, Path, std::io, std::io::Write, and Duration - all of these except std::io and std::io::Write (because they have async counterparts) could probably be added to the Prelude without much problem.
The Method
Each generated file is numbered according to its step. The commands were written and used on zsh on Linux, you may have to adapt them if you want to do it yourself or do something similar.
Get crates index: git clone https://github.com/rust-lang/crates.io-index 1-index
Get all crate JSON data: find 1-index -mindepth 2 -type f -not -path '*/\.*' -exec tail -n1 {} \; > 2-crate-json
Very cool! One question; how did you choose which crates to analyze? By number of downloads, how well maintained they are, something else entirely? It would be interesting to see how the analysis changes based on which crates are chosen.
I just used the first 5000 crates alphabetically sorted from the index. Because the index doesn't contain download numbers, I would have to use database dumps or the crates.io API to get it, but I really don't imagine it would change much.
Makes sense; I just downloaded and grepped through the database dump, but it doesn't have edition as a key, nor does it have any other interesting keys. Too bad; it would have been interesting to see how API usage changes over time.
Vec isn't in the core prelude though, of course. I think it's a common style to make a crate unconditionally #[no_std], and then import std-specific stuff manually behind a feature.
Very interesting, one thing I can see in your spreadsheet is that use module and use module::self are treated as separate entries. I think it would make sense consolidate these two since they are effectively the same.
Good point! Is there anything else that should be merged?
Come to think of it, this type of analysis would be useful for figuring out where the team needs to focus resources. Apple did something similar when they switched OS kernels to OS X, but focused on the actual APIs by asking everyone to run a tool that captured the Apple APIs that programmers used, and then shipped that to Apple. Apple then knew what they had to get right to keep developers happy.
It's weird to call it "bias". I'd say it would be more representative of relevant crates that people use.
The full set of all crates on crates-io includes spam, hello ferris crates, some pre-1.0 crates, garbage uploaded for testing whether publish works, and lots of abandoned projects.
An unbiased sample of all these crates includes this garbage (Sturgeon's_law), and is biased against the kinds of crates that people currently maintain and want to use.
Yes, but it's not representative of code that people write. For example, commonly used crates include C bindings. Those are used often, but written rarely.
If we want to create a new prelude, it should be convenient to use for everybody, not just the authors of famous crates.
FYI all versions of all crates take at about 50Gb to download. The size of latest version of each, uncompressed, is 18Gb. I have that checked out locally, getting them is surprisingly easy because someone already wrote a tool to do so, complete with incremental sync. You can find the guide here.
I'm with @kornel on this; abandoned projects and other garbage shouldn't be a real factor in this. In fact, I prefer a bias in the support. I want to know that serde is as fast as possible, which means that if it happens to be using APIs supplied by the standard library, then those need to be optimized. Conversely, if we analyze a bunch of projects from 4 years ago that all used APIs that are now deprecated and discouraged, they should be weighted less.
The real question becomes 'how do we choose our biasing factors?' This addresses @Aloso's comment:
Here's a few factors we can use (others could be added at any time):
What would be the performance impact be to improve that portion of the code (Amdahl's Law)?
We can substitute how old the last release on crates.io is for the first point, use the number of dependent projects (including transitive dependencies) as pulled from the Cargo.toml files on all projects on crates.io for the second, and work on improving crater so that it is able to profile code for the third. Once we have a definition of how to measure each of the above, then it becomes pretty easy to define a metric that can be fully automated, which will let you know where you're going to get the most bang for your buck, which can feed into a priority queue of what tasks need to be done.
For what it's worth, my vote is to multiply all factors together; that is, each factor is mapped to the range [0.0, 1.0], and they are all multiplied together. This is the equivalent of calculating the volume of a hyperrectangle, and has the advantage of punishing efforts that improve one axis at the cost of a different one. Once you have the volume, divide it by the estimated effort to improve a given API. This forms the priority number (easy, high impact stuff first, and hard, low-impact stuff last). If there is enough processing power available, crater could be run on nightly for this, but I suspect that it would be used on the latest stable instead, running across a large swath of the ecosystem for several weeks to gather data.
That substitution only works if the crate continues to require maintenance. A heavily-used crate that is correct as it stands, thus not needing maintenance, would receive a low score that, given your proposal for multiplicative factors, would inherently trigger a low-priority rating.
For the record, I'm not tied to any particular set of factors, or even how those factors are mapped to the range [0.0, 1.0]. What I feel strongly about is:
Factors should always map to the range [0.0, 1.0].
The mapping function must be computable by a program, so that the values can be generated automatically (this conflicts with my 'level of effort' statement earlier, but there may be ways to work around it too).
I think that all that is sufficient to prove that as the volume of the hyperrectangle increases, the API is getting 'better', but I haven't done a complete analysis and may have missed something. YMMV.
I expect that if this was turned into a real proposal and implemented, then there would be a learning curve as to which factors are the ones to concentrate on, as well as to decide what the precise shape of their mapping functions should be.
Maybe a good compromise would be to check the latest version of all crates that had a release in the last 12 months that contained more than 500 lines of Rust code, excluding documentation. That rules out very outdated projects, as well as most garbage/test/toy projects.
I still don't think that crates with many dependencies or downloads should be prioritized. The most popular crates simply aren't representative with respect to their usage of standard library imports.
No problem, can you think of an alternative metric to use that can be automated, and which can be mapped the way I mentioned earlier? We can toss it into the pot and see if the resulting priority makes sense to us.
does this crate have out-of-date dependencies? (with also a smart definition of what actually is out of date). Bumping deps once in a while is the best sign of crates not being dead, even if their own code is done.
has frequency of releases slowed down gradually, and switched to patch releases? (crates that are abandoned tend to end releases abruptly, crates that become "done" tend to finish with slowing down rate of patch releases)
is the crate still gaining users, or losing users? Crates switching away signals deprecation sooner than download numbers, which have lots of momentum.
This is why I've suggested using lib.rs-ranked crates, because the ranking already includes lots of factors for crate's relevance and filters out the stuff that people may have written, but nobody wants it, including the authors.