Pre-RFC: Unify dashes and underscores on crates.io

Type a crate name in Cargo.toml

I think this should go away. Typing (or copy-pasting) names and versions manually is a waste of users' time, and makes Cargo look clunky compared to npm and other package managers. Cargo should adopt cargo add instead. With cargo-edit this is already a solved problem:

cargo add serde-json
WARN: Added `serde_json` instead of `serde-json`
9 Likes

FYI I cloned the crates.io index and these were the results:

$ rg --files -l -g "*-*" . | wc -l
19131
$ rg --files -l -g "*_*" . | wc -l
8557

So the number of crates using - somewhere in their name is more than double the number of crates using _.

As for total occurences, this is the list of all characters (obtained with the command rg --files -l . | xargs -L1 basename | sed 's/./&\n/g' | sort | uniq -ic | sort -nr):

  46462 e
  38209 r
  36367 s
  33914 t
  33499 a
  31446 i
  31065 o
  24996 -
  24245 c
  23854 n
  23079 l
  17757 p
  16728 m
  16274 d
  14881 u
  12211 g
  10996 _
  10174 b
   8838 h
   8797 y
   8179 f
   6574 k
   5976 v
   4815 w
   4430 x
   2067 2
   1679 z
   1544 q
   1518 1
   1497 j
   1057 3
    788 0
    617 4
    503 5
    501 6
    444 8
    267 7
    238 9
7 Likes

Since _ is an identifier character, while - is a minus sign, it's pretty obvious (to me at least) that the canonical form needs to use underscores rather than minuses.

Note that typographical dashes, such as n-dash () and m-dash (), have never been proposed as word separators in crate names, so the issue is not "underscore vs dash", it's "underscore vs minus".

1 Like

I 100% agree _ should be the canonical form.

But if we're going to get pedantic, - is named "Hyphen-Minus" in the Unicode spec and is described as "hyphen or minus sign". I agree it's not "underscore vs dash" but I also don't think it's "underscore vs minus". Rather, it's "underscore vs hyphen".

10 Likes

I think we can all agree that the current situation is terrible.

This RFC would be great because it allows Cargo.toml files to move to using underscores (the clear choice for the preferred form) even before/without crates renaming themselves to use underscores – thus preventing the "chicken or egg" problem between renaming crates and renaming dependency entries.

I think the ideal way forward would be to not only implement this RFC, but also (either together or separately),

  • State in the API Guidelines that underscores are the preferred form and hyphens are only allowed (as an interchangeable symbol) for backwards compatibility,
  • Allow renaming crates from hyphen to underscore on crates.io, and,
  • Make a simple script (perhaps cargo fix --hyphenated-crate-names?) to convert both the current crate name and the current crate's dependency entries to use the new preferred format
3 Likes

Thanks @dtolnay! I've updated the pre-rfc.

I don't actually think the brute force is a problem: It can be structured as:

  • Look for the name directly
  • If not, traverse the tree recursively, splitting whenever the dash and slash are both extant in the tree

The brute force exponential growth is only a problem if in fact there are crates sharing prefixes with the same names modulo dashes/underscores that have a large number of separators. I.e. you'd need foo-bar-baz-quux-1, foo-bar-baz_quux-2, foo-bar_baz-quux-3 and so on to all exist. This is unlikely and an attempt to set something like this up would itself require an exponential number of automated publishes and would be a violation of the crates.io policy. Furthermore, such exponential explosions are already possible in Cargo if you allow a large number of automated publishes and create a deep dependency tree.

1 Like

I would like to request people to please avoid getting into "what the idiomatic separator is" debates. Please open a separate thread if you would like to establish one.

This RFC defines the canonical form as whatever the user uses to publish it.

8 Likes

Given https://github.com/rust-lang/cargo/issues/2775 it does seem to me that this doesn't need an RFC. I'll let this discussion happen anyway, however, and maybe open an RFC that gets approved pretty quickly as a way to get more eyeballs on it.

Just to poke at this a bit, the size of the equivalence class for any given crate name is exponential in the number of hyphens/underscores in the name. I think you've given an argument for why this isn't a problem for a particular implementation/situation. And that might be enough. But is there ever a need to enumerate the equivalence class? (I don't know, I'm not familiar with Cargo/crates.io workings, but figured I'd ask.)

Given that crates.io and docs.rs already handle it, I don't think so. From cargo's standpoint I'm only aware of the registry where you actually need to enumerate things.

1 Like

The crates.​io and docs.​rs situation are a bit different from Cargo's. Those avoid enumerating an equivalence class by using their own private index representation which, for example, might key by a normalized form of the package name with all hyphens/underscores replaced by one or the other. But Cargo's index is tied to a public and documented format for other tools and registries to build on top of (https://doc.rust-lang.org/1.47.0/cargo/reference/registries.html#index-format), so modifying the representation to support efficient querying is going to require far more coordination.

4 Likes

Currently Cargo and local Cargo tools can cheaply search crates for all hyphen/underscore variants, because the entire index is kept locally on user's machine. However, this approach doesn't scale well and will fail before crates.io reaches npm's size.

To fix the scalability problem, I've proposed a new registry interface based on static HTTP files that allows downloading only as little of the index as needed. However, that partial access means it's no longer possible to do brute-force search.

So it'd be nice to take this into consideration. For example, the file paths in the index could be changed to always use underscores.

2 Likes

If we're using direct HTTP, it seems easy enough for the server to add a redirect from a canonicalized name, so that the client only needs to make a single request.

5 Likes

I'm a bit concerned because package names are used in a large number of places (much more than the dependencies table in Cargo.toml). In particular, all external extensions and tools may need to deal with the package name canonicalization, and would likely be something that most tools will implement incorrectly. Not to mention covering every part of Cargo's command-line options, features, and internals, which I think would be a large effort.

As for the combinatorial issue, I think it is a concerning problem. There are a variety of situations that are not able to easily query for all possible permutations. https://github.com/rust-lang/rfcs/pull/2789 mentioned above is a good example. In practice, I suspect it would usually only require two checks (all hyphens or all underscores), but that seems like it could be a problem in some cases.

Would it be acceptable to try to add some kind of best-effort error messages that tries to tell you if there is an equivalent name when possible?

3 Likes

I would love a solution to this. At this point I'd honestly just flip a coin & pick one to be the "way".

We can probably figure out how to migrate existing names. I don't think anyone enjoys the inconsistency in naming we have currently.

2 Likes

I implemented that years ago, https://github.com/rust-lang/cargo/pull/5691. It uses a exponential brute force algorithm. Which is not a problem in practice.

What has prevented us from doing the rest is: https://github.com/rust-lang/cargo/issues/2775#issuecomment-557188111. TLDR, I don't know how to get the implementation to work given the existence of alternative registries.

@Eh2406 Yeah so I'm wondering: Perhaps this is an acceptable breaking change to make? There are not many users of alternative registries, and I can't imagine this is behavior anyone would want.

10 Likes

The last time I checked a year ago there were 292 crates using a combination of both - and _.

2 Likes

I appreciate your proposal, and think it's a clear step in a positive direction. It feels, though, that if we're going to end up making a breaking change, and changing the contract with alternative registries, we might as well take the opportunity to pick one separator character and canonicalise to that, rather than to try to canonicalise to the uploaded crate's choice?

5 Likes

I do not think the Rust community has a clear preference here and do not think it would be productive to try and pick whether dashes or underscores are idiomatic. If you'd like to try doing so, please open a separate RFC.

The breaking change is very minor.

2 Likes