Pre-RFC: Unify dashes and underscores on crates.io

Recently, the icu4x project had a bit of a debate about whether our crates should be icu_foo or icu-foo. The primary motivating factor for both sides was not aesthetics, rather it was what people will default to. icu-foo was motivated with the argument that dashes are what most people typically default to, whereas icu_foo was motivated with the argument that newcomers who may not be 100% familiar with the dash-to-underscore conversion that happens before Rust code sees the crate name, and it's better to be consistent for them. Rather belatedly, we realized that we had already published an icu-locale crate, effectively locking us in to dashes (or picking a new name for the locale crate) if we want internal consistency.

I'd been thinking about this for a while, but this incident really motivated me to finally post this pre-RFC.

Summary

Crates.io, cargo, and docs.rs will treat crate names as identical under transformations where dashes and slashes are replaced.

Crates will still have a canonical name that uses dashes or slashes, but it only matters for presentation.

Motivation

Crates.io already prevents you from publishing foo_bar when foo-bar already exists (and vice versa). The equivalence class of crate names under replacement between dashes and underscores already uniquely defines a single crate. Crates.io and docs.rs already perform redirects.

However, every time I type a crate name in Cargo.toml I need to remember whether the crate uses dashes or underscores. This is annoying and rather unnecessary.

New projects are also forced to make a choice between dashes and underscores, and most of the tradeoffs there have to do with the choice that people will pick first (to minimize friction when working with this crate). Dashes are more of a typical default pick, but new Rustaceans not aware of the dash-to-underscore conversion may first try underscores.

The Rust project so far has not made a stance strongly preferring one or the other, nor does it seem likely, so this problem isn't going anywhere.

It seems to me we can make all of this a moot point by treating them as equivalent in the backends.

Guide-level explanation

Crates with underscores or dashes in their names can be referred to with any name that is equivalent to the original under the replacement of one or more dash/underscore with the other separator. This applies to Cargo.toml, crates.io, and docs.rs.

Reference-level explanation

When published, crates have a canonical name which is their name when published. This crate will have an equivalence class equal to all names that can be formed by replacing one or more - with _ or vice versa in the crate.

crates.io and docs.rs will perform redirects when you visit any name within the equivalence class that is not the canonical name (This is already the case)

Cargo will also treat these crates as equivalent. When traversing the registry trie, it will traverse both underscore and dash options for the crate, picking up the first matching crate it finds. This is technically a breaking change for custom registries (see below), though I'm not sure if people would actually care about that.

Cargo will also treat names within this equivalence class as equivalent when looking for path or git dependencies, i.e. the following is okay:

# ./Cargo.toml
[dependencies]
foo-bar = {path = ../foobar}

# ../foobar/Cargo.toml
[package]
name = "foo_bar"

Cargo will, in its user interface, report the canonical name of the crate.

Drawbacks

It being forbidden to upload an underscore-crate when a dash-crate exists (and vice versa) is not a Cargo feature, it is a crates.io feature. It does not apply to custom registries. Any solution that makes the Cargo codebase itself aware of dashes and underscores may be a breaking change for custom registries.

We could potentially add support for "renames" to the registry format, however this will bloat the set of crates in the registry. Such a feature may eventually be useful for folks wishing to migrate to optional namespacing.

Rationale and alternatives

We could simply not do this, however this confusion seems to crop up a lot.

We could also "solve" this by, as a community, determining that either dashes or underscores is the accepted idiomatic style, and use those for newer crates. Over time this will diminish this problem, and if we ever add support for renames potentially get rid of the problem entirely.

Prior art

This has been in the past discussed at https://github.com/rust-lang/cargo/issues/2775

Unresolved questions

None so far

Future possibilities

It's worth considering the interaction of this feature with Pre-RFC: Packages as Optional Namespaces or whatever namespacing solution we pick. So far it does not clash.

Any solution that involves teaching Cargo about renames may also be useful for supporting renames for smoother migration to namespaced packages.

43 Likes

I'd be entirely in favor of this, particularly since we already enforce an absence of such name conflicts on crates.io.

I also think it would make sense for cargo new and similar tools to gently steer people towards one or the other, but that isn't in any way a prerequisite or blocker for this.

7 Likes

Every time I:

  • Type a crate name in Cargo.toml
  • Type https://docs.rs/<crate>
  • Type https://crates.io/crates/<crate>

I need to remember whether the crate uses dashes or underscores. This is annoying and rather unnecessary.

The second and third of those are not the case today. They already take you take you to the intended place. See e.g. https://docs.rs/serde-json & https://crates.io/crates/serde-json (actual name of serde_json is with an underscore).

For crate names in Cargo.toml: I am strongly in favor and have been asking for this change for >4 years. rust-lang/cargo#2775 has a bunch of discussion including some reasons that it is not straightforward to fix.

13 Likes

I've recently published my first crate and spent way too long thinking about the stupid - vs _ question. I also find it unfortunate that there seems to be no official recommendation on this question. I like this proposal a lot.

The only counterpoints I've found around here and the linked issue 2775

  • potential problems if the index needs to be restructured - or need for "brute-force" search through 2^n possible _/- combinations
  • breaking change for custom registries

to which I'd reply

  • keep the index format unchanged and limit the "brute-force" search to e. g. names with up to 5 or 6 underscores+hyphens, requiring users with extremely long crate names to get their _s and -s right manually.
  • if this potential breakage is actually a problem, a possible solution would be to have a new option in Cargo.toml whether to interpret dependencies underscore/hyphen-agnostic or differentiate the two. The agnostic version would always check for all name variants in the index and give an error if more than one matching result is found. The new option would have a default and that default would depend on the edition option in Cargo.tom and change from status quo to agnostic with edition 2021. This way there's no breaking change.
1 Like

The bigger problem for me is that the docs.rs search function doesn't always unify dashes and underscores right now. For example, a search for serde-j yields crates like serde-json-core or tokio-serde-json-mirror, but not the serde_json crate. The search on crates.io does not seem to have this problem, even though the results could be better (why is alt_serde_json listed before serde_json?).

1 Like

This seems pretty easy to fix, it’s be great if you could open an issue for it. (I don’t think any of the docs.rs maintainers were aware of it).

3 Likes

Done in https://github.com/rust-lang/docs.rs/issues/1101.

4 Likes

Another option to avoid the brute force solution: save every name internally with the same symbol, say '_'. Then instead of a 2^n search it would just be changing '-' to '_' before doing a lookup.

Personally I would also support picking one symbol and changing the docs to always display that symbol.

5 Likes

Type a crate name in Cargo.toml

I think this should go away. Typing (or copy-pasting) names and versions manually is a waste of users' time, and makes Cargo look clunky compared to npm and other package managers. Cargo should adopt cargo add instead. With cargo-edit this is already a solved problem:

cargo add serde-json
WARN: Added `serde_json` instead of `serde-json`
7 Likes

FYI I cloned the crates.io index and these were the results:

$ rg --files -l -g "*-*" . | wc -l
19131
$ rg --files -l -g "*_*" . | wc -l
8557

So the number of crates using - somewhere in their name is more than double the number of crates using _.

As for total occurences, this is the list of all characters (obtained with the command rg --files -l . | xargs -L1 basename | sed 's/./&\n/g' | sort | uniq -ic | sort -nr):

  46462 e
  38209 r
  36367 s
  33914 t
  33499 a
  31446 i
  31065 o
  24996 -
  24245 c
  23854 n
  23079 l
  17757 p
  16728 m
  16274 d
  14881 u
  12211 g
  10996 _
  10174 b
   8838 h
   8797 y
   8179 f
   6574 k
   5976 v
   4815 w
   4430 x
   2067 2
   1679 z
   1544 q
   1518 1
   1497 j
   1057 3
    788 0
    617 4
    503 5
    501 6
    444 8
    267 7
    238 9
6 Likes

Since _ is an identifier character, while - is a minus sign, it's pretty obvious (to me at least) that the canonical form needs to use underscores rather than minuses.

Note that typographical dashes, such as n-dash () and m-dash (), have never been proposed as word separators in crate names, so the issue is not "underscore vs dash", it's "underscore vs minus".

I 100% agree _ should be the canonical form.

But if we're going to get pedantic, - is named "Hyphen-Minus" in the Unicode spec and is described as "hyphen or minus sign". I agree it's not "underscore vs dash" but I also don't think it's "underscore vs minus". Rather, it's "underscore vs hyphen".

9 Likes

I think we can all agree that the current situation is terrible.

This RFC would be great because it allows Cargo.toml files to move to using underscores (the clear choice for the preferred form) even before/without crates renaming themselves to use underscores – thus preventing the "chicken or egg" problem between renaming crates and renaming dependency entries.

I think the ideal way forward would be to not only implement this RFC, but also (either together or separately),

  • State in the API Guidelines that underscores are the preferred form and hyphens are only allowed (as an interchangeable symbol) for backwards compatibility,
  • Allow renaming crates from hyphen to underscore on crates.io, and,
  • Make a simple script (perhaps cargo fix --hyphenated-crate-names?) to convert both the current crate name and the current crate's dependency entries to use the new preferred format
2 Likes

Thanks @dtolnay! I've updated the pre-rfc.

I don't actually think the brute force is a problem: It can be structured as:

  • Look for the name directly
  • If not, traverse the tree recursively, splitting whenever the dash and slash are both extant in the tree

The brute force exponential growth is only a problem if in fact there are crates sharing prefixes with the same names modulo dashes/underscores that have a large number of separators. I.e. you'd need foo-bar-baz-quux-1, foo-bar-baz_quux-2, foo-bar_baz-quux-3 and so on to all exist. This is unlikely and an attempt to set something like this up would itself require an exponential number of automated publishes and would be a violation of the crates.io policy. Furthermore, such exponential explosions are already possible in Cargo if you allow a large number of automated publishes and create a deep dependency tree.

1 Like

I would like to request people to please avoid getting into "what the idiomatic separator is" debates. Please open a separate thread if you would like to establish one.

This RFC defines the canonical form as whatever the user uses to publish it.

7 Likes

Given https://github.com/rust-lang/cargo/issues/2775 it does seem to me that this doesn't need an RFC. I'll let this discussion happen anyway, however, and maybe open an RFC that gets approved pretty quickly as a way to get more eyeballs on it.

Just to poke at this a bit, the size of the equivalence class for any given crate name is exponential in the number of hyphens/underscores in the name. I think you've given an argument for why this isn't a problem for a particular implementation/situation. And that might be enough. But is there ever a need to enumerate the equivalence class? (I don't know, I'm not familiar with Cargo/crates.io workings, but figured I'd ask.)

Given that crates.io and docs.rs already handle it, I don't think so. From cargo's standpoint I'm only aware of the registry where you actually need to enumerate things.

1 Like

The crates.​io and docs.​rs situation are a bit different from Cargo's. Those avoid enumerating an equivalence class by using their own private index representation which, for example, might key by a normalized form of the package name with all hyphens/underscores replaced by one or the other. But Cargo's index is tied to a public and documented format for other tools and registries to build on top of (https://doc.rust-lang.org/1.47.0/cargo/reference/registries.html#index-format), so modifying the representation to support efficient querying is going to require far more coordination.

4 Likes

Currently Cargo and local Cargo tools can cheaply search crates for all hyphen/underscore variants, because the entire index is kept locally on user's machine. However, this approach doesn't scale well and will fail before crates.io reaches npm's size.

To fix the scalability problem, I've proposed a new registry interface based on static HTTP files that allows downloading only as little of the index as needed. However, that partial access means it's no longer possible to do brute-force search.

So it'd be nice to take this into consideration. For example, the file paths in the index could be changed to always use underscores.

2 Likes