Pre-RFC: Unify dashes and underscores on crates.io

If we're using direct HTTP, it seems easy enough for the server to add a redirect from a canonicalized name, so that the client only needs to make a single request.

5 Likes

I'm a bit concerned because package names are used in a large number of places (much more than the dependencies table in Cargo.toml). In particular, all external extensions and tools may need to deal with the package name canonicalization, and would likely be something that most tools will implement incorrectly. Not to mention covering every part of Cargo's command-line options, features, and internals, which I think would be a large effort.

As for the combinatorial issue, I think it is a concerning problem. There are a variety of situations that are not able to easily query for all possible permutations. https://github.com/rust-lang/rfcs/pull/2789 mentioned above is a good example. In practice, I suspect it would usually only require two checks (all hyphens or all underscores), but that seems like it could be a problem in some cases.

Would it be acceptable to try to add some kind of best-effort error messages that tries to tell you if there is an equivalent name when possible?

3 Likes

I would love a solution to this. At this point I'd honestly just flip a coin & pick one to be the "way".

We can probably figure out how to migrate existing names. I don't think anyone enjoys the inconsistency in naming we have currently.

2 Likes

I implemented that years ago, https://github.com/rust-lang/cargo/pull/5691. It uses a exponential brute force algorithm. Which is not a problem in practice.

What has prevented us from doing the rest is: https://github.com/rust-lang/cargo/issues/2775#issuecomment-557188111. TLDR, I don't know how to get the implementation to work given the existence of alternative registries.

@Eh2406 Yeah so I'm wondering: Perhaps this is an acceptable breaking change to make? There are not many users of alternative registries, and I can't imagine this is behavior anyone would want.

9 Likes

The last time I checked a year ago there were 292 crates using a combination of both - and _.

2 Likes

I appreciate your proposal, and think it's a clear step in a positive direction. It feels, though, that if we're going to end up making a breaking change, and changing the contract with alternative registries, we might as well take the opportunity to pick one separator character and canonicalise to that, rather than to try to canonicalise to the uploaded crate's choice?

4 Likes

I do not think the Rust community has a clear preference here and do not think it would be productive to try and pick whether dashes or underscores are idiomatic. If you'd like to try doing so, please open a separate RFC.

The breaking change is very minor.

2 Likes

In Rust and most computer languages underscores (_) are identifier characters, whereas the - (minus, hyphen) character serves as an arithmetic operator. IMO on that basis the choice is very clear.

1 Like

It's not that clear to everyone. The discrepancy between identifiers and crate names has been there since the very beginning, and yet majority of crate authors have chosen to use hyphens anyway.

It's a very bikesheddable topic like tabs vs spaces, so I hope the cargo and crates-io teams will just toss a coin and pick one :slight_smile:

4 Likes

I have already asked you once, please let us not have this debate here. It is not at all as clear cut as you make it, I personally lean the other way, as do the majority of published crates.

Feel free to open a separate thread to try and form a consensus on it if you'd like.

8 Likes

To me this seems like an acceptable breakage, even without a warning period. I really doubt there is an alternative registry in existence which contains two crates differentiated by - vs _. If a warning period is reasonable to implement that would be ideal.

1 Like

FWIW, given that I may very well be one of those rare people to have published in that fashion, I've done that in order to have a hierarchy of separators: when using the proc-macro backend + frontend crate pattern, I have used crate_name-proc_macros as the name of the backend crate.

Why am I saying this? Well just to express, that in my case, sacrificing that flexibility going forward in exchange of a O(1) crate name lookup (any of all dashes or all underscores, so as not to bikeshed on which of those two choices is the better one) would definitely be worth it :slightly_smiling_face:

7 Likes

Btw the longest largest number of _ and - in a name on crates.io is 8. This means there are 256 combinations to try out, not that many. The limit on a crates.io name is 64 characters. Maybe it would make sense to limit the number of _ and - as well, say to 6 (and allow the existing crates to be grandfathered in)? Then the problem would still be limited.

$ rg --files -l . | xargs -L1 basename | sed 's/[^-_]//g' | sort | uniq -ic | sort -nr
  20886 
  14084 -
   6218 _
   3845 --
   1606 __
    656 ---
    290 ___
    139 -_
    136 ----
     73 -_-
     59 _-
     44 ____
     31 --_
     15 -----
     14 -__
     12 ---_
      9 __-
      9 _--
      8 _____
      6 --__
      5 _--_
      5 ------
      3 ___-
      3 _-_
      3 -_---
      3 ---__
      2 _______
      2 __--
      2 _-_-_
      2 -___
      2 -__-
      2 ---_---
      2 -------
      1 ________
      1 ______
      1 __-_
      1 __----
      1 _--__
      1 _-----
      1 -___-
      1 -_--
      1 --_-
      1 ---_--
      1 --------
$ rg --files -l . | xargs -L1 basename | sed 's/[^-_]//g' | sed 's/_/-/g' | sort | uniq -ic | sort -nr
  20886 
  20302 -
   5649 --
   1085 ---
    215 ----
     33 -----
      9 ------
      6 -------
      2 --------
2 Likes

This is roughly my stance as well, I can't imagine this is actually desired behavior, and there are not many consumers of alternative registries for this to have accidentally happened.

We could, but as I listed earlier the problem of exponential growth doesn't actually exist unless you maliciously publish prefixes with all the combinations. Otherwise most of your trie traversals will be pretty focused and linear. Exponential growth is scary but I do not consider it to be a problem here because you need to act maliciously to exploit it (and autopublishing an exponentially large number of crates for this purpose would be against the ToS).

I can imagine a large number of dashes being useful for crates autogenerated from some API. For example, the Google API crates use the format google-foobarbaz but they could easily have picked google-foo-bar-baz, and the google-cloudprivatecatalogproducer1_beta1-cli could easily have been google-cloud-private-catalog-producer-1_beta1-cli, which reaches 7 separators.

1 Like

I don't think this will ever go away completely. If I want to specify some features of the dependency or use it from a local path, I would rather type it in Cargo.toml than try to figure out what args I should pass to cargo add to get the desired result. I already need to remember the manifest keys because I need to read Cargo.toml, and I don't want to remember another interface for the same thing. That said, smarter auto-completion of crate names in IDEs should help a lot.

Another problem is that cargo creates an implicit feature for every dependency, so the dependency's name may appear in cfg attributes or --features command line argument. Should cargo allow all variants of the name in these places as well?

The current situation is very unfortunate, and I would love to see even a small improvement like this RFC. And I agree that it's not worth trying to overcome the disagreement about the proper canonical separator right now.

3 Likes

The crates.io trie is implemented in a limited way. It has only 2 levels of directories distinguishing the first 4 characters (and special cases for crates with 3 character names and less). That means you can't use the trie structure beyond the first 4 characters to speed up your search: after that you can only list all crates having the same first 4 letters as the one you are inquiring about, or you can test. Both are non-scalable approaches.

Yes, if the trie got extended to go deeper than just 2 levels the problem would be solved. That's a good point and I haven't thought about it. But note that the 2 level format is currently hardcoded in cargo, so there are backwards compatibility concerns.

It is a problem, because it is possible that there will be malicious actors who will exploit this, for example for DoS attacks.

The only real solution is to get rid of the brute force search: Create an internal index where all hyphens and underscores are substituted with a single character. This will allow constant-time lookup.

The problem is that this is incompatible with the current crate index format, and I agree with @kornel that this format should be deprecated and replaced with something that scales better:

I do not believe that real users will ever have enough underscores and hyphens in their crate names that the exponential look up could matter. This discussion is a misapplication of resources.

2 Likes

Users can already perform DOS attacks by sticking infinite loops in build.rs. If you don't trust your dependencies this is not a new problem. The only problem space is if users can introduce problems for non malicious crates by publishing similarly named crates.

The crates.io database can key on a normalized name and store the canonical name alongside it.

The only potential problem is people publishing crates that have a similar normalized name to existing crates, and that's only a problem if the following two things are true:

  • the registry index trie is deep (it's not, though it could be)
  • the user is allowed to autopublish an exponential number of crates (they're not)

In the current universe the worst case is "you have to search through all crates in a trie leaf", which if the trie gets in a state where this is actually a large number, we can always add more levels.

And this whole thing is predicated on non malicious crates existing that have more than six separators.

As boats said, this is a non problem, let's move along.