Pre-Pre-RFC: Prevent abnormal release of crates

From here: Hashsum crate names · Discussion #4227 · rust-lang/crates.io · GitHub

Summary

When publishing crates on crates.io, adopt certain blocking strategies for some testing/inappropriate publishing behaviors.

Motivation

When a user runs the publish command of crates.io, the user sometimes publishes crates that do not contain real content. (Unused crates are released for test or experience purposes. For example, excessively long crate names or random character strings that are obtained using the hash method are released.)

In addition, the crate name or content contains insulting words, which may be a malicious attack. We may need to make judgments and alerts when publishing similar crates.

Too many meaningless crates published do not provide any help to the community and occupy service resources. The inclusion of insulting words may lead to unhealthy tendencies. This does not contribute to the healthy development of community technology and ecology.

We need to warn and even deny these unadvocated behaviors.

Guide-level explanation

We should include judgments about sensitive words in the crate name and readme file in the publisher.

  • Limit super-long crate names, which discourages adding dependencies in projects, reject publishing crate names that exceed 20 characters (20 characters) I thought of temporarily. The exact number of characters should be re-determined.
  • Reject the release of irregular and meaningless crate names generated by using methods such as hash and base64.
  • Search for insulting and unfriendly words, remind them when they are published, confirm whether to publish them, and provide a way to modify them later.
  • Uploaded non-compliant crates are not deleted. However, reminders are provided for newly uploaded crates, explanation channels are provided, and the right to delete or change them is reserved.

For example, when a crate named [aA1ae777f650d92b903634047b1adaf0a26ece4125e7af5b75ef0d8709] or [aa2d74507d184541926d670b753d6843c0a5ea74dede83ab2d43260fa1] is released, neither the length requirement nor meaningless crate name is met. In this case, a notification or denial of service is required.

Reference-level explanation

  • The system checks whether the length of the crate name meets the release criteria. If the length does not meet the release criteria, the system prompts the user to return the modification. (For example, "The crate name you publish contains more than 20 characters. Please modify it before publishing.") Republished in.
  • Retrieve sensitive information from the publication and set up reminders. For example, the system provides the notification service when publishing words such as bitch and fuck . (For example, "The crate you are about to post contains unfriendly words").
  • Judges the crate names generated by using methods such as hash and base64 and refuses to provide publishing services. (How can I accurately determine the name of this type of crate?)

Drawbacks

The time to release crate may be extended due to the addition of pre-release testing.

Rationale and alternatives

The current design is a check on crates.io. Perhaps you can also check the results when cargo is released. I think that since the final crate is published in crates, it is appropriate that this check is finally checked in crates.io.

Prior art

There is no time to gain insights into other communities, to be added.

Unresolved questions

If a temporary test is required, check whether the measure that the crate can be automatically destroyed within 30 minutes is required. When releasing the crate, add a parameter to determine the existence time of the crate, for example, cargo publish --time 30min.

It's not entirely clear to me whether the behavior of including bitch fuck **, **, etc. is not allowed by the community.

Future possibilities

I didn't think of it for now.

FYI, if this really is supposed to be an actual "Pre-RFC", it would usually have to be already be an (almost) complete RFC, and the goal would be to get more feedback before the final "serious" proposal for something that's supposed to be accepted as an RFC as-is.

In particular, it would have to be self-contained and a bit more detailed. As it stands right now, it is not possible to understand at all what you're proposing without clicking the link (i.e. it's not self-contained) and it doesn't explain at all any criteria / characterization on what kind of behavior exactly should be considered inappropriate, and also what the proposed measures are that should to be used against such behavior (i.e. this is where I think more details are necessary).

Of course it's also a good idea to put exactly these questions (about how to fill in the details) up for discussion on this forum, and it's also legitimate to post unfinished ideas on a not self-contained manner. It's just not really a "Pre-RFC" yet. Let me update the title for you :wink:

5 Likes

I. e., is this just about the specific case of a crate name being a hash? What if such a crate is an internal dependency of a usefully-named crate? Is this about some kind of usefulness criteria for crates in general that should be esteblished? Is the goal to automatically detect or manually detect something like this? Or semi-automatically?

Should automatically detected instances be taken down? Or just new ones prevented from being uploaded? Is there supposed to be a warning first / giving the owner the opportunity to explain their practices?

How many resources are wasted? Are such crates commonly downloaded? What do they contain? If they aren't really used, it's just the resources of storing a few text files, right? What do you mean with "crates information statistics"?

3 Likes

I agree that there are crates out there that aren't useful. This issue has come up before in the context of squatting.

To move forward with this, you need to solve two problems:

  • A reasonably clear definition of what is allowed and what isn't, and make sure it covers edge cases. "I'll know when I see it" is not good enough. You'll need to answer questions like "how people are supposed to test if they deployment pipeline works?" or "what if I just need a temporary fork for myself" or "will I be punished if I run cargo publish by accident?" or "what if my crate is named after Elon's new child?"

  • A way to enforce the rules without burdening crates.io team. In the past they've been very reluctant to promise even smallest manual interventions, out of fear that it could grow into a big burden, especially if crates.io grows to npm's size.

6 Likes

Thank you for your reply. I've updated the Pre-Pre RFC, which may explain some of your issues.

My idea is to automatically detect the name of the violation to determine whether to issue an alert or continue to provide the service. I agree to keep the uploaded crate, but remind or block the newly uploaded crate. I can't count how many crates are now meaningless crates, but the number of crates currently available is huge.

Thank you for your reply and I've described the problem in more detail. I updated the latest Pre-Pre RFC and welcome comments. I accidentally released a crate, which I think should be fine, and may provide a feature that can be withdrawn within 30 minutes. Probably the hardest part for me is judging whether crate is meaningless or not.

If I'm not mistaken, the current limit for crate names is 64 chars, this has been discussed here.: Investigate long names · Issue #696 · rust-lang/crates.io (github.com).

This issue indicates that back in 2017, the longuest named crate was google-gamesconfiguration1_configuration-cli (44 chars)

By looking at the latest dump of crates.io, the longuest named crate with more than 1000 downloads as of now is lock-free-multi-producer-single-consumer-ring-buffer (57 chars). I want to make it clear that I am not saying that the name of this crate is appropriate, just that it exist.

How would you go about implementing this ? Do you only consider english words ? If so, what words ? Who gets to decide which words are forbidden ? Otherwise, we would have to answer the same two previous questions for each language, which I don't think is feasible. I have came across some systems that choose to take the route of blacklisting words using one giant list of "bad words" per supported language but it never worked flawlessly: what can be (or is) considered a bad word in one language could be perfectly fine in another language or in the same language but in a specific context (what about the brainfuck crate for example ?).

I too think there should be better rules for moderating crates posted on crates.io but it's not a trivial task at all.

It's very difficult to ban offensive words. You will get false positives — see Scunthorpe problem, and false negatives from intentional misspelling or creative insults. I suggest leaving that feature out, and just rely on manual CoC enforcement in case such incident happens.

3 Likes

Creative insults can be very creative, especially when you have access to the full power of Unicode. Spam filters have to deal with that problem already, with messages using characters from multiple code planes that happen to look like each other and can therefore get around the filters. Then there are emojis, and creative pictographs (I will not provide links to any examples). Finally, there's a psychology problem; once you implement a filter like this, a certain subset of people will decide that it's a fun game to try to 'beat' the system, just to show that they can. The end result will the opposite of what you want.

Instead of putting all of this on crates.io, why not work to enhance a tool like cargo-release1? Advertise this as a way of making your releases more professional, and therefore less embarrassing when someone points out what you just accidentally fat-fingered.

Which brings up my personal wish, that publish defaulted to false, so if you fat-finger something it doesn't suddenly get published. I don't know if that can be done though, I suspect it would cause a lot of headaches if anyone tried to implement that.

1 I'm not associated with them in any way, I just use the tool, YMMV.

3 Likes