Pre-RFC: bless the `type FastHashMap` pattern

I rather like that, actually. Fits well the whole "great for the easy cases but has some potential issues in the corners" that this is going for.

5 Likes

To try to characterize the situation from a higher level, I think we want three attributes in a name – short, precisely descriptive, and self-evident – and since nobody seems to be able to come up with any which satisfies all three, we are likely forced to play the “pick any two” game.

If we decide that it’s okay if it’s not short, I personally like CollisionProneHashMap.

If we choose to sacrifice precision, another option in addition to those already mentioned might be RiskyHashMap. (Or ReadTheDocsHashMap, but that’s tongue-in-cheek, and only one-of-three.)

Finally, if we can accept the name being a bit cryptic, there is NcrHashMap, which would be short for “non-collision-resistant”.

I think the last one in particular might be a reasonable dark horse candidate. We have Arc, after all. And “Ncr” does a good job of incentivizing people to read the docs to find out what it means.

4 Likes

I like naive

I don’t like CollisionProneHashMap since the user may think there are more hash collisions which is neither necessarily true nor describes the true problem of using HashMap.

2 Likes

Maybe give it an opaque name that does not indicate “good” or “bad”. E.g. FxHashMap when it uses the FxHash, then someone has to read the docs to figure out what the difference is. Or DSHashMap (deterministic, simple). Similar opaque names can be used for constructor methods instead of type aliases.

I think “linear” and “naive” are pretty good for 0-hash and fast hash respectively.

The problem with FxHashMap is that the stdlib specifically doesn’t want to guarantee a specific hasher implementation, but rather just the attributes thereof.

If you want a guaranteed hasher, you can still specify it manually yourself as you do today, and you still should.

That was just an example for a more general concept.

I really like your two-out-of-three categorization, and I really like Risky as an option. I think short(ish) is important, and I think cryptic has the same problem as other non-negative prefixes: it doesn’t signal immediately that the user is required to spend more time thinking about which one they’re using.

What I think we should be trying to do here is sort of grounded in Kahneman’s System 1 vs 2 research, where we want to engage the users enough that they don’t intuitively go with the easy/attractive option, but that they will either pick the safe option out of convenience or only pick the risky option after spending enough time with the documentation to know that it’s safe for their use case.

Contrary to some other people in this thread, I think we can not at all trust all users to read the documentation. It might be nice if some security-minded folks (like @bascule?) spoke up here.

3 Likes

While I’m not a fan of FastHashMap either, I’d like to frame the argument against it a little more generously than “we don’t trust users”.

IMO, when a library (std or otherwise) exports public types named Foo and FastFoo, that very strongly implies that users can decide which one they want based on whether “fast is good” for their use case, without necessarily reading the docs. The same way that you can decide between Arc and Rc based on whether you need it to be atomic (after reading just enough of the docs to learn the A means atomic), or between Vec and VecDeque based on whether you need it to be double-ended. The FastHashMap name would be dangerous not because users are untrustworthy, but because it’s just a very misleading name (I kinda want to say “dishonest” but that’s probably too far).


I also agree that we’re in a “pick any two” situation, but I don’t think “self-evident” is actually an option available to us. I think the intent was that imprecise names like FastHashMap and InsecureHashMap are “self-evident” because everyone knows what “fast”, “insecure”, “risky”, etc mean in regular English. But those totally fail to convey “does not use a hashing algorithm that’s resistant to collision-based DoS attacks”, which is not risky or insecure at all in the use cases where it should be used. I don’t think a word can be called “self-evident” when its usage would be just plain incorrect. In contrast, ThreadSafeRC would have been less precise than AtomicRC/Arc, but still accurate enough to qualify as self-evident for anyone familiar with thread-safety issues.

However, IMO a long, precise name like NonCollisionResistantHashMap is only “self-evident” if you’re already familiar with the concept of collision resistance and how it prevents certain kinds of DoS attacks. Since no other language (that I know of) provides this by default, many newcomers to Rust will have simply never heard of it at all (like me), or will never realize it applies to this type before they read the docs, because there simply is no standard, concise, cross-language term of art for this property yet (maybe there is among security experts, but I don’t think there is among programmers). That’s why I don’t think it’s possible to achieve “self-evident” with any name; not enough programmers are familiar with the concept we’re trying to name.

Which means the best “pick two” we can hope to practically achieve is “short” and “precisely descriptive”. Therefore, I’m still a fan of the “cryptic” options like NCRHashMap (unless it turns out there’s a different concise term of art for this which is universally recognizable in the security community).

2 Likes

It sounds a bit too much hard to remember. NaiveHash sounds like a better choice to me.

I do think Naive is an improvement over Fast/Insecure/etc because it’s at least accurate (if I understand the whole collision resistance property correctly; maybe I don’t). But I don’t think any novice would see naive and non-naive versions of a type and have any idea whether they wanted “naivety” without reading the docs. They’d still need to learn and remember that “naive” in this context means “non-collision resistant”, so I’m not sure just calling it NCR would make things any harder to remember.

How about HastyHashMap? It implies both that it is fast, and that it might be a little careless or rash. (Bonus for the alliteration.)

Or RashHashMap, which has a rhyme bonus.

10 Likes

Yes, my intent by this was just that without prior knowledge you can read the name and parse it into words whose independent meanings you understand. I don't know what a better word for this is, do you?

"Legible" maybe?

Rather than bikeshedding the name for a “faster but insecure” hash, could we revisit the question of whether we need one?

Suppose we have a hash table that switches from “fast but insecure hash” to “reasonably fast and secure hash” if and only if it detects collision attacks (based on the length of the longest chain compared to the number of elements in the table)? If we had that, under what circumstances would we ever need the one that stays “fast but insecure” forever? Bear in mind that the two would only differ in cases where the hash table algorithm is being attacked to produce collisions and the hash table is devolving into a linked list, in which case your performance is tanking anyway.

4 Likes

That may be a good compromise between fast and secure, but will it guarantee deterministic behaviour? The first paragraph of the original post does list non-deterministic behaviour as one of the two costs of the default hashing algorithm.

How feasible is this to implement? Presumably you want this to be the behavior for the default S = RandomState which produces DefaultHasher. I suppose you'd need some internal trait which no-ops this logic for most BuildHasher types, and specializes RandomState to upgrade to a secure hasher?

It’s a pain to do in a way that’s compatible with the existing hash map types (with their pluggable hashing), though it should still be quite possible. I’d love to see a patch implementing that.

It’s super easy to add adaptive hashing to the current system. You just toss a default-implemented try_upgrade(&mut self) -> Result<(), ()> { Err(()) } method on BuildHasher (or maybe Hasher? same diff design-wise) that HashMap calls when it “detects” degeneracy. If Ok() is yielded it rebuilds and rehashes everything.

Since all impls of try_upgrade will statically always yield Ok or Err, in practice this should be statically compiled out for any hashing implementation that doesn’t support the feature.

2 Likes

Followup that became too long for a reasonable edit:

The real burden of adaptive hashing is proving out heuristics and benchmarking everything to make a convincing argument for its adoption. The current system, and the proposed FastHashMap type, has obvious characteristics, so it’s a simple change we can make to make writing performant/deterministic Rust code a little bit better out of the box.

Also regardless of adaptive hashing there is still a desire for non-deterministic/deterministic dichotomy. Deterministic hashing has perfectly valid usecases (see: rustc and reproducible builds). But having non-deterministic hashing by default avoids users accidentally relying on things that we don’t guarantee (see: JS was forced to guarantee objects behave like linked hashmaps for iteration because people were just dumping them into ordered drop downs).

edit: also for full information pczarn is interested in re-investigating an adaptive hashing impl

4 Likes