I rather like that, actually. Fits well the whole "great for the easy cases but has some potential issues in the corners" that this is going for.
To try to characterize the situation from a higher level, I think we want three attributes in a name â short, precisely descriptive, and self-evident â and since nobody seems to be able to come up with any which satisfies all three, we are likely forced to play the âpick any twoâ game.
If we decide that itâs okay if itâs not short, I personally like CollisionProneHashMap
.
If we choose to sacrifice precision, another option in addition to those already mentioned might be RiskyHashMap
. (Or ReadTheDocsHashMap
, but thatâs tongue-in-cheek, and only one-of-three.)
Finally, if we can accept the name being a bit cryptic, there is NcrHashMap
, which would be short for ânon-collision-resistantâ.
I think the last one in particular might be a reasonable dark horse candidate. We have Arc
, after all. And âNcr
â does a good job of incentivizing people to read the docs to find out what it means.
I like naive
I donât like CollisionProneHashMap
since the user may think there are more hash collisions which is neither necessarily true nor describes the true problem of using HashMap
.
Maybe give it an opaque name that does not indicate âgoodâ or âbadâ. E.g. FxHashMap
when it uses the FxHash
, then someone has to read the docs to figure out what the difference is. Or DSHashMap
(deterministic, simple). Similar opaque names can be used for constructor methods instead of type aliases.
I think âlinearâ and ânaiveâ are pretty good for 0-hash and fast hash respectively.
The problem with FxHashMap
is that the stdlib specifically doesnât want to guarantee a specific hasher implementation, but rather just the attributes thereof.
If you want a guaranteed hasher, you can still specify it manually yourself as you do today, and you still should.
That was just an example for a more general concept.
I really like your two-out-of-three categorization, and I really like Risky
as an option. I think short(ish) is important, and I think cryptic has the same problem as other non-negative prefixes: it doesnât signal immediately that the user is required to spend more time thinking about which one theyâre using.
What I think we should be trying to do here is sort of grounded in Kahnemanâs System 1 vs 2 research, where we want to engage the users enough that they donât intuitively go with the easy/attractive option, but that they will either pick the safe option out of convenience or only pick the risky option after spending enough time with the documentation to know that itâs safe for their use case.
Contrary to some other people in this thread, I think we can not at all trust all users to read the documentation. It might be nice if some security-minded folks (like @bascule?) spoke up here.
While Iâm not a fan of FastHashMap
either, Iâd like to frame the argument against it a little more generously than âwe donât trust usersâ.
IMO, when a library (std or otherwise) exports public types named Foo
and FastFoo
, that very strongly implies that users can decide which one they want based on whether âfast is goodâ for their use case, without necessarily reading the docs. The same way that you can decide between Arc
and Rc
based on whether you need it to be atomic (after reading just enough of the docs to learn the A
means atomic), or between Vec
and VecDeque
based on whether you need it to be double-ended. The FastHashMap
name would be dangerous not because users are untrustworthy, but because itâs just a very misleading name (I kinda want to say âdishonestâ but thatâs probably too far).
I also agree that weâre in a âpick any twoâ situation, but I donât think âself-evidentâ is actually an option available to us. I think the intent was that imprecise names like FastHashMap
and InsecureHashMap
are âself-evidentâ because everyone knows what âfastâ, âinsecureâ, âriskyâ, etc mean in regular English. But those totally fail to convey âdoes not use a hashing algorithm thatâs resistant to collision-based DoS attacksâ, which is not risky or insecure at all in the use cases where it should be used. I donât think a word can be called âself-evidentâ when its usage would be just plain incorrect. In contrast, ThreadSafeRC
would have been less precise than AtomicRC
/Arc
, but still accurate enough to qualify as self-evident for anyone familiar with thread-safety issues.
However, IMO a long, precise name like NonCollisionResistantHashMap
is only âself-evidentâ if youâre already familiar with the concept of collision resistance and how it prevents certain kinds of DoS attacks. Since no other language (that I know of) provides this by default, many newcomers to Rust will have simply never heard of it at all (like me), or will never realize it applies to this type before they read the docs, because there simply is no standard, concise, cross-language term of art for this property yet (maybe there is among security experts, but I donât think there is among programmers). Thatâs why I donât think itâs possible to achieve âself-evidentâ with any name; not enough programmers are familiar with the concept weâre trying to name.
Which means the best âpick twoâ we can hope to practically achieve is âshortâ and âprecisely descriptiveâ. Therefore, Iâm still a fan of the âcrypticâ options like NCRHashMap
(unless it turns out thereâs a different concise term of art for this which is universally recognizable in the security community).
It sounds a bit too much hard to remember. NaiveHash sounds like a better choice to me.
I do think Naive
is an improvement over Fast
/Insecure
/etc because itâs at least accurate (if I understand the whole collision resistance property correctly; maybe I donât). But I donât think any novice would see naive and non-naive versions of a type and have any idea whether they wanted ânaivetyâ without reading the docs. Theyâd still need to learn and remember that ânaiveâ in this context means ânon-collision resistantâ, so Iâm not sure just calling it NCR
would make things any harder to remember.
How about HastyHashMap
? It implies both that it is fast, and that it might be a little careless or rash. (Bonus for the alliteration.)
Or RashHashMap
, which has a rhyme bonus.
Yes, my intent by this was just that without prior knowledge you can read the name and parse it into words whose independent meanings you understand. I don't know what a better word for this is, do you?
"Legible" maybe?
Rather than bikeshedding the name for a âfaster but insecureâ hash, could we revisit the question of whether we need one?
Suppose we have a hash table that switches from âfast but insecure hashâ to âreasonably fast and secure hashâ if and only if it detects collision attacks (based on the length of the longest chain compared to the number of elements in the table)? If we had that, under what circumstances would we ever need the one that stays âfast but insecureâ forever? Bear in mind that the two would only differ in cases where the hash table algorithm is being attacked to produce collisions and the hash table is devolving into a linked list, in which case your performance is tanking anyway.
That may be a good compromise between fast and secure, but will it guarantee deterministic behaviour? The first paragraph of the original post does list non-deterministic behaviour as one of the two costs of the default hashing algorithm.
How feasible is this to implement? Presumably you want this to be the behavior for the default S = RandomState
which produces DefaultHasher
. I suppose you'd need some internal trait which no-ops this logic for most BuildHasher
types, and specializes RandomState
to upgrade to a secure hasher?
Itâs a pain to do in a way thatâs compatible with the existing hash map types (with their pluggable hashing), though it should still be quite possible. Iâd love to see a patch implementing that.
Itâs super easy to add adaptive hashing to the current system. You just toss a default-implemented try_upgrade(&mut self) -> Result<(), ()> { Err(()) }
method on BuildHasher (or maybe Hasher? same diff design-wise) that HashMap calls when it âdetectsâ degeneracy. If Ok() is yielded it rebuilds and rehashes everything.
Since all impls of try_upgrade will statically always yield Ok or Err, in practice this should be statically compiled out for any hashing implementation that doesnât support the feature.
Followup that became too long for a reasonable edit:
The real burden of adaptive hashing is proving out heuristics and benchmarking everything to make a convincing argument for its adoption. The current system, and the proposed FastHashMap type, has obvious characteristics, so itâs a simple change we can make to make writing performant/deterministic Rust code a little bit better out of the box.
Also regardless of adaptive hashing there is still a desire for non-deterministic/deterministic dichotomy. Deterministic hashing has perfectly valid usecases (see: rustc and reproducible builds). But having non-deterministic hashing by default avoids users accidentally relying on things that we donât guarantee (see: JS was forced to guarantee objects behave like linked hashmaps for iteration because people were just dumping them into ordered drop downs).
edit: also for full information pczarn is interested in re-investigating an adaptive hashing impl