Wouldn’t it be nice if Rust hash maps were faster? Everybody says so. The libs
team generally agrees that std should provide hash maps that are fast in cases
where today’s HashMaps with the default hasher (SipHash 1-3) are not, but the
libs team does not have the capacity to work on this directly right now. This is
your invitation to help make that happen.
A suitable design for this is not clear, and this is a contentious issue, so a
motivated person will need to work with interested parties to come up with a
viable design, then take it through the RFC process.
Some likely constraints:
A solution that removes from the default HashMap Rust’s long-established
strong protection against collision-based attacks is unlikely to be accepted,
for backwards-compatibility reasons if nothing else, so the most likely path
forward is to add additional hashers in some form to std.
The old plan for this depended on default type parameter fallback, so that
e.g. let map: HashMap<_, _, FnvState> = HashMap::new() would create FNV
hashmaps while let map: HashMap<_, _> = HashMap::new() would continue to
create SipHash hashmaps. Unfortunately, today HashMap::new is only
implemented for the default hasher and changing that appears to be massively
breaking, so this is unlikely to be workable; not to mention that default type
parameter fallback is unstable. As a result it will only be possible to
instantiate HashMap<_, _, CustomState> with the default method or some
other custom constructor.
It’s desirable to have freedom to change the hashing algorithms used in std,
so naming should be semantic instead of based on the algorithm. Today’s hasher
is called DefaultHasher, a new one might be called e.g. FastHasher or
FastShortKeyHasher.
It is primarily short keys (< 20 bytes) that are slow with
SipHash1-3. SipHash1-3 performance with medium length keys is
competitive. With long keys SipHash trails algorithms like xxHash, but the
case for long keys in hash maps may not be common.
Previous investigations into the characteristics of various hash algorithms can
be found here.
Hybrid implementations that switch algorithms based on key length sound
attractive but seem to be firmly in the realm of research so don’t seem
prudent right now.
Solutions might entail a new FastHasher type and possibly new FashHashMap and
FastHashSet types or typedefs.
So that’s it. The door is open. We just need to solve some design problems to
step through. Be the change you want to see in the std.
In parallel, would the libs team be amenable to exploring what if we changed the implementations of Hash for str and [u8]? They currently both hash their contents, and additionally a sentinel value, to help hashes of tuples or structs that have different strings not have the same hash.
For example, the current impls make it so, with a struct Thing(&'static str, &'static str), a Thing("foobar", "baz") and Thing("foo", "barbaz") do not have the same hash output. Unfortunately, that means hashing "foo" results in 2 calls to Hasher.hash(), slowing it down.
Changing this behavior would mean single strings would only call hash once, and be faster. Structs and tuples that include multiple string fields would have the possibility of hitting a hash collision, but they still should not eq each other. It would shift the cost from “every single string, all the time”, to “when a struct with multiple strings has a collision, handle the collision”. This change appears to prioritize the common case, leaving the slow case to the “quite unlikely case”.
This change appears to prioritize the common case, leaving the slow case to the "quite unlikely case".
Unfortunately, that "unlikely case" is trivially exploitable. That is, it would be trivial to construct a massive set of structs that all have the same hash even if a secure hash algorithm were used.
What sort of hash functions adhere to this requirement? I assume FnvHasher is out, by the virtue of not being a keyed cryptographic function (list of what I mean).
I think we can fix this in a “perfect” way by having the hash map automatically use one-shot hashing when possible.
For example if the key is String, it’s hashed in one go without terminator. If it’s a compound type (containing Strings etc), it uses the current method.
hashing in one go has more upsides than just skipping the terminator; it also means that the handling for partial blocks can be omitted in SipHash for that case.
Edit: Although, the above has been attempted and no good solution was found yet.
We unfortunately can’t use oneshot hashing for any type where the oneshot version would be different than the non-oneshot version. Notably this includes strings since we add the 0xff byte and primitive slices since we prepend with the length.
You run into breakage when a user has defined their own type which can borrow to a &str (think InternedString). If we add a oneshot hash implementation to &str and use it in HashMap and HashSet, the user’s code would suddenly fail to successfully look up things if InternedString is missing an identical oneshot implementation.
We also can’t realistically change it for the same reason as we run into with oneshot - user defined types that have to has the same way as a &str because they’re linked via Borrow.
But then you still have collisions between e.g. ([0xff], []) vs ([], [0xff]). 0xff works specifically for strings because that’s not a valid UTF-8 byte.
Oneshot hashing is when you have a single byte buffer you need to hash and can give it to the hasher all at once. It can be more efficient than iterative hashing since you don’t need to carry a bunch of state around. You can also pick an algorithm based on the input size to maximize throughput.
For tuples, how about first hashing the lengths of all the variable length sub components, together, before any of their contents? Wouldn’t this defeat any attempts to trigger a collision?
(The tuple collision problem reminds me of the issue of unambiguously serializing data. Perhaps some techniques in this space could be useful?)