Help wanted: fast hash maps in std


#1

Wouldn’t it be nice if Rust hash maps were faster? Everybody says so. The libs team generally agrees that std should provide hash maps that are fast in cases where today’s HashMaps with the default hasher (SipHash 1-3) are not, but the libs team does not have the capacity to work on this directly right now. This is your invitation to help make that happen.

A suitable design for this is not clear, and this is a contentious issue, so a motivated person will need to work with interested parties to come up with a viable design, then take it through the RFC process.

Some likely constraints:

  • A solution that removes from the default HashMap Rust’s long-established strong protection against collision-based attacks is unlikely to be accepted, for backwards-compatibility reasons if nothing else, so the most likely path forward is to add additional hashers in some form to std.
  • The old plan for this depended on default type parameter fallback, so that e.g. let map: HashMap<_, _, FnvState> = HashMap::new() would create FNV hashmaps while let map: HashMap<_, _> = HashMap::new() would continue to create SipHash hashmaps. Unfortunately, today HashMap::new is only implemented for the default hasher and changing that appears to be massively breaking, so this is unlikely to be workable; not to mention that default type parameter fallback is unstable. As a result it will only be possible to instantiate HashMap<_, _, CustomState> with the default method or some other custom constructor.
  • It’s desirable to have freedom to change the hashing algorithms used in std, so naming should be semantic instead of based on the algorithm. Today’s hasher is called DefaultHasher, a new one might be called e.g. FastHasher or FastShortKeyHasher.
  • It is primarily short keys (< 20 bytes) that are slow with SipHash1-3. SipHash1-3 performance with medium length keys is competitive. With long keys SipHash trails algorithms like xxHash, but the case for long keys in hash maps may not be common.
  • Previous investigations into the characteristics of various hash algorithms can be found here.
  • Hybrid implementations that switch algorithms based on key length sound attractive but seem to be firmly in the realm of research so don’t seem prudent right now.

Solutions might entail a new FastHasher type and possibly new FashHashMap and FastHashSet types or typedefs.

So that’s it. The door is open. We just need to solve some design problems to step through. Be the change you want to see in the std.

cc @library_subteam


#2

In parallel, would the libs team be amenable to exploring what if we changed the implementations of Hash for str and [u8]? They currently both hash their contents, and additionally a sentinel value, to help hashes of tuples or structs that have different strings not have the same hash.

For example, the current impls make it so, with a struct Thing(&'static str, &'static str), a Thing("foobar", "baz") and Thing("foo", "barbaz") do not have the same hash output. Unfortunately, that means hashing "foo" results in 2 calls to Hasher.hash(), slowing it down.

Changing this behavior would mean single strings would only call hash once, and be faster. Structs and tuples that include multiple string fields would have the possibility of hitting a hash collision, but they still should not eq each other. It would shift the cost from “every single string, all the time”, to “when a struct with multiple strings has a collision, handle the collision”. This change appears to prioritize the common case, leaving the slow case to the “quite unlikely case”.


#3

This change appears to prioritize the common case, leaving the slow case to the “quite unlikely case”.

Unfortunately, that “unlikely case” is trivially exploitable. That is, it would be trivial to construct a massive set of structs that all have the same hash even if a secure hash algorithm were used.


#5

So it’s easy to write and use something like this:

fn hashmap_with_default_hasher<K: Eq + Hash, V, H: Hasher + Default>() -> HashMap<K, V, BuildHasherDefault<H>> {
    HashMap::default()
}

but unfortunately this can’t be implemented directly on HashMap because that won’t let you specify H directly as a type parameter.


#6

I actually don’t find this too bad:

let mut map=HashMap::<_,_,FnvState>::default();

The fnv crate (and all other crates providing hashers) just needs to have a pub type FnvState = BuildHasherDefault<FnvHasher>


#7

What about just a type alias FnvHashMap<K, V> = ... ?


#8

What sort of hash functions adhere to this requirement? I assume FnvHasher is out, by the virtue of not being a keyed cryptographic function (list of what I mean).


#9

I think we can fix this in a “perfect” way by having the hash map automatically use one-shot hashing when possible.

For example if the key is String, it’s hashed in one go without terminator. If it’s a compound type (containing Strings etc), it uses the current method.

hashing in one go has more upsides than just skipping the terminator; it also means that the handling for partial blocks can be omitted in SipHash for that case.

Edit: Although, the above has been attempted and no good solution was found yet.


#10

What if the default Hasher would stay the same and with a builder, the user could change the hasher?

I think an elegant way would be with an optional type parameter or a new new.

For example to create a fast hash table for u8:

let map = HashMap::with_hasher(|e: u8| e % 64)

#11

@bluss

We unfortunately can’t use oneshot hashing for any type where the oneshot version would be different than the non-oneshot version. Notably this includes strings since we add the 0xff byte and primitive slices since we prepend with the length.

You run into breakage when a user has defined their own type which can borrow to a &str (think InternedString). If we add a oneshot hash implementation to &str and use it in HashMap and HashSet, the user’s code would suddenly fail to successfully look up things if InternedString is missing an identical oneshot implementation.

@dns2utf8

I think this is what you’re looking for? https://doc.rust-lang.org/std/collections/struct.HashMap.html#method.with_hasher


#12

Why is it important to continue doing those things?


#13

It avoids trivial collisions between things like ("foo", "bar") and ("foob", "ar").


#14

We also can’t realistically change it for the same reason as we run into with oneshot - user defined types that have to has the same way as a &str because they’re linked via Borrow.


#15

I wanted to point out a relevant idea and commentary from the 2017 planning thread:

@comex, @wycats, maybe you can continue discussion here?


#16

Wouldn’t the solution be more generic if it was implemented as part of the hashing of a tuple rather than as part of String?

For example:

    fn hash<S: Hasher>(&self, state: &mut S) {
        let (ref a, ref b) = *self;
        a.hash(state);
        state.write_u8(0xff);
        b.hash(state);
        state.write_u8(0xff);
    }

#17

Just curious, what’s meant by ‘oneshot’?


#18

But then you still have collisions between e.g. ([0xff], []) vs ([], [0xff]). 0xff works specifically for strings because that’s not a valid UTF-8 byte.


#19

Oneshot hashing is when you have a single byte buffer you need to hash and can give it to the hasher all at once. It can be more efficient than iterative hashing since you don’t need to carry a bunch of state around. You can also pick an algorithm based on the input size to maximize throughput.


#20

I hadn’t considered that, but that’s because my initial suggestion was something that the current api can’t do, for example:

    fn hash<S: Hasher>(&self, state: &mut S) {
        let (ref a, ref b) = *self;
        let a_s = S::new(); // This doesn't exist currently.
        a.hash(a_s);
        state.write_u64(a_s.finish());
        let b_s = S::new(); // This doesn't exist currently.
        b.hash(b_s);
        state.write_u64(b_s.finish());
    }

There are probably other implications to this that I’m not aware of, but does solve the problem.


#21

For tuples, how about first hashing the lengths of all the variable length sub components, together, before any of their contents? Wouldn’t this defeat any attempts to trigger a collision?

(The tuple collision problem reminds me of the issue of unambiguously serializing data. Perhaps some techniques in this space could be useful?)