Unstable hash architecture


#1

Hashing in Rust is slow. The current hashing scheme tends to produce very small chunks, which results in a lot of overhead compared to the actual hashing, which leads to a performance hit on the order of 25x for normal usage. A faster hashing algorithm makes no difference:

test hash::xxh::tests::bench_64k_oneshot                   ... bench:     16567 ns/iter (+/- 973) = 3955 MB/s

test hash::xxh::tests::bench_long_str                      ... bench:       213 ns/iter (+/- 6) = 2093 MB/s
test hash::sip::tests::bench_long_str                      ... bench:       905 ns/iter (+/- 12)

test hash::xxh::tests::bench_str_under_8_bytes             ... bench:        53 ns/iter (+/- 3) = 56 MB/s
test hash::sip::tests::bench_str_under_8_bytes             ... bench:        48 ns/iter (+/- 2)

But what if we relaxed the requirements to allow large chunks more often? Hash the binary representation instead of the semantics. This would make the hash result depend on architecture (endianness), compiler decisions (order of fields, padding) as well as unsafe shenanigans (assume padding bytes are zero), but it could make common cases a whole lot faster.

For a motivating example, right now a Vec<u16> can only be hashed one u16 at a time because the bytes might need to be swapped on big endian, which forces the worst possible code path we can take, even though to_le() is a no-op on little endian (i.e. most Rust installations). The compiler cannot be expected to optimize this away without arbitrary rewrite rules. Under the new scheme, the data would be hashed as a single &[u8], allowing the hasher to do its job with no overhead.

Security would not suffer because we still use secure algorithms, in fact it would force an attacker to gather additional data.

Obviously, the stable hash would remain as either a prominent option or the default, but I expect most people would take the speed if the docs gave them the choice.


#2

Padding bytes aren’t zero in real-world Rust AFAIK.


#3

Padding bytes are undefined memory.


#4

The drafts I’ve been doing for output iterators could help performance a lot with the hashers, especially in the Vec scenario as it could lead to better code generation

Doing what you’re talking would violate everything Rust stands for


#5

It could theoretically be useful to have a way to indicate that a particular struct’s padding bytes should be zeroed, in order to allow this.


#6

What kind of types does this affect? Why doesn’t &[primitive_type], str and String hash in one shot? There’s no alignment or padding involved as far as I’m aware.


#7
  • All integers including u8 observe endianness, so they must be hashed individually.
  • strings are hashed in one shot, plus a trailing canary.
  • structs and tuples are hashed by individual fields
  • Box, Rc and references are hashed by the contained value: (**self).hash(state);
  • &[T] is hashed as the length as canary, followed by the individual elements. There isn’t even a special case for &[u8].
  • Option and Result do canary + contained value, there’s no generic impl for enums
  • raw pointers are not dereferenced.

This is hardcoded in HashMap, we’d need higher-kinded types to parametrize over the hashing scheme. A faster solution will need to fork the entire thing into a separate library and tell people to implement the custom hashing scheme on their own. Maybe provide some macros, but #[deriving] seems out of the question.


#8

What do you mean by this? AFAIK, the hashmap just stores a hasher and feeds any hashing requests directly to that, hashmap doesn’t have any particular knowledge about how hashing works.


We could do a similar trick to the one that Haskell does with Show, and have a hash_slice(&[Self], &mut S) type on Hash; this would have a default impl that’s just iterating, but types like u8 and other primitives can overload it to achieve highest performance. The Hash impl for types like &[T] and Vec<T> would then call hash_slice rather than iterating themselves.

This is slightly restrictive in that it’s only providing extra performance for &[] but I don’t think it’s so bad: most hashers will only get a speed-up with contiguous data, and data is contiguous if and only if getting a &[] is possible.

(I suppose we could have hash_iter (in addition to, or replacing, hash_slice, I don’t know).)


#9

You can always provide a wrapper struct around your data, and impl Show the way you want. Such as `struct HashVec(pub Vec<u16>)`.


#10

What I meant was that HashMap has the knowledge that hashing works through std::hash::Hash:

use hash::{Hash, Hasher, RandomSipHasher};

impl<K: Eq + Hash<S>, V, S, H: Hasher<S>> HashMap<K, V, H> {

and Hash is the bad guy here because I can’t say the keys should be K: my::UnstableHash<S>.

But now I realized I’m dumb because I could simply implement non-recursive hashing for any key type if we had some compiler help with padding. Is it guaranteed that rustc will always choose an unpadded ordering for structs when one exists?


#11

No, there’s no guarantees at all, and, in practice (for now), Rust just uses the declaration order.

That doesn’t help with a general scheme for hashing Vec<u8>, Vec<u32> etc. efficiently.


#12

Why rust requires the hash code to be portable across little and big endian? If you put that aside all primitive slices (thus Vec, str and String) can be hashed in one call. In fact anything else will impose a huge slowdown (as you already stated).


Stability of hash values
#13

There could be a method implemented alongside hash for zeroing all padding, if the key type has any. However, hashing should work for immutable structs.