I've been working on optimizing hash-based data structures recently and I think I've found an important pattern Rust's current Hash/Hasher traits cannot express or support. We've got support for performant and cryptographic hashing for large objects, but very little for low-latency hashing of small objects, which is what, in my experience, hashmaps are often used for.
I've written a post on this covering the problem: Thoughts on Rust hashing | purplesyringa's blog. If this reads like an offensive critique, I'm sorry; I can assure you this wasn't my intention.
TL;DR: I've found that the current design forces the Hasher to be able to consume objects of arbitrary structure without being aware of it beforehand, leading to suboptimal behavior on such "simple" tasks as hashing a struct containing three integers (or, indeed, hashing arrays of small objects).
The bummer here is that the Java-style hashCode-like API would kind of support this use case better, but it has its own share of problems. I didn't manage to find a new architectural design (yet alone something backwards-compatible) to fix this, so I'm interested if anyone knows a better approach.
Then you can impl MyHasher specifically with your type. Unfortunately I don't believe replacing Hasher with MyHasher will be backwards-compatible for collections, but I can't pinpoint why exactly.
It seems to me that if you had comprehensive compile time reflection you could get all the benefits (except probably compile times). You could then introspect the data layout to determine where the padding is and if it is safe to hash a block of data in one go or not.
Maybe we could go part of the way by improving the derive(Hash) macro to do this reflection in const with size_of to check if we're just hashing a bunch of primitives where the size adds up to the overall struct.
But that too likely is going to drive up compile times.
I got to thinking a bit more about this. One point I don't see addressed here or in the blog is enums.
It seems to me that this could quickly get extremely complicated and convoluted to do optimally with enums. Any proposed solution should consider this case, as enums is (IMO) one of the flagship features of Rust[1] and we don't want to discourage using them for performance reasons.
I have a C++ background, and the type safety and modelling power offered by enums is truly astounding compared to what you can do in C++. To me it is the main feature (together with borrow checking/ownership) that makes Rust safe and ergonomic. ↩︎
The obvious way to hash sum types is to emit the discriminant and then the variant, padding it with zeroes (or whatever static garbage) s.t. all variants take the same amount of space.
Mixing a small discriminant into the state can also be achieved by simply XORing pseudo-random data into the state instead of running the block hash (just like hashing bool could return either 0 or a random integer). How good of an idea this is depends heavily on the underlying hash, but IMO fast hashes might like this. Ideally this possibility would be supported by the API.
Yes, but consider Option<NonZeroU32>, there is no reason it shouldn't hash as quickly as a u32, since due to niches the layout is the same. Hashing both should ideally be the same speed.
The obvious way to hash sum types is to emit the discriminant and then the variant
I don't think this works for Option<String>, which 1. doesn't have a discriminant byte (as Vorpal pointed out for Option<NonZeroU32>) and 2. can't be hashed by looking at just its stack-allocated data.
Also, if an enum has one variant much larger than the rest, you shouldn't pay the cost of hashing the smaller variants’ padding bytes when hashing them, at least not unless the cost of branching is greater than the savings from not hashing the padding bytes, which probably depends on the size of the enum and the difference in size between the variants. (But see above, where in the case of something like Result<T, String> you have to branch if T is not also String.)
Maybe the compiler could choose what kind of hashing code to emit based on whether the payload types are Copy or not? I think there is a lot of room in the design space to explore here.
Padding bytes can never be hashed. Reading them is UB. They have undefined contents, and LLVM is permitted to change it according to it's whims (e.g. not copy the padding bytes when moving around a struct).
They are also poison. Which means that LLVM can assume you never read them and optimise accordingly.
EDIT: They might be undef rather than poison when I think about it. Unsure. Either way, you can't hash them.