[Idea] Faster derive(Hash) for types without padding

Oh, interesting! So does bytemuck: ByteHash in bytemuck - Rust

I tried adding benchmarks to bytemuck to see if their impl could be improved further, and how it compares to mine. I'm getting rather nonsensical results: changing the bytemuck impl also changes the timings for #[derive(Hash)] rather dramatically. The byte-based approach is still faster than #[derive(Hash)], but I'm starting to doubt whether I can really trust these micro-benchmarks.

@Shnatsel

changing the bytemuck impl also changes the timings for #[derive(Hash)]

This sounds like you might need to set codegen-units = 1 and perhaps enable lto in your release/benchmark profiles for consistent benchmarking results? Inlining across the hash traits can be a little unreliable.


I've also drafted an alternative FastHash trait that avoids the external dependencies, and instead determines if a type is safe to hash as raw bytes via a HASH_AS_BYTES associated const in the FastHash trait. If size of type == sum(size of subtypes), and all subtypes are HASH_AS_BYTES, then we can safely assume there are no padding bytes, references, or pointers.

It's an experimental draft based on portable-hash, definitely not complete, but the approach is hopefully clear: GitHub - hoxxep/fast-hash: Experimental faster hash traits for rust

I think it would still benefit from a separate AsBytes trait to avoid deriving unsafe code in end user crates though.

This approach as a whole seems slightly limited since it's restricted only to unpadded types. For example, std::time::Duration is a struct {u64, u32} which wouldn't benefit from this, and is likely better served by the foldhash-style integer sponge/buffering.

I'm not sure the added complexity would allow this to be merged into the std lib, and in niche cases where utmost performance is necessary, users can implement Hash by transmuting to a byte slice themselves?

I think you posted the wrong link — that link is to bytemuck docs. In any case, I took a look at your currently posted benchmark code and, though I don't think this is the problem, I think you're using black_box wrong; you have

fn hash_it(value: impl Hash, mut hasher: impl Hasher) -> u64 {
    black_box(&value);
    black_box(&hasher);
    // actually hash the value
    value.hash(&mut hasher);
    let result = hasher.finish();
    // black-box the output for accurate measurement
    black_box(result);
    result
}

but what you should be doing is passing the values of interest through the black_box function, like this:

fn hash_it(value: impl Hash, mut hasher: impl Hasher) -> u64 {
    black_box(value).hash(&mut black_box(hasher));
    let result = hasher.finish();
    black_box(result)
}

It might be that this doesn’t matter in the current implementation, but doing it this way is important in principle to avoid the optimizer making use of knowledge about the value. It's also shorter!

2 Likes

Right, I meant to post this link: GitHub - Shnatsel/bytemuck at hash-benchmarks

I've applied the black_box change you proposed and pushed it to git on derive_hash_fast. A handful of benchmarks regressed but it didn't change the overall picture much.

Well, I don't have further suggestions, other than a general warning from experience that microbenchmarks can be extremely “noisy”, not in the measurement, but in what is being measured — changes can affect the how the code of the benchmark runs in ways that are essentially uncorrelated with what you meant to do or how the change will affect performance when part of an actual application. To make your benchmark less “micro”, you could try benchmarking the use of a HashMap with and without your derive — this will reduce the size of any effect, but also make the results more “realistic” by not letting the compiler or CPU have an unrealistic situation of only doing lots of hashing. (Make sure to call HashMap::with_capacity so you're not measuring the allocator too.)

By the way, in case you missed it, I also sent you a PR with some small improvements to the library.

2 Likes

Yep, I added a whole other benchmark suite that measures operations on a HashSet instead of the raw hashing performance.

And it shows that maybe my padding optimizations weren't beneficial after all.

1 Like