[Idea] Faster derive(Hash) for types without padding

Hi everyone, I am new here.

After reading a blog post about hashing in Rust I was thinking how the situation could be improved. My first thought was that if Rust had an equivalent of C++’s has_unique_object_representations then derive Hash could use it to generate potentially faster implementations for types that do not have padding. Something like this code:

pub unsafe fn obj_as_bytes<T>(obj: &T) -> &[u8] {
    let byte_ptr = std::ptr::from_ref(obj).cast::<u8>();
    let byte_len = std::mem::size_of_val(obj);
    unsafe { std::slice::from_raw_parts(byte_ptr, byte_len) }
}

pub unsafe fn obj_slice_as_bytes<T>(obj_slice: &[T]) -> &[u8] {
    let byte_ptr = obj_slice.as_ptr().cast::<u8>();
    let byte_len = std::mem::size_of_val(obj_slice);
    unsafe { std::slice::from_raw_parts(byte_ptr, byte_len) }
}

impl Hash for Foo {
    fn hash<H: Hasher>(&self, state: &mut H) {
        state.write(unsafe { obj_as_bytes(&self.0) });
    }

    fn hash_slice<H: Hasher>(data: &[Self], state: &mut H)
    where
        Self: Sized,
    {
        state.write(unsafe { obj_slice_as_bytes(data) });
    }
}

Would it be possible? Doesn’t it break any Rust rules? Am I missing something?

I tested the idea using different hashers and the results look promising. Bench results for std DefaultHasher and Foo being struct { u16, u16, char }.

test bool_hash_slice_custom_unsafe  ... bench:          13.72 ns/iter (+/- 0.05)
test bool_hash_slice_derived        ... bench:          36.19 ns/iter (+/- 0.05)
test char_hash_slice_custom_unsafe  ... bench:          17.92 ns/iter (+/- 0.29)
test char_hash_slice_derived        ... bench:          38.66 ns/iter (+/- 0.11)
test foo_hash_one_custom_unsafe     ... bench:           9.41 ns/iter (+/- 0.03)
test foo_hash_one_derived           ... bench:          16.88 ns/iter (+/- 0.17)
test foo_hash_slice_1_custom_unsafe ... bench:          13.84 ns/iter (+/- 0.04)
test foo_hash_slice_1_derived       ... bench:          21.14 ns/iter (+/- 0.06)
test foo_hash_slice_2_custom_unsafe ... bench:          15.24 ns/iter (+/- 0.02)
test foo_hash_slice_2_derived       ... bench:          32.12 ns/iter (+/- 0.34)
test foo_hash_slice_4_custom_unsafe ... bench:          19.66 ns/iter (+/- 0.02)
test foo_hash_slice_4_derived       ... bench:          54.43 ns/iter (+/- 1.14)
test foo_hash_slice_8_custom_unsafe ... bench:          24.08 ns/iter (+/- 0.08)
test foo_hash_slice_8_derived       ... bench:         103.34 ns/iter (+/- 1.87)
5 Likes

The types would also have to not contain pointers (so no hashing Vec<T> or &str) or any custom logic.

Those are some nice performance wins. It would be great if this could be done. But it does seem to require a lot from the compiler. Reflection almost?

Yes, that would need to be implemented by means of compiler magic.

It looks like the zerocopy crate has implemented its own mechanism for detecting whether it is safe to interpret object as byte slice: zerocopy::IntoBytes. Just a small subset of it would be needed for the purpose of derive Hash, but it’s cool that to know that it would also be useful for serialization.

I did not think about interior mutability. So there would need to be two traits automatically implemented by the compiler:

trait NoInteriorMutability {}

trait SafeToInterpretAsBytes : NoInteriorMutability {
    fn as_bytes(&self) -> &[u8];
}

I know next to nothing about Rust’s internals, so I have no idea whether it would be difficult to implement it or not but it definitely seems possible considering zerocopy has it in a form of an automatically derived unsafe trait.

Isn't the core requirement here basically impl Copy and no (or being able to skip) padding?

Almost. Copy is a broader set of types. Pointers, references and f{32,64,128} are Copy but they do not qualify for what I am describing.

Good question, what are the requirements? For the purpose of derive Hash I think requirements are:

  • no interior mutability
  • flat object representation (no indirection of any mean)
  • contiguous object representation (no padding)
  • unique object representations (for all values of T (a == b) ↔ (a.as_bytes() == b.as_bytes())

This would presumably be good to have for Eq too? And for Ord, for that matter, except for uff, little-endianess kind of ruins it, unless the fields are rearranged from least to most significant too.

Another idea: if there are padding bytes, AND them to zero. This would require LLVM magic (freeze?) to strictly avoid reading uninit memory even though the program would never see the uninit bytes in practice.

Being able to SIMDify Eq (and maybe Ord) would be super awesome. On x86-64s without AVX-512 masked loads it could only be easily done for vector-width types, unfortunately. Maybe other architectures have better support for loading arbitrary-width vectors?

3 Likes

You are right, this would benefit Eq too, but how much it would need to be benchmarked.

For Hash it seems to improve performance for slices but also for single object (well, at least for single instance of struct { u16, 16, char}).

More results from my computer for various hashers:

// ahash::AHasher
test bool_hash_slice_custom_unsafe  ... bench:           4.10 ns/iter (+/- 0.06)
test bool_hash_slice_derived        ... bench:           7.90 ns/iter (+/- 0.24)
test char_hash_slice_custom_unsafe  ... bench:           5.06 ns/iter (+/- 0.12)
test char_hash_slice_derived        ... bench:           8.19 ns/iter (+/- 0.23)
test foo_hash_one_custom_unsafe     ... bench:           2.20 ns/iter (+/- 0.03)
test foo_hash_one_derived           ... bench:           2.82 ns/iter (+/- 0.06)
test foo_hash_slice_1_custom_unsafe ... bench:           4.21 ns/iter (+/- 0.02)
test foo_hash_slice_1_derived       ... bench:           3.78 ns/iter (+/- 0.13)
test foo_hash_slice_2_custom_unsafe ... bench:           4.10 ns/iter (+/- 0.11)
test foo_hash_slice_2_derived       ... bench:           6.01 ns/iter (+/- 0.09)
test foo_hash_slice_4_custom_unsafe ... bench:           5.12 ns/iter (+/- 1.04)
test foo_hash_slice_4_derived       ... bench:          11.51 ns/iter (+/- 0.13)
test foo_hash_slice_8_custom_unsafe ... bench:           7.17 ns/iter (+/- 0.07)
test foo_hash_slice_8_derived       ... bench:          25.33 ns/iter (+/- 0.14)
// rustc_hash::FxHasher
test bool_hash_slice_custom_unsafe  ... bench:           1.97 ns/iter (+/- 0.01)
test bool_hash_slice_derived        ... bench:           3.83 ns/iter (+/- 0.05)
test char_hash_slice_custom_unsafe  ... bench:           2.97 ns/iter (+/- 0.05)
test char_hash_slice_derived        ... bench:           4.68 ns/iter (+/- 0.03)
test foo_hash_one_custom_unsafe     ... bench:           1.07 ns/iter (+/- 0.00)
test foo_hash_one_derived           ... bench:           1.04 ns/iter (+/- 0.00)
test foo_hash_slice_1_custom_unsafe ... bench:           2.01 ns/iter (+/- 0.00)
test foo_hash_slice_1_derived       ... bench:           1.80 ns/iter (+/- 0.00)
test foo_hash_slice_2_custom_unsafe ... bench:           2.01 ns/iter (+/- 0.01)
test foo_hash_slice_2_derived       ... bench:           2.86 ns/iter (+/- 0.01)
test foo_hash_slice_4_custom_unsafe ... bench:           2.86 ns/iter (+/- 0.05)
test foo_hash_slice_4_derived       ... bench:           5.45 ns/iter (+/- 0.01)
test foo_hash_slice_8_custom_unsafe ... bench:           4.54 ns/iter (+/- 0.01)
test foo_hash_slice_8_derived       ... bench:          11.80 ns/iter (+/- 0.03)
// rapidhash::RapidHasher
test bool_hash_slice_custom_unsafe  ... bench:           5.40 ns/iter (+/- 0.01)
test bool_hash_slice_derived        ... bench:          16.04 ns/iter (+/- 0.01)
test char_hash_slice_custom_unsafe  ... bench:           6.52 ns/iter (+/- 0.05)
test char_hash_slice_derived        ... bench:          17.47 ns/iter (+/- 0.07)
test foo_hash_one_custom_unsafe     ... bench:           1.59 ns/iter (+/- 0.01)
test foo_hash_one_derived           ... bench:           3.49 ns/iter (+/- 0.02)
test foo_hash_slice_1_custom_unsafe ... bench:           5.50 ns/iter (+/- 0.01)
test foo_hash_slice_1_derived       ... bench:           7.52 ns/iter (+/- 0.03)
test foo_hash_slice_2_custom_unsafe ... bench:           5.49 ns/iter (+/- 0.02)
test foo_hash_slice_2_derived       ... bench:          12.96 ns/iter (+/- 2.31)
test foo_hash_slice_4_custom_unsafe ... bench:           6.53 ns/iter (+/- 0.06)
test foo_hash_slice_4_derived       ... bench:          26.10 ns/iter (+/- 0.46)
test foo_hash_slice_8_custom_unsafe ... bench:           7.47 ns/iter (+/- 0.10)
test foo_hash_slice_8_derived       ... bench:          53.30 ns/iter (+/- 0.38)

// xxhash_rust::xxh3::Xxh3Default
test bool_hash_slice_custom_unsafe  ... bench:          10.53 ns/iter (+/- 0.04)
test bool_hash_slice_derived        ... bench:          42.40 ns/iter (+/- 0.13)
test char_hash_slice_custom_unsafe  ... bench:          12.84 ns/iter (+/- 0.14)
test char_hash_slice_derived        ... bench:          43.04 ns/iter (+/- 0.24)
test foo_hash_one_custom_unsafe     ... bench:           6.88 ns/iter (+/- 0.05)
test foo_hash_one_derived           ... bench:          15.94 ns/iter (+/- 0.04)
test foo_hash_slice_1_custom_unsafe ... bench:          10.70 ns/iter (+/- 0.08)
test foo_hash_slice_1_derived       ... bench:          22.54 ns/iter (+/- 0.04)
test foo_hash_slice_2_custom_unsafe ... bench:          11.49 ns/iter (+/- 0.10)
test foo_hash_slice_2_derived       ... bench:          34.15 ns/iter (+/- 0.04)
test foo_hash_slice_4_custom_unsafe ... bench:          12.98 ns/iter (+/- 0.09)
test foo_hash_slice_4_derived       ... bench:          59.19 ns/iter (+/- 0.05)
test foo_hash_slice_8_custom_unsafe ... bench:          14.80 ns/iter (+/- 0.09)
test foo_hash_slice_8_derived       ... bench:         117.02 ns/iter (+/- 0.45)
// foldhash::fast::RandomState
test bool_hash_slice_custom_unsafe  ... bench:           2.06 ns/iter (+/- 0.01)
test bool_hash_slice_derived        ... bench:          11.62 ns/iter (+/- 0.18)
test char_hash_slice_custom_unsafe  ... bench:           4.36 ns/iter (+/- 0.03)
test char_hash_slice_derived        ... bench:          10.55 ns/iter (+/- 0.16)
test foo_hash_one_custom_unsafe     ... bench:           0.85 ns/iter (+/- 0.00)
test foo_hash_one_derived           ... bench:           0.96 ns/iter (+/- 0.00)
test foo_hash_slice_1_custom_unsafe ... bench:           2.20 ns/iter (+/- 0.00)
test foo_hash_slice_1_derived       ... bench:           6.20 ns/iter (+/- 0.03)
test foo_hash_slice_2_custom_unsafe ... bench:           2.20 ns/iter (+/- 0.00)
test foo_hash_slice_2_derived       ... bench:           9.09 ns/iter (+/- 0.06)
test foo_hash_slice_4_custom_unsafe ... bench:           4.36 ns/iter (+/- 0.02)
test foo_hash_slice_4_derived       ... bench:          15.85 ns/iter (+/- 0.11)
test foo_hash_slice_8_custom_unsafe ... bench:           5.82 ns/iter (+/- 0.03)
test foo_hash_slice_8_derived       ... bench:          29.17 ns/iter (+/- 0.15)

That exists (unstable): Freeze in std::marker - Rust

From what I read it will likely not be stabilised under that name.

One of the bytemuck traits seems like they might be a fit here? Probably NoUninit in bytemuck - Rust

1 Like

Ths idea has been floated around a few times, but one of the issues is that the compiler doesn't actually know anything about padding or pointers at the time it is expanding this derive macro. This would be useful for other things (see "perfect derive"), but it is a very hard compiler change.

It could be maybe done as a specialized macro with an unsafe trait, I wanted to do something like this for rustc's StableHash trait.

Could you share the code you used for benchmarking?

I want to see if I can come up with a macro that would expand into these optimized implementations by leveraging bytemuck and/or zerocopy and publish it as a crate.

1 Like

The initial prototype is up:

https://crates.io/crates/derive_hash_fast

Should be fully functional for Hash. Supports both zerocopy and bytemuck, at your option.

Completely undocumented so far. TODO: documentation and benchmarks.

Sure, here's the code I used to benchmark the idea: GitHub - Sawyer47/unsafe_hash_test

I imagined it being implemented so that derive(Hash) expands into code that somehow dispatches at compile time into an appropriate implementation An equivalent of this C++ code

if constexpr (std::has_unique_object_representations_v<T>) {
    // impl1
} else {
    // impl2
}   

Similarly how println! macro expands into its internal implementation that cannot be used directly.

2 Likes

A quite ugly but functional (and soon maybe sound) way could be something like this:

Have a PerfectHashAligned (bikesheddable ofc) marker trait that is implemented by the compiler. (Alternatively this could also be Align<A = 0> or a bytemuck Uninitialized equivalent etc.)

Then for every single derive we do

impl DerivenHash for .. {
    // normal impl goes here
}

And then have a specialized

impl<T: DerivenHash> Hash for T {
    default fn hash() { 
        <T as DerivenHash>::hash()
    }
}

impl<T: DerivenHash + PerfectHashable> Hash for T {
    fn hash() { /* .. */ }
}

This is a bit verbose and goes through some indirection and is currently unsound when lifetimes are around, but I am working on at least fixing the soundness holes of specialization when used in such context as above.

That looks wrong to me. It won't properly handle pointers and other indirection where you need hash the pointed to values.

The existing Hash impls for pointers do only hash the address and metadata; if your type contains pointers and wants deep hashing, it has to impl Hash manually. References are deep-hashed, however; the Rust equivalent to has_unique_object_representations would have to return false for types containing references.

2 Likes

Okay, the first shippable version is up: derive_hash_fast

I've added another optimization: instead of always using state.write() which is a slow codepath that needs to be able to handle data of any size, I dispatch to the appropriate fixed-size impl such as write_u64. Structs that don't align neatly with a specific integer size are padded to the larger one.

To avoid a performance cliff for structs above 128 bits (maximum fixed-size write is u128) I turn structs of up to 512 bits into a fixed amount of u128 writes with the last one being padded.

You can clone the repo and run cargo bench, I think the benchmarks speak for themselves.

I'll formally announce this crate soon, probably tomorrow.

2 Likes

There is ByteHash in zerocopy - Rust, except maybe for the write_* optimization (at least it's not mentioned in the doc).