Make Hasher portable? (mini RFC)


#1

std::hash::Hasher (code) is not currently portable: it doesn’t fix the Endianness of input. This is fine for HashMap but useless for anything needing portable output (cryptographic hashes, checksums, file systems, etc. (in my case: a reproducible mechanism for seeding PRNGs from arbitrary input)).

Making it portable (with some fixed “personality” i.e. arbitrary Endianness) would only require adding .to_le() or .to_be() on the input (although of course the implementation is still responsible for making sure finish() returns a portable result). I don’t think this would be considered a breaking change because the results are not currently portable and this is currently only useful consistent results are not required anyway (excepting possibly platform specific stuff).

Would there be much performance impact? .to_le() is a no-op on x86 anyway. ARM is normally also LE. There are some BE architectures; I have little idea what overhead an extra byte-swap would have in practice.

Limitation: portable hash functions must specify expected Endianness of input. So choosing an arbitrary Endianness here would not make the trait compatible with all hash functions.

Motivation: make the Hasher trait more widely useful.

Note: the output type is still limited to u64. This isn’t necessarily a problem though since alternative output functions can be added, as part of the trait or not. The following would be a very flexible option:

/// Size of state in bytes (i.e. optimal output size)
fn state_size(&self) -> u32;
/// Fill the output buffer from the state. If the output is longer
/// than the state, then the hash function should be used to mutate
/// the state as many times as necessary to generate the required
/// length of data, as with a random number generator.
fn fill_output(&mut self, buf: &mut [u8]);

#2

BE for short-length multi-byte integers is an aberration; all architectures become LE at some bigint granularity.

To understand why this is the case, imagine a multi-precision increment of [0xff; 256] performed BE, where outputting the result for each byte from high to low has to be suspended until the least-significant 0xff byte is incremented and its carry-out status known, with the awaited carry then propagating back across all the higher-order 0xff bytes, low to high.

As @hsivonen has pointed out, BE is going away. Thus the endianness for any portable hasher needs to be LE.


#3

How about usize? How can we hash it portably?


#4

I think the short version is: you can’t.

With LE, unused high bytes will result in trailing zeros. But those zeros should still affect the hash value.


#5

You can’t, because the size of a usize value varies by architecture and specific implementation options. For example, a usize value on RISC-V can be 4 B, 8 B or 16 B, depending on the selected ISA base option. See https://en.wikipedia.org/wiki/RISC-V#ISA_Base_and_Extensions.

A usize value needs to be cast to an agreed fixed size. Given the above example, the minimum longer-term cast would be to u128. However, for the common case where usize is used as an index, u64 would probably be sufficient.


#6

Actually, I believe I made a mistake here. Hash functions are usually defined in terms of an input message as a byte slice, and output another byte slice. They may use multi-byte integers internally, but Endianness of these conversions is an internal detail, not part of the API.

Hasher converts various integer types to byte slices and this job is external to the hash function and language-specific.


#7

What’s the precise change you want to make to the code? Changing the default implementations of write_u16 etc. would not be enough to achieve your goal, because an impl can override those defaults (and indeed, some like int-hash already do). Adding a documented requirement to the trait would make existing impls retroactively incorrect, but would not actually fix them.

Currently it is possible to create both portable and non-portable hashers; portable ones have to override the write_* methods, and may be slower on some architectures. This seems like a reasonable situation. To make portable hashers easier to write and more usable in generic code, perhaps we could have a PortableHasher trait that works only on bytes, and an impl<T: PortableHasher> Hasher for T that implements write_u16 etc. in the portable fashion.

[edited to tweak the PortableHasher idea]


#8

I don’t mean to be dismissive but… is this even the right trait to be using for this? I’ve always throught of std::hash as being for the same purpose Java’s Object.hashcode(). I’ve always mentally separated explicitly non-portable “ephemeral” hashes and portable digests. I think this distinction is valuable.

Until convinced otherwise, I think the right solution is to note in the documentation that std::hash is not meant for cryptographic digests and checksums, and instead point the user to an external crate with a Digest trait that features all of the knobs such code would require (far more than any implementor of Hash cares about).


#9

Well, there is some rationale:

  1. Having to re-implement all the default methods is tedious
  2. It isn’t at all obvious from the documentation whether an implementation is portable or not
  3. It’s also not obvious whether the overhead is actually significant — i.e. whether it matters that this is “slower on some architectures” — if you have any actual data on that it would be really nice

I don’t know. As you say, if it shouldn’t be used for portable hashing then the documentation should say so.


#10

Yes, I agree with all three points, and trait PortableHasher is one idea to address them. I’m still not sure what your proposed change is, though. How else would you make it clear, without breaking backward compatibility, which hashers are portable?


#11

Possibly relevant: I have a project where I need portable hashes, so instead of using Hash, I call bincode::serialize_into() using a custom Write type that’s a wrapper for SipHasher. https://github.com/elidupree/time-steward/blob/master/src/deterministic_random_id.rs#L15-L66


#12

Change all default implementations for multi-byte integers to call .to_le() on the input value before conversion to bytes.


#13

Didn’t realise I hadn’t linked digest-hash (doc) which is roughly what we want.


#14

Change all default implementations for multi-byte integers to call .to_le() on the input value before conversion to bytes.

This will cause some non-portable hashers to become portable, but not all (because not all of them use the default implementations). So consumers will still need to check the docs or read the source to determine if a given hasher is portable.


#15

True, implementations can deliberately be non-portable. But if the default is to be portable then non-portable implementations (which don’t clearly say they are deliberately so) can be regarded as buggy.

I still don’t know whether making Hasher portable is the right option, but it should be documented. And if making it portable turns out to be cheap, then IMO it’s worthwhile just to avoid potential headaches.

The cost of not doing so is that crates like metrohash, twox-hash and fxhash area not portable. Interestingly only the first of these mentions the fact, and the last one overrides many of the Hasher methods but without making them portable.

I suppose for a build tool, SQL statement caching, and reqwest replay, non-portable hashes are probably acceptable. For partitioning in Kafka I guess they just happen not to hit portability problems.


#16

Note that Hash not being portable is a good thing – .Net even took a breaking change to make .GetHashCode non-portable for many core types as part of security improvements.

If the docs are insufficiently clear, I’m sure PRs would be welcome to state the nonportability directly.


#17

Is there a link you could post about this? It sounds very interesting but Google is completely failing to find it for me.


#18

I can’t find any blogs about the change, but you can see the intentional non-portability in the reference source around the FEATURE_RANDOMIZED_STRING_HASHING define: https://referencesource.microsoft.com/#mscorlib/system/string.cs,836

Was probably around the time of CVE-2011-3414, an instance of “Collisions in HashTable May Cause DoS Vulnerability” in ASP.Net.

The docs for GetHashCode current explicitly mention

You should never persist or use a hash code outside the application domain in which it was created, because the same object may hash across application domains, processes, and platforms.

(though the version picker is broken so I can’t check old ones.)


#19

There’s keyed hashing (as our HashMap already uses) and there’s portable hashing. I’m not sure that there’s actually a real use case for anything in between though? It sounds like what you’re talking about is using keyed hashing.


#20

We are using the Hasher trait in the compiler for platform independent hashes for quite a while now: https://github.com/rust-lang/rust/blob/c0955a34bcb17f0b31d7b86522a520ebe7fa93ac/src/librustc_data_structures/stable_hasher.rs#L78-L166