Stability of hash values

The Rust stdlib documentation for the Hash trait is a high level overview. I think it’d be useful for a few more details on what is (/isn’t) guaranteed.

Specifically, I’m interested in the stability of hash values across a) executions of the same binary b) executions of the same code compiled with different versions of rust c) executions of the same code compiled on different architectures.

My previous assumption was that if I create a value in exactly the same way, it will hash to the same value in a, b and c above. However, this is clearly not a universal assumption - as a data point, Python does have a ‘hash randomisation’ option to make hash values unpredictable between executions.

According to this post, Rust seems to try to be consistent at present(?)

I don’t mind if the final documentation says “you should only rely on hash values to be the same within a single execution of a binary” (though I might raise an RFC for ‘ConsistentHash’ or similar), I just think it’d be useful if it was explicit.

1 Like

Hashers generally expose enough state to produce identical hashes between program executions but this may not be their default behaviour. In particular the default Hasher’s (SipHash) default constructor will produce a random state that will vary between executions. But it also exposes a constructor for you to specifically seed the state in a deterministic way.

XxHash (as used by rustc internally), conversely, is always deterministic. As far as I know, both of these Hashers have no architecture/language-specific behaviour. They’re totally deterministic with respect to their seed.

Basically, this is a implementation-defined behaviour, and you should check out the particular Hasher for what it does. I don’t believe the trait can or should care about it, as it’s largely just there for HashMap to consume. It seems reasonable for individual hashers to specify this, as they’re usually pretty well-specified.

2 Likes

What happened in this particular case was that a type changed which bytes it emitted to the hashing API. Because of this, any hash algorithm would have changed the hash (with high probability).

I think if we the focus on this: What bytes do I represent this value by for hashing, it’s pretty clear that it is a breaking change, and depending on the type it’s an important change. I can’t myself disregard it as an unimportant change, but it depends on the library and type affected.

Link: brief rust-libs chat about this

Going off topic a bit, but this surprised me. So I've just read the documentation and nightly code and also tried a simple program...but I can't make a hash vary between executions! The docs say the default constructor uses constant keys? I assume I've misunderstood your message.

This seems reasonable. However, there are some decisions that rustc itself makes that are relevant. For example:

#[derive(Hash)]                                                                 
struct FooBar {                                                                 
    pub i: i32,                                                                 
    pub j: i32,                                                                 
}

The hash of this depends on which order the two members are hashed in. A particularly unfriendly Sufficiently Smart Compiler could switch these around for optimisation purposes if there was no guarantee of ordering, sabotaging repeatable hashing unless you manually implement the trait for every struct you might want to hash.

Would a sentence like (roughly) "rust will always pass fields from a struct to the hasher in the same order - how values are then handled across different architectures depends on the implementation of the hasher" be uncontentious?

Whoops, sorry! Got the defaults backwards.

I am quite sure the hasher calls #[derive(Hash)] are unstable and implementation defined (i.e. may change between minor versions).

FWIW, std::hash::hash is marked unstable, so that is explicit.

What changed here was the impl Hash for ParticularType. Whether that is stable or not is entirely a library issue. I think you should treat any change to the Hash impl as breaking.

Edit: Remember that the Hash trait is not a trait for hashing but for which bytes to pass to an arbitrary hash algorithm.

Thanks for the comments. So to summarise my understanding of the two faces of this:

The first face concerns what situations a Hasher impl is deterministic in (and therefore how freely algorithm changes may be applied). I (now!) accept that this is defined by the hasher and should be documented by them. E.g. you could come up with a hasher that’s only deterministic in a particular run, within a particular thread.

The second face concerns what situations a Hash impl (i.e. generating the bytes) is deterministic in. This is defined by Rust (when implemented automatically) or by the creator of a struct. There is a documented restriction for Eq types (k1 == k2 -> hash(k1) == hash(k2)). I think rustc sets the tone here - if (once it’s stabilised) an automatically generated Hash impl is documented to potentially change between minor versions, libraries are likely to mimic and not treat it as a breaking change. If Hash is only really intended for Hashmap then this route seems like a good way to discourage other uses?

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.