Hashes for caches: is it dirty?

A couple of times I have come across a situation where I want to test whether some set of data is dirty, implying something else needs an update. Easy: just generate a hash-sum, and check later whether the new hash is different. For this type of thing it may even be acceptable if the hash-test isn't completely accurate.

I'm not even going to get into performance here (libstd's hash function being more secure than required). The problems I have:

  • f32 and f64 don't implement hash. Understandable, since there are distinct binary values which compare equal, and equal binary values which compare different. But still, for my uses neither of these cases really matter (simply casting the binary value should be fine).
  • HashMap doesn't implement Hash. I'm guessing this is because to generate a stable hash value one would have to sort all entries first, thus there is a significant performance cost. All the same, if I need the hash I either need to do this or switch to a different collection (BTreeMap is a perfectly reasonable alternative much of the time).

The point is: it feels like there is scope for a separate hasher system here (and/or tweaks to the existing hasher, such as support for HashMap), and I don't see one (after a quick search through crates.io).

1 Like

A second point is that it is tempting to work around the "issue" of MyType not being able to derive Hash because it contains a f32 value with a manual impl of Hash, but this ignores the rationale behind not implementing f32 for Hash in the first place: if, later, MyType ends up being used as a HashMap key, there could be problems.

So: the ecosystem should have something like a permissive_hash crate?

There is a partially-acceptable alternative for the lazy: derive(serde::Serialize), then create a hash from the serialised value.

Note that this strategy will miss some updates: Hash can quickly tell you that a value has definitely changed, but you always have to double-check with Eq if the two hashes are the same.

In particular, this is always a conforming Hash implementation:

impl Hash for SomeType {
    fn hash<H: Hasher>(&self, state: &mut H) {}
}

While that’s an extreme example, it’s sometimes useful to trade collision rate for hash calculation speed. For instance, if your object contains the full text of a book, it could make sense to consider only the first few hundred bytes when calculating the hash.

4 Likes

std Hash is by definition only an approximation, and it's only 64 bits, so it's not suitable for telling whether something has definitely changed.

You could define your own Hash-like trait, and use a cryptographically secure hash with it. Then you could be pretty certain that it accurately reflects the content (with >=128 bits and non-broken cryptographic hash collisions you could bet your life on not ever seeing a collision).

Hashing bits of a float is fine. If you're worried about NaN, you could have a special case for it. But generally, if the bits have changed, then something must have touched the value. At worst you'll invalidate the cache unnecessarily, but it won't affect correctness. If you only care about changes to a float with some tolerance, you can convert it to a fixed-point integer for hashing.

But it depends how you're caching, and how much you're modifying the data. Consider just using a "dirty" flag. It can be much faster than hashing. You could then use hashing as a double-check in tests/debug mode to ensure that the dirty flag is always set whenever hash changes.

Since HashMap relies on Q: Hash + Eq, it's still impossible to use a float in a key, so one might get away with implementing Hash for f32 if that's what you're proposing — though I suspect it wouldn't be accepted.

But I'm not trying to make changes like that. I'm also not trying to impl Hash for my own types (that's easy when I don't care about the precise semantics or downstream implications). I'm asking: should we have a separate type of hasher for this?

And should we make HashMap impl Hash (despite the cost of sorting to make the result stable)?

I think the core problem you have here is that you want the complement of what Hash actually does. You want "definitely changes if it's different (but might also spuriously change)", but Hash is "definitely the same if it's the same (but might also spuriously be the same)".

Thus you fundamentally cannot do what you're trying to do using Hash.

I don't see a reason to have it in std. You can certainly have one in a crate if you find it useful.

I would consider sorting unacceptable. Even Vec: Hash is of questionable usefulness, because linear time to go over the whole thing is often wasteful -- if one is keying by a Vec, then perhaps it should be using a trie or something. But being worse than linear (and perhaps needing to allocate space in which to sort it?) seems even worse than that, and thus intolerable. (Maybe it could xor the hashes of the elements together, as that's order-invariant so could be done linearly? But it also has a bunch of known problems, so it's not clear that it's worth doing either.) That said, I think it would be fine to implement DhardyHash (where spurious differences are fine) for HashMap, since you could just iterate in HashMap order.

But really, the better answer to all this, as far as I'm concerned, is that if you want to hash an associative container, just use a BTreeMap instead of a HashMap.

3 Likes

Yes — though "might also spuriously be the same" is a fundamentally unavoidable property of any hash/checksum.

Perhaps I'm over-thinking things and should just set a "dirty" bit whenever the data gets updated — this design has a significantly higher "false positive" rate, but is simple and avoids false negatives.

1 Like

Are you pushing to the 'something else', or lazily pulling? That can have an effect on your design as well.

Also, dirty bits can suffer from an ABA problem if there is more than one process that is able to reset the dirty bit (depends on your situation, of course). In that case, just use an atomic u64 counter that you increment each time you write the main value, and cache the current value along with any other cached data. If the cached counter and dirty counter don't match, then there was an update that needs to be dealt with, which solves the ABA problem.

This LWN article has information on using even/odd status to also indicate whether an update is "in progress" (for batched updates) (see the Seqcounts section). Depending on how fine-grained your dirty status and locking mechanisms are, this may be useful to take into consideration as well.

1 Like

There are various academic papers which discuss hashing unordered lists. The short answer is adding the hashes up works well as long as your underlying hash function is good. I use this all the time, with success.

Also, I find in many circumstances a hashmap of vecs is much faster and memory efficient than a trie. Tries are really only useful if there are a few elements the vecs can take (like 8 bit chars).