Implementing Hash for HashSet/HashMap?

Is there any intrinsic reason HashSet/HashMap can’t implement Hash? Right now, its not possible to have a set of maps, or sets as keys, etc.

Are there any other extended collection crates which do add this extension?

This has come up for me because I’m working on Rust support for Thrift, and its tests have a case which creates such a structure. It’s possible its also used in real examples (particularly when interoperating with python/php/etc where maps are used much more commonly).

Thanks, J

1 Like

BTreeSet/BTreeMap implement Hash. You can use them without worries if your keys implement Ord.

Of course it is possible to make a naive implementation for HashSet/HashMap. The hard part is improving its unacceptable performance. We want Hash to work consistently for all HashMaps of same type. We want it to run in O(n) time and in O(1) space.

A naive implementation would convert maps to sorted Vecs, or to BTreeMaps. It would run in O(n log n) time and O(n) space. An improved implementation would somehow use hashes that are already stored in HashMaps to its advantage.

Yes, I considered that, but it only allows two levels of nesting, and supplying an Ord implementation isn't much easier than Hash.

The naive implementation would be O(n), and could be memoized between changes. BTreeMap's implementation is already the simplest possible:

    fn hash<H: Hasher>(&self, state: &mut H) {
        for elt in self {
            elt.hash(state);
        }
    }

Couldn't it be the same for HashMap? I guess the concern is that it would depend on element ordering, and whether logically equal maps have the same element ordering - but the implementation of PartialEq already assumes this, so that's settled.

The PartialEq implementation doesn’t depend on ordering – for each key in the first map, it checks if the second map contains it and has the same value: https://github.com/rust-lang/rust/blob/master/src/libstd/collections/hash/map.rs#L1176-L1182

Oh, right. I must have been looking at the BTreeMap implementation.

WRT Thrift, you may want to look here for some in progress support: https://github.com/terminalcloud/thrift

Yes, that effort has stalled, and I’ve picked it up to get it into an upstreamable form. The main missing piece is getting the test suite working.

Ah, great!

I forgot to mention that BTrees implement Ord. You can nest them in any number of levels, as long as all keys and values other than the values of the topmost BTreeMap implement Ord.

To have HashMaps implement Hash, we need to iterate through their contained hashes in a consistent order. We only need to access hashes to compute a hash. (All hashes are approximate, anyway.) Most of the time, logically equal maps have the same element ordering. There are two obstacles to making sure our iteration order is consistent regardless of element ordering.

First, whenever a map contains groups of similar hashes, the element ordering depends on the original order of insertion. Fortunately, in other cases, a consistent order is guaranteed by Robin Hood hashing. To consistently iterate through all hashes, we need to detect groups of similar hashes and grab them in an order that depends only on their exact values. By "similar" hashes, I mean hashes that are congruent modulo capacity -- in other words, some of their least significant bits are equal.

Second, logically equal maps can have different capacities. For maps with "inflated" capacities, we need to divide tables into parts that are equal in size to the smallest map's table. To iterate through a map in the right order, we must zip all its parts to get an iterator that visits positions that correspond to a single position in the smallest map. Collisions still must be dealt with somehow. Even with such complex iteration, there's an issue with elements that were displaced across parts. In the smallest possible map, they are displaced to the beginning of the table, just like in a circular buffer.

As you can see, this is a difficult but certainly solvable problem. In case you need help understanding Robin Hood hashing, here's a visualization: Visualization of Robin Hood hashing edit: I tried to make the description better.

2 Likes

I was going to propose something like this:

impl<K, V, S> Hash for HashMap<K, V, S>
    where K: Hash, V: Hash, S: BuildHasher
{
    fn hash<H: Hasher>(&self, state: &mut H) {
        u64 mut h = 0;

        for elt in self {
            let mut hasher = self.build_hasher();
            elt.hash(&mut hasher);
            h ^= hasher.finish();
        }

        state.write_u64(h);
    }
}

to create an order-independent hash function. I haven't even compiled this yet, let alone tested it, so maybe I've overlooked something. My main concern is whether this allows for a hash-collision attack that wasn't otherwise possible, but I haven't come up with anything.

Edit: Hm, do different instances of HashMap have different hash keys? In which case, two instances will have not have the same key ordering, since they'll effectively have different hash functions, so hashing based on the properties of Robin Hood Hashing will not work. Nor will my proposal, if the two/N HashMaps have different hash functions/hash keys in their BuildHasher instances.

@jsgf, you're right about using XOR to treat maps as sets rather than as lists of entries. I think there's no additional risk of hash-collision attacks, since maps and hashers already prevent such attacks for their keys, and your solution is based on the outer map's hasher and the inner map's key-value pairs. Since all individual hashes of key-value pairs are pseudorandom, XORing them will give a pseudorandom hash. Further, you can rely on the outer map's hasher to use the right seed, just like for any other key type, regardless of the inner map's random seed. Your proposal should compile and work when you change it to call build_hasher() on self.hasher(), and fix the syntax. (I removed my previous response because I misunderstood the gist of your solution. Sorry.)

However, this is not the most efficient solution. For one, creating and finalizing hashers adds some overhead. For keys and values of average size, hashed with either variant of SipHash, this overhead will likely be below 50% of total work, so there's no need to worry. To make it optimal, we would need all changes that I described before, which would take lots of code. Furthermore, an optimal solution would conflict with my proposal for adaptive hashing.

Hm, do different instances of HashMap have different hash keys?

I'll put it in other words: Yes, different instances of the outer map (self) can have different hash seeds, but that's a part of attack-resistant hashing. A hash computed with a specific instance of HashMap in mind is never used for anything besides that instance... we can ignore clones of that instance, because their hasher states are the same.

Right now, two distinct instances of HashMap will get distinct instances of BuildHasher, and will (in general) be at least keyed differently, or perhaps an entirely different hash function, so a given (K, V) pair will hash differently. This would result in logically identical HashMaps having different hashes, despite the order independence of HashMap::hash()'s implementation.

The only solution I can think of for this is to combine BuildHasher into Hasher so that you can create a new Hasher instance from an existing one to allow each (K, V) to be hashed independently using the outer Hasher for each, making it independent from the HashMap's own internal BuildHasher. (I prototyped this by introducing a new MakeHashable trait to create an auxiliary hashable object from a HashMap rather than making HashMap directly implement Hash so that it has access to a BuildHasher, but it was all rather messy and doesn't make it compose any better with the rest of the world.)

If we can't rely on the hash function applied to each (K,V) being commutative (add, xor, etc), then that implies that HashMap can only implement Hash when (K,V) is Ord to define an ordering, and as a result the hash() implementation must contain a gather/sort component, with O(n log n) complexity. This makes the whole thing look unappealing compared to simply using a BTreeMap which is already hashable, though I guess it depends on how much you're using it for normal map operations vs as a hashable thing.

Yes, I see this problem now. It's clearer when I think of nested maps in terms of their layers. Or when I think of Hash as an example of the visitor pattern. Your proposal visits maps and copies their hashers to pass them to the inner layer as state. So the issue is that self.hasher().build_hasher() becomes the state argument for immediate recursive calls. Hashing ultimately yields the root of a Merkle tree with each level set of siblings computed by a different hasher.

So I can think of a few solutions... a) a new MakeHashable trait as you said. I'm not sure how it works, though. b) a new variant of the Hash trait that requires hashers to be cloneable and resettable, so that we don't need any builders. c) a naive solution with O(n log n) complexity. d) my first complex solution, which reduces complexity to O(n) or more precisely O(n log n log log n) (where n is the length of an entry group). It still requires V to be Ord. e) just making all RandomStates equal. I think this may be the best one, so I'll describe it below.

RandomState has been recently changed to share a unique random seed within each system thread. Seeds are stored in TLS. Right now, your proposed code may only break for RandomState when a HashMap is sent between threads. I think the seed should be process-wide. Most languages that hash a seed made the seed process-wide.[1][2] Probably for simplicity.

Then, Hash can be implemented for HashMap<K, V, RandomState>.

Having a common seed also makes HashMaps leaner by 16 bytes.


[1] PEP 456 – Secure and interchangeable hash algorithm | peps.python.org

_Py_HashSecret_t is initialized in Python/random.c:_PyRandom_Init() exactly once at startup.

[2] Go has a global variable for the seed. runtime: use some startup randomness in the fallback hashes · golang/go@ce5cb03 · GitHub

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.