A new default Hasher for HashMap?

tkaitchuck · March 31, 2019, 6:07pm

I have been working on aHash for a while now. It’s performance has steadily improved. As a hashing algorithm for a HashMap I believe it to be significantly superior to SipHash-1-3 which is the current default.

This is my first time working on such a change so It would be good to get some advice. What is the best next step? Is there a good way to test our the real world performance impact? (Like maybe a compiler benchmark)

Does anyone have any ideas for tests, benchmarks, features, or integrations they would like to see?

Dushistov · March 31, 2019, 7:37pm

The HashMap by it self can be replaced in the near future: https://github.com/rust-lang/rust/pull/58623 , so I guess you can look at benchmarks that were used, and do the same.

bascule · March 31, 2019, 8:20pm

The README is a bit unclear about the properties you’re trying to achieve.

It’s described as “non-cryptographic hashing algorithm”, which seems to imply that it is not collision resistant, but it also simultaneously makes claims that it is “DOS resistant”. To me these claims are conflicting.

The silver bullet for preventing hashDoS is a cryptographically secure PRF with a random key. Since this construction does not fall into that category, I’m curious what the basis is for claiming it resists hashDoS. Can you provide a rationale for why you believe this is the case?

Have you run this construction by any of the authors of the previous hashDoS papers and gotten their opinion on it? I’m worried because there’s quite a history of constructions put forward as being “hashDoS resistant” which were subsequently broken when given proper scrutiny.

scottmcm · March 31, 2019, 8:27pm

Note that the compiler already doesn't use the default hasher in most places. You could swap the use of fnv for yours and see if the perf suite comes out better, but that would just be for changing rustc -- it wouldn't say much about whether the default should change, since there are many hashers faster than the default.

Centril · March 31, 2019, 9:25pm

tkaitchuck · March 31, 2019, 9:42pm

Not at all. Consider SHA-1 It's a cryptographic hash, but it's not DOS resistant, because it's trivial to brute force a collision of say the lower 20bits. So if you used SHA as your hash map's hashcode algorithm an malicious party could collect a bunch of strings that cause a partial collision and turn your hashmap into a list, effectively DOSing the CPU on your server.

So the criteria are different. DOS resistance implies that even if you know the algorithm, you can't choose in advance a bunch of keys that will have partial collisions. A cryptographic hash guarentees for example that you ought not to be able to 'undo' the hash, and work out what data produced it. AHash does NOT provide this feature. Obviously AES is a block cipher, so it can be easily reversed (knowing the key).

tkaitchuck · March 31, 2019, 10:02pm

For the AES version, the is based on the assumption that AES is strong. In particular it rests on the assumption that someone not knowing the key won't be able to generate data that will result in a high probability of collision after two rounds of AES.

Looking at the structure of AES you can immediately see why this is highly likely. The first step is an S-box lookup. If the incoming data is xored with a key the attacker cannot control the lookup path and resulting data will be totally different. This is only amplified after the shift/mix step, and a second round repeats this process.

AES has been sufficiently analyzed that I don't think any new predictable patterns will come up that will allow for exploit.

For the fallback algorithm the case is harder to make, as it is operating on primitives. So what I can say is this: I have taken the time to understand how each of the hashes was broken and did not repeat the same mistake. (I can't promise I haven't invented a whole new way of being wrong.) But I do think any attack against it will require a significantly higher level of sophistication.

One reason for this is a very simple test I wrote: I assert that every single bit in the input will end up changing a large enough number of bits, nibbles, and bytes in the output to look like a totally different value. I make a similar assertion about changing every single bit in the key. Every hasher that has been broken that I am aware of, would have failed this simple test. It's easy for an attacker to cause a collision when one bit only affects one other bit, they just have to find a way to cancel it out. But if a single bit flip affects many other bits and which ones depends on a key they cannot observe, it is much harder to attack.

bascule · March 31, 2019, 10:08pm

You're describing the properties of collision resistance and preimage resistance respecively. So I guess your claim is your construction is collision resistant, but not preimage resistant. I'd agree that would be sufficient for this particular use case, however...

For substantiating your security claims, the modern bar in cryptographic research is generally something like a security proof that your hash function is collision resistant so long as AES is a secure PRP.

tkaitchuck · March 31, 2019, 10:49pm

Oh, it absolutely holds that if a single round of AES is a secure PRP then AHash is provably collision resistant. That part is trivial. The problem is that a single round of AES is NOT a secure PRP! It is fairly distinguishable, even just visually.

bascule · March 31, 2019, 11:08pm

I'd agree that it should be fairly trivial to provide a proof your construction is secure if it's based on a PRP, as a PRP is a powerful building block for avoiding collisions. However...

...you are using AES in a particularly unusual way, which is worrisome from a security perspective.

I would suggest looking through the academic literature for a construction which already has a security proof you can build on or adapt.

Encrypting the output of a universal hash function is a fairly well-understood technique.

tkaitchuck · April 1, 2019, 5:49am

@bascule I’ve written up some of my thoughts here: https://github.com/tkaitchuck/aHash/wiki/Attacking-aHash-or-why-it’s-good-enough-for-a-hashmap Treating AES a 4 32bit SPRPs is plenty secure. As unpredictability required to prevent DOS attack is many orders of magnitude lower than what is needed in a cryptographically secure hash.

RalfJung · April 8, 2019, 2:02pm

If I understand correctly, the hash function will differ depending on whether AES is available? Are there situations where this might be a problem, such serializing a data structure on one machine and deserializing on another?

Nemo157 · April 8, 2019, 2:27pm

The default Hasher used in an instance of HashMap is already randomized through RandomState so not even two HashMaps created within the same run of a program can have hashes directly compared with each other (unless they have their BuildHasher overridden). Also RandomState provides no way to seed with a specific value, so you cannot even attempt to construct two HashMaps using the same Hasher across multiple program runs via it.

If you’re using a Hasher instance for serialization purposes you already need to be choosing a concrete implementation that guarantees reproducibility, changing the default for HashMap shouldn’t affect that.

tkaitchuck · July 2, 2019, 9:29pm

Ok, there hasn’t been any replies here in a while. Would someone on the libs team like to share their views and any advice for what would be a good next step?

jschievink · July 2, 2019, 9:47pm

I’m not on the libs team, but since this is a change potentially affecting DoS resistance of the default HashMap, perhaps an RFC would be in order?

PS: I doubt that the libs team would accept such a change given that aHash has not yet received a formal review by a cryptographer, and unlike SipHash is not established as a collision resistant hash.

tkaitchuck · July 2, 2019, 9:57pm

To be clear I am not proposing to change the semantic of DoS resistance. It is fully the intention to retain the same level of DoS resistance Sip offers. If for some edge case that isn't true, I would consider it a bug.

You are correct though that the code hasn't been reviewed by anyone other than me. Obviously to be accepted I assume it will have to go through a good amount of scrutiny, and given that we are talking about crypto it would be good for someone who has a background in it to help with review.

If anyone here would like to do a review I would be more than happy to help walk through the code and thinking behind it.

withoutboats · July 2, 2019, 9:57pm

I’m on the libs team but I’ve never been involved in the maintenance of our hashmap type, nor have I been very involved in the recent work to bring high quality crates like hashbrown and parking_lot into libstd. So my opinion is only worth so much. That said:

I think this is great and interesting work! It certainly seems to me like aHash has a lot of potential, and if everything plays out well it could end up as the default hash. But I think that it will be a slow process and involve more “proving out” before it would be accepted into std. I would recommend:

Publicize and advertise this project, establishing the properties you claim it has, and encouraging people who trust you to use it.
Encourage review of your algorithm by other people with relevant expertise, attempting to find flaws.
Publish additional benchmarks (like holistic benchmarks of lookups and inserts in std’s hash table implementation).

As confidence naturally builds over time, it becomes easier to integrate this code into std. But right now this project isn’t being used or reviewed by other people to a significant extent, making it a big leap of faith for std to start using it.

I also note that this is in essence not one but two hash algorithms: that effectively doubles the level of confidence needed for its adoption. The AES implementation seems easier to be confident in, because it builds on the established work of the AES algorithm. It seems potentially easier to get that path into std before the novel fallback algorithm.

I do want to say that this work looks awesome, and the possibility of adopting an algorithm which is DoS resistant but competitive with FxHash is exciting!

Amanieu · July 2, 2019, 10:31pm

I think a good start would be to replace the default hasher in hashbrown with aHash, as proposed in this issue. Unlike the libstd HashMap, hashbrown makes no guarantees about DoS resistance so the replacement should be perfectly safe.

bascule · September 4, 2019, 4:39am

Please, don't replace SipHash with aHash (or any PRF which has not received proper peer review)

"hashDoS reloaded" applied cryptanalysis techniques (i.e. differential cryptanalsysis) to non-cryptographic hash functions [1]. The conclusion of the cryptographers who carried out these attacks is the bar for preventing the sort of multi-collisions they managed to create is a secure PRF. Generally all non-cryptographic hash functions are vulnerable to this class of attack, and I would count aHash in the vulnerable class, especially because...
Golang already shipped something similar to "aHash" called "aeshash". This construction was not vetted by cryptographers and was vulnerable to key recovery attacks, failing to achieve the goal of hashDoS resistance. Subsequent attempts to increase round depth may still be vulnerable to such attacks. [2]
There are some promising constructions in recent peer-reviewed cryptographic literature for faster, higher-performance alternatives to SipHash. I think those should receive due consideration when looking at SipHash alternatives.

KellerFuchs · September 4, 2019, 9:39am

I strongly agree with @bascule here, and most of the relevant points have been covered in a Github issue.

In particular, @tkaitchuck claims aHash fullfills the avalanche criteria; those are statistical properties, and do not say anything meaningful about adversarial settings like hashDoS: CityHash and MurmurHash fulfilled them and yet were susceptible to universal multicollisions (as shown in hashDoS).

Given that aHash doesn't make security claims, it isn't clear it can be meaningfully cryptanalysed or how can it be used securely; as such, it shouldn't be made the default, especially in std.

Topic		Replies	Views
Help wanted: fast hash maps in std	29	6094	March 25, 2019
Global default hash function override for HashMap and HashSet libs	26	1202	August 20, 2024
Pre-RFC: bless the `type FastHashMap` pattern	92	3977	March 25, 2019
Help harden HashMap in libstd! libs	27	4393	March 25, 2019
Pre-RFC: The amortized hashing strategy ideas (deprecated)	18	4080	March 25, 2019

A new default Hasher for HashMap?

Related topics