Help harden HashMap in libstd!

It was recently discovered that the implementation of HashMap in the standard library can be susceptible to behavior which can cause pathological performance typically seen as part of HashDoS attacks. The default implementation of HashMap is specifically designed to be resistant to this style of attack, though, so this is definitely a bug that needs fixing!

The specific issue here is currently purely map.extend(other_map) and isn’t actually at all related to SipHash being used, but rather the fact that we expose key insertion order when iterating over map. This pathological performance seems to be tied to the linear probing done by our robin hood hash maps currently.

There’s some possible mitigation strategies outlined in the issue iteslf at the top, some of the more promising being:

  • Go back to having per-hashmap SipHash keys so all iteration order is random.
  • Otherwise try to randomize iteration order
  • Explore alternate implementations which aren’t susceptible to this problem (as proposed by @bluss)

Are you interested in helping out here? Got some ideas of what we might be able to do to solve this? Post here!

3 Likes

The specific issue here is currently purely map.extend(other_map)

Not sure I agree with that. The second example I give never explicitly makes two maps interact; the second is just created from a specific ordering of the first. Your comment makes it seem like extend is the culprit, rather than a more general flaw with the open addressing scheme.


My suggestion is that in the very short term you should use per-hashmap SipHash keys. Given Rust's bravado with SipHash, it's not unlikely that people have been relying on it, so a quick fix seems essential. The change is trivial and massively reduces the attack surface.

The fancier options can happen after that at a more leisurely pace. @bluss's ordered dictionary seems like a great solution and even solves quadratic blowup with repeated naïvely implemented pops.


On a tangential note, there are other issues with the current implementation. Extremely slow probes occur even with randomly distributed hashes. Rust's HashMap fairs extremely poorly with badly randomizing hashes. Secure hashing is itself slow. These are things that would be very nice to improve when investigating options. It's possible that quadratic probing or other schemes yield speed benefits even on nice workloads. A method that merely mitigates pathological behaviour may cause problems if we want adaptive hashing. Etc.

That's true for any non-prime-number capacity hashmap.

Please explain.

RH hashing and linear probing is really fast because of the predictable/controlled memory access patterns, jumping around the table will definitely hurt.

That's true for any non-prime-number capacity hashmap.

The modulo isn't that important; it's removes a couple of really basic patterns but doesn't solve fundamental problems. Python's dict uses power of two buckets, but works well with nonrandom keys. You can read about Python's approach here.

RH hashing and linear probing is really fast because of the predictable/controlled memory access patterns, jumping around the table will definitely hurt.

Indeed, but long, cascading insertions also hurt. When I tested the current HashMap, I found that insertions into a million-long map could end up displacing 1k elements, even when testing on a random sample. Lookups often reached maximal probe counts of nearly 50 elements.

A half-way solution, like using a Python-style rehash only when the index is divisible by 16, means most probes will still get a short linear search but the values won't end up bunching up pathologically.

2 Likes

[quote="Veedrac, post:4, topic:4138"]Python's dict uses power of two buckets, but works well with nonrandom keys. You can read about Python's approach here. [/quote] Yeah, that strategy works well for them because the cache pressure is already very high due to other things.

I find these numbers a bit high. Worth investigating.

Everything is a trade-off. Doing that for example will make the resizing algorithm slower as it won't be able to make some assumptions.

It's not very accurate, but one could say that RH favors memory/lookups/resizes over inserts.

I collected some interesting data. I tested 2M and 10M element hashmaps and also the known worst case.

Lookup hash probing is very small and the keys probing didn't go above 1.

The insert displacements had big outliners (max 735) but was very controlled (5±15) for the 10M.

Code: collect statistics · arthurprs/hashmap2@c5250d4 · GitHub

hist_h = hash array probe length
hist_k = key probe length
hist_d = robin hood displacement on inserts
HM len 2000000, cap3813004
Histogram hist_h: Min: Ok(1) Avg: Ok(2) Max: Ok(13) StdDev: Some(1) Median: Ok(1) 95pct: Ok(3) 99pct: Ok(4)
Histogram hist_k: Min: Ok(1) Avg: Ok(1) Max: Ok(1) StdDev: Some(0) Median: Ok(1) 95pct: Ok(1) 99pct: Ok(1)
Histogram hist_d: Min: Ok(1) Avg: Ok(6) Max: Ok(969) StdDev: Some(17) Median: Ok(1) 95pct: Ok(24) 99pct: Ok(73)
HM len 10000000, cap15252015
Histogram hist_h: Min: Ok(1) Avg: Ok(2) Max: Ok(13) StdDev: Some(2) Median: Ok(1) 95pct: Ok(4) 99pct: Ok(5)
Histogram hist_k: Min: Ok(1) Avg: Ok(1) Max: Ok(1) StdDev: Some(0) Median: Ok(1) 95pct: Ok(1) 99pct: Ok(1)
Histogram hist_d: Min: Ok(1) Avg: Ok(5) Max: Ok(735) StdDev: Some(15) Median: Ok(1) 95pct: Ok(19) 99pct: Ok(66)
Merged map len 20000, cap29789
Histogram hist_h: Min: Ok(1) Avg: Ok(413) Max: Ok(1658) StdDev: Some(539) Median: Ok(5) 95pct: Ok(1525) 99pct: Ok(1618)
Histogram hist_d: Min: Ok(1) Avg: Ok(480) Max: Ok(2052) StdDev: Some(635) Median: Ok(3) 95pct: Ok(1817) 99pct: Ok(1997)

By the way, that is exactly what is to be expected from robin hood hashing — very low variance on these metrics is the raison d’être for the whole algorithm. That doesn’t preclude extremely large outliers, it just makes them rarer.

Everything is a trade-off.

Indeed. I've not tried alternatives, so it's all theories from me. (I feel this is sidetracking things a bit, so maybe move we should move this to another thread.)

Lookup hash probing is very small

Because you only test lookups at extremely low occupancies; 52% and 66%. The occupancy gets as high as ~90%. If you test with sizes 1.9m and 7.6m, you'll see the worst case gets way higher. The average is still only ~6, but it's a long tail distribution.

Histogram hist_h: Min: Ok(1) Avg: Ok(6) Max: Ok(52) StdDev: Some(6) Median: Ok(4) 95pct: Ok(16) 99pct: Ok(25)

Lookups are the strongest point of Robin Hood, though, and a probe length of 12 being within one standard deviation isn't particularly worrying.

The insert displacements had big outliners (max 735) but was very controlled (5±15) for the 10M.

The average is heavily compensated for by the fact your benchmarks measures inserts over the full range of occupancies. Measuring just over higher occupancies, I get numbers more like

1.5m → 1.9m
Histogram hist_d: Min: Ok(1) Avg: Ok(11) Max: Ok(767) StdDev: Some(25) Median: Ok(2) 95pct: Ok(48) 99pct: Ok(118)

7.0m → 7.6m
Histogram hist_d: Min: Ok(1) Avg: Ok(19) Max: Ok(1073) StdDev: Some(36) Median: Ok(5) 95pct: Ok(81) 99pct: Ok(169)

19±36 no longer seems very controlled.

keys probing didn't go above 1

Indeed; key probing only happens on a full 64-bit collision, which is vanishingly unlikely.

very low variance on these metrics is the raison d'être for the whole algorithm

The original paper actually uses double hashing, FWIW.

2 Likes

Ah yes indeed, and very good points that it's more general!

True yeah, but it won't hit stable any more rapidly if we were to land it today or a few weeks from now. That, coupled with the fact that this doesn't seem actively exploited, led the libs team to the conclusion that we can regroup for a bit to determine the best long-term solution here.

The downside of going back to per-map keys is that it brings in quite a bit of randomness machinery back into the standard library. Additionally, it can drastically slow down creating of a large number of hash maps (the motivation for the original PR). Just to mention that it's not a free revert!

The downside of going back to per-map keys is that it brings in quite a bit of randomness machinery back into the standard library. Additionally, it can drastically slow down creating of a large number of hash maps (the motivation for the original PR).

I should have been more clear; I'm just referring to incrementing the keys each hash map. This was the original idea of PR #33318, but that part was removed after some discussion.

Well, for a 7M+ hashmap approaching a occupancy of 91% I'd say it is.

Have @briansmith’s concerns with that approach been addressed?

That seems unavoidable.

We could go halfway back and update only one of the seeds on each hashmap creation, this reduces the rng overhead in half. The other half can be generated by a simpler rng seeded from the OS as well.

@briansmith's concerns are basically

The hash function isn't necessarily SipHash.

I'm not sure this matters because this part of the code is explicitly tied to SipHash.

I don't find "we don't currently know of a related-key attack on SipHash" very encouraging.

It's strictly harder than an attack when the keys are already equal.

Further, hash maps don't care about any hypothetical key attack; the nature of the problem means that the attack has to be both extremely fast, else it's useless for amplification, and extremely effective, as a HashMap only leaks the bottom few bits of its hashes (and even then imperfectly).

We assume that the attacker can add or remove arbitrary (key, value) entries from any hash table used in the program.

Though it would be useful to defend against this kind of "anything goes" attack, right now we're vulnerable to attackers with extremely basic capabilities.

More generally, crypto people never generate a secret key by adding a constant value to another secret key.

This isn't the same thing as deriving a key by adding one to the previous key.

This isn't the same thing as deriving a key by adding one to the previous key.

Assuming you're combining it additively with the IV, it's a randomly chosen key that's incremented by one each block you encrypt.

I'll admit it's a very different domain with a different security model.

No. Counter mode will generate a keystream like this: E(K,IV+0), E(K,IV+1), ..., E(K,IV+n). This is very different from E(K,IV)+0, E(K,IV)+1, ..., E(K,IV)+n. If you were to use the former (which is basically NIST SP 800-90A's CTR DRBG) to initialize successive HashMaps, that would be totally fine.

1 Like

IV+n corresponds to the seed to SipHash, not E(K,IV+n). (Not that it matters; “what do completely different cryptographic schemes do” is a poor heuristic anyway.)

@briansmith please check this, but I think that incrementing the seed to SipHash (not the key) would solve the problem, inasmuch as as SipHash is considered to be a strong PRF (pseudorandom function) and that is not affected by any of these attacks. Since SipHash is a PRF, changing even one bit of the input completely changes the output. By “seed”, I mean the common prefix that is used by all hasher invocations.

Also, here is another point to consider: We aren’t encrypting anything. We don’t need to defend against offline attacks by attackers with NSA-level computing power. As long as the attacker requires significantly more computing power to launch the attack than we need to defend against it, we win.

That means that we can use PRFs and PRNGs for hash tables that would be very ill-advised in any other context. I am thinking of 7-round AES-CTR on machines with AES-NI, or 8-round ChaCha20 otherwise.

Additionally, I think we need to look into optimizing the SipHash implementation. It may very well be worthwhile to use hand tuned assembler here, or at least SIMD intrinsics.

Finally, I very much like the idea of adaptive hashing. There are hash functions (universal hash functions) that offer unconditional guarantees regarding the maximum probability of collisions, and which are often very fast, but which lose all strength if the actual hash codes are revealed. This means that those would need to be somehow protected.

@dobenour if you'd really like to improve the performance (which seems to be a perpetual point of concern), I'd suggest taking a look at HighwayHash, which is AVX-2 optimized (with a fallback SSE4.1 implementation) and performs much better than SipHash for larger inputs: