Summary
Implement the amortized hashing strategy for HashMap. That is, use the fastest available hash function initially, but keep SipHash as a secure fallback by default.
Thanks due to @bill-myers, who characterized the general idea in the context of Rust. Also to @Jurily, who provided an implementation of xxh32.
Motivation
SipHash-2-4 with a random seed guarantees security against algorithmic complexity attacks. The positions of entries are unpredictable.
SipHash’s performance is satisfactory for large streams. However, it must perform at least 8 compression rounds that consist of 14 simple bitwise and addition operations each, even for the smallest values.
In the classic DoS scenario, many thousands of key collisions are precomputed. The complexity of creating a hash table with these keys is O(n^2). Inserting a key requires a comparison with all previously inserted keys. It causes many L2 or even L3 cache misses on modern processor cores.
XxHash is a fast function designed for calculating checksums. Aside from security concerns, it has all properties that can be expected from a function used for hash tables.
Ideally, HashMap would use a fast function initially, and when the length of any probe sequence surpasses a threshold, the map switches to one-way randomized hashing.
Perl has been using a similar strategy since 2003 for a hash table implemented by chaining.
Detailed design
Add a new cold
private method for rehashing the map with the following
behavior:
if load_factor >= 0.625 {
self.resize(capacity * 2);
} else {
self.hasher.reseed();
let old_table = replace(&mut self.table, RawTable::new(capacity));
for (_, k, v) in old_table.into_iter() {
self.insert_hashed_nocheck(self.make_hash(&k), k, v);
}
}
Choose a constant threshold, such as 64. When an attempt to insert an entry with a probe count higher than the threshold is made, this strategy is used to resolve the situation.
With very high factors, when probe distances are longer, it’s possible that several millions of insertions would trigger this condition. Thus the map is resized to reduce probe distances. Here, the load factor of 0.625 is chosen to thwart the attacker’s ability to repeatedly double hashmap’s capacity[1]. Furthermore, the map can’t be resized repeatedly because the map will have a factor of at least 0.625/2 = 0.3125 once it’s grown. Shrinking happens at the load of 0.25. It means it will take 6.25% of elements to be removed to shrink back. Such behavior is unlikely.
Reaching the other branch is enormously unlikely during normal operation on hashmaps of any size[2].
The method above is called only during insertion:
// fn insert_or_replace_with
while probe.index() < ib + 64 || !self.hasher.is_adaptive() {
// ... looking for a spot to insert ...
}
self.adapt_table();
// ...
The reseed
and is_adaptive
methods are provided by Hasher
:
pub trait Hasher<S> {
/// Compute the hash of a value.
fn hash<T: Hash<S>>(&self, value: &T) -> u64;
/// Re-randomize the seed from a task-local rng, if applicable.
fn reseed(&mut self) {}
#[inline]
fn is_adaptive(&self) -> bool { false }
}
Implement an adaptive hasher XxHasherOrRandomSipHasher
:
impl XxHasherOrRandomSipHasher {
/// Construct a new `XxHasherOrRandomSipHasher` that uses xxhash initially.
#[inline]
pub fn new() -> XxHasherOrRandomSipHasher {
XxHasherOrRandomSipHasher { k0: 0, k1: 0 }
}
}
impl Hasher<XxStateOrRandomSipState> for XxHasherOrRandomSipHasher {
#[inline]
fn hash<T: Hash<XxStateOrRandomSipState>>(&self, value: &T) -> u64 {
let mut state = if self.k0 == 0 {
UseXx(XxState::new(0))
} else {
UseSip(sip::SipState::new_with_keys(self.k0, self.k1))
};
value.hash(&mut state);
state.result()
}
fn reseed(&mut self) { /* ... */ }
#[inline]
fn is_adaptive(&self) -> bool { true }
}
Use it by default for HashMap
.
Implement both variants of XxHash, namely XXH32 and XXH64, in the standard
library. Export them both, but conditionally compile XxHashOrRandomSipHasher
to use the faster one.
Conditionally compile RawTable
to store hashes as u64 on 64-bit platforms,
and as u32 in all other cases. Keep in mind that most significant bits of the
hash are truncated for all operations where an index into the table is needed:
index = hash(key) modulo capacity
Drawbacks
- Yet another hashing function is implemented and included in every binary that uses the default HashMap.
- The logic for map.insert() has additional complexity in the cold path.
- An additional layer of security is removed. Randomness is an obstacle in the way of useful exploitation of potential memory unsafety in the implementation of HashMap.
- Determinism is bad. Users might be prone to think that the order of the iteration can be relied upon, since it’s the same on every program run. Moreover, it can expose some information.
Alternatives
- Keep SipHash by default. (no change)
- Keep using 64-bit hashes on all platforms, but change the strategy.
- Variant 1a. Use XxHash with the
type_id
ofHashMap<K, V, H>
as the seed. - Variant 1b. Use XxHash with a randomized task-local seed.
- Variant 2. Use only the Murmur3/XxHash’s finalizer for word-sized or shorter constant-length keys. The current API makes this very difficult.
- Variant 3a. Put rarely used 16 bytes of
SipHasher
behind anOption<Box<SipHasher>>
. - Variant 3b. Put
SipHasher
behindmap.table.ptr
. This means the HashMap struct can be as small as Vec. - Choose another fast hashing function, such as lookup3.
Unresolved questions
- Should SipHash be enabled by default for debug builds?
- What’s the soundness of the algorithm presented here?
- Would it be desirable to integrate parts the algorithm with
ResizePolicy
and/orHasher
?
References
- Yves Orton, 2013. Hardening Perl’s Hash Function
- 99.999th percentile of probe count