Pre-RFC: The amortized hashing strategy

pczarn · September 21, 2014, 9:41am

Summary

Implement the amortized hashing strategy for HashMap. That is, use the fastest available hash function initially, but keep SipHash as a secure fallback by default.

Thanks due to @bill-myers, who characterized the general idea in the context of Rust. Also to @Jurily, who provided an implementation of xxh32.

Motivation

SipHash-2-4 with a random seed guarantees security against algorithmic complexity attacks. The positions of entries are unpredictable.

SipHash’s performance is satisfactory for large streams. However, it must perform at least 8 compression rounds that consist of 14 simple bitwise and addition operations each, even for the smallest values.

In the classic DoS scenario, many thousands of key collisions are precomputed. The complexity of creating a hash table with these keys is O(n^2). Inserting a key requires a comparison with all previously inserted keys. It causes many L2 or even L3 cache misses on modern processor cores.

XxHash is a fast function designed for calculating checksums. Aside from security concerns, it has all properties that can be expected from a function used for hash tables.

Ideally, HashMap would use a fast function initially, and when the length of any probe sequence surpasses a threshold, the map switches to one-way randomized hashing.

Perl has been using a similar strategy since 2003 for a hash table implemented by chaining.

Detailed design

Add a new cold private method for rehashing the map with the following behavior:

if load_factor >= 0.625 {
    self.resize(capacity * 2);
} else {
    self.hasher.reseed();
    let old_table = replace(&mut self.table, RawTable::new(capacity));
    for (_, k, v) in old_table.into_iter() {
        self.insert_hashed_nocheck(self.make_hash(&k), k, v);
    }
}

Choose a constant threshold, such as 64. When an attempt to insert an entry with a probe count higher than the threshold is made, this strategy is used to resolve the situation.

With very high factors, when probe distances are longer, it’s possible that several millions of insertions would trigger this condition. Thus the map is resized to reduce probe distances. Here, the load factor of 0.625 is chosen to thwart the attacker’s ability to repeatedly double hashmap’s capacity[1]. Furthermore, the map can’t be resized repeatedly because the map will have a factor of at least 0.625/2 = 0.3125 once it’s grown. Shrinking happens at the load of 0.25. It means it will take 6.25% of elements to be removed to shrink back. Such behavior is unlikely.

Reaching the other branch is enormously unlikely during normal operation on hashmaps of any size[2].

The method above is called only during insertion:

// fn insert_or_replace_with
    while probe.index() < ib + 64 || !self.hasher.is_adaptive() {
        // ... looking for a spot to insert ...
    }
    self.adapt_table();
    // ...

The reseed and is_adaptive methods are provided by Hasher:

pub trait Hasher<S> {
    /// Compute the hash of a value.
    fn hash<T: Hash<S>>(&self, value: &T) -> u64;

    /// Re-randomize the seed from a task-local rng, if applicable.
    fn reseed(&mut self) {}

    #[inline]
    fn is_adaptive(&self) -> bool { false }
}

Implement an adaptive hasher XxHasherOrRandomSipHasher:

impl XxHasherOrRandomSipHasher {
    /// Construct a new `XxHasherOrRandomSipHasher` that uses xxhash initially.
    #[inline]
    pub fn new() -> XxHasherOrRandomSipHasher {
        XxHasherOrRandomSipHasher { k0: 0, k1: 0 }
    }
}

impl Hasher<XxStateOrRandomSipState> for XxHasherOrRandomSipHasher {
    #[inline]
    fn hash<T: Hash<XxStateOrRandomSipState>>(&self, value: &T) -> u64 {
        let mut state = if self.k0 == 0 {
            UseXx(XxState::new(0))
        } else {
            UseSip(sip::SipState::new_with_keys(self.k0, self.k1))
        };
        value.hash(&mut state);
        state.result()
    }

    fn reseed(&mut self) { /* ... */ }

    #[inline]
    fn is_adaptive(&self) -> bool { true }
}

Use it by default for HashMap.

Implement both variants of XxHash, namely XXH32 and XXH64, in the standard library. Export them both, but conditionally compile XxHashOrRandomSipHasher to use the faster one.

Conditionally compile RawTable to store hashes as u64 on 64-bit platforms, and as u32 in all other cases. Keep in mind that most significant bits of the hash are truncated for all operations where an index into the table is needed:

index = hash(key) modulo capacity

Drawbacks

Yet another hashing function is implemented and included in every binary that uses the default HashMap.
The logic for map.insert() has additional complexity in the cold path.
An additional layer of security is removed. Randomness is an obstacle in the way of useful exploitation of potential memory unsafety in the implementation of HashMap.
Determinism is bad. Users might be prone to think that the order of the iteration can be relied upon, since it’s the same on every program run. Moreover, it can expose some information.

Alternatives

Keep SipHash by default. (no change)
Keep using 64-bit hashes on all platforms, but change the strategy.
Variant 1a. Use XxHash with the type_id of HashMap<K, V, H> as the seed.
Variant 1b. Use XxHash with a randomized task-local seed.
Variant 2. Use only the Murmur3/XxHash’s finalizer for word-sized or shorter constant-length keys. The current API makes this very difficult.
Variant 3a. Put rarely used 16 bytes of SipHasher behind an Option<Box<SipHasher>>.
Variant 3b. Put SipHasher behind map.table.ptr. This means the HashMap struct can be as small as Vec.
Choose another fast hashing function, such as lookup3.

Unresolved questions

Should SipHash be enabled by default for debug builds?
What’s the soundness of the algorithm presented here?
Would it be desirable to integrate parts the algorithm with ResizePolicy and/or Hasher?

References

Yves Orton, 2013. Hardening Perl’s Hash Function
99.999th percentile of probe count

huon · September 21, 2014, 9:47am

How did you compute these values? In particular, what sort of keys were used?

pczarn · September 21, 2014, 9:56am

u64 keys generated with xorshift rng, then hashed by the HashMap with xxhash. The code is here

pczarn · September 22, 2014, 8:51am

If my math is right, the cumulative distribution of probe counts is exponential.

In a Robin Hood hash table with a load factor of α = 0.625, the probability that an insertion will take k or more probes is

Pr(psl >= k) ≈ 0.4166^k

The value of Pr(psl >= 32) is 6.78e-13 for α = 0.625. It follows that after 31 hours of continuous insertions to a map with a threshold of 32, the map would switch to SipHash with the probability of approximately 50%. With a threshold of 64, it would take 10^12 times longer. (Assuming an insertion and cleanup takes 110ns.)

All benchmarks hash u64.

before
find_existing         51780 ns/iter (+/- 4616)
find_nonexisting      54130 ns/iter (+/- 3198)
find_pop_insert         256 ns/iter (+/- 2)
grow_by_insertion       177 ns/iter (+/- 7)
hashmap_as_queue        141 ns/iter (+/- 3)
new_drop                131 ns/iter (+/- 41)
new_insert_drop         230 ns/iter (+/- 97)

after
find_existing         41050 ns/iter (+/- 3470)
find_nonexisting      41080 ns/iter (+/- 3970)
find_pop_insert         163 ns/iter (+/- 23)
grow_by_insertion       146 ns/iter (+/- 23)
hashmap_as_queue        122 ns/iter (+/- 10)
new_drop                  1 ns/iter (+/- 0)
new_insert_drop          74 ns/iter (+/- 4)

Gankra · September 22, 2014, 12:14pm

Great work! If you have the raw data, and the code you used in your analysis, I’d love to see it to confirm your methodology.

Also: How does XxOrSip compare to just using Xx, is there much overhead?

arthurprs · September 25, 2014, 4:45pm

64bit hashes in a general purpose hashtable are wasteful, even in a 64bit environment in which u32 are still 4 byte aligned. Even if they’re usually 1.5~2x faster than the 32bit variants, it’s not worth it for the common case of small to medium keys.

I also believe that using a cryptographically secure hash (like SipHash) for a general purpose hashmap of a systems language is a bad decision. It’s expected to be as fast an lean as possible and right now it’s neither. But I guess this is disputed.

Does task local seed works? Doesn’t that makes global hashmaps (not recommended but must be possible) and Arc<HashMap<?>> unusable.

Isn’t a process wide seed for xxhash sufficient for the stdlib hashmap?

As an alternative. Isn’t a process wide SipHash keys sufficient? Python and Ruby implementations went this way, this will save 16 bytes in each HashMap and shave some time on initialization.

pczarn · September 26, 2014, 10:07am

Thanks!

That's right, but it would take 64GB of memory to overflow capacity. It's possible with swap and on some processors (Xeon).

FWIW, I don't like the idea of changing the hash size dynamically or failing.

I'd argue that the default shouldn't be exploitable. See rust#11783 "Implement a fast HashMap / hash function"

Task local seed would make some collections not Send.

u32 seed is surely insecure in practice. SipHash is required for theoretical security.

Perhaps it would be sufficient, if there was no way for keys to be leaked. However, initialization before main or when calling from C is not an option, I suppose. But Once.doit could be used.

arthurprs · September 26, 2014, 3:44pm

I'm not sure if I follow. Are you talking about a 68GB+hashmap? Is that a problem?

Well, theoretically...

Just to complement my previous comment.

Golang use a u32 seed per hashmap. There was some golang-dev talks about switching to SipHash but they were dropped it in favor of an special hash based on AES instruction on x64 and FVN as a fallback (both still using the u32 per map seed)

Java: u32 seed along Murmur3 per JVM.

Python and Ruby: per process 128bit SipHash keys.

--

Maybe we should prepare some benchmarks before taking any decisions.

Gankra · September 26, 2014, 4:30pm

@arthurprs This isn’t out of nowhere. This has been part of HashMap for a while. HashMap is parameterized over the hash function. The current default is SipHash for security reasons. Meanwhile, the compiler overrides this with XxHash because DoS attacks against the compiler aren’t a concern, and SipHash was deemed too expensive. This propsal suggests changing the default to a special hash function that starts with XxHash, and falls back to SipHash if statistically improbable (on random input) bad behaviour occurs.

arthurprs · September 26, 2014, 4:54pm

I know. I’m just exercising the discussion providing personal thoughts and decisions taken by other languages. Fell free to disagree with any arguments.

pczarn · September 26, 2014, 10:24pm

Correction: hashmaps are expected to use FNV in the compiler.

Another correction: It would take a 128GB+ hashmap to run out of bits; all hashes should have at least log_2 capacity meaningful bits. If there was a rehash from u32 to u64, it would happen with a capacity increase from 2^32 to 2^33. The smallest, yet useful hashmap is e.g. HashSet<u64> with 8+8=16 bytes per entry. This gives us 2^33 * 16 bytes = 128GB required to represent a map with capacity>2^32 without running out of memory.

(HashSet<()>, HashSet<u8> etc. are smaller, but presumably not functional.)

If smaller memory consumption is important, consider compiling for Linux x32.

arthurprs · September 27, 2014, 4:40am

I don’t see the point in having a theoretical capacity bigger than billions. The implementation isn’t aimed for such extreme cases and becomes pretty much unusable long long before that.

Jurily · September 28, 2014, 1:17pm

This is a really good post about xxHash in a security context.

arthurprs · October 1, 2014, 7:32pm

@pczarn Do you have benchmarks using only xxHash? As @Gankra suggested it’d be useful to check the overhead.

Personally I think all stdlib collections should be Send’able (ex: return results from a task). On the other hand, one must be able to access a global hashmap from any task. Based on those two I’d rule the task-locals option out.

pczarn · October 4, 2014, 10:04am

With -O, XxOrSip vs Xx:

find_existing                ... bench:     42094 ns/iter (+/- 98)
find_existing_with_xxh64     ... bench:     37813 ns/iter (+/- 209)

find_nonexisting             ... bench:     42706 ns/iter (+/- 151)
find_nonexisting_with_xxh64  ... bench:     41284 ns/iter (+/- 105)

find_pop_insert              ... bench:       168 ns/iter (+/- 30)
find_pop_insert_with_xxh64   ... bench:       206 ns/iter (+/- 0)

grow_by_insertion            ... bench:       143 ns/iter (+/- 4)
grow_by_insertion_with_xxh64 ... bench:       131 ns/iter (+/- 3)

hashmap_as_queue             ... bench:       135 ns/iter (+/- 0)
hashmap_as_queue_with_xxh64  ... bench:       117 ns/iter (+/- 0)

new_drop                     ... bench:         1 ns/iter (+/- 0)
new_drop_with_xxh64          ... bench:         0 ns/iter (+/- 0)

new_insert_drop              ... bench:        80 ns/iter (+/- 0)
new_insert_drop_with_xxh64   ... bench:        72 ns/iter (+/- 0)

With good optimization, --opt-level=3 -C lto:

find_existing                ... bench:     32026 ns/iter (+/- 8082)
find_existing_with_xxh64     ... bench:     29026 ns/iter (+/- 1609)

find_nonexisting             ... bench:     36371 ns/iter (+/- 2186)
find_nonexisting_with_xxh64  ... bench:     37102 ns/iter (+/- 1897)

find_pop_insert              ... bench:       189 ns/iter (+/- 5)
find_pop_insert_with_xxh64   ... bench:       179 ns/iter (+/- 2)

grow_by_insertion            ... bench:       137 ns/iter (+/- 11)
grow_by_insertion_with_xxh64 ... bench:       136 ns/iter (+/- 11)

hashmap_as_queue             ... bench:       113 ns/iter (+/- 2)
hashmap_as_queue_with_xxh64  ... bench:       106 ns/iter (+/- 1)

new_drop                     ... bench:         1 ns/iter (+/- 0)
new_drop_with_xxh64          ... bench:         0 ns/iter (+/- 0)

new_insert_drop              ... bench:        72 ns/iter (+/- 1)
new_insert_drop_with_xxh64   ... bench:        68 ns/iter (+/- 19)

With a high opt level, find is mostly getting inlined in find_existing, only hasher.write and hasher.result are not being inlined. Ideally, everything would be inlined except perhaps table.make_hash.

Because of the way Hasher currently works, the state of XxOrSip must be an enum. Writes to that enum are slower and harder to optimize.

arthurprs · October 7, 2014, 1:15am

Nice results, the overhead is minimal.

Although I’m concerned with the adaptive algorithm overall since it adds a lot of complexity just to solve SipHash slowness (isAdaptive method on the trait makes my brain hurt).

Also, reducing the growth threshold to 62.5% seems a big step backwards in my opinion (it’s 90%+ right now). Since it goes into HashMap it’ll be the same for any Hasher, right?

Gankra · October 7, 2014, 1:25am

The 62.5% load factor isn’t used most of the time, it’s only used for deciding what to do when a long probe sequence is detected. If the load factor is high at the time of the detection, we assume that this was just bad luck and grow in the hopes that this will resolve the pressure. If the load factor is low, we assume that we are being attacked, and change to secure mode.

arthurprs · October 7, 2014, 1:47am

Thanks @Gankro. I fell stupid, I should have read more carefully.

system · March 25, 2019, 8:22am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Help harden HashMap in libstd! libs	27	4395	March 25, 2019
Help wanted: fast hash maps in std	29	6098	March 25, 2019
A new default Hasher for HashMap? libs	26	6880	December 8, 2019
Global default hash function override for HashMap and HashSet libs	26	1210	August 20, 2024
Pre-RFC: bless the `type FastHashMap` pattern	92	3981	March 25, 2019