Pre-rfc: adaptive sets and deduplication functions


#1

Motivation

Often enough, one has to deal with sets that one expects to have less than 4 elements.

If you want to do that in Rust, you either use a full HashSet (and face high performance overhead) or use a Vec (and risk an O(n^2) performance explosion). Both options are non-ideal.

We want to handle sets that are almost always small, but sometimes grow large.

Design

First, HashMap should use O(n^2) insertions and removals on sets of less than 8 elements.

Second, for extra performance, we can introduce a full_dedup function for Vec:

    /// Remove all duplicate items from this `Vec`.
    fn full_dedup(&mut self) where T: Ord {
        // <snip>
    }

Using Hash would be problematic for Vec, as that would require a hasher, which is not available at libcollections.


#2

What?


#3

There’s no explosion at all, it’s a tiny sequence, so even a linear scan is very efficient.

Since Ada 2012 there are also Bounded Containers: http://www.ada-auth.org/standards/12rat/html/Rat12-8-2.html

So you can stack-allocate a not too much large HashMap and HashSet on the stack, this is handy.

But you seem to want something different, a set like this, so a “SmallSet”: https://crates.io/crates/smallvec

Small sets could be implemented as bit set or unsorted vectors or sorted vectors (the unsorted ones have less code cache pressure, but searching in them is a little slower if their length is more than 40-200 values).

What about a sort + dedup?

The problem is that sometimes I’d like to do a dedup on a slice (and return a new possibly shorter slice).


#4

I have often wanted to have a hashmap that begins as a vector of pairs and only gets converted to a hashmap at a certain length (and I want this to be hidden from me, naturally). I believe that growing and looking data up in such a hashmap can be substantially faster. Doing a quick google search yielded this answer, which is talking about C++, but direct comparisons are difficult (in part because C++'s usual map is actually a btree).

Particularly given that hashing in Rust is not all that cheap, because it is secure by default, I am confident that Vec<(K,V)> will definitely beat a default HashMap<K,V> for some number, but I don’t know what that number is (and it will of course depend on the K, too).

As a first step here, it’d be interesting to gather some statistics about hashmaps in practice (the compiler is a good case study, though not necessarily representative). How big are they and how are the sizes clustered? How many lookups are done on maps of different sizes? That sort of thing.


#5

Even better (for some cases) is to have that vector allocate in-place, in a SmallVec.

Edit: there is already a SmallSet that uses that (it keeps the keys unordered, so you can’t binary search them): https://crates.io/crates/smallset

Since C++11 the situation is different, now people are using hash maps often enough (and even before that, Boost had hash maps. But not everybody likes and uses Boost).

Them being secure by default is kind of acceptable. But I still believe Rust standard library should offer a really easy way to use unsecure HashMaps/HashSets in a really simple, easy and succinct way. I think the current situation is not good enough, there’s space for improvements… In many situations I prefer faster hashes.

For HashSets it’s even better because there are no Values, so in that array there are less cache misses. In D language the threshold is about 40-100 integral keys, past that number a HashMap is faster… but there are many other considerations. Iterating on a vector is faster… So it depends on many factors.

Regarding HashMaps, I am still missing ergonomy of a hmap!{} macro to create HashMap literarls. I will wait for language improvements…


#6

I recently switched the backing of hyper Headers from a HashMap to a VecMap, and saw noticeable improvements. Before, siphash was high in profiles. While the linked PR also includes comparing against literals, I do recall the change being worth it. Especially since an average request has 10-15 headers (from a browser).

In contrast, I also tried using a SmallVec, and saw no noticeable affect, even with requests not causing a heap allocation. My guess is that the additional branch of the internal enum offset not calling malloc.


#7

Is it really necessary to decide between secure and fast in the first place? Why not start with a fast hash and dynamically switch to a secure one if chain lengths get too high? (This was proposed two years ago but I think never went anywhere.)


#8

I don’t disagree, but it seems at least somewhat orthogonal – that is, hashing will always have cost, secure or not. But I agree we should have an easier way to ask for insecure hashing.

Sure, but using unsafe code you could organize the vector to colocate the keys. HashMap already does this internally. So it’s only logically a Vec<(K,V)>, doesn’t have to be stored that way.

We should revisit this question – though this is easy enough to add in your own crate, for sure. Probably there even exists one?


#9

In my opinion if you put associative arrays in the standard library, then it’s reasonable and good to have a rather small helper macro to create their literals. But I seem to remember that people have asked to delay the addition of such macro in the standard library to wait for language improvements… I don’t remember the details and the desired improvements.


#10

From my experience usage, SmallVec is useful in some cases, and not faster in other. So you have to test and benchmark.

Also, the implementation of SmallVec is memory-wasteful, perhaps the new way dropping is implemented allows to make SmallVec more efficient (in memory used and perhaps performance).


#11

I am not expert on hashing, I assumed max performance and full safety are not compatible.

I’d like to use a HashSet/HashMap that keeps being fast at all its sizes.


#12

It wouldn’t switch to a secure hash merely because the map got big, but only if there were far more entries with the same or similar hash-modulo-table-size than would be expected (forming a long ‘chain’ that has to be linearly traversed), and growing the table was unable to solve the problem (the thread I linked mentions a check for load factor in addition to chain length). Thus it would only switch in cases where the insecure hash would be unable to provide adequate performance anyway, whether because of an attack or because the table got a lot of similar keys inserted into it for some other reason.

The main drawbacks would be (1) any direct runtime cost incurred from the extra checks and (2) code size. I don’t think the former would be significant if done properly (and Rust’s current HashMap implementation has some low-hanging fruit for performance - why store the keys, values, and hashes in separate contiguous buffers, requiring at least three different cache lines to be loaded if the table is cold, when one would suffice!?), and the latter probably isn’t a big deal (there’s low-hanging fruit there too).


#13

This pre-RFC has a tradeoff. Either keep high overhead for small maps, or make hash map’s internals more complex – pick one.

I think having a special case for small maps goes against the philosophy of doing the most straightforward thing in the standard library.

To be clear, do you propose reallocating tables for maps that grow from 8 to 16 to 32 entries, to save some memory? Currently, all empty maps go directly from 0 to 32 entries.

I made an experimental implementation of adaptive hashing when specialization became available: https://github.com/contain-rs/hashmap2/pull/5


#14

Note that also Java abandoned the whole “strong hash function” approach due to performance concerns, in favor of a totally different way of structuring the hash table buckets in the face of many collisions: https://bugs.openjdk.java.net/browse/JDK-8046170


#15

This is how D language associative arrays were designed originally. Later this design was abandoned for a simpler and vulnerable but faster design.


#16

I like the idea, and it definitely seems like less complexity than entirely swapping out the data structure behind the scenes. But as an initial implementation, the simple switch-to-SipHash-when-the-table-gets-big approach would be a good-enough first step, and have less potential performance fallout elsewhere since it only imposes a branch on table resize rather than a branch on each insertion.


#17

I don’t agree that this is a good goal for the standard library. I think the primary goals for the standard library should be to be easy to use and to do be performant and correct; if the implementation contains complex optimizations, that’s just a gain for everyone.

I also think if this goal is followed through, someone else will reimplement std types with more optimizations, and then these alternative collections could become de facto standards, making the experience worse for users.

Lastly, I think the std collections already contains a number of internal optimizations which aren’t necessarily the most straightforward thing.