BTreeMap methods .insert(), .append() to combine values instead of overwriting

Scenario

I'm currently working on a project that makes heavy use of the BTreeMap of the standard library. One of my needs is to merge two maps into one where values of the same key are combined using a user-specified function f:

// pseudo-code
m1 = {"key1": "foo", "key2": "baz", ...}
m2 = {"key1": "bar", "key3": "dog", ...}

mergeWith(f, m1, m2) = {
  "key1": f("foo", "bar"),
  "key2": "baz",
  "key3": "dog",
  ...
}

instead of one of the values being dropped. However, I failed to find such a function in the doc.

If m2.len() << m1.len(), there is an efficient way to merge two just like what insert() does using m1.entry(k2). However, when m1 and m2 are comparable in size, this implementation is not optimal. The standard library implements a faster m1.append(&mut m2) function but this function does not support a value-combining function f for the moment.

IMO it is a frequent use case to combine values and some languages do have analogous functions. Haskell implements a unionWith function and Mathematica has Merge[] for its associations (although the inner data structures are different).

Benchmarks (append() vs. extend())

This post also made a rough comparison of the efficiency between merging two sorted iterators (append()) and successively inserting elements from one BTree into the other (extend()).

For simplicity and clarity I merge two BTreeSets (s1 of length n-k and s2 of length k) of random BigUints below 10^1000.

  • Platform: MacBook Pro 2021 M1 Pro.

  • Rust 1.80.0 (cargo run --release); Haskell GHC 9.4.8 with containers-0.6.7 (cabal run -O2).

  • Time measurement:

    \\ Rust
    let mut delta_t = 0;
    \\ ...
    let now = Instant::now();
    \\ ...
    delta_t += now.elapsed().as_nanos();
    
    -- Haskell
    start <- getCPUTime
    -- ... strict code
    end <- getCPUTime
    

The statistics of time is showed in the figures.

From the figures one can see that, in practice, when the length of s2: k < n/10, successive insert() (extend()) outperforms append(), probably more than an order of magnitude.

As a side note, what surprises me is that Haskell is actually not that slow for set operations even compared with Rust! Although the total running time of the Haskell program is much longer, which I think is wasted in the generation of random big integers. The Rust implementations here also have a drawback that they consume the input sets. Making clones of the input will slightly increase the time complexity further.

Request

  • Generalize functions like insert() and append() to allow custom value-combining behavior.
  • Optimize append() for merging large set/map with small set/map.
5 Likes

I too have a use case for being able to efficiently union BTreeMaps while combining rather than dropping values with duplicate keys, and would like to see this kind of functionality.

There are two ways to help make this happen:

  • Actually write the implementation and send a pull request.
  • Write an API change proposal precisely describing the new methods to be added. (Optimizations don't need this.)
5 Likes

Related, but not exactly this (I needed to pair them up and iterate over to produce a diff): itertoools merge_join_by

Thanks for your really helpful advice. If many people agree that this is a common use case, I (or someone else) can write an ACP sometime.

Besides, I found that the implementation should be very straightforward. Just find a way to pass the custom function f into an analogue of struct MergeIter and modify its Iterator trait to use f instead of "returns the key-value pair from the right source". The upstream functions append_from_sorted_iters() and append() just need a tiny change for passing f.

And also,

This reminds me that a more general use case may be like

fn f(v1: Option<V>, v2: Option<V>) -> V

where f will receive a None if one of the maps doesn't contain the key of the other.

Code

// In ".../btree/map.rs"

pub fn append_with<F>(&mut self, f: F, other: &mut Self)
where
    F: FnMut(Option<V>, Option<V>) -> V,
    K: Ord,
    A: Clone,
{
    // ... Same as append()
    root.append_from_sorted_iters_with(
        f,
        self_iter,
        other_iter,
        &mut self.length,
        (*self.alloc).clone(),
    )
}
// In ".../btree/append.rs"

pub fn append_from_sorted_iters_with<F, I, A: Allocator + Clone>(
    &mut self,
    f: F,
    // ...
) where
    F: FnMut(Option<V>, Option<V>) -> V,
    // ...
{
    // We prepare to merge `left` and `right` into a sorted sequence in linear time.
    let iter: MergeWithIter<F, K, V, I> = MergeWithIter{
        inner: MergeIterInner::new(left, right),
        f,
    };

    // ...
}

// ...

struct MergeWithIter<F, K, V, I: Iterator<Item = (K, V)>>{
    inner: MergeIterInner<I>,
    f: F,
}

impl<F, K: Ord, V, I> Iterator for MergeWithIter<F, K, V, I>
where
    F: FnMut(Option<V>, Option<V>) -> V,
    I: Iterator<Item = (K, V)> + FusedIterator,
{
    type Item = (K, V);

    fn next(&mut self) -> Option<(K, V)> {
        let (a_next, b_next) = self.inner.nexts(|a: &(K, V), b: &(K, V)| K::cmp(&a.0, &b.0));
        match (a_next, b_next) {
            (Some((ka, va)), Some((kb, vb))) => Some((kb, (self.f)(Some(va), Some(vb)))),
            (Some((ka, va)), None) => Some((ka, (self.f)(Some(va), None))),
            (None, Some((kb, vb))) => Some((kb, (self.f)(None, Some(vb)))),
            (None, None) => None,
        }
    }
}
1 Like

Is this perhaps related to C++ multimap containers? Eg is it faster to collect all the values first with a dedicated type then process them on the way out rather than reducing the existing entry? (As with all performance, I expect the answer is "it depends")

I've definitely wanted to reach for a multimap several times, so two birds and all...

Not in my use case: I needed to know specifically if things existed in the left, right or both map (and then also compare the values to generate diffs of those values describing what needs to be done to go from left to right).

I fail to see how multimap would help there.

Also, do you have any examples of where multimap would be useful? I never used it when coding C++ and can't really think of any scenarios for it. (It seems to be a map that allows duplicate keys.)

Boost multi index though I have used (where you can have multiple indexed columns, allowing lookup on either key or value), but multimap doesn't seem to be that.

The most straightforward use is a group-by result.

If the key and value are the same id type, or the value is a pair of the id and a weight value, multimap is the edges in a directed graph, which is often all you need, so that's neat.

It's not critical, Map<K, Set<T>> has good enough ergonomics with the entry API that for the small scale uses I've had I haven't needed to even look for a crate, but I do often enough find myself missing a standard C++ type, which should never happen :smile:

3 Likes

My use case is much the same, except that my maps are already patches — or rather, “transactions” — and the goal is to produce a single transaction which has the simultaneous effects of both inputs. In the general case, this requires computing a new value using application-specific rules (such as, for example, doing the same thing to a nested map). Also, that computation may fail (if the transactions’ goals conflict with each other), so I'd ideally like a try_* style operation to allow exiting early.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.