Scenario
I'm currently working on a project that makes heavy use of the BTreeMap
of the standard library. One of my needs is to merge two maps into one where values of the same key are combined using a user-specified function f
:
// pseudo-code
m1 = {"key1": "foo", "key2": "baz", ...}
m2 = {"key1": "bar", "key3": "dog", ...}
mergeWith(f, m1, m2) = {
"key1": f("foo", "bar"),
"key2": "baz",
"key3": "dog",
...
}
instead of one of the values being dropped. However, I failed to find such a function in the doc.
If m2.len()
<< m1.len()
, there is an efficient way to merge two just like what insert()
does using m1.entry(k2)
. However, when m1
and m2
are comparable in size, this implementation is not optimal. The standard library implements a faster m1.append(&mut m2)
function but this function does not support a value-combining function f
for the moment.
IMO it is a frequent use case to combine values and some languages do have analogous functions. Haskell implements a unionWith
function and Mathematica has Merge[]
for its associations (although the inner data structures are different).
Benchmarks (append()
vs. extend()
)
This post also made a rough comparison of the efficiency between merging two sorted iterators (append()
) and successively inserting elements from one BTree into the other (extend()
).
For simplicity and clarity I merge two BTreeSet
s (s1
of length n-k
and s2
of length k
) of random BigUint
s below 10^1000.
-
Platform: MacBook Pro 2021 M1 Pro.
-
Rust 1.80.0 (
cargo run --release
); Haskell GHC 9.4.8 with containers-0.6.7 (cabal run -O2
). -
Time measurement:
\\ Rust let mut delta_t = 0; \\ ... let now = Instant::now(); \\ ... delta_t += now.elapsed().as_nanos();
-- Haskell start <- getCPUTime -- ... strict code end <- getCPUTime
The statistics of time is showed in the figures.
From the figures one can see that, in practice, when the length of s2
: k < n/10, successive insert()
(extend()
) outperforms append()
, probably more than an order of magnitude.
As a side note, what surprises me is that Haskell is actually not that slow for set operations even compared with Rust! Although the total running time of the Haskell program is much longer, which I think is wasted in the generation of random big integers. The Rust implementations here also have a drawback that they consume the input sets. Making clones of the input will slightly increase the time complexity further.
Request
- Generalize functions like
insert()
andappend()
to allow custom value-combining behavior. - Optimize
append()
for merging large set/map with small set/map.