`BTreeMap::split_off` implementation

Hello! I’ve started to implement split_off for BTreeMap and BTreeSet (a part of this issue), but it needs a decision to make.

There is an algorithm of splitting BTree by value using only O(log n) operations (some description is here), but to recalc sizes of trees after split_off, it’s necessary to know exact size of each subtree. So, there are some different solutions:

  1. add size field to the node structure.

I think, benefits are great.

First, it lets us to do very flexible splits and merges with only O(log n) operations. Now it’s possible to merge trees using O(log n) operations, but it is not so interesting without split.

Second, some other interesting and usefull functions need it. For example: kth_element or “get the number of elements between two values”. Both with O(log n) operations.

Third, as I understand (but i am not totally sure), it lets us to implement the rope in easy way. Just as BTreeMap<(), V>.

Also some optimizations of a good case in app_end need it.

But overhead is notable (4 or 8 bytes for each node (there are about n / (1.5B) = n / 9 nodes, where n is a number of elements in the container), depending on machine word size). I think, it is possible to embed this sizes without memory overhead for 64-bit machines (due to aligning).

  1. Implement split_off with O(log n + “size of one of the results”)

2.1) Split with O(log n) operations and then calculate size of one of the results.

2.2) There is another algorithm which doesn’t use the fast algorithm with the same complexity, but (as I understand) with greater constant.

  1. Implement split_off with O(log n + “size of the smallest one”)

Split using O(log n) operations and then calculate sizes of the trees via 2 alternating iterators, but stop when one of the sizes is counted.

So, which solution is better?

I personally prefer the first one.

What do you think? Is this overhead okay or not?

CC @gereeter @jooert

Copied from GitHub:

First, note that option one can be slightly optimized by only storing composite sized in internal nodes.

I’m not a big fan of option one - I expect split_off to be a fairly rare operation, and slowing down everything else and increasing the memory footprint solely for its benefit does not seem like a good idea. Since we explicitly support large collections (as large as can fit in memory, basically), the size field would have to be a usize, which cannot be combined with other fields. The application to ropes is interesting, but really wants a separate implementation - the BTreeMap interface is totally wrong for the job (though the node module might be useful). kth_element and measuring the length of a range seems useful, but again, I see them as niche use cases that really want a specialized sequence data structure.

I don’t have strong feelings between options two and three. I think that a variation on option two, where the choice of which side to count is made based on which edge was descended at the root node, would probably be the best. It doesn’t have the overhead of option three, but still wont take long if you split near the edges.

There is also a fourth option, namely dropping support for constant time calls to len(). This is a somewhat more radical change, but it would easily allow for split_off in O(log n) and avoid the question of how to calculate size completely. Moreover, while it might inconvenience users, it is easy for users to wrap a BTreeMap with something that does track length, recovering the behavior for anyone who actually needs accurate lengths. I suspect that there is much higher demand for is_empty than there is for len.

From GitHub:

About rope: it can be implemented as BTree with an implicit key. It needs just 2 more usefull methods for BTreeMap: split_nth (splits off the first number of elements) and erase_nth.

"First, note that option one can be slightly optimized by only storing composite sized in internal nodes." Yes, that’s realy great point! There are about (a rough estimate, in fact, on average, more) 6 / 7 leafs. So only 1 / 7 of all nodes should contain 1 extra usize. So, only about 1 / 63 n extra usizes. It’s only about 1 bit per element on a 64-bit machine and a half bit on a 32-bit machine. It’s nothing, isn’t it? Now the size of InternalNode is about 11 * (size_of::() + size_of::()) + 112 on 64-bit architecture. So, 8 extra bytes would be okay I think.

There are some average values above. The worst case is: (1 / 5)n nodes and 1 / 6 internal nodes. So (1 / 30)n extra usizes, about 2 extra bits per key on a 64-bit architecture and about 1 extra bit per key on a 32-bit architecture. It’s also nothing.

If I didn’t make any blunder, I am assured that each Internal node should contain itselfs subtree size.

AFAIK () does not implement Ord so it cannot be the key of a BTreeMap. Maybe the node module could be reused to make a new rope type, but I don't think it will be as easy as BTreeMap<(), V>.

() does implement Ord (trivially returning Ordering::Equal from Ord::cmp), but the implication is that only one () can be in a BTreeMap at any point.

May be it will be not so straightforward.

If we will add the size field, we get many functions which don’t use comparisons (i.e. the Ord trait). For example the BTreeMap::insert_kth will be the Rope::insert and the BTreeMap::erase_kth will be the Rope::erase.

May be the current interface has some restrictions, but in the theory they are not necessary. So, they can be only for external use.

I’m more worried about runtime overhead than memory overhead. Currently, insert and remove go down the tree and then only come back up if a node over- or under-filled. Even when they do need to split or merge nodes, they rarely have to climb back up the whole tree.

With size fields in every internal node, they would have to climb back up the entire tree every single time. Note that they can’t adjust the sizes on the way down, since they don’t know whether the key exists or not.

split_off is a very niche use case, and I don’t want to slow down the most important and fundamental methods on BTreeMap just so that a method that very few people will every use can have the right asymptotic behavior.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.