Hello! I’ve started to implement split_off
for BTreeMap
and BTreeSet
(a part of this issue), but it needs a decision to make.
There is an algorithm of splitting BTree by value using only O(log n) operations (some description is here), but to recalc sizes of trees after split_off
, it’s necessary to know exact size of each subtree.
So, there are some different solutions:
- add
size
field to the node structure.
I think, benefits are great.
First, it lets us to do very flexible splits and merges with only O(log n) operations. Now it’s possible to merge trees using O(log n) operations, but it is not so interesting without split.
Second, some other interesting and usefull functions need it.
For example: kth_element
or “get the number of elements between two values”. Both with O(log n) operations.
Third, as I understand (but i am not totally sure), it lets us to implement the rope in easy way. Just as BTreeMap<(), V>.
Also some optimizations of a good case in app_end
need it.
But overhead is notable (4 or 8 bytes for each node (there are about n / (1.5B) = n / 9 nodes, where n is a number of elements in the container), depending on machine word size). I think, it is possible to embed this sizes without memory overhead for 64-bit machines (due to aligning).
- Implement
split_off
with O(log n + “size of one of the results”)
2.1) Split with O(log n) operations and then calculate size of one of the results.
2.2) There is another algorithm which doesn’t use the fast algorithm with the same complexity, but (as I understand) with greater constant.
- Implement
split_off
with O(log n + “size of the smallest one”)
Split using O(log n) operations and then calculate sizes of the trees via 2 alternating iterators, but stop when one of the sizes is counted.
So, which solution is better?
I personally prefer the first one.
What do you think? Is this overhead okay or not?