Summary
Implement BTreeMap::drain(range)
and BTreeSet::drain(range)
following rfc#1257-drain-range-2.
Motivation
As BTreeMap
and BTreeSet
are ordered collections, it is reasonable to have a drain
method which takes a range parameter to allow range removal.
Currently, there are .range(range)
and .drain_filter(pred)
methods available, and the .drain(range)
function can be achieved by .drain_filter(|k| range.contains(k))
. It requires full traverse of the tree, in O(n log(n))
, but for a B-Trees, only O(m log(n))
is needed, where m
is the length of the range. Having .drain(range)
method can undoubtedly benefit the situations where it needs to remove a range from a BTreeMap
or BTreeSet
.
There have be previous discussions on the furum such as BTree{Map,Set} range removal and BTreeMap::drain with a range.
Guide-level explanation
The signature of BTreeMap::<K, V>::drain
method will be
pub fn drain<T: ?Sized, R>(&mut self, range: R) -> DrainRange<'_, K, V>
where
T: Ord,
K: Borrow<T> + Ord,
R: RangeBounds<T>,
{ /* ... */ }
Like .range
, It returns an iterator that drains out all the nodes bounded in the given range, and like .drain_filter
, the drained items will be definitely removed from the tree when the returned iterator is dropped.
Here is a short example:
use std::collections::BTreeSet;
let mut set: BTreeSet<i32> = (0..=10).collect();
let drained: Vec<i32> = set.drain(5..=8).collect();
assert_eq!(set.into_iter().collect::<Vec<_>>(), [0, 1, 2, 3, 4, 9, 10]);
assert_eq!(drained, [5, 6, 7, 8]);
Reference-level explanation
Public API
pub mod alloc {
pub mod collections {
pub mod btree_map {
impl<K, V> BTreeMap<K, V> {
pub fn drain<T: ?Sized, R>(&mut self, range: R) -> DrainRange<'_, K, V>
where
T: Ord,
K: Borrow<T> + Ord,
R: RangeBounds<T>,
{ }
}
pub struct DrainRange<'a, K: 'a, V: 'a> {}
impl<K, V> Iterator for DrainRange<'_, K, V> {
type Item = (K, V);
fn next(&mut self) -> Option<(K, V)> {}
fn min(self) -> Option<(K, V)> {}
fn max(self) -> Option<(K, V)> {}
}
impl<K, V> FuseIterator for DrainRange<'_, K, V> {}
impl<K, V> DoubleEndedIterator for DrainRange<'_, K, V> {
fn next_back(&mut self) -> Option<(K, V)> {}
}
impl<K, V> Drop for DrainRange<'_, K, V> {
fn drop(&mut self) {}
}
impl<K, V> Debug for DrainRange<'_, K, V> {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {}
}
}
pub mod btree_set {
impl<T> BTreeSet<T> {
pub fn drain<K: ?Sized, R>(&mut self, range: R) -> DrainRange<'_, T>
where
K: Ord,
T: Borrow<K> + Ord,
R: RangeBounds<K>,
{ }
}
pub struct DrainRange<'a, T: 'a> {}
impl<T> Iterator for DrainRange<'_, T> {
type Item = T;
fn next(&mut self) -> Option<T> {}
fn min(self) -> Option<T> {}
fn max(self) -> Option<T> {}
}
impl<T> FuseIterator for DrainRange<'_, T> {}
impl<T> DoubleEndedIterator for DrainRange<'_, T> {
fn next_back(&mut self) -> Option<T> {}
}
impl<T> Drop for DrainRange<'_, T> {
fn drop(&mut self) {}
}
impl<T> Debug for DrainRange<'_, T> {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {}
}
}
}
}
Overview
There are two key points in DrainRange
:
- it must act like an iterator, i.e., it must provide capability of iterations over elements bounded by range;
- it is intended to do range removal after
.drain(range)
is called.
Let's highlight the two points one by one.
Range Iteration
DrainRange
iterators the elements in range like what the Range
and RangeMut
does, except that DrainRange
takes the KVs' ownership and returns to its caller. But for efficiency consideration, unlike DrainFIlter
, it does not immediately remove the node from the tree when calling next
and next_back
. Instead, it leaves the iterated nodes uninitialized. The drained nodes are actually removed when dropping the DrainRange
iterator.
Range Removal
Range removal is generally more efficient that single-node removal because of fewer balancing operations. That is the reason why we choose range removal but not multiple single-node removals.
Before removal
After draining, there is a big "hole" in the tree, and what we do is to remove the middle empty nodes, and collapsing the left and right half nodes, for each layers.
n2 +----+----+----+
| A | E | I |
+----+----+----+
/ \
: ... :
: :
n1 / \ n4
+----+----+----+ n3 +----+----+----+
| B | C | D | ... | F | G | H |
+----+----+----+ +----+----+----+
Imagine on the B-Tree as above, after draining a range (e.g. D..=F
), the following changes will happen:
- For the lowest common ancestor (LCA) layer (e.g.
n1
), a subrange (e.g.E
) of no more than one node is emptied, between one edge node (excluded, called left edge bound, or left edge) to another edge node (excluded right edge). - For any layer except the LCA layer, there is a left node (eg.
n1
) of which a suffix is emptied, and a right node (eg.n4
) of which a prefix is emptied. All the nodes between the left node and the right node is completely emptied. Call the empty and non-empty bound of the left node and the right node as left edge and right edge, respectively as the LCA layer. - For any layer, any edge node between two empty KV node is emptied, as well as its descendants if it is non-leaf.
Bottom-up collapse algorithm
As we already know the left leaf edge and the right leaf edge after iteration, range removal can be achieived in a bottom-up way.
Here the terminology "collapse" means a combination of merging and stealing, similar with the implementation of single KV removal in BTree, but more complicated. The next section is about how to collapse the two half nodes into one.
To clarify, here are some other important terminologies:
-
left_node
andright_node
represent the two nodes which have an empty suffix and an empty prefix, respectively.left_node
andright_node
might be the same node in the LCA layer. -
left_edge
andright_edge
represent the edge bound of empty and non-empty nodes ofleft_node
andright_node
, respectively. -
left_parent
andright_parent
represent the parent edges ofleft_node
andright_node
, respectively. They are both optional. -
collapsed
represent the edge whichleft_edge
andright_edge
collapsed into. It might contains descendants for non-leaf layers.
The collapse algorithm is as the following pseudocode:
fn remove_range(left_edge: Edge, right_edge: Edge) -> Edge {
let mut left_node = left_edge.to_node();
let mut right_node = right_edge.to_node();
let leaf = collapsed(left_node, right_node);
left_node = left_edge.get_parent_node();
right_node = right_edge.get_parent_node();
while (left_node, right_node) not None {
let collapsed: Edge = collapse(left_node, right_node);
let left_parent = left_edge.get_parent_edge();
let right_parent = right_edge.get_parent_edge();
collapsed.link_as_left_child(right_parent.left_kv());
left_node = left_parent.to_node();
right_node = right_parent.to_node();
}
leaf
}
After and before each collapse, collapsed
edge is linked as a left child of right_parent
, and left_parent
has no right child. This makes the left KV node of left_parent
(i.e. left_edge
of the next collapse) free to move, and it is very useful during collapse.
The collapse
Collapse is the most complicated part of this algorithm. Let's reveal its details.
The first thing is to recursively deallocate all the internal edge nodes (including the empty left_edge
) located on left_node
and right_node
, since they are empty.
Successively, to reuse the code of BalancingContext
to rebalence the tree, we shift all the KVs and edges at the right of right_edge
(inclusive) to compress the right_node
, and then using the merging or stealing strategy to rebalance the tree. We will handle this later.
Collapse of non-LCA
For non-LCA layers, take the following figure for example, where we the drained range is D..=G
.
+---------+====+ +====+---------+
| A | E | ... | F | K |
+---------+====+ +====+---------+
`left_parent`/ \ `right_parent`
/ \
`left_node`/ \ `right_node`
+----+----+====+ +====+----+----+
| B | C | D | ... | G | I | J |
+----+----+====+ +====+----+----+
/ | : / | |
B' C' H I' J'
(`left_edge` is empty) (H is `right_edge` as a child)
First check if sum of the length of drained left_node
and right_node
is no larger than CAPACITY
. If yes, it means we can simply merge left_node
and right_node
into a single node, make it a child of right_parent
, and leave another empty node as the child of left_parent
, being sure that the next collapse iteration will deallocate it. (In the figure above, mrege |B|C|
with |I|J|
into a single node, and let it be a left child of K
.)
Note that this merge operation is quite simple, and unlike that of single KV removal, it does not touch any nodes in the parent layer at all (keep in mind that left_edge
is always empty before and after the collapse). There are left_node.len() + right_node.len()
KV data, and left_node.len() + right_node.len() + 1
edge children before the merge, which is consistent with after the merge. (In the figure above, merged node |B|C|I|J|
has exactly 5 children B', C', H, I', J'
.)
Things are more complicated when left_node
and right_node
cannot be simply merged into one, which means there are more than CAPACITY
KV data in left_node
plus right_node
.
We construct a BalanceContext
by pushing up the right most non-empty KV datum of left_node
to the left of right_parent
, naming the pushed KV as right_parent_kv
, then linking the left_node
as a left child of right_parent_kv
, and we get a valid BalanceContext
consisting of right_parent_kv
, left_node
, and right_node
. (In the above figure, push C
up to replace F
, and the BalanceContext
is C
as parent, |B|
as left child, and I|J
as right child.)
After that, rebalance left_node
and right_node
via stealing. Here left_node
and right_node
are not really collapsed into a single node, but we can simply regard the left_node
as the collapsed new node, move the right_parent
pointer to its left adjancent edge (actually the left_node
before), then go on the next collapse.
Collapse of LCA
For the LCA node, simply shifting all edges and KVs of the right part can make it a single node. If it underflows after collapse, rebalance the node merging or (bulk) stealing like that in single KV removal. Go on to fix its ancestors and notify the caller to handle pop internal root if the root node becomes empty.
Cornor cases
When pushing up KV datum beside left_edge
, if the right_parent
node happens to be full, we first try to push up to the left_parent
and do stealing in a semmetric way. If both left_parent
and right_parent
is full, push it up to the nearest ancestor which has an empty KV to place it, then skip all the passed-by layers where both left_node
and right_node
are full (there are actually no work to do with these layers), and continue the next collapse on the layer where the descentant KV has been pushed up to.
If pushing up to its ancestor leaves left_node
and right_node
underflowing, do stealing on each of them as where we do in single KV removol.
Drawbacks
- The recursive empty edge deallocation operation might not be efficient enough and has risk of running out of stack space.
- Compared to
Vec::drain
,VecDeque::drain
andString::drain
, which take a index range as input parameter, this function signature might be confusing where the input range parameter means a range of keys, instead of indices.
Alternatives
Interval Deletion in B-trees proposed a top-down range deletion algorithm. In this algorithm, before descendant to the children during the binary search, it first deletes nodes with key bounded within the range, which makes ordered iterating of the deleted range is not possible.
Recursively deallocating the empty edge nodes can be done by a traverse from the left_edge
to the right_edge
, which deallocates all the empty edge nodes before collapsing in advance. Or even this can be done when iterating the DrainRange
.
Unresolved questions
- Should the method be named
drain_range
?