BTreeMap: searching with a predicate?


#1

The BTreeMap::range and BTreeSet::range functions are very handy for a variety of searching tasks; for example, they can be used to find the first element larger than something, or the last element smaller than or equal to something, etc. However, these functions have a common drawback: they require the caller to be able to construct something of a type Q for which K: Borrow<Q> holds. In some cases this might be difficult, for example if your key type is (i32, SomeLargeType) and you want to search for the first key whose first coordinate is at least 5.

One solution would be to add the method BTreeMap::range_by, with the signature

  fn range_by<F: FnMut(&K) -> Ordering>(&self, f: F) -> impl Iterator<Item=(&K, &V)>

(and similarly for BTreeSet) where the function f is supposed to return Ordering::Equal for some consecutive range of values, Ordering::Less for values before that range, and Ordering::Equal for values after that range. The return value would iterate over all values for which f returns Ordering::Equal.

There’s some precedent for this in the slice type, which has binary_search_by in addition to binary_search.

What do you think? Is there a better solution?

EDIT: I forgot to say it, but I want the solution to be as efficient as BTreeMap::range. That is, it should have a one-time cost of O(log n) (where n is the size of the map), and amortized cost of O(1) for each iteration.


#2

There is. BTreeMap has an iter() method, and Iterators have a .filter() method that takes a predicate.


#3

Unfortunately, that does not solve @jneem’s problem. An arbitrary predicate isn’t good enough. You want something that defines an ordering such that it can be implemented efficiently by traversing only a subset of the map. A btreemap iterator’s filter method cannot be implemented that way because its semantics do not require it to define an ordering.


#4

Sorry, I forgot to include the bit about time complexity in the original post, so I’ve added it.


#5

This will be slower than the current search methods (because there needs to be a binary search within a single tree node, meaning more comparisons), but not algorithmically slower since nodes are currently small (B=6 as I recall).


#6

I don’t understand why it would need to be any slower than the current range function; can you elaborate? Whatever the search strategy is for a node, can’t the same strategy be used for both range and range_by?


#7

The strategy within a node will be different. When you know the boundary key, you can just keep track of the next highest element and do a linear scan of the node, when you don’t know it you have to do a bisection, and the comparison function gets called a few times with some unpredictable branching.


#8

But you can also do a linear scan with the comparison function, no?


#9

You would just get a bunch of bools, nothing sufficiently useful to know which node to visit next.

There’s a reason the higher-level interface you propose is mostly used for binary trees. It has a performance cost on most cache-friendly structures.


#10

I asked for a function returning std::cmp::Ordering, so I’ll get a bunch of those instead of a bunch of bools. And unless I’m missing something, I do think it’s sufficient to know which node to visit next. For example, suppose I’m looking for the lower bound on my range. I do a linear search through the current node, evaluating my comparison function on every element. Then I find the first element that returned Ordering::Equal, and I recurse on the subtree to the left of that element. (I could have done a binary search instead of a linear search, but as you pointed out it would be less cache-friendly.)

Really, I think that the implementation and performance characteristics of my proposed function would be identical to those of the existing range function. I might have a try at writing an implementation on the weekend, if that would help convince you…