[Feature Request] Expose bulk_build_from_sorted_iter in BTreeMap

BTreeMap's from_iter implementation collects all values into a vec and sorts them, before passing into bulk_build_from_sorted_iter. Sometimes, the data comes sorted.

For my use case, from_iter is already insanely fast; I am looking mainly to avoid the extra Vec allocation necessary for sorting. Use case is a dictionary of 1mil+ terms each containing some misc info, parsed using a custom iterator (not in contiguous Vec<_> form so I think this can't possibly be compiler-optimized away). This is a wasm app as well (where the allocation is never really "reclaimed" due to the ArrayBuffer never shrinking again). Not sure if there's an alternative to achieve the allocation removal.

But I imagine this might also be useful for other use cases if the sort operation itself is expensive.

Maybe: also a from_sorted_unique_iter method - useful if the equality comparison in DedupSortedIter is costly.

What makes you think that? I"m pretty sure it just traverses the tree in order.

See the source:

Derp! I misread "from_iter" as "into_iter". I'm so sorry for adding nothing useful to the discussion.

1 Like

One interesting question here: are you expecting this to double-check the sorting? For example, should it be an unsafe from_sorted_unique_iter_unchecked?

Because I'd hope that I can trust a BTreeMap<usize, usize> at least, since it's a known and trusted type, even if I can't necessarily trust an arbitrary BTreeSet<T> where I don't know if T: Ord is well-behaved.

3 Likes

Wouldn't it work to have from_iter compare each item as it comes in in a loop and if the input is already sorted (i.e. compare with the last item read), take the optimal path; otherwise as soon as we detect an out-of-order item, take all the remaining items into a vector, sort it and append the rest as before?

In other words; the fast path becomes automatic; there's no need for a separate API.

5 Likes

That's an interesting idea. There may also be room for "mostly" sorted sequences -- only collect those that go backwards on the side, but continue with the forward ones directly, then sort and merge the others later. Although I guess one MAX out of order will interrupt the rest anyway.

1 Like

Thanks for the replies!

ps: I just found the rfc repo and this issue Make BTreeMap's from_sorted_iter public · Issue #1805 · rust-lang/rfcs · GitHub, sorry for the duplicate post.

I hadn't considered this when I posted, but was thinking and still favouring an unchecked version as a safe method, preferably named from_sorted_iter:

  • @dtolnay raised some great points in the above issue on the assertion, and binary search:

    I would prefer not to assert. The caller can wrap their iterator in an iterator adapter containing the assert if they want that.

    binary search also does not verify that its input is sorted.

  • As an api user, I think its pretty clear as well from the potential method name "from_sorted_iter" that the onus of ensuring the iterator is sorted would fall on the user. (even more so than binary_search)
  • As for T:Ord, my take is that its fine to leave that responsibility to the user much like vec.sort() methods (and anything that relies on a properly defined Ord I guess) which are also not unsafe unchecked.
  • Maximum performance (without the checks)

Though, this is mostly just going by precedent, I'm not quite familiar as to what warrants labelling an api unsafe.

Hmm to clarify, how would the appending work exactly?

E.g: vec![1, 2, 3, 4, 2].into_iter() (2 needs to be inserted before 3, but 1,2,3,4 was already consumed)

  • We could clone the iterator (at the start) then collect the entire sequence into a Vec once 4, 2 is detected but it may be costly to re-iterate (e.g. custom iterators with a fair bit of stuff going on inside, and having to restart the bulk_push subroutine from element 0 presumably).
  • Or collect 1,2,3,4 (up until an out of order element) into a Vec right from the start, but this reintroduces the need for an extra Vec<_> allocation.
  • Do something "smart" with the bulk_push subroutine -- Is this it? I'm not too familiar with BTrees to know if this is possible. My limited understanding is that the entire sorted iterator needs to be present in "one go" for bulk_push's benefits to be preserved. :thinking:

Sounds interesting too, but might be subject to the same issue per the above: how to deal with the already consumed elements.

1 Like

The idea as I understood it was that you'd already be pushing the pre-sorted items into a new map, then you'd sort the rest on the side and merge them together -- using BTreeMap::append or a similar internal mechanism.

2 Likes

It depends what the invariants of the type that unsafe code can rely on are deemed to be.

For example, with the current API surface I can know that BTreeSet<usize> is strictly increasing, and thus I could do something like rely on that for knowing there's no overlap to return a bunch of &muts, even if someone else passed me the tree.

Whereas if from_sorted_iter is safe and doesn't check that it's actually sorted, then the set I get could actually be [3, 2, 2, 1], even though usize is a type that I trust and which has a correct Ord impl.

4 Likes

I see, thanks @cuviper @scottmcm for clarifying.

Speaking as a user, I think I would prefer an unsafe unchecked method in that case if indeed necessary, since the main point of such an api would be to reap maximum performance gains.

The automatic fast path ideas are interesting as well if easily feasible, but I personally wouldn't mind having to slap an unsafe { from_sorted_iter(...) }, and giving up a single from_iter method's convenience. (also, digging around to find this method after I've indeed identified the application bottleneck to be BTree construction - which I speculate is rarely the case).

Might also avoid having any negative performance impact on existing use cases of randomly-sorted sequences.

As the append implementation is currently, this looks identical to collecting the pre-sorted items into a Vec in terms of memory usage. It currently converts both BTrees with into_iter and interleaves them into a new BTree with the bulk building function.

Not quite sure on other existing internal mechanisms. I'll try to dig around in the coming days and post back here if I find anything worthwhile :grinning:

This assumes that there is no “safe-transmute” that would allow taking a BTreeMap<Reverse<usize>, ()> and converting it to a BTreeMap<usize, ()>. I’m not sure if there is any API allowing that today, but it seems possible to have a safe API that does (since BTreeMap doesn’t rely, or make guarantees, about the ordering).

I think that's the question being discussed, not a fait accompli.

I agree that if you just have an unknown type in a set then you can't know that its Ord is well-behaved. But it seems silly to me to say that you can't even trust a set of a trusted type like usize, where you know that its Ord is definitely correct.

I'm not sure, the peak memory usage might be lower with fine-grained nodes being allocated and freed on the fly, rather than an entire Vec buffer on the side.

I believe the BTreeSet documentation already guarantees that using unsafe code to modify a BTreeSet<usize> to be out-of-order cannot be a memory safety error (emphasis mine):

It is a logic error for an item to be modified in such a way that the item’s ordering relative to any other item, as determined by the Ord trait, changes while it is in the set. This is normally only possible through Cell, RefCell, global state, I/O, or unsafe code. The behavior resulting from such a logic error is not specified (it could include panics, incorrect results, aborts, memory leaks, or non-termination) but will not be undefined behavior.

ETA: I suppose you could argue that it's only saying "if this somehow were to happen, in a way that is otherwise permitted, then none of the BTreeSet methods specifically would invoke UB", but not saying that you are actually permitted to do it (because BTreeSet doesn't provide any APIs that would allow you to use unsafe code for this in a way that is sound, because it doesn't make any guarantees about its layout and things like get() only give you an &, which means that modifying it is specifically called out as UB by the Reference). But still, given that documentation, if I was writing unsafe code myself, I certainly wouldn't feel comfortable relying on "a BTreeSet<usize> passed from someone else's crate" to be in-order for memory safety.

1 Like

I think that whole section should be rewritten -- the intent of that documentation has nothing to do with items being modified. It doesn't matter whether the items are modified or not, what matters is whether Ord returns contradictory results, i.e. results inconsistent with a linear order (which may happen regardless of whether anything is being modified).

Personally I think it would be totally sufficient to say that if Ord ever tells BTreeSet results inconsistent with a linear order, a panic may happen, or BTreeSet will pick a subset of those comparisons that does result in a linear order. In other words: if BTreeSet ever makes any redundant comparisons, and the results are contradictory, it will either panic or ignore some of them.

This was previously discussed at length here.

Using unsafe to modify items owned by BTreeMap<usize> would be unsound. I agree with @scottmcm that BTreeMap<usize> should always be in order, so any such new API should either verify the order or be unsafe.

And that's not possible for a usize. There's no way to change the value of a usize inside a set.

(And before anyone says they could do it with unsafe, any attempt to do so is UB because it's behind a reference but not in an UnsafeCell, and thus can't happen.)

2 Likes