Optimize unnecessary O(n^2) check in `slice::get_disjoint_mut`

yaksher · April 16, 2025, 5:47pm

The current implementation of slice::get_disjoint_mut warns of an O(n^2) check for overlaps between the arguments; while this shouldn't generally be a major issue, it's also an entirely unnecessary one, since the check can be done in O(n log n). (Note that the code below is meant to be demonstrative, rather than intended as the actual implementation.)

unsafe trait GetDisjointMutIndices: Clone + Sealed {
    fn is_valid(&self, len: usize) -> bool;
}
// trivial to implement for `usize`, since you just sort and check adjacent indices for equality
unsafe impl<const N: usize> GetDisjointMutIndices for [std::ops::Range<usize>; N] {
    fn is_valid(&self, len: usize) -> bool {
        if !self.iter().all(|i| i.is_in_bounds(len)) { return false; }
        let mut indices = self.clone();
        indices.sort_by_key(|r| (r.start, r.end));
        // change to `<` for `RangeInclusive`
        indices.windows(2).all(|w| w[0].end <= w[1].start)
    }
}

This is correct because if there exist two overlapping ranges, a and b (assume without loss of generality that a is the one which starts first) then this means that a.start <= b.start < a.end. In this case,

either no range has a start between a and b, in which case they end up adjacent to each other and checked
a range c starts between a and b, in which case c necessarily also overlaps with a, and that is caught by adjacent pairs check (note that the edge case of c.start == c.end == a.start will sort c, a, b, not a, c, b since we sort by the pair (r.start, r.end) and so c won't interfere with anything in that case).

Either way, at least one overlap is found if an overlap exists and validity is correctly rejected.

This could come with an if N > FAST_ALGORITHM_THRESHOLD check which falls back to the O(N^2) algorithm if it proved to be slower for the typical case of N == 2.

kornel · April 16, 2025, 6:04pm

The sorted check can be performed while checking whether the indices are already in a sorted order, so you can go straight to the check without sorting first. This would be a very simple fast path when users are able to provide sorted indices.

If they're sorted and verified to be non-overlapping, you don't even need the extra bounds checks loop. Just check first start and last end.

scottmcm · April 16, 2025, 6:10pm

It's by design that the implementation doesn't bother doing this.

For any realistic number of indices passed, sort_by_key actually is quadratic anyway. So it's not worth a bunch of extra checks when emitting the quadratic function that actually only does two or three checks in the vast majority of cases anyway.

If you really are passing enough indices that this matters, then you don't actually want get_disjoint_mut to be checking them for uniqueness anyway. What you want in those cases is to build up a pre-checked set that you can re-use, then call get_disjoint_mut_unchecked with the set that you've already pre-checked.

(Or, just use the technically-correct answer: any call is O(1) because it's monomorphized and doesn't depend on the size of the runtime input, so there's nothing to worry about. AKA the "I say my hash table of strings is O(1) even though it's really not" approach.)

yaksher · April 16, 2025, 6:29pm

I'm not sure what you mean by it being "by design" that it "doesn't bother doing this." While it's true that sort_by_key is quadratic for small values, that doesn't seem like a particularly good reason not to support large numbers of indices when it's easy to do so?

It wouldn't be a bunch of extra checks, I think? I'm not sure where in the pipeline this gets resolved, but if you added the if N > THRESHOLD check, that's a single check that should get resolved before much else. This seems like a pretty trivial compile-time cost compared to all the monomorphization and inlining already happening with each distinct value of N. Could you point me to precedent that says the standard library prioritizes minimal compile-time benefits over potentially (though admittedly, rarely) significant run-time speed benefits in this way?

Using get_disjoint_mut_unchecked seems like poor advice for most things. I think the vast majority of code should avoid unsafe no matter how clever it thinks it is, and certainly not for what could be made minor performance gains.

(While almost certainly not worth bloating the standard library, it could be interesting to have a type for "certified checked indices" which is only constructible by checking them and allows safe repeated calls without repeated checks. But this seems overkill for an uncommon use-case.)

riking · April 16, 2025, 7:08pm

Calling out to a sorting algorithm has intense code-size implications, because sorting is currently always monomorphized in the Rust stdlib.

Generating a sorting network is theoretically viable here due to the monomorphization on size, though I suspect an open research item on how to actually write that in Rust.

More viable is probably a custom range insertion sort that bails on detected overlap, since insertion sort is what gets invoked anyways.

the8472 · April 16, 2025, 7:25pm

During libs-api meetings we spent a bit of time on this. The O(n²) approach was chosen because the expectation is that users will mostly use N=2 and maybe N=3 keys. For those cases the dumb approach and small codesize make it easier for the optimizer to eliminate the checks entirely. Basically, the asymptotes don't matter for those.

If you need many indexes and the check is too expensive you can use get_disjoint_unchecked_mut. If you have a concrete usecase for many indexes and it happens to have sorted inputs then we might consider adding that fast-path.

Also note that this is array-based, not slice-based. So it's less likely that arbitrary, dynamic, user-supplied inputs will slip into this that would require expensive runtime checks, simply because the N needs to be plumbed through to some point where a concrete constant is supplied.

yaksher · April 16, 2025, 7:38pm

Fair enough. I don't have any concrete usecase, it just feels like a relatively easy optimization that would be easily pruned at compile-time in cases where N is actually small.

Though running some checks in Godbolt (inspired by Riking's comment) suggests it may still have adverse affects on code-size from pulling in copies of the sort functions (which are then never actually called because they're inlined and optimized out, so perhaps they would be later removed by some dead-code elimination pass that Godbolt's pipeline omits, but still, the overhead on compile time may be bigger than I'd assumed).

Out of curiosity, since you're familiar with the libs-api meetings on this, can I ask why the chosen interface was an array (requiring homogenous input types) rather than a tuple (which would have allowed mixing of range types and such, and e.g., effectively subsumed .split_at_mut(i) as a .get_disjoint_mut((..i, i..)) call)?

the8472 · April 16, 2025, 7:49pm

Due to the lack of variadic generics we'd lose it being generic over N, which some matrix code might need. And iirc the support for ranges was fairly late, it started out as getting individual indices, like the hashmap version of the same function.

yaksher · April 16, 2025, 7:51pm

Fair enough. I could pretty easily imagine an interface based on a GetDisjointMutIndices trait which is implemented both small tuples of (potentially heterogenous) GetDisjointMutIndex types and for arrays of GetDisjointMutIndex types, but that increases interface complexity a fair bit.

cuviper · April 16, 2025, 7:56pm

At least you can still normalize this to Range: .get_disjoint_mut([0..i, i..len])

yaksher · April 16, 2025, 8:01pm

Yeah, it's not a major issue, you can always convert all your indices to the same range type (though getting back slices rather than single items for your single indices by doing i..(i + 1) would be mildly obnoxious), mostly just wondering what motivated the decision.

CAD97 · April 16, 2025, 9:35pm

In large part because it's easier to see through for the optimizer. If you have a constant set of indices, all of which are unique, it's fairly simple for the optimizer to unroll the assert!((0..N).all(|i| ((i+1)..N).all(|j| !overlap(i, j)))) into a bunch of comparisons that it knows are true. Having to see through a sort first is a not more to peel through.

tczajka · April 20, 2025, 9:11pm

This is going to be true for most use cases in practice, but isn't technically correct in full generality. For example, you could call:

    let indices: [usize; usize::BITS as usize] = array::from_fn(|x| x);
    v.get_disjoint_mut(indices);

usize::BITS is not O(1) for the purposes of asymptotic analysis even though it's a compile-time constant. Proof: it's at least Ω(log n) if n is your slice length. Another way to see that it can't be treated as O(1) is that this would lead to the absurd conclusion that every slice has O(1) length and hence <[T]>::sort also has O(1) complexity. ^[1]

There is an unfortunate limit of 2^(2^32) on all slice lengths because usize::BITS is a u32 rather than a usize. That does ultimately defeat asymptotic analysis... It's a shame. If it was defined as usize there would be no such theoretical problem. ↩︎

Topic		Replies	Views
Performance of `get_many_mut()` with `DisjointIndices` libs	9	364	October 21, 2024
Rust 1.86.0 pre-release testing announcements	3	522	April 3, 2025
Assume slice length is smaller equal `isize::MAX` libs	7	446	June 18, 2024
Indexing into a slice with a known length	9	881	April 11, 2023
Slice patterns in the land of MIR compiler	19	4238	March 25, 2019

Optimize unnecessary O(n^2) check in `slice::get_disjoint_mut`

Related topics