Performance of `get_many_mut()` with `DisjointIndices`

Tracking issue: Tracking Issue for get_many_mut · Issue #104642 · rust-lang/rust · GitHub

I would like to move the stabilization of that forwards. That begins with answering the unresolved question.

One question that is presented throughout the thread is whether the current API, taking an array of indices directly, is the best. An alternative that was proposed is to encapsulate the invariant of "disjoint indices" in a type. This has the benefit that it can be reused. It also allows to distinguish between "overlapping indices" and "out-of-bounds indices" error conditions (of course, this can be done with the current API, but that requires making the error type no longer a ZST, which may harm performance a bit).

I quite like this idea. There are two main ways to implement this: one is to keep track of only the invariant, the other is also to find the max index while checking that. The last one has the advantage that inside the method, we can check with one instruction whether any of the indices is OOB.

However, when trying to implement the second approach, I found that LLVM generates worse code for it when using few indices (2), and in benchmarks it performs worse too (487.40ps vs. 555.49ps with criterion). This is something I think we cannot afford, given that probably most of the usages will be with 2 indices.

Code:

#![feature(array_windows, get_many_mut)]

#[derive(Clone, Copy)]
pub struct DisjointIndices<const N: usize> {
    indices: [usize; N],
    max_index: usize,
}

impl<const N: usize> DisjointIndices<N> {
    #[inline]
    pub fn new(mut indices: [usize; N]) -> Result<Self, ()> {
        if N > 10 {
            indices.sort_unstable();
            return Self::new_sorted(indices);
        }

        let mut max_index = 0;
        let mut valid = true;
        for (i, &idx) in indices.iter().enumerate() {
            max_index = std::cmp::max(idx, max_index);
            for &idx2 in &indices[..i] {
                valid &= idx != idx2;
            }
        }
        match valid {
            true => Ok(Self { indices, max_index }),
            // true => Ok(Self { indices }),
            false => Err(()),
        }
    }

    #[inline]
    pub fn new_sorted(indices: [usize; N]) -> Result<Self, ()> {
        if N == 0 {
            return Ok(Self {
                indices,
                max_index: 0,
            });
        }

        let mut valid = true;
        for [prev, next] in indices.array_windows() {
            valid &= next > prev;
        }
        match valid {
            true => Ok(Self {
                max_index: *indices.last().unwrap(),
                indices,
            }),
            false => Err(()),
        }
    }
}

#[inline]
pub fn my_get_many_mut<'a, T, const N: usize>(
    slice: &'a mut [T],
    indices: &DisjointIndices<N>,
) -> Result<[&'a mut T; N], ()> {
    if N != 0 && indices.max_index >= slice.len() {
    // if N != 0 && indices.indices.iter().any(|&i| i >= slice.len()) {
        Err(())
    } else {
        unsafe { Ok(slice.get_many_unchecked_mut(indices.indices)) }
    }
}

The alternative (commented out) performed just as well as current get_many_mut() with few indices, but becomes worse with many (100) indices.

So, if we want maximum performance, and we want DisjointIndices, the only way I can see is by switching decision based on N. If N is less than <insert constant based on benchmarking here> check the indices in get_many_mut(), otherwise check them in DisjointIndices.

The disadvantages are that we get a redundant field in case of small indices (minor), and also that we cannot provide a fully zero-cost DisjointIndices::new_unchecked() API, because it will have to calculate the maximum index, as passing manually it won't work for small N, and it will be weird and inconsistent to use/not use the passed number based on an implementation detail constant. I am already opposed to such API that takes the maximum index, however, as I feel it exposes an implementation detail

What are your thoughts?

Note that you can’t sort indices itself, as the result of get_many_mut must match the original index order given by the caller.

(But if you could sort or otherwise reorder them, then you wouldn’t need to store the max element separately!)

DisjointIndices::new_unchecked doesn’t sound very useful, given that get_many_unchecked_mut could just take an array and skip both uniqueness and bounds checks.

Bikeshed, but DisjointIndices is too specific a name for what is basically an ArraySet.

Anything that’s much more verbose than get_many_mut([i, j]) in the common case is ergonomically a non-starter IMO.

In what sort of use cases would one reuse the same index set so much that amortizing the uniqueness check is worth the extra ceremony?

1 Like

My mistake, but it doesn't really matter for the discussion.

It is the ability to not skip the bounds check that is useful: if the indices are not arbitrary but the slice is, it can be useful to skip the uniqueness check but keep the bounds check.

If it was generic, sure, but it's a complication that isn't needed, and so I feel DisjointIndices is actually a very good name.

I disagree, but even with that requirement we can make DisjointIndices work by taking impl TryInto<DisjointIndices>.

This is a good point. I can imagine a case where some specific indices are used in stored-in-array data structure, but even there they will probably be constants.

However, this is another advantage to DisjointIndices (mentioned in the tracking issue): it can be used to provide a new_sorted() constructor, that guarantees a O(N) check. This sounds more useful.

By the way, there is Seek motivating example for more than 3 simultaneous borrows · Issue #27 · uazu/qcell · GitHub and someone made a PR with benchmarks etc initial implementation of rw_array based on const generics by geeklint · Pull Request #26 · uazu/qcell · GitHub

cannot provide a fully zero-cost DisjointIndices::new_unchecked()

Can it skip evaluating max and just store usize::MAX marker value? Then code will skip if value is usize::MAX

I guess this is okay to forbid slices of length usize::MAX. It is also not the first time something like that is restricted. Take a look here:

/// The methods `index` and `index_mut` panic if:
/// - the end of the range is `usize::MAX` or
/// - the start of the range is greater than the end of the range or
/// - the end of the range is out of bounds.
#[stable(feature = "inclusive_range", since = "1.26.0")]
#[rustc_const_unstable(feature = "const_slice_index", issue = "none")]
unsafe impl<T> SliceIndex<[T]> for ops::RangeInclusive<usize>
    /// Constructs a new, empty `Vec<T>` with at least the specified capacity.
    ///
    /// The vector will be able to hold at least `capacity` elements without
    /// reallocating. This method is allowed to allocate for more elements than
    /// `capacity`. If `capacity` is 0, the vector will not allocate.
    ///
    /// # Errors
    ///
    /// Returns an error if the capacity exceeds `isize::MAX` _bytes_,
    /// or if the allocator reports allocation failure.
    #[inline]
    #[unstable(feature = "try_with_capacity", issue = "91913")]
    pub fn try_with_capacity(capacity: usize) -> Result<Self, TryReserveError> {
        Self::try_with_capacity_in(capacity, Global)
    }

I am already opposed to such API that takes the maximum index, however, as I feel it exposes an implementation detail

If new_unchecked would have safety requirement that passed maximum value is always correct and having debug safety assert for it, then I don't think it counts as exposing an implementation detail

Aditinally, should we also explore a possibility of using sorting networks for this?

While libraries do enjoy doing so, it's generally considered an anti-pattern to take impl [Try]Into as an argument; the caller can write .into() if they want that type coercion. When the standard library does something similar, it's always with more specific traits, e.g. impl AsRef.

Existing slice indexing is proxied through trait SliceIndex. If get_many_mut is to accept more than just [usize; N], it should probably be via a trait SliceIndexMany or similar.

Proof of concept:

pub trait SliceMultiIndex<T: ?Sized> {
    type Output<'a>
    where
        T: 'a;

    fn get_many_mut(self, slice: &mut T) -> Option<Self::Output<'_>>;
    unsafe fn get_many_unchecked_mut(self, slice: &mut T) -> Self::Output<'_>;
}

impl<T, const N: usize> SliceMultiIndex<[T]> for [usize; N] { /* … */ }
impl<T, const N: usize> SliceMultiIndex<[T]> for MultiIndex<N> { /* … */ }

#[derive(Copy, Clone)]
pub struct MultiIndex<const N: usize>([usize; N]);

impl<const N: usize> MultiIndex<N> {
    pub fn new(indices: [usize; N]) -> Option<Self> { /* … */ }
}

Simple translation of optimized LLVM IR to pseudo-MIR for the two index case, just to illustrate the difference in how the check orders get compiled:

// NB: tuple indices for simpler to analyze ScalarPair ABI

#[no_mangle]
pub fn do_get_2_mut_fused(
    slice: &mut Slice<i32>,
    indices: (usize, usize),
) -> Option<[&mut i32; 2]> {
    slice.get_many_mut([indices.0, indices.1])
    /*
    {
        let _13ii = indices.0 < slice.1;
        let _131ii = indices.1 < slice.1;
        let _1 = _13ii & _13lii;
        let _211ii = indices.1 != indices.0;
        let _2 = _211ii & _1;
        match _2 {
            true => goto 'cont,
            false => goto 'exit(nullptr),
        }
    }
    'cont = {
        let _12ii = &raw mut (*slice.0)[indices.0];
        let _121ii = &raw mut (*slice.0)[indices.1];
        RET.1 = _121ii;
        goto 'exit(_12ii);
    }
    'exit(let _sinki) = {
        RET.0 = _sinki;
        return;
    }
    */
}

#[no_mangle]
pub fn do_get_2_mut_staged(
    slice: &mut Slice<i32>,
    indices: (usize, usize),
) -> Option<[&mut i32; 2]> {
    slice.get_many_mut(MultiIndex::new([indices.0, indices.1])?)
    /*
    {
        let _191inot = indices.1 == indices.0;
        match _191inot {
            true => goto 'yeet,
            false => goto 'try,
        }
    }
    'yeet = {
        RET.0 = nullptr; // NB: different !attributes from 'bail
        goto 'exit;
    }
    'try = {
        let _12i = indices.0 < slice.1;
        let _121i = indices.1 < slice.1;
        let _orcond = _12i | _121i;
        match _orcond {
            true => goto 'bail,
            false => goto 'cont,
        }
    }
    'cont = {
        let _14i = &raw mut (*slice.0)[indices.0];
        let _141i = &raw mut (*slice.0)[indices.1];
        RET.0 = _14i;
        RET.1 = _141i;
        goto 'exit;
    }
    'bail = {
        RET.0 = nullptr; // NB: different !attributes from 'yeet
        goto 'exit;
    }
    'exit = {
        return;
    }
    */
}

The x86_64 assembly difference between these two looks like it should just be noise w.r.t. perf. Although IIUC the OP was more alluding to the cost of (idx0 < len) & (idx1 < len) vs max(idx0, idx1) < len, which I haven't tried to inspect.

Edit to add: plugged in a version tracking the max, and for two indices it's essentially the same codegen as not with only even more marginal ordering differences.

#[no_mangle]
pub fn do_get_2_mut_wmax(
    slice: &mut Slice<i32>,
    indices: (usize, usize),
) -> Option<[&mut i32; 2]> {
    slice.get_many_mut(MultiIndex::new([indices.0, indices.1])?)
    /*
    {
        let _191inot = indices.1 == indices.0;
        match _191inot {
            true => goto 'yeet,
            false => goto 'try,
        }
    }
    'yeet = {
        RET.0 = nullptr; // NB: same !attributes as 'bail
        goto 'exit;
    }
    'try = {
        let _sroaiiiiiiii = max(indices.0, indices.1);
        let _12 = _sroaiiiiiiii < slice.1;
        match _12 {
            true => goto 'bail,
            false => goto 'cont,
        }
    }
    'exit = {
        return;
    }
    'cont = {
        let _12i = &raw mut (*slice.0)[indices.0];
        let _121i = &raw mut (*slice.0)[indices.1];
        RET.0 = _12i;
        RET.1 = _121i;
        goto 'exit;
    }
    'bail = {
        RET.0 = nullptr; // NB: same !attributes as 'yeet
        goto 'exit;
    }
    */
}

I think I feel fairly confident calling the OP's observed perf spread random noise (e.g. alignment of the machine instructions themselves), given this.

So... The performance difference in benchmark is reliable. It stays in many variations. But inspecting the assembly, the versions are almost equivalent indeed. It seems (I have not yet inspected the assembly of the benchmark) that the combination with criterion makes LLVM generate worse assembly. The question is, does that imply on the real-world performance of the versions?

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.