Poor optimization of fallible construction of arrays from iterators

Poking around with variations on @newpavlov's "pathological bounds checking" example in the "pure annotations for stateless functions" thread, I found another case where the optimizer doesn't manage to do nearly as good a job as it could.

#![feature(array_try_from_fn)]

const N: usize = 8;

pub fn try_add_1(a: &[u32], b: &[u32], idx: u16) -> Option<[u32; N]> {
    let idx = usize::from(idx);
    if a.get(idx + N - 1).is_none() || b.get(idx + N - 1).is_none() {
        None
    } else {
        Some(core::array::from_fn(|i| {
            a[idx + i] + b[idx + i]
        }))
    }
}

pub fn try_add_2(a: &[u32], b: &[u32], idx: u16) -> Option<[u32; N]> {
    let idx = usize::from(idx);
    let mut sums = a.iter().zip(b.iter())
        .skip(idx).take(N)
        .map(|(x, y)| x + y);
    core::array::try_from_fn(|_| sums.next())
}

(playground link)

try_add_1 is optimized very tightly, with only one merged bounds check followed by vectorized code for the loop. (The return convention for Option<[u32; N]> is suboptimal but that's a separate issue.) try_add_2 on the other hand does multiple bounds checks and doesn't manage to vectorize the whole loop. [N.B. Both functions are meant to do the same abstract calculation but I'm not sure I got try_add_2 exactly right, feel free to offer corrections.]

Is this worth filing a bug on?

I think it would depend if you can isolate it a bit more to a particular part that's just generally working badly, rather than this particular combination that doesn't strike me as something that one ought to be writing in the first place.

When I look at the function, the obvious way to me is instead to do something like

const N: usize = 8;
pub fn try_add_2(a: &[u32], b: &[u32], idx: u16) -> Option<[u32; N]> {
    let idx = usize::from(idx);
    let a = a.get(idx..)?.first_chunk::<N>()?;
    let b = b.get(idx..)?.first_chunk::<N>()?;
    Some(core::array::from_fn(|i| a[i] + b[i]))
}

which is shorter than both, has no math on the indices, and is still optimized perfectly https://rust.godbolt.org/z/sao51bbeK.

A couple of reasons that try_add_2 doesn't surprise me at all that it's non-great:

  1. try_from_fn(|_| it.next()) is rarely any good, in my experience. It's a two-exit-condition loop -- one for the array length and one for the iterator exhaustion -- which LLVM is far worse at than single-exit-condition loops. (Part of the previous work making from_fn and map better on arrays was having them use next_unchecked instead so there's no fallibility inside the loop.)
  2. if one is using nightly anyway, might as well use Iterator::next_chunk from nightly instead, which has a way better chance of actually being optimized well. (It can see the loop, which next can't, so has more flexibility to do something smart.)
  3. zip is always harder than other iterators to optimize.
  4. .skip(idx) is a shrinking combinator which cannot propagate TrustedLen, so it's not surprising that it loses a bunch of potential optimizations.
4 Likes

I always forget that first_chunk exists and that .get() can take a range expression as well as a single value. That does look like a better way to write this function for real.

To me, this seems like the most important thing to fix. Would it help to have array::try_from_iter here?

.skip(idx) is a shrinking combinator which cannot propagate TrustedLen

I see why this is currently the case -- if the nested iterator is TrustedLen and its size hint is (usize::MAX, None), we have no way of defining a trustworthy upper bound on the number of elements after skipping k elements, and TrustedLen requires trustworthy upper as well as lower bounds. But for this scenario, what matters is the lower bound, and that can be accurately stated as usize::MAX - k, can't it? Yeah, we'd need to define a new TrustedLowerBound trait to take advantage...

Well, that's just next_chunk at that point. But that's tricky to optimize and also unclear whether the return type is the right choice, see Provide a means of turning iterators into fixed-size arrays · Issue #81615 · rust-lang/rust · GitHub

1 Like

And a worse next_chunk too, because there wouldn't be the opportunity for individual iterators to override it with a smarter implementation.

But generally the reason I don't think it could do that much better is that it'd need some trait bound telling it more than just Iterator for it to be able to be smarter. For a general iterator at most it could do it check the size_hint and exit early if the hint happens to say it's definitely too short, but it might be too short even if the hint doesn't say that (either because the hint isn't actually trustworthy or because it's the default (0, None) hint).

There is actually an internal array::try_from_trusted_iter, but it needs UncheckedIterator and actually assert!s that the length is long enough: