Zipping slices sucks

fn sum(x: &[i32], y: &[i32], z: &[i32]) -> i32 {
  fn repack<'a>(((x, y), z): ((&'a i32, &'a i32), &'a i32)) -> (&'a i32, &'a i32, &'a i32) { (x, y, z) }
  let mut s = 0;
  for (&x, &y, &z) in x.iter().zip(y.iter()).zip(z.iter()).map(repack) {
    s += x + y + z;
  }
  s
}

The inner loop has 3 comparisons per iteration, when the equivalent in C would have 1. I have no idea how to solve this nicely, but it’s a problem. I’d love to hear some ideas.

I’m curious as to how the equivalent C could could manage only 1 comparison per iteration, since you have to make sure you haven’t reached the end of any of the slices. Which means checking each slice individually.

I imagine the equivalent C would do a one-time min on the lengths, and then every iteration would just be comparing against the min.

I’m hopeful that there’s a fast-path to be had with trusted_len here. Basically if the lengths are trusted take the min and apply assume or something.

Ok, so see what was possible without using zip, I did this:

pub fn sum(x: &[i32], y: &[i32], z: &[i32]) -> i32 {
    let n = min(x.len(), min(y.len(), z.len()));
    let mut i = 0;
    let mut s = 0;
    while i < n {
        unsafe {
            s += *x.get_unchecked(i)
                + *y.get_unchecked(i)
                + *z.get_unchecked(i);
        }
        i += 1;
    }
    s
}

fn min(a: usize, b: usize) -> usize {
    if a < b { a } else { b }
}

Which LLVM helpfully vectorizes and indeed only does a single comparison for each iterator. I’m not sure if the zip iterator could manage it though. I’ll do some more investigation.

Out of curiosity, I benchmarked the unsafe sum function and a naive version:

pub fn naive(x: &[i32], y: &[i32], z: &[i32]) -> i32 {
    x.iter().zip(y.iter()).zip(z.iter())
        .map(flatten) // ((a, b), c) -> (a, b, c)
        .fold(0, |a, (x, y, z)| a + *x + *y + *z)
}

And a safe version of the loop:

pub fn naive_loop(x: &[i32], y: &[i32], z: &[i32]) -> i32 {
    let n = min(x.len(), min(y.len(), z.len()));
    let mut s = 0;
    for i in 0..n {
        s += x[i] + y[i] + z[i];
    }
    s
}

Results were:

test tests::naive       ... bench:      2094 ns/iter (+/- 106)
test tests::naive_loop  ... bench:      1412 ns/iter (+/- 44)
test tests::unsafe_loop ... bench:       360 ns/iter (+/- 16)

I can understand the unsafe one being faster… but the iterator version, which is supposed to avoid bounds checks, being slower than a naive loop? That’s… really less than ideal.

Edit: “compressed” test case on playpen if you want to reproduce. Make sure to read comment on the test data set.

2 Likes

Remember that when we boast about Rust’s iterators and how they behave exactly as efficiently as C++ iterators, it’s about how rust managed to get the slice’s iterators to compile to the exact same code, and that even that effort hasn’t reached the finish line just yet.

I’m extremely excited, because exactly that works! (At least in the two-iterator case… three or more has improvements possible)…

Edit: Just let’s experiment with that a bit. It seems I’m worse at benchmarking than at implementing this… didn’t get everything exactly right in that commit (iter by value vs ptr) etc. Looks like triple iteration maybe is equally fast already…

It’s a shame that only works for the 2-zip case, and falls down for the k-zip case.

It doesn’t! That was just my benchmarks being inconsistent. It’s fixed in the next commit and both the 2-slice and 3-slice cases are just as fast with numerical loop + unchecked indexing as using ZipTrusted.

Niiiiice

TrustedLen ftw

llvm::assume is magic.

The solution to all optimisation problems:

let magic: f64 = get_magic();
llvm::assume(magic == magic);

:stuck_out_tongue:

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.