.zip() optimization riddles


#1

As mentioned in Niko’s recent blog post, .zip() can sometimes produce very good loop code.

Here’s some examples to show how fragile this is – two pieces of code that are very similar, yet only one compiles well. This code can be used as a test case for experiments with different llvm passes.

(Playground link).

// zip then for loop
// Bad - no loop optimization, no vectorization
pub fn copy_1(xs: &[u8], ys: &mut [u8]) {
    for (a, b) in ys.iter_mut().zip(xs) {
        *a = *b;
    }
}

// zip then .map()
// Good -- vectorized
pub fn copy_2(xs: &[u8], ys: &mut [u8]) {
    for _ in ys.iter_mut().zip(xs).map(|(a, b)| *a = *b) { }
}

I don’t understand why these are so different, the .map() code compiles so much better.

We can actually do even better. This triggers loop idiom recognize and the optimizer inserts memcpy.

// slice, then iterate with indices
// Best -- recognizes memcpy
pub fn copy_3(xs: &[u8], ys: &mut [u8]) {
    let len = std::cmp::min(ys.len(), xs.len());
    let ys = &mut ys[..len];
    let xs = &xs[..len];
    for i in 0..len {
        ys[i] = xs[i];
    }
}

For arithmetic I have a similar riddle.

.zip() produces good code, but the indexed approach is even better!

pub fn add_zip(xs: &[i32], ys: &mut [i32]) {
    for _ in ys.iter_mut().zip(xs).map(|(a, b)| *a += *b) { }
}

pub fn add_indexed(xs: &[i32], ys: &mut [i32]) {
    let len = std::cmp::min(ys.len(), xs.len());
    let ys = &mut ys[..len];
    let xs = &xs[..len];
    for i in 0..len {
        ys[i] += xs[i];
    }
}

Both autovectorize, but the indexed approach is better because not only has it vectorized and unrolled the loop, it has also taken the unrolled operations and interleaved them and allocated registers better.


#2

The last time I investigated this, it seemed to me that the problem was that zip did not quite that things have the same length, and so there were extra branches in the loop that were causing LLVM trouble. IOW, the code we were generating was not as smart as finding the minimum index, as you have done, which made the loop more branchy. (Though in my original example I just asserted that the two sequences were the same length, because I knew that they were.) However, I am really not that familiar with the specific LLVM optimizations and how/when they fire, and it may well be that it ought to be unraveling this, so you can take my “conclusions” with a good ol’ heap of salt.


#3

Some more data. copy_4 optimizes to memcpy, copy_5 does not. I believe copy_5 is closer to what we generate for the equivalent iterator? But I’ve exceeded my quota for playing with this right now. :slightly_smiling:

#[inline(never)]
pub fn copy_4(xs: &[u8], ys: &mut [u8]) {
    for i in 0.. {
        if i >= xs.len() { break; }
        if i >= ys.len() { break; }
        ys[i] = xs[i];
    }
}

#[inline(never)]
pub fn copy_5(xs: &[u8], ys: &mut [u8]) {
    unsafe {
        let mut p: *const u8 = &xs[0];
        let mut q: *mut u8 = &mut ys[0];
        let mut p_end: *const u8 = p.offset(xs.len() as isize);
        let mut q_end: *mut u8 = q.offset(ys.len() as isize);
        while p != p_end && q != q_end {
            *q = *p;
            p = p.offset(1);
            q = q.offset(1);
        }
    }
}