.zip() optimization riddles

bluss · February 21, 2016, 12:59am

As mentioned in Niko’s recent blog post, .zip() can sometimes produce very good loop code.

Here’s some examples to show how fragile this is – two pieces of code that are very similar, yet only one compiles well. This code can be used as a test case for experiments with different llvm passes.

(Playground link).

// zip then for loop
// Bad - no loop optimization, no vectorization
pub fn copy_1(xs: &[u8], ys: &mut [u8]) {
    for (a, b) in ys.iter_mut().zip(xs) {
        *a = *b;
    }
}

// zip then .map()
// Good -- vectorized
pub fn copy_2(xs: &[u8], ys: &mut [u8]) {
    for _ in ys.iter_mut().zip(xs).map(|(a, b)| *a = *b) { }
}

I don’t understand why these are so different, the .map() code compiles so much better.

We can actually do even better. This triggers loop idiom recognize and the optimizer inserts memcpy.

// slice, then iterate with indices
// Best -- recognizes memcpy
pub fn copy_3(xs: &[u8], ys: &mut [u8]) {
    let len = std::cmp::min(ys.len(), xs.len());
    let ys = &mut ys[..len];
    let xs = &xs[..len];
    for i in 0..len {
        ys[i] = xs[i];
    }
}

For arithmetic I have a similar riddle.

.zip() produces good code, but the indexed approach is even better!

pub fn add_zip(xs: &[i32], ys: &mut [i32]) {
    for _ in ys.iter_mut().zip(xs).map(|(a, b)| *a += *b) { }
}

pub fn add_indexed(xs: &[i32], ys: &mut [i32]) {
    let len = std::cmp::min(ys.len(), xs.len());
    let ys = &mut ys[..len];
    let xs = &xs[..len];
    for i in 0..len {
        ys[i] += xs[i];
    }
}

Both autovectorize, but the indexed approach is better because not only has it vectorized and unrolled the loop, it has also taken the unrolled operations and interleaved them and allocated registers better.

nikomatsakis · February 21, 2016, 9:31am

The last time I investigated this, it seemed to me that the problem was that zip did not quite that things have the same length, and so there were extra branches in the loop that were causing LLVM trouble. IOW, the code we were generating was not as smart as finding the minimum index, as you have done, which made the loop more branchy. (Though in my original example I just asserted that the two sequences were the same length, because I knew that they were.) However, I am really not that familiar with the specific LLVM optimizations and how/when they fire, and it may well be that it ought to be unraveling this, so you can take my “conclusions” with a good ol’ heap of salt.

nikomatsakis · February 21, 2016, 9:39am

Some more data. copy_4 optimizes to memcpy, copy_5 does not. I believe copy_5 is closer to what we generate for the equivalent iterator? But I’ve exceeded my quota for playing with this right now.

#[inline(never)]
pub fn copy_4(xs: &[u8], ys: &mut [u8]) {
    for i in 0.. {
        if i >= xs.len() { break; }
        if i >= ys.len() { break; }
        ys[i] = xs[i];
    }
}

#[inline(never)]
pub fn copy_5(xs: &[u8], ys: &mut [u8]) {
    unsafe {
        let mut p: *const u8 = &xs[0];
        let mut q: *mut u8 = &mut ys[0];
        let mut p_end: *const u8 = p.offset(xs.len() as isize);
        let mut q_end: *mut u8 = q.offset(ys.len() as isize);
        while p != p_end && q != q_end {
            *q = *p;
            p = p.offset(1);
            q = q.offset(1);
        }
    }
}

bluss · March 25, 2019, 8:25am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Zipping slices sucks libs	12	2374	March 25, 2019
Iterator codegen regression	6	1290	March 25, 2019
Non-truncating, more usable zip() language design	12	5249	March 25, 2019
Should there by an array::zip method? libs	7	253	September 27, 2024
Thoughts on a free `cmp` function libs	8	1213	July 27, 2022

.zip() optimization riddles

Related topics