As mentioned in Niko’s recent blog post, .zip() can sometimes produce very good loop code.
Here’s some examples to show how fragile this is – two pieces of code that are very similar, yet only one compiles well. This code can be used as a test case for experiments with different llvm passes.
// zip then for loop
// Bad - no loop optimization, no vectorization
pub fn copy_1(xs: &[u8], ys: &mut [u8]) {
for (a, b) in ys.iter_mut().zip(xs) {
*a = *b;
}
}
// zip then .map()
// Good -- vectorized
pub fn copy_2(xs: &[u8], ys: &mut [u8]) {
for _ in ys.iter_mut().zip(xs).map(|(a, b)| *a = *b) { }
}
I don’t understand why these are so different, the .map()
code compiles so much better.
We can actually do even better. This triggers loop idiom recognize and the optimizer inserts memcpy
.
// slice, then iterate with indices
// Best -- recognizes memcpy
pub fn copy_3(xs: &[u8], ys: &mut [u8]) {
let len = std::cmp::min(ys.len(), xs.len());
let ys = &mut ys[..len];
let xs = &xs[..len];
for i in 0..len {
ys[i] = xs[i];
}
}
For arithmetic I have a similar riddle.
.zip()
produces good code, but the indexed approach is even better!
pub fn add_zip(xs: &[i32], ys: &mut [i32]) {
for _ in ys.iter_mut().zip(xs).map(|(a, b)| *a += *b) { }
}
pub fn add_indexed(xs: &[i32], ys: &mut [i32]) {
let len = std::cmp::min(ys.len(), xs.len());
let ys = &mut ys[..len];
let xs = &xs[..len];
for i in 0..len {
ys[i] += xs[i];
}
}
Both autovectorize, but the indexed approach is better because not only has it vectorized and unrolled the loop, it has also taken the unrolled operations and interleaved them and allocated registers better.