Mir optimization pass that implements auto-vectorization

The general thing that both those examples are hitting is that they can panic when one of the arrays is not long enough. So LLVM is carefully maintaining the partial work that is done before the panic in the cases where that panic is hit in the middle of the loop. We would have to do this in MIR too.

If you assert up-front that things are long enough, then the loops become canonical again and it optimizes them better: https://rust.godbolt.org/z/4c9v5WKvb

pub fn func1_earlycheck(arr1: &mut [u32], arr2: &mut [u32]) {
    let n = arr2.len();
    let (arr1, arr2) = (&mut arr1[..n], &mut arr2[..n]);
    for i in 1..n {
        arr2[i] += arr2[i - 1];
        arr1[i] += arr2[i];
    }
}

"reslicing" like that is a good general technique to encourage removal of bounds checks. It's what I used in MIRI says `reverse` is UB, so replace it with something LLVM can vectorize by scottmcm · Pull Request #90821 · rust-lang/rust · GitHub to get that to vectorize, for example.

(Autovectorizing this stuff is easier in C since the compiler can just assume everything is always long enough, as out-of-bounds indexing is UB.)

Of course, often the best approach is just to use iterators instead. If we write that second example like so: https://rust.godbolt.org/z/eG8sc6fhs

pub fn func1_just_use_iter(arr1: &mut [u32], arr2: &[u32]) {
    std::iter::zip(arr1, arr2)
        .for_each(|(a, b)| {
            if *a > 100 {
                *a = *b;
            }
        })
}

Then it does get vectorized:

  %13 = icmp ugt <8 x i32> %wide.load, <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>, !dbg !75
  %14 = icmp ugt <8 x i32> %wide.load8, <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>, !dbg !75
  %15 = icmp ugt <8 x i32> %wide.load9, <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>, !dbg !75
  %16 = icmp ugt <8 x i32> %wide.load10, <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>, !dbg !75
  %17 = getelementptr [0 x i32], [0 x i32]* %arr2.0, i64 0, i64 %index, !dbg !83
  %18 = bitcast i32* %17 to <8 x i32>*, !dbg !94
  %wide.masked.load = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32>* %18, i32 4, <8 x i1> %13, <8 x i32> poison), !dbg !94, !noalias !89
  %19 = getelementptr i32, i32* %17, i64 8, !dbg !94
  %20 = bitcast i32* %19 to <8 x i32>*, !dbg !94
  %wide.masked.load11 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32>* %20, i32 4, <8 x i1> %14, <8 x i32> poison), !dbg !94, !noalias !89
  %21 = getelementptr i32, i32* %17, i64 16, !dbg !94
  %22 = bitcast i32* %21 to <8 x i32>*, !dbg !94
  %wide.masked.load12 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32>* %22, i32 4, <8 x i1> %15, <8 x i32> poison), !dbg !94, !noalias !89
  %23 = getelementptr i32, i32* %17, i64 24, !dbg !94
  %24 = bitcast i32* %23 to <8 x i32>*, !dbg !94
  %wide.masked.load13 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32>* %24, i32 4, <8 x i1> %16, <8 x i32> poison), !dbg !94, !noalias !89
  %25 = bitcast i32* %5 to <8 x i32>*, !dbg !95
  call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> %wide.masked.load, <8 x i32>* %25, i32 4, <8 x i1> %13), !dbg !95, !alias.scope !84, !noalias !89
  %26 = bitcast i32* %7 to <8 x i32>*, !dbg !95
  call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> %wide.masked.load11, <8 x i32>* %26, i32 4, <8 x i1> %14), !dbg !95, !alias.scope !84, !noalias !89
  %27 = bitcast i32* %9 to <8 x i32>*, !dbg !95
  call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> %wide.masked.load12, <8 x i32>* %27, i32 4, <8 x i1> %15), !dbg !95, !alias.scope !84, !noalias !89
  %28 = bitcast i32* %11 to <8 x i32>*, !dbg !95
  call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> %wide.masked.load13, <8 x i32>* %28, i32 4, <8 x i1> %16), !dbg !95, !alias.scope !84, !noalias !89

Loops with indexes are often an anti-pattern in Rust -- see needless_range_loop in clippy -- so I'd be particularly sad if we got any feature that only worked with them.

12 Likes