Mir optimization pass that implements auto-vectorization

Hello everyone, I made some contributions to both stdarch and rustc in rust-lang in last two years. I would like to talk about the thought of implementing automatic vector optimization in rustc. (I have done a preliminary implementation based on rust1.60.0 and written the documentation, in the README of this repository.)

SIMD(Single Instruction Multiple Data) is a commonly used acceleration technology in computing scenarios. it is also called vectorization. LLVM provides some automatic vectorization optimization mechanisms, but in many scenarios, we still need to manually use SIMD instructions to rewrite the program to get the SIMD acceleration effect.

Rust developers currently have multiple ways to use these SIMD instructions:

· stdarch

· protable-SIMD

· use 'extern “platform-intrinsics” ' Abi directly

But these ways require developers to rewrite their original code. This has several distinct drawbacks:

(1) Makes the code complex and hard to read, multiplying the size of the code

(2) For those unfamiliar with SIMD, getting these speedups is very difficult.

To solve the problems, we can implement the auto-vectorization feature in the rust compiler. In this way, we don't need to change the original code at all, but only need to add a feature flag to get the SIMD acceleration effect.

I have made a preliminary implementation of this feature. It is based on Rust mir, since mir can clearly express the structure of a function.It behaves like a normal mir optimization pass, and automatically analyze and re-factor the loop part in mir, and to obtain SIMD acceleration in as many scenarios as possible while ensuring safety and program functionality.

for example:

#[vectorization]
fn func1(arr: &[f32]) -> f32 {
    let mut sum = 0.;
    for i in arr {
        sum += *i;
    }
    sum
}

fn func2(arr: &[f32]) -> f32 {
    let mut sum = 0.;
    for i in arr {
        sum += *i;
    }
    sum
}

(Test code is here)

In this example, the only difference bwteen func1 and func2 is that the #[vectorization] attribute added to func1, indicating that automatic vectorization is enabled on this function. We call two functions separately in the main function and count the time they spend. After compiled and run with the rustc -Copt-level=3 -Zmir-opt-level=4 command in the local x86_64 environment, the results are as follows:

t1: 327
t2: 6519
ans1: 2201600
ans2: 2201600

We can see that the first function is almost 20 times faster than the second function.

More practical example: calculating the variance of an array:

Calculating variance is a very commonly used function in the field of statistics. The test code is here.

We can see the following two functions:

#[vectorization]
fn var1(arr: &[f32]) -> f32 {
    let mut sq_sum = 0.;
    let mut sum = 0.;
    let len = arr.len() as f32;
    for i in arr {
        sum += i;
        sq_sum += i * i;
    }
    let ave = sum / len;
    (sq_sum /  len - ave * ave).sqrt()
}

fn var2(arr: &[f32]) -> f32 {
    let mut sq_sum = 0.;
    let mut sum = 0.;
    let len = arr.len() as f32;
    for i in arr {
        sum += i;
        sq_sum += i * i;
    }
    let ave = sum / len;
    (sq_sum /  len - ave * ave).sqrt()
}

The difference between the two functions is still only whether to add the #[vectorization] attribute. After using the same compilation command and environment as before, the effect of running is as follows:

t1: 7
t2: 67
ans1: 28.667055
ans2: 28.649292

We can see that the first function is about 10 times faster than the second one. The result of the calculation is somewhat different, because the calling SIMD instruction results in a different order of floating-point operations, resulting in different rounding errors. There is currently no good solution to this problem.

More complex example

We can look at this more complex example below. The test code is here.

#[vectorization]
fn func1(src: &[f32], src2: &[f32], val: &mut [f32]) {
    let mut sum = 0.;
    for x in 0..src.len() {
        let v = src[x];
        sum += v * v;
        val[x] = src2[x] + sum;
    }
}

In this example, the value of the variable sum will be changed in each thorn loop based on the value in the previous loop, which can have a significant impact on the effect of auto-vectorization. We can check the result of running:

t1: 6984
t2: 7952

The optimization effect is not very significant, but there is still more than 13% efficiency improvement.

There are more detailed descriptions and implementation code in this repository. I think this feature deserves a proposal. Anyone interested in this can work together to implement it.

Can you say more about why doing this is rustc, as opposed to the LLVM autovectorization that already exists, is important?

I would guess that a substantial part of that is from add_unordered, since the default addition in rust needs to be ordered.

If we had scalar add_unordered then I expect this would get vectorized the same way it does with i32s: https://play.rust-lang.org/?version=nightly&mode=release&edition=2021&gist=985182c57dff729c7374b6c06ab2ee3c

  %12 = add <4 x i32> %wide.load, %vec.phi20
  %13 = add <4 x i32> %wide.load23, %vec.phi21
  %14 = mul <4 x i32> %wide.load, %wide.load
  %15 = mul <4 x i32> %wide.load23, %wide.load23
  %16 = add <4 x i32> %14, %vec.phi
  %17 = add <4 x i32> %15, %vec.phi19
1 Like

Yes, there is also auto-vectorization in llvm. But if auto-vectorization in llvm solves all problems, why do we use stdarch and portable-simd? We Implemente it in rustc to enable vectorization in those scenarios where llvm auto-vectorization can't solve it and the code has to be rewritten manually. For example, loop structures contain if...else statements, or reading of continuous array indexes from an iterator.

The automatic vectorization I have implemented so far can only handle some uncomplicated scenarios, which can also be optimized by llvm's auto-vectorization. For example, in the above code, we call the "simd_add" intrinsic in the loop and use the "simd_reduce_add_unorderd" intrinsic again after the loop ends. In theory, llvm can also do this kind of optimization. You can view the optimized mir structure here

Well, one reason is to use various operations that aren't easily coded in scalar rust. Like if you really want _mm512_bitshuffle_epi64_mask in core::arch::x86_64 - Rust then I'd be very surprised for LLVM autovectorization to pick it up. But those cases, to me, are the ones that we wouldn't implement in a rust autovectorizer either.

So basically the core of my question is what patterns can we reasonably notice in Rust in a way that's better than extending LLVM to notice those same patterns? (Especially considering things like https://polly.llvm.org/, which I suspect will always be more powerful than anything we do.)

Or is there some advantage we get from doing this in MIR even in cases where LLVM can do it anyway?

This is a good example of one of those things that's far easier to detect in LLVM.

This code, for example, https://rust.godbolt.org/z/Wsj1ssTxj

pub fn demo1(a: &[i32], b: &[i32]) -> i32 {
    std::iter::zip(a, b)
        .map(|(a, b)| a * b)
        .sum()
}

Already vectorizes in LLVM

  %21 = mul <8 x i32> %wide.load29, %wide.load, !dbg !66
  %22 = mul <8 x i32> %wide.load30, %wide.load26, !dbg !66
  %23 = mul <8 x i32> %wide.load31, %wide.load27, !dbg !66
  %24 = mul <8 x i32> %wide.load32, %wide.load28, !dbg !66
  %25 = add <8 x i32> %21, %vec.phi, !dbg !83
  %26 = add <8 x i32> %22, %vec.phi23, !dbg !83
  %27 = add <8 x i32> %23, %vec.phi24, !dbg !83
  %28 = add <8 x i32> %24, %vec.phi25, !dbg !83

But that's something that's much harder to see in MIR. We're just not good at seeing through all the intermediate noise.


Not that rust has been great about it exposing LLVM's auto-vectorization as much as it could. For example, this test had to be added only a couple of weeks ago along with a change (#94570) to stop rust blocking a bunch of autovectorization opportunities from getting picked up:

https://github.com/rust-lang/rust/blob/master/src/test/codegen/autovectorize-f32x4.rs


Also, let me give a shout-out to std::simd, which gives a really easy line-by-line translation of the variance example in the repo readme: https://rust.godbolt.org/z/9Wc6Kbb47

const LANES: usize = 16;
pub fn var_std_simd(arr: &[f32]) -> f32 {
    let mut sq_sum = Simd::from_array([0.0; LANES]);
    let mut sum = Simd::from_array([0.0; LANES]);
    
    // the `platform-intrinsic` example in the introduction of the auto-vectorization 
    // fork's readme is ignoring the tail, so this does too
    let (chunks, _tail) = arr.as_chunks();
    for chunk in chunks {
        let chunk = Simd::from_array(*chunk);
        sum += chunk;
        sq_sum += chunk * chunk;
    }
    
    let len = arr.len() as f32;
    let ave = sum.reduce_sum() / len;
    sq_sum.reduce_sum() / len - ave * ave
}
2 Likes

For example this function:

pub fn func1(arr1: &mut [u32], arr2: &mut [u32]) {
    for i in 1..arr2.len() {
        arr2[i] += arr2[i - 1];
        arr1[i] += arr2[i];
    }
}

It is not auto-vectorized in llvm: Compiler Explorer

But the second statement in the loop:

        arr1[i] += arr2[i];

By analyzing the mir structure, we know that this statement can do automatic vector optimization in rustc.

And this function too: Compiler Explorer

pub fn func1(arr1: &mut [u32], arr2: &[u32]) {
    for i in 0..arr1.len() {
        if arr1[i] > 100 {
            arr1[i] = arr2[i];
        }
    }
}

By analyzing mir, we can easily do auto-vectorization. For example on x86_64 use a combination of _mm_cmplt_* and _mm_blendv_*

It's not clear to me what it takes to get llvm's auto-vectorization to support these scenarios, but we can easily get what we want by mir optimizations.

1 Like

The general thing that both those examples are hitting is that they can panic when one of the arrays is not long enough. So LLVM is carefully maintaining the partial work that is done before the panic in the cases where that panic is hit in the middle of the loop. We would have to do this in MIR too.

If you assert up-front that things are long enough, then the loops become canonical again and it optimizes them better: https://rust.godbolt.org/z/4c9v5WKvb

pub fn func1_earlycheck(arr1: &mut [u32], arr2: &mut [u32]) {
    let n = arr2.len();
    let (arr1, arr2) = (&mut arr1[..n], &mut arr2[..n]);
    for i in 1..n {
        arr2[i] += arr2[i - 1];
        arr1[i] += arr2[i];
    }
}

"reslicing" like that is a good general technique to encourage removal of bounds checks. It's what I used in MIRI says `reverse` is UB, so replace it with something LLVM can vectorize by scottmcm · Pull Request #90821 · rust-lang/rust · GitHub to get that to vectorize, for example.

(Autovectorizing this stuff is easier in C since the compiler can just assume everything is always long enough, as out-of-bounds indexing is UB.)

Of course, often the best approach is just to use iterators instead. If we write that second example like so: https://rust.godbolt.org/z/eG8sc6fhs

pub fn func1_just_use_iter(arr1: &mut [u32], arr2: &[u32]) {
    std::iter::zip(arr1, arr2)
        .for_each(|(a, b)| {
            if *a > 100 {
                *a = *b;
            }
        })
}

Then it does get vectorized:

  %13 = icmp ugt <8 x i32> %wide.load, <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>, !dbg !75
  %14 = icmp ugt <8 x i32> %wide.load8, <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>, !dbg !75
  %15 = icmp ugt <8 x i32> %wide.load9, <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>, !dbg !75
  %16 = icmp ugt <8 x i32> %wide.load10, <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>, !dbg !75
  %17 = getelementptr [0 x i32], [0 x i32]* %arr2.0, i64 0, i64 %index, !dbg !83
  %18 = bitcast i32* %17 to <8 x i32>*, !dbg !94
  %wide.masked.load = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32>* %18, i32 4, <8 x i1> %13, <8 x i32> poison), !dbg !94, !noalias !89
  %19 = getelementptr i32, i32* %17, i64 8, !dbg !94
  %20 = bitcast i32* %19 to <8 x i32>*, !dbg !94
  %wide.masked.load11 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32>* %20, i32 4, <8 x i1> %14, <8 x i32> poison), !dbg !94, !noalias !89
  %21 = getelementptr i32, i32* %17, i64 16, !dbg !94
  %22 = bitcast i32* %21 to <8 x i32>*, !dbg !94
  %wide.masked.load12 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32>* %22, i32 4, <8 x i1> %15, <8 x i32> poison), !dbg !94, !noalias !89
  %23 = getelementptr i32, i32* %17, i64 24, !dbg !94
  %24 = bitcast i32* %23 to <8 x i32>*, !dbg !94
  %wide.masked.load13 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32>* %24, i32 4, <8 x i1> %16, <8 x i32> poison), !dbg !94, !noalias !89
  %25 = bitcast i32* %5 to <8 x i32>*, !dbg !95
  call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> %wide.masked.load, <8 x i32>* %25, i32 4, <8 x i1> %13), !dbg !95, !alias.scope !84, !noalias !89
  %26 = bitcast i32* %7 to <8 x i32>*, !dbg !95
  call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> %wide.masked.load11, <8 x i32>* %26, i32 4, <8 x i1> %14), !dbg !95, !alias.scope !84, !noalias !89
  %27 = bitcast i32* %9 to <8 x i32>*, !dbg !95
  call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> %wide.masked.load12, <8 x i32>* %27, i32 4, <8 x i1> %15), !dbg !95, !alias.scope !84, !noalias !89
  %28 = bitcast i32* %11 to <8 x i32>*, !dbg !95
  call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> %wide.masked.load13, <8 x i32>* %28, i32 4, <8 x i1> %16), !dbg !95, !alias.scope !84, !noalias !89

Loops with indexes are often an anti-pattern in Rust -- see needless_range_loop in clippy -- so I'd be particularly sad if we got any feature that only worked with them.

12 Likes

Thank you for the explanation. Based on the above discussion, I think maybe we can continue to do auto-vectorization mir-opt from another direction. We don't call SIMD intrinsics directly in mir, but just adjust the structure of the loop to be recognized by llvm's auto-vectorization mechanism. For example the above example:

pub fn func1(arr1: &mut [u32], arr2: &mut [u32]) {
    for i in 1..arr2.len() {
        arr2[i] += arr2[i - 1];
        arr1[i] += arr2[i];
    }
}

It's not currently optimized by llvm's auto-vectorization, but if we go through mir analysis and adjust its structure to something like this:

pub fn func1(arr1: &mut [u32], arr2: &mut [u32]) {
    for i in 1..arr2.len() {
        arr2[i] += arr2[i - 1];
    }
    for i in 1..arr2.len() {
        arr1[i] += arr2[i];
    }
}

Then it's easily auto-vector-optimized by llvm: Compiler Explorer

Likewise, for the following example:

pub fn func1(arr1: &mut [u32], arr2: &[u32]) {
    for i in 0..arr1.len() {
        if arr1[i] > 100 {
            arr1[i] = arr2[i];
        }
    }
}

It can also be auto-vectorized by llvm if we tweak it a bit through mir-opt: Compiler Explorer

pub fn func1(arr1: &mut [u32], arr2: &[u32]) {
    for i in 0..arr1.len() {
        let x = arr1[i];
        let y = arr2[i];
        if x > 100 {
            arr1[i] = y;
        }
    }
}

I don't think those are valid transformations:

  • In the first one arr1[i] could panic, and this changes the result of the function because in that case not all arr2 elements will be changed;

  • In the second one arr2[i] could panic even if arr1[i] didn't (consider for example func1(&mut [101, 0], &[42]), this doesn't panic, but with your transformation it would).

Edit: compile -> panic typo, not sure what I was thinking at that time.

5 Likes

I think this is also the reason why sometimes developers have to use SIMD intrinsics manually - developers themselves need to be responsible for the correctness of the program.

I think we can declare some optimization pass as unsound first. This at least provides developers with some choice - if they can guarantee the correctness of the program from the overall logic themselves.

Note that that first loop is a prefix sum, which has data dependencies that make it non-trivial: https://en.wikipedia.org/wiki/Prefix_sum#Parallel_algorithms

LLVM will unroll it, but it doesn't vectorize it in either index or iterator forms: https://rust.godbolt.org/z/1vdErE8jE

There are a multitude of possible choices for how to go about it, with different cache and parallelism tradeoffs.

EDIT: I stumbled on this great chart demonstrating that from Prefix Sum with SIMD - Algorithmica

4 Likes

Most of the time you can "solve" the problem that makes those transformation by adding an explicit assert at the start which guarantess that the later assertions will always succeed if the first assert succeed. In contrast the current behaviour is a sequence of asserts that could all fail even if the previous ones succeed, and this is what makes many transformations (and thus optimizations) invalid.

Unfortunately this is something that needs to be done why the programmer writing the code, not the compiler.

I feel like this could be a footgun. If you know the algorithm is right, then unsafe is already an option and makes what's happening pretty clear. Relying on some obscure optimization doing what you expect feels much more fragile.

3 Likes

It seems that I'm heading in the wrong direction entirely. Thank you all for taking the time and pointing out. I think I won't go any further in this regard.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.