Hello everyone, I made some contributions to both stdarch and rustc in rust-lang in last two years. I would like to talk about the thought of implementing automatic vector optimization in rustc. (I have done a preliminary implementation based on rust1.60.0 and written the documentation, in the README of this repository.)
SIMD(Single Instruction Multiple Data) is a commonly used acceleration technology in computing scenarios. it is also called vectorization. LLVM provides some automatic vectorization optimization mechanisms, but in many scenarios, we still need to manually use SIMD instructions to rewrite the program to get the SIMD acceleration effect.
Rust developers currently have multiple ways to use these SIMD instructions:
· stdarch
· protable-SIMD
· use 'extern “platform-intrinsics” ' Abi directly
But these ways require developers to rewrite their original code. This has several distinct drawbacks:
(1) Makes the code complex and hard to read, multiplying the size of the code
(2) For those unfamiliar with SIMD, getting these speedups is very difficult.
To solve the problems, we can implement the auto-vectorization feature in the rust compiler. In this way, we don't need to change the original code at all, but only need to add a feature flag to get the SIMD acceleration effect.
I have made a preliminary implementation of this feature. It is based on Rust mir, since mir can clearly express the structure of a function.It behaves like a normal mir optimization pass, and automatically analyze and re-factor the loop part in mir, and to obtain SIMD acceleration in as many scenarios as possible while ensuring safety and program functionality.
for example:
#[vectorization]
fn func1(arr: &[f32]) -> f32 {
let mut sum = 0.;
for i in arr {
sum += *i;
}
sum
}
fn func2(arr: &[f32]) -> f32 {
let mut sum = 0.;
for i in arr {
sum += *i;
}
sum
}
(Test code is here)
In this example, the only difference bwteen func1 and func2 is that the #[vectorization] attribute added to func1, indicating that automatic vectorization is enabled on this function. We call two functions separately in the main
function and count the time they spend. After compiled and run with the rustc -Copt-level=3 -Zmir-opt-level=4
command in the local x86_64
environment, the results are as follows:
t1: 327
t2: 6519
ans1: 2201600
ans2: 2201600
We can see that the first function is almost 20 times faster than the second function.
More practical example: calculating the variance of an array:
Calculating variance is a very commonly used function in the field of statistics. The test code is here.
We can see the following two functions:
#[vectorization]
fn var1(arr: &[f32]) -> f32 {
let mut sq_sum = 0.;
let mut sum = 0.;
let len = arr.len() as f32;
for i in arr {
sum += i;
sq_sum += i * i;
}
let ave = sum / len;
(sq_sum / len - ave * ave).sqrt()
}
fn var2(arr: &[f32]) -> f32 {
let mut sq_sum = 0.;
let mut sum = 0.;
let len = arr.len() as f32;
for i in arr {
sum += i;
sq_sum += i * i;
}
let ave = sum / len;
(sq_sum / len - ave * ave).sqrt()
}
The difference between the two functions is still only whether to add the #[vectorization] attribute. After using the same compilation command and environment as before, the effect of running is as follows:
t1: 7
t2: 67
ans1: 28.667055
ans2: 28.649292
We can see that the first function is about 10 times faster than the second one. The result of the calculation is somewhat different, because the calling SIMD instruction results in a different order of floating-point operations, resulting in different rounding errors. There is currently no good solution to this problem.
More complex example
We can look at this more complex example below. The test code is here.
#[vectorization]
fn func1(src: &[f32], src2: &[f32], val: &mut [f32]) {
let mut sum = 0.;
for x in 0..src.len() {
let v = src[x];
sum += v * v;
val[x] = src2[x] + sum;
}
}
In this example, the value of the variable sum will be changed in each thorn loop based on the value in the previous loop, which can have a significant impact on the effect of auto-vectorization. We can check the result of running:
t1: 6984
t2: 7952
The optimization effect is not very significant, but there is still more than 13% efficiency improvement.
There are more detailed descriptions and implementation code in this repository. I think this feature deserves a proposal. Anyone interested in this can work together to implement it.