Updated
After some attempts, I discovered that Clang generates TBAA metadata while Rust does not, which leads to more conservative optimization strategy for loops. For more details, check godbolt.
I'm so curious why Rust doesn't have TBAA...
Problem details
This topic is a continuation of that topic.
Through further investigation, I discovered that the failure of vectorization is caused by the failure of Loop-Invariant Code Motion (LICM) . Consider the following minimal example:
#[no_mangle]
unsafe fn add_of_1(a:&mut Vec<Vec<i32>>,b:*const i32){
let a=a.as_mut_slice();
for i in 0..a.len(){
for j in 0..a[i].len(){
*a.get_unchecked_mut(i).get_unchecked_mut(j)+=*b;
}
}
}
#[no_mangle]
unsafe fn add_of(a:&mut [Vec<i32>],b:*const i32){
for i in 0..4000{
for j in 0..4000{
*a.get_unchecked_mut(i).get_unchecked_mut(j)+=*b;
}
}
}
Here is the LLVM IR with opt-level=1
(showing the inner loop):
bb8.preheader:
%spec.select18 = phi i64 [ 1, %start ], [ %spec.select, %bb2.loopexit ]
%iter.sroa.0.017 = phi i64 [ 0, %start ], [ %spec.select18, %bb2.loopexit ]
%0 = getelementptr inbounds %"alloc::vec::Vec<i32>", ptr %a.0, i64 %iter.sroa.0.017, i32 0, i32 1
%self1.i = load ptr, ptr %0, align 8
br label %bb10
bb6:
ret void
bb10:
%_0.i.i.i616 = phi i1 [ true, %bb8.preheader ], [ %_0.i.i.i6, %bb10 ]
%iter1.sroa.0.015 = phi i64 [ 0, %bb8.preheader ], [ %spec.select13, %bb10 ]
%_0.i1.i.i9 = zext i1 %_0.i.i.i616 to i64
%spec.select13 = add nuw i64 %iter1.sroa.0.015, %_0.i1.i.i9
%_15 = load i32, ptr %b, align 4
%_0.i.i11 = getelementptr inbounds i32, ptr %self1.i, i64 %iter1.sroa.0.015
%1 = load i32, ptr %_0.i.i11, align 4
%2 = add i32 %1, %_15
store i32 %2, ptr %_0.i.i11, align 4
%_0.i.i.i6 = icmp ult i64 %spec.select13, 4000
br i1 %_0.i.i.i6, label %bb10, label %bb2.loopexit
bb8.preheader:
%spec.select20 = phi i64 [ 1, %start ], [ %spec.select, %bb2.loopexit ]
%iter.sroa.0.019 = phi i64 [ 0, %start ], [ %spec.select20, %bb2.loopexit ]
%1 = getelementptr inbounds %"alloc::vec::Vec<i32>", ptr %self1.i, i64 %iter.sroa.0.019, i32 0, i32 1
br label %bb10
bb6:
ret void
bb10:
%_0.i.i.i618 = phi i1 [ true, %bb8.preheader ], [ %_0.i.i.i6, %bb10 ]
%iter1.sroa.0.017 = phi i64 [ 0, %bb8.preheader ], [ %spec.select15, %bb10 ]
%_0.i1.i.i9 = zext i1 %_0.i.i.i618 to i64
%spec.select15 = add nuw i64 %iter1.sroa.0.017, %_0.i1.i.i9
%_15 = load i32, ptr %b, align 4
%self1.i11 = load ptr, ptr %1, align 8
%_0.i.i13 = getelementptr inbounds i32, ptr %self1.i11, i64 %iter1.sroa.0.017
%2 = load i32, ptr %_0.i.i13, align 4
%3 = add i32 %2, %_15
store i32 %3, ptr %_0.i.i13, align 4
%_0.i.i.i6 = icmp ult i64 %spec.select15, 4000
br i1 %_0.i.i.i6, label %bb10, label %bb2.loopexit
The dereference operation of the inner vectors (%self1.i = load ptr, ptr %0, align 8
) is not properly hoisted out of the loop. So the llvm-opt chooses to unroll the loop instead of vectorizing it.
For more details, see here and here.
Why Does This Problem Matter?
As described in this topic, nested vectors are quite typical in daily work. Manually optimizing such code can be quite tedious. Similar optimizations are already available in Clang.
Or we can introduce a lint, such as deref_before_loop
, which would look like this:
#[no_mangle]
unsafe fn add_of(a:&mut MyStruct,b:*const i32){
for i in 0..4000{
// lint: must take the slice before the inner loop to vectorize it
let inner_vec=a.get_ith_vector_mut(i).as_mut_slice();
for j in 0..4000{
*inner_vec.add(j)+=*b;
}
}
}