TBAA Absence Causes LICM Optimization Failures

Updated

After some attempts, I discovered that Clang generates TBAA metadata while Rust does not, which leads to more conservative optimization strategy for loops. For more details, check godbolt.

I'm so curious why Rust doesn't have TBAA...

Problem details

This topic is a continuation of that topic.

Through further investigation, I discovered that the failure of vectorization is caused by the failure of Loop-Invariant Code Motion (LICM) . Consider the following minimal example:

#[no_mangle]
unsafe fn add_of_1(a:&mut Vec<Vec<i32>>,b:*const i32){
    let a=a.as_mut_slice();
    for i in 0..a.len(){
        for j in 0..a[i].len(){
            *a.get_unchecked_mut(i).get_unchecked_mut(j)+=*b;
        }
    }
}

#[no_mangle]
unsafe fn add_of(a:&mut [Vec<i32>],b:*const i32){
    for i in 0..4000{
        for j in 0..4000{
            *a.get_unchecked_mut(i).get_unchecked_mut(j)+=*b;
        }
    }
}

Here is the LLVM IR with opt-level=1 (showing the inner loop):

bb8.preheader:
  %spec.select18 = phi i64 [ 1, %start ], [ %spec.select, %bb2.loopexit ]
  %iter.sroa.0.017 = phi i64 [ 0, %start ], [ %spec.select18, %bb2.loopexit ]
  %0 = getelementptr inbounds %"alloc::vec::Vec<i32>", ptr %a.0, i64 %iter.sroa.0.017, i32 0, i32 1
  %self1.i = load ptr, ptr %0, align 8
  br label %bb10

bb6:
  ret void

bb10:
  %_0.i.i.i616 = phi i1 [ true, %bb8.preheader ], [ %_0.i.i.i6, %bb10 ]
  %iter1.sroa.0.015 = phi i64 [ 0, %bb8.preheader ], [ %spec.select13, %bb10 ]
  %_0.i1.i.i9 = zext i1 %_0.i.i.i616 to i64
  %spec.select13 = add nuw i64 %iter1.sroa.0.015, %_0.i1.i.i9
  %_15 = load i32, ptr %b, align 4
  %_0.i.i11 = getelementptr inbounds i32, ptr %self1.i, i64 %iter1.sroa.0.015
  %1 = load i32, ptr %_0.i.i11, align 4
  %2 = add i32 %1, %_15
  store i32 %2, ptr %_0.i.i11, align 4
  %_0.i.i.i6 = icmp ult i64 %spec.select13, 4000
  br i1 %_0.i.i.i6, label %bb10, label %bb2.loopexit
bb8.preheader:
  %spec.select20 = phi i64 [ 1, %start ], [ %spec.select, %bb2.loopexit ]
  %iter.sroa.0.019 = phi i64 [ 0, %start ], [ %spec.select20, %bb2.loopexit ]
  %1 = getelementptr inbounds %"alloc::vec::Vec<i32>", ptr %self1.i, i64 %iter.sroa.0.019, i32 0, i32 1
  br label %bb10

bb6:
  ret void

bb10:
  %_0.i.i.i618 = phi i1 [ true, %bb8.preheader ], [ %_0.i.i.i6, %bb10 ]
  %iter1.sroa.0.017 = phi i64 [ 0, %bb8.preheader ], [ %spec.select15, %bb10 ]
  %_0.i1.i.i9 = zext i1 %_0.i.i.i618 to i64
  %spec.select15 = add nuw i64 %iter1.sroa.0.017, %_0.i1.i.i9
  %_15 = load i32, ptr %b, align 4
  %self1.i11 = load ptr, ptr %1, align 8
  %_0.i.i13 = getelementptr inbounds i32, ptr %self1.i11, i64 %iter1.sroa.0.017
  %2 = load i32, ptr %_0.i.i13, align 4
  %3 = add i32 %2, %_15
  store i32 %3, ptr %_0.i.i13, align 4
  %_0.i.i.i6 = icmp ult i64 %spec.select15, 4000
  br i1 %_0.i.i.i6, label %bb10, label %bb2.loopexit

The dereference operation of the inner vectors (%self1.i = load ptr, ptr %0, align 8) is not properly hoisted out of the loop. So the llvm-opt chooses to unroll the loop instead of vectorizing it.

For more details, see here and here.

Why Does This Problem Matter?

As described in this topic, nested vectors are quite typical in daily work. Manually optimizing such code can be quite tedious. Similar optimizations are already available in Clang.

Or we can introduce a lint, such as deref_before_loop , which would look like this:

#[no_mangle]
unsafe fn add_of(a:&mut MyStruct,b:*const i32){
    for i in 0..4000{
        // lint: must take the slice before the inner loop to vectorize it
        let inner_vec=a.get_ith_vector_mut(i).as_mut_slice();
        for j in 0..4000{
            *inner_vec.add(j)+=*b;
        }
    }
}

It is generally advisable to use iterators rather than manual loop+indexing as iterators tend to optimize better due to avoiding redundant bound checks. In addition iterators are safer than avoiding bound checks using unsafe code.

5 Likes

Just to emphasise what @bjorn3 said: I rewrote add_one_of to use iterators, rather than indexing, and the rewrite (a) doesn't need unsafe, and (b) vectorizes successfully. It's also, IMO, clearer about what's going on - iterate over every vector in a, for each of those vectors add b to every single element.

6 Likes

This sounds like either a problem with LLVM or with the IR that rustc generates not giving enough informations to LLVM. What optimizations does Clang do to avoid this issue?

I don't think a lint would be good here since there's a high change it will be made useless by a future upgrade of LLVM. It would also be very specific to LLVM (what about the GCC and Cranelift backends?).

4 Likes

I have found the reason for that LICM Optimization Failures: the lack of TBAA. Please check the updates at the top.

TBAA is often confusing to people (many major C projects specify -fno-strict-aliasing to disable it) and only really makes sense when you have typed memory like C and C++, but unlike Rust. Rust also doesn't rely on it as much due to the aliasing rules of references allowing us to specify the equivalent of C's restrict keyword at the LLVM level (noalias) for all references.

3 Likes

Rust does not have TBAA, and will not have TBAA.

We let you read a u32 through a *const [u16; 2], unlike C++. That's much easier for people to understand, and means that checkers don't need to know the defined type of the memory -- which is a ton of extra state it would need to track -- and instead the UB definitions are on typed operations instead.

The non-overlap we get from &mut is way better a solution for this for users than TBAA is, especially because things like &mut [f32] and &mut [f32] are distinct thanks to &mut, but TBAA doesn't help that case at all.

Stop using *const i32 for things like this. Use &i32 and the *b is trivially allowed to be hoisted.

6 Likes

Note that even with changing *const i32 to &i32, you can't persuade LLVM to vectorize the indexing based version of the loop - Compiler Explorer shows that it just won't do it. Making it iteration over the Vec does work.

1 Like