TBAA Absence Causes LICM Optimization Failures

zhongyi51 · July 11, 2024, 5:31am

Updated

After some attempts, I discovered that Clang generates TBAA metadata while Rust does not, which leads to more conservative optimization strategy for loops. For more details, check godbolt.

I'm so curious why Rust doesn't have TBAA...

Problem details

This topic is a continuation of that topic.

Through further investigation, I discovered that the failure of vectorization is caused by the failure of Loop-Invariant Code Motion (LICM) . Consider the following minimal example:

#[no_mangle]
unsafe fn add_of_1(a:&mut Vec<Vec<i32>>,b:*const i32){
    let a=a.as_mut_slice();
    for i in 0..a.len(){
        for j in 0..a[i].len(){
            *a.get_unchecked_mut(i).get_unchecked_mut(j)+=*b;
        }
    }
}

#[no_mangle]
unsafe fn add_of(a:&mut [Vec<i32>],b:*const i32){
    for i in 0..4000{
        for j in 0..4000{
            *a.get_unchecked_mut(i).get_unchecked_mut(j)+=*b;
        }
    }
}

Here is the LLVM IR with opt-level=1 (showing the inner loop):

bb8.preheader:
  %spec.select18 = phi i64 [ 1, %start ], [ %spec.select, %bb2.loopexit ]
  %iter.sroa.0.017 = phi i64 [ 0, %start ], [ %spec.select18, %bb2.loopexit ]
  %0 = getelementptr inbounds %"alloc::vec::Vec<i32>", ptr %a.0, i64 %iter.sroa.0.017, i32 0, i32 1
  %self1.i = load ptr, ptr %0, align 8
  br label %bb10

bb6:
  ret void

bb10:
  %_0.i.i.i616 = phi i1 [ true, %bb8.preheader ], [ %_0.i.i.i6, %bb10 ]
  %iter1.sroa.0.015 = phi i64 [ 0, %bb8.preheader ], [ %spec.select13, %bb10 ]
  %_0.i1.i.i9 = zext i1 %_0.i.i.i616 to i64
  %spec.select13 = add nuw i64 %iter1.sroa.0.015, %_0.i1.i.i9
  %_15 = load i32, ptr %b, align 4
  %_0.i.i11 = getelementptr inbounds i32, ptr %self1.i, i64 %iter1.sroa.0.015
  %1 = load i32, ptr %_0.i.i11, align 4
  %2 = add i32 %1, %_15
  store i32 %2, ptr %_0.i.i11, align 4
  %_0.i.i.i6 = icmp ult i64 %spec.select13, 4000
  br i1 %_0.i.i.i6, label %bb10, label %bb2.loopexit

bb8.preheader:
  %spec.select20 = phi i64 [ 1, %start ], [ %spec.select, %bb2.loopexit ]
  %iter.sroa.0.019 = phi i64 [ 0, %start ], [ %spec.select20, %bb2.loopexit ]
  %1 = getelementptr inbounds %"alloc::vec::Vec<i32>", ptr %self1.i, i64 %iter.sroa.0.019, i32 0, i32 1
  br label %bb10

bb6:
  ret void

bb10:
  %_0.i.i.i618 = phi i1 [ true, %bb8.preheader ], [ %_0.i.i.i6, %bb10 ]
  %iter1.sroa.0.017 = phi i64 [ 0, %bb8.preheader ], [ %spec.select15, %bb10 ]
  %_0.i1.i.i9 = zext i1 %_0.i.i.i618 to i64
  %spec.select15 = add nuw i64 %iter1.sroa.0.017, %_0.i1.i.i9
  %_15 = load i32, ptr %b, align 4
  %self1.i11 = load ptr, ptr %1, align 8
  %_0.i.i13 = getelementptr inbounds i32, ptr %self1.i11, i64 %iter1.sroa.0.017
  %2 = load i32, ptr %_0.i.i13, align 4
  %3 = add i32 %2, %_15
  store i32 %3, ptr %_0.i.i13, align 4
  %_0.i.i.i6 = icmp ult i64 %spec.select15, 4000
  br i1 %_0.i.i.i6, label %bb10, label %bb2.loopexit

The dereference operation of the inner vectors (%self1.i = load ptr, ptr %0, align 8) is not properly hoisted out of the loop. So the llvm-opt chooses to unroll the loop instead of vectorizing it.

For more details, see here and here.

Why Does This Problem Matter?

As described in this topic, nested vectors are quite typical in daily work. Manually optimizing such code can be quite tedious. Similar optimizations are already available in Clang.

Or we can introduce a lint, such as deref_before_loop , which would look like this:

#[no_mangle]
unsafe fn add_of(a:&mut MyStruct,b:*const i32){
    for i in 0..4000{
        // lint: must take the slice before the inner loop to vectorize it
        let inner_vec=a.get_ith_vector_mut(i).as_mut_slice();
        for j in 0..4000{
            *inner_vec.add(j)+=*b;
        }
    }
}

bjorn3 · July 11, 2024, 6:34am

It is generally advisable to use iterators rather than manual loop+indexing as iterators tend to optimize better due to avoiding redundant bound checks. In addition iterators are safer than avoiding bound checks using unsafe code.

farnz · July 11, 2024, 9:28am

Just to emphasise what @bjorn3 said: I rewrote add_one_of to use iterators, rather than indexing, and the rewrite (a) doesn't need unsafe, and (b) vectorizes successfully. It's also, IMO, clearer about what's going on - iterate over every vector in a, for each of those vectors add b to every single element.

SkiFire13 · July 11, 2024, 10:26am

This sounds like either a problem with LLVM or with the IR that rustc generates not giving enough informations to LLVM. What optimizations does Clang do to avoid this issue?

I don't think a lint would be good here since there's a high change it will be made useless by a future upgrade of LLVM. It would also be very specific to LLVM (what about the GCC and Cranelift backends?).

zhongyi51 · July 12, 2024, 3:29pm

I have found the reason for that LICM Optimization Failures: the lack of TBAA. Please check the updates at the top.

bjorn3 · July 12, 2024, 3:40pm

TBAA is often confusing to people (many major C projects specify -fno-strict-aliasing to disable it) and only really makes sense when you have typed memory like C and C++, but unlike Rust. Rust also doesn't rely on it as much due to the aliasing rules of references allowing us to specify the equivalent of C's restrict keyword at the LLVM level (noalias) for all references.

scottmcm · July 12, 2024, 6:08pm

Rust does not have TBAA, and will not have TBAA.

We let you read a u32 through a *const [u16; 2], unlike C++. That's much easier for people to understand, and means that checkers don't need to know the defined type of the memory -- which is a ton of extra state it would need to track -- and instead the UB definitions are on typed operations instead.

The non-overlap we get from &mut is way better a solution for this for users than TBAA is, especially because things like &mut [f32] and &mut [f32] are distinct thanks to &mut, but TBAA doesn't help that case at all.

Stop using *const i32 for things like this. Use &i32 and the *b is trivially allowed to be hoisted.

farnz · July 12, 2024, 8:24pm

Note that even with changing *const i32 to &i32, you can't persuade LLVM to vectorize the indexing based version of the loop - Compiler Explorer shows that it just won't do it. Making it iteration over the Vec does work.

system · October 10, 2024, 8:24pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Missed stack manipulation elision? compiler	5	1159	March 25, 2019
Function Multi-Versioning for Rust? compiler	17	3376	March 25, 2019
Announcing support for rust 1.0 in "Rust Explorer" tools and infrastructure	6	2554	March 25, 2019
FYI: IBM is working on LLVM support for arbitrary aliasing graphs (for Fortran) compiler	4	1635	January 30, 2021
Resolving Rust's forward progress guarantees compiler	52	6771	August 6, 2020

TBAA Absence Causes LICM Optimization Failures

Updated

Problem details

Why Does This Problem Matter?

Related Topics