MIR: Optimize for-loop on integer ranges

https://rust.godbolt.org/z/195ee687T

pub fn test_v1(n: usize) -> usize {
    let mut s = 0;
    for i in 1..n {
        s += i;
    }
    s
}

pub fn test_v2(n: usize) -> usize {
    let mut s = 0;
    let mut i = 1;
    while i < n {
        s += i;
        i += 1;
    }
    s
}

test_v1 and test_v2 are equivalent. But test_v1 generates tons of MIR when fully inlined.

What if rustc can transform test_v1 to test_v2 when optimizing MIR? I believe it will give some perf wins on compiletime and may generate better machine code for more complicated cases.

The unoptimized for-loop has caused an issue for SIMD code. LLVM fails to optimize it.

https://github.com/rust-lang/rust/issues/102220

7 Likes

Excellent idea.

But it seems hard for rust to achieve it.

IIRC, for logically call the next function for Iterators, and Rust have less knowledge with iterators.

i.e., if you wrote:

let mut a:i32;
let mut b:i32=0i32;
for i in 0..10 {
    b+=i*i;
    a=b;
}
dbg!(a);

the program won't compile since Rust could not inference whether a is initialized:

4 | for i in 0..10 {
  |          ---- if the `for` loop runs 0 times, `a` is not initialized

It might be hard to do such optimize.

I think it would be easier because:

  • Rust has built-in range syntax.
  • rustc knows whether the range is Range<{integer}>.
  • rustc has MIR optimizations which can transform special cases like slice::len.

maybe we could talking about another Trait, FixedSizeIterator<T,Idx>

where FixedSizeIterator::next(&mut self) returns T, and const fn FixedSizeIterator::next(&self) return Idx, where Idx is either u8,u16,u32,u64 or u128

After that, when we wrote Range::intoIter, we would got a FixedSizeIterator

But actually it is an API-break.

for i in 0u8..{
//triggers a overflow panic! when we try to yield its 256-th element
}
for i in FixedSizeIterator(0u8..){
// should not panic, since we should got element count of u8, thus up to 255.
}

Maybe a discussion thread is needed, since we are changing something.

Wow, there is a lot of garbage in the MIR of test_v1! Variables created and instantly moved out of, references created and instantly reborrowed.

_8 = &mut _5;                    // scope 2 at /app/example.rs:3:14: 3:18
// _7 is not used anywhere else
_7 = &mut (*_8);                 // scope 2 at /app/example.rs:3:14: 3:18
StorageLive(_12);                // scope 5 at /rustc/57f097ea25f2c05f424fc9b9dc50dbd6d399845c/library/core/src/iter/range.rs:711:9: 711:25
_12 = &mut (*_7);                // scope 5 at /rustc/57f097ea25f2c05f424fc9b9dc50dbd6d399845c/library/core/src/iter/range.rs:711:9: 711:25
StorageLive(_11);                // scope 3 at /app/example.rs:4:14: 4:15
// _11 is otherwise unused
_11 = _10;                       // scope 3 at /app/example.rs:4:14: 4:15
_0 = Add(_0, move _11);          // scope 3 at /app/example.rs:4:9: 4:15
StorageDead(_11);                // scope 3 at /app/example.rs:4:14: 4:15

I don't know whether the MIR for test_v1 can be made as succinct as for test_v2, but there is a ton of low-hanging fruits here. Simple elimination of redundant variables should decrease MIR size in half if not more. There is no reason to do lots of reborrows, or to store a range as a tuple and keep destructing it. I wonder if there is some extra flag to enable those optimizations on current Rust?

2 Likes

While in this case it is pretty pointless, in case of calls copies can be necessary for correctness. rustc_mir_build doesn't know when it is necessary and when not (doing so requires a non-local analysis), so it unconditionally inserts the copies just in case. We also don't yet know the exact rules. In any case you might be interested in following

Optimizing this is done using SROA (scalar replacement of aggregates), which

will implement.

5 Likes

This seems completly different than what OP's talking about. OP just wants the resulting codegen to be the same, while you want different semantics from for. Moreover, the current semantics of the while in OP's post still don't allow the compiler to infer the loop body will run at least once.

1 Like

Can you explain more about this? LLVM optimizes test_v1 down to the closed-form solution (https://rust.godbolt.org/z/88obnj1jK), so the extra noise in MIR clearly doesn't impact it.

LLVM fails to optimize the loop when the function is annoated with targer_feature(enable="neon"). I'm not sure whether it is a LLVM bug. https://rust.godbolt.org/z/1M6WvWqjG

That's probably due to ABI differences preventing inlining. If you specify -Ctarget-feature=+neon as compiler option instead it works.

1 Like

Interesting that a simple range iterator works. https://rust.godbolt.org/z/E4Y8Tv1o5

It seems that the std range implementation is too complex for the optimizer.