Iterator performance regression 1.81 -> 1.82

Here is the code:

use std::time::Instant;

fn main() {
    let start = Instant::now();

    (0..=1000)
    .flat_map(|a| {
        (0..=1000).flat_map(move |b| {
            (0..=1000)
                .filter(move |&c| a * a + b * b == c * c && a + b + c == 1000)
                .map(move |c| (a, b, c))
        })
    })
    .for_each(|(a, b, c)| {
        println!("a: {} b: {} c: {}", a, b, c);
    });

    let duration = start.elapsed();
    println!("{} seconds", duration.as_secs_f64());
}

After updating from Rust 1.81 to 1.85, the same code runs approximately 80% slower with:

cargo run --release

on my laptop (Windows 11 26100.2894 + Intel Core i7-10750H), so I tried version 1.82 to 1.84 and found that Rust 1.82 produces roughly the same output as Rust 1.85.

The difference between the ASM outputs from Rust 1.81 & 1.82 are:

1.81:

        cmp r14d, eax
        jne .LBB5_6
        cmp esi, r12d
        jne .LBB5_6

1.82:

        xor eax, r14d
        mov ecx, esi
        xor ecx, r12d
        or ecx, eax
        jne .LBB5_5

And from 1.81:

        cmp dword ptr [rsp + 116], 0
        jne .LBB5_10
        cmp r14d, 1000000
        jne .LBB5_10

1.82:

        xor r14d, 1000000
        or dword ptr [rsp + 116], r14d
        jne .LBB5_8

So basically, starting from Rust 1.82, the compiler sometimes generates bitwise operations on comparison results instead of multiple compare & jump operations, which makes those comparisons no longer short-circuited.

I'm not familiar with how rustc/LLVM works, so I don't really know how we should handle this kind of regression.

4 Likes

By the way, the "good old" for-loop style:

use std::time::Instant;

fn main() {
    let start = Instant::now();

    for a in 0..=1000 {
        for b in 0..=1000 {
            for c in 0..=1000 {
                if a * a + b * b == c * c && a + b + c == 1000 {
                    let (a, b, c) = (a, b, c);
                    println!("a: {} b: {} c: {}", a, b, c);
                }
            }
        }
    }

    let duration = start.elapsed();
    println!("{} seconds", duration.as_secs_f64());
}

takes about the same time to run as the FP-style version compiled with Rust 1.82+ (although the for-loop generates fewer asm instructions).

And if I comment out this line in the innermost loop:

let (a, b, c) = (a, b, c);

the time it takes to run doubles.


Meanwhile, with Rust nightly 1.87:

#![feature(gen_blocks)]

use std::time::Instant;

fn main() {
    let start = Instant::now();

    let gen_results = gen {
        for a in 0..=1000 {
            for b in 0..=1000 {
                for c in 0..=1000 {
                    if a * a + b * b == c * c && a + b + c == 1000 {
                        yield (a, b, c);
                    }
                }
            }
        }
    };

    for (a, b, c) in gen_results {
        println!("a: {} b: {} c: {}", a, b, c);
    }

    let duration = start.elapsed();
    println!("{} seconds", duration.as_secs_f64());
}

The above code takes the same amount of time to run as a naive for-loop with println! embedded, and without let (a, b, c) = (a, b, c);, which means this is the slowest variant. And I couldn't find a way to make it faster.

1 Like

File an issue for the regression: Sign in to GitHub · GitHub

1 Like

Done:

3 Likes

This issue is likely due to the loop variables address escaping due to the println!, which prevents some optimizations.

2 Likes

Yeah I've come across that before, but it's still undesirable since:

println!("a: {} b: {} c: {}", a, b, c);

is already semantically Copying or moving a, b and c, which means the compiler is free to change their addresses - if I want the original addresses, I should pass in &a, &b, &c

To me, an optimization like this looks like it could well be a win in the general case (e.g. when the first branch is unpredictable).

While it is of course a toy example, this specific benchmark is somewhat less convincing in that you can easily compute the same results a thousand times faster without the innermost loop:

for a in 0..=1000 {
    for b in 0..=1000 {
        // only value that satisfies `a + b + c == 1000`
        let c = 1000 - a - b;
        if c >= 0 {
            if a * a + b * b == c * c {
                println!("a: {} b: {} c: {}", a, b, c);
            }
        }
    }
}
0.002927962 seconds

In the original, both of the filtering conditions are only true at most once during the innermost loop, so it makes sense that branching on the non-shortcircuited cond1() & cond2() would be the slowest choice, since the extra work almost never pays off.

1 Like

println! is supposed to implictly borrow its parameters - it's a compiler builtin but it behaves as if you had passed &a, &b and &c - though the docs don't mention it, or if they do it's kind of hard to find.

If you want a move, there is dbg!. The output is different though.

Maybe there should be another function println_move! that moves rather than borrows, for the times it matter

4 Likes

As others said, println! is a macro and internally borrows a, b and c. You can force a move/copy by using a block expression, e.g. println!("a: {}", { a }).

4 Likes

Now that you bring it up, I've tried this:

use std::time::Instant;

fn main() {
    let start = Instant::now();

    (0..=30000)
        .flat_map(|a| {
            (30000 - a..=(60000 - 2 * a).min(30000 - a / 2)).map(move |b| (a, b, 60000 - a - b))
        })
        .filter(|(a, b, c)| a * a + b * b == c * c)
        .for_each(|(a, b, c)| {
            println!("a: {} b: {} c: {}", a, b, c);
        });

    let duration = start.elapsed();
    println!("{} seconds", duration.as_secs_f64());
}

Good news: the above code actually performs the same when compiled with Rust 1.82+ as with Rust 1.81.

Bad news: the for-loop version is still 100% slower:

use std::time::Instant;

fn main() {
    let start = Instant::now();

    for a in 0..=30000 {
        for b in (30000 - a)..=(60000 - 2 * a).min(30000 - a / 2) {
            let c = 60000 - a - b;
            if a * a + b * b == c * c {
                println!("a: {} b: {} c: {}", a, b, c);
            }
        }
    }

    let duration = start.elapsed();
    println!("{} seconds", duration.as_secs_f64());
}

And this time, neither adding

let (a, b, c) = (a, b, c);

nor replacing

println!("a: {} b: {} c: {}", a, b, c);

with

println!("a: {} b: {} c: {}", { a }, { b }, { c });

helps.


It's actually quite sad because the for-loop version in C:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(void) {
    clock_t start, finish;
    start = clock();
    double duration;

    for (int a = 0; a <= 30000; a++) {
        for (int b = 30000 - a; b <= min(60000 - 2 * a, 30000 - a / 2); b++) {
            int c = 60000 - a - b;
            if (a * a + b * b == c * c) {
                printf("a: %d b: %d c: %d\n", a, b, c);
            }
        }
    }

    finish = clock();
    duration = (double)(finish - start) / CLOCKS_PER_SEC;
    printf("%f seconds\n", duration);
    return 0;
}

compiled with:

clang -O3 benchmark.c -o benchmark.exe

runs as fast as the FP-style version in Rust.

I don't see the reason why for-loops in Rust shouldn't be able to run as fast.

This is probably tied to that inclusive range are unfortunately very difficult for LLVM to optimize. Because they can't just use <= where half-open ranges uses <, as they need to handle the case where the top bound is the maximum value.

“Internal” iteration with for_each doesn't have this issue. So a slight change to the for loop version should recover performance, e.g.

    (0..=30000).for_each(|a| {
        ((30000 - a)..=(60000 - 2 * a).min(30000 - a / 2)).for_each(|b| {
            let c = 60000 - a - b;
            if a * a + b * b == c * c {
                println!("a: {} b: {} c: {}", a, b, c);
            }
        })
    })

It's definitely far from ideal that this pitfall exists, but its proven very difficult to address. It'll probably need some peephole optimization in LLVM to turn last-iter loop flags like ..= uses into range overflow, when it's within the type domain.

3 Likes

I think this reduced loop reproduces the same regression:

// NOTE: to simplify the assembly, will loop infinitely if there is no solution
pub fn looptest(
    d: u64,
    s: u64,
) -> u64 {
    let mut c = 0;
    loop {
        if s == c * c && d + c == 1000000 {
            return c;
        }
        c += 1;
    }
}

godbolt

full benchmark code:
// NOTE: to simplify the assembly, will loop infinitely if there is no solution
#[no_mangle]
#[inline(never)]
pub fn looptest(
    d: u64,
    s: u64,
) -> u64 {
    let mut c = 0;
    loop {
        if s == c * c && d + c == SUM {
            return c;
        }
        c += 1;
    }
}

const SUM: u64 = 1_000_000_000_000;

fn main() {
    let c = std::hint::black_box(2_000_000_000);
    {
        let f: fn(u64, u64) -> u64 = std::hint::black_box(looptest);
        let start = std::time::Instant::now();
        assert_eq!(c, f(SUM - c, c * c));
        println!("c = {c} took {} seconds (merged)", start.elapsed().as_secs_f64());
    }
}

Which ran in 0.69s with rustc 1.81, and 1.06s with rustc 1.83.

3 Likes

I tested the performance of these variants on my laptop:

Implementation RangeInclusive (<=) Range (<)
FP-style with flat_map & filter ~0.1s ~0.125s
Plain for-loop ~0.2s ~0.125s
for replaced with .for_each() ~0.15s ~0.125s
for-loop in C ~0.1s ~0.14s

Strangely enough, the two fastest variants both got slower when switched from RangeInclusive to Range (or <= to < in C's case) along with manually adjusted end bound.