Iterator performance regression 1.81 -> 1.82

Sunshine · February 27, 2025, 12:43pm

Here is the code:

use std::time::Instant;

fn main() {
    let start = Instant::now();

    (0..=1000)
    .flat_map(|a| {
        (0..=1000).flat_map(move |b| {
            (0..=1000)
                .filter(move |&c| a * a + b * b == c * c && a + b + c == 1000)
                .map(move |c| (a, b, c))
        })
    })
    .for_each(|(a, b, c)| {
        println!("a: {} b: {} c: {}", a, b, c);
    });

    let duration = start.elapsed();
    println!("{} seconds", duration.as_secs_f64());
}

After updating from Rust 1.81 to 1.85, the same code runs approximately 80% slower with:

cargo run --release

on my laptop (Windows 11 26100.2894 + Intel Core i7-10750H), so I tried version 1.82 to 1.84 and found that Rust 1.82 produces roughly the same output as Rust 1.85.

The difference between the ASM outputs from Rust 1.81 & 1.82 are:

1.81:

        cmp r14d, eax
        jne .LBB5_6
        cmp esi, r12d
        jne .LBB5_6

1.82:

        xor eax, r14d
        mov ecx, esi
        xor ecx, r12d
        or ecx, eax
        jne .LBB5_5

And from 1.81:

        cmp dword ptr [rsp + 116], 0
        jne .LBB5_10
        cmp r14d, 1000000
        jne .LBB5_10

1.82:

        xor r14d, 1000000
        or dword ptr [rsp + 116], r14d
        jne .LBB5_8

So basically, starting from Rust 1.82, the compiler sometimes generates bitwise operations on comparison results instead of multiple compare & jump operations, which makes those comparisons no longer short-circuited.

I'm not familiar with how rustc/LLVM works, so I don't really know how we should handle this kind of regression.

Sunshine · February 27, 2025, 1:04pm

By the way, the "good old" for-loop style:

use std::time::Instant;

fn main() {
    let start = Instant::now();

    for a in 0..=1000 {
        for b in 0..=1000 {
            for c in 0..=1000 {
                if a * a + b * b == c * c && a + b + c == 1000 {
                    let (a, b, c) = (a, b, c);
                    println!("a: {} b: {} c: {}", a, b, c);
                }
            }
        }
    }

    let duration = start.elapsed();
    println!("{} seconds", duration.as_secs_f64());
}

takes about the same time to run as the FP-style version compiled with Rust 1.82+ (although the for-loop generates fewer asm instructions).

And if I comment out this line in the innermost loop:

let (a, b, c) = (a, b, c);

the time it takes to run doubles.

Meanwhile, with Rust nightly 1.87:

#![feature(gen_blocks)]

use std::time::Instant;

fn main() {
    let start = Instant::now();

    let gen_results = gen {
        for a in 0..=1000 {
            for b in 0..=1000 {
                for c in 0..=1000 {
                    if a * a + b * b == c * c && a + b + c == 1000 {
                        yield (a, b, c);
                    }
                }
            }
        }
    };

    for (a, b, c) in gen_results {
        println!("a: {} b: {} c: {}", a, b, c);
    }

    let duration = start.elapsed();
    println!("{} seconds", duration.as_secs_f64());
}

The above code takes the same amount of time to run as a naive for-loop with println! embedded, and without let (a, b, c) = (a, b, c);, which means this is the slowest variant. And I couldn't find a way to make it faster.

Jules-Bertholet · February 27, 2025, 1:17pm

File an issue for the regression: Sign in to GitHub · GitHub

Sunshine · February 27, 2025, 1:41pm

Done:

github.com/rust-lang/rust

Iterator performance regression 1.81 -> 1.82

opened 01:37PM - 27 Feb 25 UTC

Sunshine40

C-bug I-prioritize regression-untriaged needs-triage

### Code I tried this code: ```rust use std::time::Instant; fn main() { l…et start = Instant::now(); (0..=1000) .flat_map(|a| { (0..=1000).flat_map(move |b| { (0..=1000) .filter(move |&c| a * a + b * b == c * c && a + b + c == 1000) .map(move |c| (a, b, c)) }) }) .for_each(|(a, b, c)| { println!("a: {} b: {} c: {}", a, b, c); }); let duration = start.elapsed(); println!("{} seconds", duration.as_secs_f64()); } ``` I expected to see this happen: code behaves the same across compiler versions Instead, this happened: code runs 80% slower when compiled with Rust 1.82+ ### Version it worked on It most recently worked on: Rust 1.81 ### Version with regression `rustc --version --verbose`: ``` rustc 1.82.0 (f6e511eec 2024-10-15) binary: rustc commit-hash: f6e511eec7342f59a25f7c0534f1dbea00d01b14 commit-date: 2024-10-15 host: x86_64-pc-windows-msvc release: 1.82.0 LLVM version: 19.1.1 ``` ### More details in this post https://internals.rust-lang.org/t/iterator-performance-regression-1-81-1-82/22473

SkiFire13 · February 27, 2025, 1:47pm

This issue is likely due to the loop variables address escaping due to the println!, which prevents some optimizations.

Sunshine · February 27, 2025, 2:12pm

Yeah I've come across that before, but it's still undesirable since:

println!("a: {} b: {} c: {}", a, b, c);

is already semantically Copying or moving a, b and c, which means the compiler is free to change their addresses - if I want the original addresses, I should pass in &a, &b, &c

quaternic · February 27, 2025, 2:22pm

To me, an optimization like this looks like it could well be a win in the general case (e.g. when the first branch is unpredictable).

While it is of course a toy example, this specific benchmark is somewhat less convincing in that you can easily compute the same results a thousand times faster without the innermost loop:

for a in 0..=1000 {
    for b in 0..=1000 {
        // only value that satisfies `a + b + c == 1000`
        let c = 1000 - a - b;
        if c >= 0 {
            if a * a + b * b == c * c {
                println!("a: {} b: {} c: {}", a, b, c);
            }
        }
    }
}

0.002927962 seconds

In the original, both of the filtering conditions are only true at most once during the innermost loop, so it makes sense that branching on the non-shortcircuited cond1() & cond2() would be the slowest choice, since the extra work almost never pays off.

dlight · February 27, 2025, 2:32pm

println! is supposed to implictly borrow its parameters - it's a compiler builtin but it behaves as if you had passed &a, &b and &c - though the docs don't mention it, or if they do it's kind of hard to find.

If you want a move, there is dbg!. The output is different though.

Maybe there should be another function println_move! that moves rather than borrows, for the times it matter

SkiFire13 · February 27, 2025, 3:22pm

As others said, println! is a macro and internally borrows a, b and c. You can force a move/copy by using a block expression, e.g. println!("a: {}", { a }).

Sunshine · February 27, 2025, 4:16pm

quaternic:

While it is of course a toy example, this specific benchmark is somewhat less convincing in that you can easily compute the same results a thousand times faster without the innermost loop:
for a in 0..=1000 {
    for b in 0..=1000 {
        // only value that satisfies `a + b + c == 1000`
        let c = 1000 - a - b;
        if c >= 0 {
            if a * a + b * b == c * c {
                println!("a: {} b: {} c: {}", a, b, c);
            }
        }
    }
}
0.002927962 seconds

Now that you bring it up, I've tried this:

use std::time::Instant;

fn main() {
    let start = Instant::now();

    (0..=30000)
        .flat_map(|a| {
            (30000 - a..=(60000 - 2 * a).min(30000 - a / 2)).map(move |b| (a, b, 60000 - a - b))
        })
        .filter(|(a, b, c)| a * a + b * b == c * c)
        .for_each(|(a, b, c)| {
            println!("a: {} b: {} c: {}", a, b, c);
        });

    let duration = start.elapsed();
    println!("{} seconds", duration.as_secs_f64());
}

Good news: the above code actually performs the same when compiled with Rust 1.82+ as with Rust 1.81.

Bad news: the for-loop version is still 100% slower:

use std::time::Instant;

fn main() {
    let start = Instant::now();

    for a in 0..=30000 {
        for b in (30000 - a)..=(60000 - 2 * a).min(30000 - a / 2) {
            let c = 60000 - a - b;
            if a * a + b * b == c * c {
                println!("a: {} b: {} c: {}", a, b, c);
            }
        }
    }

    let duration = start.elapsed();
    println!("{} seconds", duration.as_secs_f64());
}

And this time, neither adding

let (a, b, c) = (a, b, c);

nor replacing

println!("a: {} b: {} c: {}", a, b, c);

with

println!("a: {} b: {} c: {}", { a }, { b }, { c });

helps.

It's actually quite sad because the for-loop version in C:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(void) {
    clock_t start, finish;
    start = clock();
    double duration;

    for (int a = 0; a <= 30000; a++) {
        for (int b = 30000 - a; b <= min(60000 - 2 * a, 30000 - a / 2); b++) {
            int c = 60000 - a - b;
            if (a * a + b * b == c * c) {
                printf("a: %d b: %d c: %d\n", a, b, c);
            }
        }
    }

    finish = clock();
    duration = (double)(finish - start) / CLOCKS_PER_SEC;
    printf("%f seconds\n", duration);
    return 0;
}

compiled with:

clang -O3 benchmark.c -o benchmark.exe

runs as fast as the FP-style version in Rust.

I don't see the reason why for-loops in Rust shouldn't be able to run as fast.

CAD97 · February 27, 2025, 4:39pm

This is probably tied to that inclusive range are unfortunately very difficult for LLVM to optimize. Because they can't just use <= where half-open ranges uses <, as they need to handle the case where the top bound is the maximum value.

“Internal” iteration with for_each doesn't have this issue. So a slight change to the for loop version should recover performance, e.g.

    (0..=30000).for_each(|a| {
        ((30000 - a)..=(60000 - 2 * a).min(30000 - a / 2)).for_each(|b| {
            let c = 60000 - a - b;
            if a * a + b * b == c * c {
                println!("a: {} b: {} c: {}", a, b, c);
            }
        })
    })

It's definitely far from ideal that this pitfall exists, but its proven very difficult to address. It'll probably need some peephole optimization in LLVM to turn last-iter loop flags like ..= uses into range overflow, when it's within the type domain.

quaternic · February 27, 2025, 4:45pm

I think this reduced loop reproduces the same regression:

// NOTE: to simplify the assembly, will loop infinitely if there is no solution
pub fn looptest(
    d: u64,
    s: u64,
) -> u64 {
    let mut c = 0;
    loop {
        if s == c * c && d + c == 1000000 {
            return c;
        }
        c += 1;
    }
}

godbolt

full benchmark code:

// NOTE: to simplify the assembly, will loop infinitely if there is no solution
#[no_mangle]
#[inline(never)]
pub fn looptest(
    d: u64,
    s: u64,
) -> u64 {
    let mut c = 0;
    loop {
        if s == c * c && d + c == SUM {
            return c;
        }
        c += 1;
    }
}

const SUM: u64 = 1_000_000_000_000;

fn main() {
    let c = std::hint::black_box(2_000_000_000);
    {
        let f: fn(u64, u64) -> u64 = std::hint::black_box(looptest);
        let start = std::time::Instant::now();
        assert_eq!(c, f(SUM - c, c * c));
        println!("c = {c} took {} seconds (merged)", start.elapsed().as_secs_f64());
    }
}

Which ran in 0.69s with rustc 1.81, and 1.06s with rustc 1.83.

Sunshine · February 27, 2025, 6:46pm

CAD97:

So a slight change to the for loop version should recover performance, e.g.

    (0..=30000).for_each(|a| {
        ((30000 - a)..=(60000 - 2 * a).min(30000 - a / 2)).for_each(|b| {
            let c = 60000 - a - b;
            if a * a + b * b == c * c {
                println!("a: {} b: {} c: {}", a, b, c);
            }
        })
    })

I tested the performance of these variants on my laptop:

Implementation	RangeInclusive (<=)	Range (<)
FP-style with `flat_map` & `filter`	~0.1s	~0.125s
Plain `for`-loop	~0.2s	~0.125s
`for` replaced with `.for_each()`	~0.15s	~0.125s
`for`-loop in C	~0.1s	~0.14s

Strangely enough, the two fastest variants both got slower when switched from RangeInclusive to Range (or <= to < in C's case) along with manually adjusted end bound.

Topic		Replies	Views
Appreciable compilation slowdown with Sync + Send markers compiler	25	4331	March 25, 2019
Unexpected 3 x performance regression starting with Rust version 1.67 compiler	31	1825	July 25, 2023
Performance of Iterator::chain and std::iter::once libs	7	1967	June 2, 2019
Help us benchmark incremental compilation!	48	12211	March 25, 2019
Rust Compiler Performance Working Group announcements	44	12242	March 25, 2019

Iterator performance regression 1.81 -> 1.82

Related topics