Unusual missed optimization for signed integers modulo `1<<k`?

Coder-256 · August 12, 2022, 8:07am

Consider the following code:

#[inline(never)]
fn black_box<T: std::fmt::Display>(x: T) { println!("{}", x); }

pub fn foo_i32(n: usize) {
    let mut x = 0i32;
    for _ in 0..n {
        black_box(x);
        x = (x+5) % 16;
    }
}

pub fn foo_u32(n: usize) {
    let mut x = 0u32;
    for _ in 0..n {
        black_box(x);
        x = (x+5) % 16;
    }
}

As shown by Godbolt (with -C opt-level=3), the assembly for modifying x in foo_u32 is:

add     ebp, 5  ; bp += 5
and     ebp, 15 ; bp &= 15

Which essentially translates to x = (x+5) & 15. Great, this is probably as optimal as it could possibly be!

However, the assembly for foo_i32 is:

lea     eax, [rbx + 5]  ; ax = bx+5;
lea     ecx, [rbx + 20] ; cx = bx+20;
test    eax, eax
cmovns  ecx, eax        ; if ax >= 0 { cx = ax; }
and     ecx, -16        ; cx &= -16;   (i.e. cx &= ~15)
neg     ecx             ; cx *= -1;
add     ebx, ecx        ; bx += cx;
add     ebx, 5          ; bx += 5;

Which essentially translates to:

x+5 - ((if x+5 >= 0 { x+5 } else { x+20 }) & -16)

However, in the case that x+5 >= 0, it's the same as x+5 - ((x+5) & -16), and x+5 - ((x+5) & ~15), and (x+5) & 15 which is exactly what the assembly for foo_u32 does above. In other words, the only difference is that the assembly for foo_i32 needlessly handles the unreachable case where x < 0, and the instruction count suffers as a result.

Shockingly, if you just give the compiler a nudge, it literally compiles to the exact same assembly as the unsigned version, completely eliding any panics!

pub fn foo_i32(n: usize) {
    let mut x = 0i32;
    for _ in 0..n {
        black_box(x);
        if (x+5 < 0) { unreachable!() } // !!!
        x = (x+5) % 16;
    }
}

Literally, you don't even need to do unreachable_unchecked(), just regular unreachable!() does the trick, which can only mean that the compiler is able to prove that x+5 >= 0... but only if prompted?!?! Perhaps this is due to some sort of search depth limit or time limit in the optimizer, in which case maybe that limit should be relaxed?

Anyway, the bottom line is that this seems like a missed optimization to me; even though the compiler could (and clearly knows how to!!) optimize the operation, it normally doesn't.

sahnehaeubchen · August 12, 2022, 8:44am

This seems to be a missed optimization on llvm's side. The corresponding C++ code is optimized by gcc but not by clang:

scottmcm · August 12, 2022, 5:15pm

Well you black-boxed it. So of course it doesn't know that x is positive -- losing context like that is the whole point of black_box.

And (-5) % 16 is -5, not 11, so of course it doesn't use & 0xF for signed numbers.

EDIT: Oh, you called a function black_box that's neither std::hint::black_box nor test::black_box. That's really confusing.

jrose · August 12, 2022, 5:17pm

That’s a black-box input, passed by Copy. It shouldn’t be able to affect the local variable x.

scottmcm · August 12, 2022, 6:18pm

Oh, I thought that when it said black_box it meant the real black_box function, not something that isn't a black box at all, and can in fact be inlined (no, inline(always) doesn't prevent inlining).

Real black box absolutely affects non-arguments:

But also, just don't use signed integers, especially if you're trying to do something modular. If you think you need signed numbers, you probably don't.

(And if you're really doing a println!, the cost of the cmov is completely trivial compared to the string formatting and actual output.)

Conveniently, though, LLVM is on github now for issues, so it's much easier to file bugs than it used to be: https://github.com/llvm/llvm-project/issues/new.

Coder-256 · August 12, 2022, 7:12pm

I think my black box function behaves very similarly to the real std::hint::black_box, I just didn't want to use nightly; either way in this case the real one has the same effect. Correct me if I'm wrong here but my understanding is: a "black box" is basically just an extern function, it has an opaque body, and therefore unknown side effects and an unknown return value. The point is that the compiler needs to ensure that all inputs are fully computed, and can't assume anything about the output value. It shouldn't affect any non-arguments, and in your example, the assembly still is hardcoded to return 1. IIRC the real black_box just does a volatile read and returns the result (which is like a fast, inline-able, generic, pure-Rust way to emulate the "unknown side effects" of an extern function, and still shouldn't affect the optimizations of other variables).

Agreed. However this is still a missed optimization; IMHO it should be fixed if doing so is not too difficult, but I'm no compiler developer!

Aloso · August 12, 2022, 10:44pm

I think the reason is this: Without the panicking branch, (x + 5) % 16 can only be proven to be non-negative if x is non-negative, and x can only be proven to be non-negative if (x + 5) % 16 is non-negative. It appears that LLVM fails due to this circular reasoning.

By adding the panicking branch, proving it becomes much easier: The panicking branch ensures that (x + 5) % 16 is non-negative, and once that has been proven, LLVM can prove that x is in the interval [0; 15], so the panicking branch is actually unreachable.

simonbuchan · August 30, 2022, 10:47am

Huh, I thought that sort of analysis was pretty common across basic blocks, perhaps llvm just didn't handle with the right ordering with respect to expanding that modulo?

afetisov · August 30, 2022, 1:05pm

Your function is unrelated to the real black_box in any regard. You pass it a local variable by copy, so whatever it does inside has no effect on outer optimizations. Also, #[inline(never)] is just a hint, and has no real effect on black-box behaviour. If you want black-box behaviour, then you should either pass a &mut T, or return T and reassign it to the variable. You also need something which is properly blackboxed from the compiler, like a real extern fn call, or read_volatile/write_volatile, or inline assembly, or compiler intrinsics. Finally, note that Rust, strictly speaking, has no concept of a non-optimizable non-inlinable call, and all the methods listed above are only hacks which may be broken in various more compelx cases. For example, if you are compiling to WASM, then the WASM runtime can inline and optimize your function calls regardless of whatever Rust code shannanigans that you may try.

tczajka · August 30, 2022, 2:27pm

That's not true. If you make the black_box body empty, whole functions foo_i32 and foo_u32 get optimized into nothingness.

He doesn't want that, he wants the value to be consumed, not regenerated (as far as the compiler knows). The optimization we are talking about would no longer be valid if the value was arbitrarily changed inside the loop.

Coder-256 · August 30, 2022, 3:12pm

@afetisov Yes you correctly point out that my “black box” function is not guaranteed to function the same as the real black box in many situations, I’m only saying that in this case (stable compiler with this exact code) it has the right properties to work in a pinch. Indeed the generated assembly is nearly identical when using the real black box on nightly! Even an inlined, macro-expanded println!() would do the job in this situation, I just wanted the assembly to be clean so I made a function with a hint for inline(never).

Passing the variable by copy is 100% intentional; I am not trying to get the compiler to assume the variable can take on any valid value, which would defeat the point by making the optimization impossible. Just trying to force the compiler not to optimize away the entire value calculation by using it in a side effect (printing to stdout).

It’s just to demonstrate the missed optimization I was talking about, which was the original reason I created this thread. I hope the debate over what constitutes a black box doesn’t derail this too much.

tczajka · August 31, 2022, 11:28am

I have a simpler example:

pub fn foo(x: i32) -> i32 {
    if x >= 0 {
        x % 16
    } else {
        42
    }
}

The x % 16 could just be x & 15, but is compiled into something like x - ((if x >= 0 { x } else { x + 15 }) & -16).

tczajka · August 31, 2022, 11:42am

I have filed a bug with LLVM:

fee1-dead · September 27, 2022, 2:59pm

Note that using println! as a black box is a real issue that I think should be fixed, see Compiler fails to recognize that `println!` does not mutate the value · Issue #86510 · rust-lang/rust · GitHub

system · December 26, 2022, 3:00pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Initializing large arrays with non-zero values fails to optimize compiler	13	1568	August 15, 2021
Optimizing branched assignings based on possible states compiler	5	690	April 26, 2021
Faster PartialEq for small arrays of non 2^n length compiler	25	744	September 14, 2024
Copy elision & RVO optimization compiler	12	2641	November 26, 2022
AddAssign<i32> for Wrapping<u32> etc language design	4	582	January 30, 2024

Unusual missed optimization for signed integers modulo `1<<k`?

Related topics