Slightly Faster Loops by skipping compare to 0 instructions

carrybit · August 23, 2020, 12:00am

Having previously programmed in assembly language it seems like there might be an opportunity to squeeze a little more performance out of loops, specifically by removing the cmp instruction.

Example:

   load count, number-of-iterations;

L3:

    ... loop internals ...

   add count, -1             # dec count

   jump_not_zero .L3

The reason this would work is because the jump_not_zero conditional branching instruction doesn't have to follow a compare instruction if it's comparing the result of the add instruction to 0.

Do all supported architectures have a branch instruction that will work without a compare instruction?

Tom-Phinney · August 23, 2020, 12:32am

Not to my knowledge. Condition-code dependencies such as you presume are very detrimental to multi-issue instruction pipelines (though they worked fine on the PDP-11 and other in-order legacy processors).

jschievink · August 23, 2020, 12:44am

For this code:

pub fn main() {
    let mut x = 999;
    while x >= 0 {
        sideeffect();
        x -= 1;
    }
}

#[inline(never)]
fn sideeffect() {
    println!();
}

LLVM already generates code that avoids unnecessary cmp instructions on x86-64:

example::main:
        push    rbx
        mov     ebx, -1000
.LBB0_1:
        call    example::sideeffect
        inc     ebx
        jne     .LBB0_1
        pop     rbx
        ret

(in fact it turned the loop into an up-counting loop, interestingly)

mbrubeck · August 23, 2020, 4:10am

As before: If you are suggesting improvements in rustc/LLVM optimization passes, it would be very useful if you could include an example of Rust code that is currently not optimized correctly.

carrybit · August 23, 2020, 6:58am

All the code I had seen, like this example includes the cmp instruction in the loop.

    pub fn main() {
        let mut v:Vec<usize> = vec![0; 1117];
        let mut total:usize = 0;

        for vi in v.iter() {
            total += *vi;
        }
        println!("{}", total);
    }

This code generates used a test or cmp instruction in the loop:

    ...
    add     rdx, 40
    cmp     rdx, rax
    je      .LBB2_3
    ...

This is an example where a small improvement could be made by getting the index to compare to a value of zero to indicate the exit case:

    load ptr, vec_data_address
    load index, 8928  # 1116 * 8
L5:
    ... Loop internals ...   
    add index, -8
    jge .L5

Interesting bits:

1.) This version of the iterator runs backwards.

Frequently that is fine, but there are times when programmers need an iterator to run in order.

2.) The whole loop would need to be skipped when vector length = 0.

3.) If the jge (conditional branch if greater/equal) instruction isn't available, the code can be made to work with a jne, but it seems like it would need the ptr to be offset by one element backward.

Thanks

chrisd · August 24, 2020, 1:10am

Are you sure that's faster for that size? Here's a test where I prevent the compiler from optimizing for a particular length slice:

pub fn test(v: &[usize]) -> usize {
    v.iter().sum()
}

See godbolt for the assembly: https://godbolt.org/z/6sn69q

You can test how the assembly changes with slice length by taking a sub-slice. E.g. amend the test function like so:

pub fn test(v: &[usize]) -> usize {
    let v = &v[..27];
    v.iter().sum()
}

This should unroll the loop. If you use usize::MAX instead of 27 you'll get different assembly. Try using a value of 1117.

I'd assume LLVM optimizes these cases differently because they really do have an effect on performance. However, I've not run benchmarks myself.

carrybit · August 25, 2020, 5:51am

Thanks Chris.

This optimization doesn't help when Rust unrolls the loop...

It does help when you can eliminate the cmp instruction in the loop by manipulating the parameters in such a way that you would be comparing the loop variable with 0.

The most common modification is to change the loop from a count up to a count down loop.

Many loops have code like this in the loop:

...
add rdx, 40
cmp rcx, rdx
jne .LBB0_2

When you rework the loop so you would be comparing the loop counter/index against zero, the loop looks like this:

...
add rdx, -40
jge .LBB0_2

It's only a small improvement, and it's complicated by often running the loop backwards from what is expected, etc. But, when you can optimize the cmp out of the loop, it's just a LITTLE faster.

Nemo157 · August 25, 2020, 7:45am

Going backwards is a change in behaviour, so can’t be an optimisation. Try .iter().rev(), it seems likely that if that matches the pattern llvm can optimise it.

system · November 23, 2020, 7:45am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
More about step_by	15	3553	August 31, 2020
Optimize `loop {}` language design	17	1344	May 4, 2025
Increment / Decrement (x86 - IA32, AMD64) compiler	6	703	August 11, 2022
Resolving Rust's forward progress guarantees compiler	53	6882	December 22, 2024
Why does Rust generate 10x as much unoptimized assembly as GCC? compiler	23	4801	November 2, 2021

Slightly Faster Loops by skipping compare to 0 instructions

Related topics