Slightly Faster Loops by skipping compare to 0 instructions

Having previously programmed in assembly language it seems like there might be an opportunity to squeeze a little more performance out of loops, specifically by removing the cmp instruction.

Example:

   load count, number-of-iterations;

L3:

    ... loop internals ...

   add count, -1             # dec count

   jump_not_zero .L3

The reason this would work is because the jump_not_zero conditional branching instruction doesn't have to follow a compare instruction if it's comparing the result of the add instruction to 0.

Do all supported architectures have a branch instruction that will work without a compare instruction?

Not to my knowledge. Condition-code dependencies such as you presume are very detrimental to multi-issue instruction pipelines (though they worked fine on the PDP-11 and other in-order legacy processors).

For this code:

pub fn main() {
    let mut x = 999;
    while x >= 0 {
        sideeffect();
        x -= 1;
    }
}

#[inline(never)]
fn sideeffect() {
    println!();
}

LLVM already generates code that avoids unnecessary cmp instructions on x86-64:

example::main:
        push    rbx
        mov     ebx, -1000
.LBB0_1:
        call    example::sideeffect
        inc     ebx
        jne     .LBB0_1
        pop     rbx
        ret

(in fact it turned the loop into an up-counting loop, interestingly)

6 Likes

As before: If you are suggesting improvements in rustc/LLVM optimization passes, it would be very useful if you could include an example of Rust code that is currently not optimized correctly.

10 Likes

All the code I had seen, like this example includes the cmp instruction in the loop.

    pub fn main() {
        let mut v:Vec<usize> = vec![0; 1117];
        let mut total:usize = 0;

        for vi in v.iter() {
            total += *vi;
        }
        println!("{}", total);
    }

This code generates used a test or cmp instruction in the loop:

    ...
    add     rdx, 40
    cmp     rdx, rax
    je      .LBB2_3
    ...

This is an example where a small improvement could be made by getting the index to compare to a value of zero to indicate the exit case:

    load ptr, vec_data_address
    load index, 8928  # 1116 * 8
L5:
    ... Loop internals ...   
    add index, -8
    jge .L5

Interesting bits:

1.) This version of the iterator runs backwards.

Frequently that is fine, but there are times when programmers need an iterator to run in order.

2.) The whole loop would need to be skipped when vector length = 0.

3.) If the jge (conditional branch if greater/equal) instruction isn't available, the code can be made to work with a jne, but it seems like it would need the ptr to be offset by one element backward.

Thanks

Are you sure that's faster for that size? Here's a test where I prevent the compiler from optimizing for a particular length slice:

pub fn test(v: &[usize]) -> usize {
    v.iter().sum()
}

See godbolt for the assembly: https://godbolt.org/z/6sn69q

You can test how the assembly changes with slice length by taking a sub-slice. E.g. amend the test function like so:

pub fn test(v: &[usize]) -> usize {
    let v = &v[..27];
    v.iter().sum()
}

This should unroll the loop. If you use usize::MAX instead of 27 you'll get different assembly. Try using a value of 1117.

I'd assume LLVM optimizes these cases differently because they really do have an effect on performance. However, I've not run benchmarks myself.

Thanks Chris.

This optimization doesn't help when Rust unrolls the loop...

It does help when you can eliminate the cmp instruction in the loop by manipulating the parameters in such a way that you would be comparing the loop variable with 0.

The most common modification is to change the loop from a count up to a count down loop.

Many loops have code like this in the loop:

...
add rdx, 40
cmp rcx, rdx
jne .LBB0_2

When you rework the loop so you would be comparing the loop counter/index against zero, the loop looks like this:

...
add rdx, -40
jge .LBB0_2

It's only a small improvement, and it's complicated by often running the loop backwards from what is expected, etc. But, when you can optimize the cmp out of the loop, it's just a LITTLE faster.

Going backwards is a change in behaviour, so can’t be an optimisation. Try .iter().rev(), it seems likely that if that matches the pattern llvm can optimise it.