Having previously programmed in assembly language it seems like there might be an opportunity to squeeze a little more performance out of loops, specifically by removing the cmp instruction.
The reason this would work is because the jump_not_zero conditional branching instruction doesn't have to follow a compare instruction if it's comparing the result of the add instruction to 0.
Do all supported architectures have a branch instruction that will work without a compare instruction?
Not to my knowledge. Condition-code dependencies such as you presume are very detrimental to multi-issue instruction pipelines (though they worked fine on the PDP-11 and other in-order legacy processors).
As before: If you are suggesting improvements in rustc/LLVM optimization passes, it would be very useful if you could include an example of Rust code that is currently not optimized correctly.
Frequently that is fine, but there are times when programmers need an iterator to run in order.
2.) The whole loop would need to be skipped when vector length = 0.
3.) If the jge (conditional branch if greater/equal) instruction isn't available, the code can be made to work with a jne, but it seems like it would need the ptr to be offset by one element backward.
This optimization doesn't help when Rust unrolls the loop...
It does help when you can eliminate the cmp instruction in the loop by manipulating the parameters in such a way that you would be comparing the loop variable with 0.
The most common modification is to change the loop from a count up to a count down loop.
Many loops have code like this in the loop:
...
add rdx, 40
cmp rcx, rdx
jne .LBB0_2
When you rework the loop so you would be comparing the loop counter/index against zero, the loop looks like this:
...
add rdx, -40
jge .LBB0_2
It's only a small improvement, and it's complicated by often running the loop backwards from what is expected, etc. But, when you can optimize the cmp out of the loop, it's just a LITTLE faster.
Going backwards is a change in behaviour, so can’t be an optimisation. Try .iter().rev(), it seems likely that if that matches the pattern llvm can optimise it.