In theory a modern compiler should not need annotations like:
#[inline]
#[inline(always)]
Because in theory a modern smart compiler knows better than the programmer (expecially with PGO https://en.wikipedia.org/wiki/Profile-guided_optimization ). In practice I’ve already seen they are sometimes useful in Rust code. And sometimes I’ve used the opposite annotation:
#[inline(never)]
To have a clean assembly listing of a function (reading the same asm of an inline function is harder).
From my experience with D programming, and from the small experience of Rust, I’ve seen a similar situation with loop unrolling. This means I’ve seen many situations where the compiler has not unrolled or has not unrolled a loop enough. With some tricks to force a higher unrolling I’ve seen several times the computational kernels my code get faster on my PC, both on older and newer CPUs.
So I suggest to add a way to ask the Rust compiler to unroll a specific loop (so this is an enhancement request).
This post is quite long because I try to give some evidence for my claims and request.
Currently there is a way to ask LLVM to unroll loops more, like this:
rustc -C llvm-args="-unroll-threshold=1000" my_code.r
But this is a blunt tool, that unrolls loops in the whole module (or the whole crate). If you unroll loops randomly, you often get a slowdown of your code. So you want a way to unroll only specific loops.
A possible syntax to do it (that uses #![feature(stmt_expr_attributes)] ):
This is just a generic hint to unroll a loop more:
#[unroll]
Disables unrolling for a loop:
#[unroll(never)]
asks the compiler to try to fully unroll a specific loop:
#[unroll(try_full)]
(A #[unroll(full)] annotations is also possible, but sounds dangerous, and it can’t be used with dynamic loops, where the number of loops is not statically known).
Asks the compiler to unroll the loop N times:
#[unroll(N)]
Example usages:
#[unroll(8)] for x in ....
#[unroll(try_full)] while pred { ....
And now two examples to show my point.
This first program computes the integer square root on u64 numbers. The basic function is isqrt1(), while in isqrt2() I have manually unrolled the loop:
// isqrt from:
// http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots
use std::u32;
use std::mem::size_of;
type Tin = u64;
type Tout = u32;
fn isqrt1(mut a: u64) -> Tout {
let mut rem: Tin = 0;
let mut root: Tin = 0;
for _ in 0 .. size_of::<Tin>() * 4 {
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
}
return (root >> 1) as Tout;
}
fn isqrt2(mut a: u64) -> Tout {
let mut rem: Tin = 0;
let mut root: Tin = 0;
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
//a <<= 2;
root += 1;
if root <= rem {
//rem -= root;
root += 1;
} else {
root -= 1;
}
return (root >> 1) as Tout;
}
fn main() {
let n_iter = 10_000_000;
let mut n: u64 = u32::MAX as u64;
let mut result: u64 = 0;
for _ in 0 .. n_iter {
result += isqrt1(n) as u64;
n += u32::MAX as u64;
}
println!("{}", result);
}
Manually unrolling a loop is ugly, bug-prone, and it ties this isqrt2 to a specific data type (if I use u32 as inputs then the loop has to run only 16 times).
The generated asm, using this on a modern 64 bit i7 CPU:
rustc -C opt-level=3 -C target-cpu=native --emit asm test.rs
For isqrt1:
_ZN5sqrt120hb91a06ee8b9d7345GbaE:
pushq %r14
pushq %rsi
pushq %rdi
pushq %rbx
xorl %r8d, %r8d
movl $32, %edx
movl $572, %r9d
movl $570, %r10d
movl $568, %r11d
xorl %edi, %edi
xorl %eax, %eax
.align 16, 0x90
.LBB0_1:
shldq $2, %rcx, %rdi
movq %rcx, %r14
shlq $8, %r14
leaq 1(%rax,%rax), %rbx
cmpq %rbx, %rdi
leaq 2(%rax,%rax), %rsi
leaq (%rax,%rax), %rax
cmovaeq %rsi, %rax
cmovbq %r8, %rbx
subq %rbx, %rdi
shlq $2, %rdi
bextrq %r9, %rcx, %rsi
orq %rdi, %rsi
leaq 1(%rax,%rax), %rdi
cmpq %rdi, %rsi
leaq 2(%rax,%rax), %rbx
leaq (%rax,%rax), %rax
cmovaeq %rbx, %rax
cmovbq %r8, %rdi
subq %rdi, %rsi
shlq $2, %rsi
bextrq %r10, %rcx, %rbx
orq %rsi, %rbx
leaq 1(%rax,%rax), %rsi
cmpq %rsi, %rbx
leaq 2(%rax,%rax), %rdi
leaq (%rax,%rax), %rax
cmovaeq %rdi, %rax
cmovbq %r8, %rsi
subq %rsi, %rbx
shlq $2, %rbx
bextrq %r11, %rcx, %rdi
orq %rbx, %rdi
leaq 1(%rax,%rax), %rcx
cmpq %rcx, %rdi
leaq 2(%rax,%rax), %rsi
leaq (%rax,%rax), %rax
cmovaeq %rsi, %rax
cmovbq %r8, %rcx
subq %rcx, %rdi
movq %r14, %rcx
addq $-4, %rdx
jne .LBB0_1
shrq %rax
popq %rbx
popq %rdi
popq %rsi
popq %r14
retq
For isqrt2:
_ZN5sqrt220hb052487376d2024e4caE:
pushq %rsi
movq %rcx, %rdx
shrq $62, %rdx
movl %ecx, %r8d
andl $3, %r8d
testq %rdx, %rdx
setne %al
movzbl %al, %r9d
leaq -4(,%rdx,4), %rax
cmoveq %rdx, %rax
movl $572, %edx
bextrq %rdx, %rcx, %rdx
orq %rax, %rdx
leaq 1(,%r9,4), %rsi
xorl %r10d, %r10d
cmpq %rsi, %rdx
leaq 2(,%r9,4), %r11
leaq (,%r9,4), %rax
cmovaeq %r11, %rax
cmovbq %r10, %rsi
subq %rsi, %rdx
shlq $2, %rdx
movl $570, %esi
bextrq %rsi, %rcx, %rsi
orq %rdx, %rsi
leaq 1(%rax,%rax), %rdx
cmpq %rdx, %rsi
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rdx
subq %rdx, %rsi
shlq $2, %rsi
movl $568, %edx
bextrq %rdx, %rcx, %rdx
orq %rsi, %rdx
leaq 1(%rax,%rax), %rsi
cmpq %rsi, %rdx
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rsi
subq %rsi, %rdx
shlq $2, %rdx
movl $566, %esi
bextrq %rsi, %rcx, %rsi
orq %rdx, %rsi
leaq 1(%rax,%rax), %rdx
cmpq %rdx, %rsi
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rdx
subq %rdx, %rsi
shlq $2, %rsi
movl $564, %edx
bextrq %rdx, %rcx, %rdx
orq %rsi, %rdx
leaq 1(%rax,%rax), %rsi
cmpq %rsi, %rdx
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rsi
subq %rsi, %rdx
shlq $2, %rdx
movl $562, %esi
bextrq %rsi, %rcx, %rsi
orq %rdx, %rsi
leaq 1(%rax,%rax), %rdx
cmpq %rdx, %rsi
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rdx
subq %rdx, %rsi
shlq $2, %rsi
movl $560, %edx
bextrq %rdx, %rcx, %rdx
orq %rsi, %rdx
leaq 1(%rax,%rax), %rsi
cmpq %rsi, %rdx
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rsi
subq %rsi, %rdx
shlq $2, %rdx
movl $558, %esi
bextrq %rsi, %rcx, %rsi
orq %rdx, %rsi
leaq 1(%rax,%rax), %rdx
cmpq %rdx, %rsi
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rdx
subq %rdx, %rsi
shlq $2, %rsi
movl $556, %edx
bextrq %rdx, %rcx, %rdx
orq %rsi, %rdx
leaq 1(%rax,%rax), %rsi
cmpq %rsi, %rdx
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rsi
subq %rsi, %rdx
shlq $2, %rdx
movl $554, %esi
bextrq %rsi, %rcx, %rsi
orq %rdx, %rsi
leaq 1(%rax,%rax), %rdx
cmpq %rdx, %rsi
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rdx
subq %rdx, %rsi
shlq $2, %rsi
movl $552, %edx
bextrq %rdx, %rcx, %rdx
orq %rsi, %rdx
leaq 1(%rax,%rax), %rsi
cmpq %rsi, %rdx
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rsi
subq %rsi, %rdx
shlq $2, %rdx
movl $550, %esi
bextrq %rsi, %rcx, %rsi
orq %rdx, %rsi
leaq 1(%rax,%rax), %rdx
cmpq %rdx, %rsi
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rdx
subq %rdx, %rsi
shlq $2, %rsi
movl $548, %edx
bextrq %rdx, %rcx, %rdx
orq %rsi, %rdx
leaq 1(%rax,%rax), %rsi
cmpq %rsi, %rdx
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rsi
subq %rsi, %rdx
shlq $2, %rdx
movl $546, %esi
bextrq %rsi, %rcx, %rsi
orq %rdx, %rsi
leaq 1(%rax,%rax), %rdx
cmpq %rdx, %rsi
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rdx
subq %rdx, %rsi
shlq $2, %rsi
movl $544, %edx
bextrq %rdx, %rcx, %rdx
orq %rsi, %rdx
leaq 1(%rax,%rax), %rsi
cmpq %rsi, %rdx
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rsi
subq %rsi, %rdx
shlq $2, %rdx
movl $542, %esi
bextrq %rsi, %rcx, %rsi
orq %rdx, %rsi
leaq 1(%rax,%rax), %rdx
cmpq %rdx, %rsi
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rdx
subq %rdx, %rsi
shlq $2, %rsi
movl $540, %edx
bextrq %rdx, %rcx, %rdx
orq %rsi, %rdx
leaq 1(%rax,%rax), %rsi
cmpq %rsi, %rdx
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rsi
subq %rsi, %rdx
shlq $2, %rdx
movl $538, %esi
bextrq %rsi, %rcx, %rsi
orq %rdx, %rsi
leaq 1(%rax,%rax), %rdx
cmpq %rdx, %rsi
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rdx
subq %rdx, %rsi
shlq $2, %rsi
movl $536, %edx
bextrq %rdx, %rcx, %rdx
orq %rsi, %rdx
leaq 1(%rax,%rax), %rsi
cmpq %rsi, %rdx
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rsi
subq %rsi, %rdx
shlq $2, %rdx
movl $534, %esi
bextrq %rsi, %rcx, %rsi
orq %rdx, %rsi
leaq 1(%rax,%rax), %rdx
cmpq %rdx, %rsi
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rdx
subq %rdx, %rsi
shlq $2, %rsi
movl $532, %edx
bextrq %rdx, %rcx, %rdx
orq %rsi, %rdx
leaq 1(%rax,%rax), %rsi
cmpq %rsi, %rdx
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rsi
subq %rsi, %rdx
shlq $2, %rdx
movl $530, %esi
bextrq %rsi, %rcx, %rsi
orq %rdx, %rsi
leaq 1(%rax,%rax), %rdx
cmpq %rdx, %rsi
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rdx
subq %rdx, %rsi
shlq $2, %rsi
movl $528, %edx
bextrq %rdx, %rcx, %rdx
orq %rsi, %rdx
leaq 1(%rax,%rax), %rsi
cmpq %rsi, %rdx
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rsi
subq %rsi, %rdx
shlq $2, %rdx
movl $526, %esi
bextrq %rsi, %rcx, %rsi
orq %rdx, %rsi
leaq 1(%rax,%rax), %rdx
cmpq %rdx, %rsi
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rdx
subq %rdx, %rsi
shlq $2, %rsi
movl $524, %edx
bextrq %rdx, %rcx, %rdx
orq %rsi, %rdx
leaq 1(%rax,%rax), %rsi
cmpq %rsi, %rdx
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rsi
subq %rsi, %rdx
shlq $2, %rdx
movl $522, %esi
bextrq %rsi, %rcx, %rsi
orq %rdx, %rsi
leaq 1(%rax,%rax), %rdx
cmpq %rdx, %rsi
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rdx
subq %rdx, %rsi
shlq $2, %rsi
movl $520, %edx
bextrq %rdx, %rcx, %rdx
orq %rsi, %rdx
leaq 1(%rax,%rax), %rsi
cmpq %rsi, %rdx
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rsi
subq %rsi, %rdx
shlq $2, %rdx
movl $518, %esi
bextrq %rsi, %rcx, %rsi
orq %rdx, %rsi
leaq 1(%rax,%rax), %rdx
cmpq %rdx, %rsi
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rdx
subq %rdx, %rsi
shlq $2, %rsi
movl $516, %edx
bextrq %rdx, %rcx, %rdx
orq %rsi, %rdx
leaq 1(%rax,%rax), %rsi
cmpq %rsi, %rdx
leaq 2(%rax,%rax), %r9
leaq (%rax,%rax), %rax
cmovaeq %r9, %rax
cmovbq %r10, %rsi
subq %rsi, %rdx
shlq $2, %rdx
movl $514, %esi
bextrq %rsi, %rcx, %rcx
orq %rdx, %rcx
leaq 1(%rax,%rax), %rdx
cmpq %rdx, %rcx
leaq 2(%rax,%rax), %rsi
leaq (%rax,%rax), %rax
cmovaeq %rsi, %rax
cmovbq %r10, %rdx
subq %rdx, %rcx
shlq $2, %rcx
orq %r8, %rcx
leaq 1(%rax,%rax), %rdx
cmpq %rcx, %rdx
leaq 2(%rax,%rax), %rcx
leaq (%rax,%rax), %rax
cmovbeq %rcx, %rax
shrq %rax
popq %rsi
retq
As you see the isqrt1 was unrolled 4 times. But the run-time shows it’s not enough.
On my system with n_iter=10_000_000, with isqrt1 takes 0.74 seconds, with isqrt2 it takes 0.26 seconds.
Both return the result: 1381620290024350
This annotation should suffice to fully unroll the loop, and keep the code more generic:
fn isqrt1b(mut a: u64) -> Tout {
let mut rem: Tin = 0;
let mut root: Tin = 0;
#[unroll(try_full)] for _ in 0 .. size_of::<Tin>() * 4 {
root <<= 1;
rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
a <<= 2;
root += 1;
if root <= rem {
rem -= root;
root += 1;
} else {
root -= 1;
}
}
return (root >> 1) as Tout;
}
Extra note: this is also a situations where I’d really like a “typeof” in Rust, so I can write that code like the D one:
import std.typetuple: TypeTuple;
template Iota(uint stop) {
static if (stop <= 0)
alias Iota = TypeTuple!();
else
alias Iota = TypeTuple!(Iota!(stop - 1), stop - 1);
}
uint isqrt1c(ulong a) pure nothrow @safe @nogc {
alias T = typeof(a);
T rem = 0;
T root = 0;
foreach (immutable _; Iota!(T.sizeof * 4)) {
root <<= 1;
rem = (rem << 2) + (a >> (T.sizeof * 8 - 2));
a <<= 2;
root++;
if (root <= rem) {
rem -= root;
root++;
} else {
root--;
}
}
return cast(typeof(return))(root >> 1);
}
void main() {
import std.stdio;
enum size_t nIter = 10_000_000;
ulong n = uint.max;
ulong result = 0;
foreach (immutable _; 0 .. nIter) {
result += n.isqrt1c;
n += uint.max;
}
result.writeln;
}
That Iota() is a simple way to force full loop unrolling in D language (the foreach is static).
The D code also uses a cast(typeof(return)) that makes the code more DRY. typeof(return) is the type of the return of the function, in this case an “uint” (u32 in Rust).
As second example I use this N-body benchmark: http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=nbody
In that page there are two Rust versions, that have a different run-time and efficiency.
This is the less fast, the Rust #1: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=nbody&lang=rust&id=1
The faster of the two is the Rust #2: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=nbody&lang=rust&id=2
In my benchmarks I give an input: 20000000
And the output is:
-0.169075164
-0.169031665
I compile the two programs with (the second program gets a bit slower with LTO, I don’t know why):
rustc -C opt-level=3 -C lto nbody1.rs
rustc -C opt-level=3 nbody2.rs
The run-time for the two programs on my PC is, in seconds:
Rust #1: 2.11 s
Rust #2: 1.62 s
This is a version of mine of the (slower) Rust #1 program, I’ve simplified it, making the code more straightforward, similar to a simple version in C language:
// Rust #3
// The Computer Language Benchmarks Game
// http://benchmarksgame.alioth.debian.org/
use std::f64::consts::PI;
const SOLAR_MASS: f64 = 4.0 * PI * PI;
const N_BODIES: usize = 5;
struct Planet {
x: f64, y: f64, z: f64,
vx: f64, vy: f64, vz: f64,
mass: f64
}
type Planets = [Planet; N_BODIES];
fn advance(bodies: &mut Planets, dt: f64, n: usize) {
for _ in 0 .. n {
for i in 0 .. bodies.len() {
for j in i + 1 .. bodies.len() {
let dx = bodies[i].x - bodies[j].x;
let dy = bodies[i].y - bodies[j].y;
let dz = bodies[i].z - bodies[j].z;
let distance = (dx * dx + dy * dy + dz * dz).sqrt();
let mag = dt / (distance * distance * distance);
bodies[i].vx -= dx * bodies[j].mass * mag;
bodies[i].vy -= dy * bodies[j].mass * mag;
bodies[i].vz -= dz * bodies[j].mass * mag;
bodies[j].vx += dx * bodies[i].mass * mag;
bodies[j].vy += dy * bodies[i].mass * mag;
bodies[j].vz += dz * bodies[i].mass * mag;
}
}
for b in bodies.iter_mut() {
b.x += dt * b.vx;
b.y += dt * b.vy;
b.z += dt * b.vz;
}
}
}
fn energy(bodies: &Planets) -> f64 {
let mut result = 0.0;
for (i, bi) in bodies.iter().enumerate() {
result += (bi.vx * bi.vx + bi.vy * bi.vy + bi.vz * bi.vz) * bi.mass / 2.0;
for bj in bodies[i + 1 ..].iter() {
let dx = bi.x - bj.x;
let dy = bi.y - bj.y;
let dz = bi.z - bj.z;
let distance = (dx * dx + dy * dy + dz * dz).sqrt();
result -= bi.mass * bj.mass / distance;
}
}
result
}
fn offset_momentum(bodies: &mut Planets) {
let (mut px, mut py, mut pz) = (0.0, 0.0, 0.0);
for b in bodies.iter() {
px += b.vx * b.mass;
py += b.vy * b.mass;
pz += b.vz * b.mass;
}
bodies[0].vx = - px / SOLAR_MASS;
bodies[0].vy = - py / SOLAR_MASS;
bodies[0].vz = - pz / SOLAR_MASS;
}
fn main() {
const YEAR: f64 = 365.24;
let n = std::env::args().nth(1).unwrap().parse().unwrap_or(1000);
let mut bodies: Planets = [
Planet { // Sun
x: 0.0, y: 0.0, z: 0.0,
vx: 0.0, vy: 0.0, vz: 0.0,
mass: SOLAR_MASS
},
Planet { // Jupiter
x: 4.84143144246472090e+00,
y: -1.16032004402742839e+00,
z: -1.03622044471123109e-01,
vx: 1.66007664274403694e-03 * YEAR,
vy: 7.69901118419740425e-03 * YEAR,
vz: -6.90460016972063023e-05 * YEAR,
mass: 9.54791938424326609e-04 * SOLAR_MASS
},
Planet { // Saturn
x: 8.34336671824457987e+00,
y: 4.12479856412430479e+00,
z: -4.03523417114321381e-01,
vx: -2.76742510726862411e-03 * YEAR,
vy: 4.99852801234917238e-03 * YEAR,
vz: 2.30417297573763929e-05 * YEAR,
mass: 2.85885980666130812e-04 * SOLAR_MASS
},
Planet { // Uranus
x: 1.28943695621391310e+01,
y: -1.51111514016986312e+01,
z: -2.23307578892655734e-01,
vx: 2.96460137564761618e-03 * YEAR,
vy: 2.37847173959480950e-03 * YEAR,
vz: -2.96589568540237556e-05 * YEAR,
mass: 4.36624404335156298e-05 * SOLAR_MASS
},
Planet { // Neptune
x: 1.53796971148509165e+01,
y: -2.59193146099879641e+01,
z: 1.79258772950371181e-01,
vx: 2.68067772490389322e-03 * YEAR,
vy: 1.62824170038242295e-03 * YEAR,
vz: -9.51592254519715870e-05 * YEAR,
mass: 5.15138902046611451e-05 * SOLAR_MASS
}
];
offset_momentum(&mut bodies);
println!("{:.9}", energy(&bodies));
advance(&mut bodies, 0.01, n);
println!("{:.9}", energy(&bodies));
}
And this is the same code, where I have manually fully unrolled the loops in the advance() function:
// Rust #4
// The Computer Language Benchmarks Game
// http://benchmarksgame.alioth.debian.org/
use std::f64::consts::PI;
const SOLAR_MASS: f64 = 4.0 * PI * PI;
const N_BODIES: usize = 5;
struct Planet {
x: f64, y: f64, z: f64,
vx: f64, vy: f64, vz: f64,
mass: f64
}
type Planets = [Planet; N_BODIES];
fn advance(mut bodies: &mut Planets, dt: f64) {
{
let mut f = |i: usize, j: usize| {
let dx = bodies[i].x - bodies[j].x;
let dy = bodies[i].y - bodies[j].y;
let dz = bodies[i].z - bodies[j].z;
let distance = (dx * dx + dy * dy + dz * dz).sqrt();
let mag = dt / (distance * distance * distance);
bodies[i].vx -= dx * bodies[j].mass * mag;
bodies[i].vy -= dy * bodies[j].mass * mag;
bodies[i].vz -= dz * bodies[j].mass * mag;
bodies[j].vx += dx * bodies[i].mass * mag;
bodies[j].vy += dy * bodies[i].mass * mag;
bodies[j].vz += dz * bodies[i].mass * mag;
};
// Manual loop unroll.
f(0, 1); f(0, 2); f(0, 3); f(0, 4); f(1, 2);
f(1, 3); f(1, 4); f(2, 3); f(2, 4); f(3, 4);
}
for b in bodies.iter_mut() {
b.x += dt * b.vx;
b.y += dt * b.vy;
b.z += dt * b.vz;
}
}
fn energy(bodies: &Planets) -> f64 {
let mut result = 0.0;
for (i, bi) in bodies.iter().enumerate() {
result += (bi.vx * bi.vx + bi.vy * bi.vy + bi.vz * bi.vz) * bi.mass / 2.0;
for bj in bodies[i + 1 ..].iter() {
let dx = bi.x - bj.x;
let dy = bi.y - bj.y;
let dz = bi.z - bj.z;
let distance = (dx * dx + dy * dy + dz * dz).sqrt();
result -= bi.mass * bj.mass / distance;
}
}
result
}
fn offset_momentum(bodies: &mut Planets) {
let (mut px, mut py, mut pz) = (0.0, 0.0, 0.0);
for b in bodies.iter() {
px += b.vx * b.mass;
py += b.vy * b.mass;
pz += b.vz * b.mass;
}
bodies[0].vx = - px / SOLAR_MASS;
bodies[0].vy = - py / SOLAR_MASS;
bodies[0].vz = - pz / SOLAR_MASS;
}
fn main() {
const YEAR: f64 = 365.24;
let n = std::env::args().nth(1).unwrap().parse().unwrap_or(1000);
let mut bodies: Planets = [
Planet { // Sun
x: 0.0, y: 0.0, z: 0.0,
vx: 0.0, vy: 0.0, vz: 0.0,
mass: SOLAR_MASS
},
Planet { // Jupiter
x: 4.84143144246472090e+00,
y: -1.16032004402742839e+00,
z: -1.03622044471123109e-01,
vx: 1.66007664274403694e-03 * YEAR,
vy: 7.69901118419740425e-03 * YEAR,
vz: -6.90460016972063023e-05 * YEAR,
mass: 9.54791938424326609e-04 * SOLAR_MASS
},
Planet { // Saturn
x: 8.34336671824457987e+00,
y: 4.12479856412430479e+00,
z: -4.03523417114321381e-01,
vx: -2.76742510726862411e-03 * YEAR,
vy: 4.99852801234917238e-03 * YEAR,
vz: 2.30417297573763929e-05 * YEAR,
mass: 2.85885980666130812e-04 * SOLAR_MASS
},
Planet { // Uranus
x: 1.28943695621391310e+01,
y: -1.51111514016986312e+01,
z: -2.23307578892655734e-01,
vx: 2.96460137564761618e-03 * YEAR,
vy: 2.37847173959480950e-03 * YEAR,
vz: -2.96589568540237556e-05 * YEAR,
mass: 4.36624404335156298e-05 * SOLAR_MASS
},
Planet { // Neptune
x: 1.53796971148509165e+01,
y: -2.59193146099879641e+01,
z: 1.79258772950371181e-01,
vx: 2.68067772490389322e-03 * YEAR,
vy: 1.62824170038242295e-03 * YEAR,
vz: -9.51592254519715870e-05 * YEAR,
mass: 5.15138902046611451e-05 * SOLAR_MASS
}
];
offset_momentum(&mut bodies);
println!("{:.9}", energy(&bodies));
for _ in 0 .. n {
advance(&mut bodies, 0.01);
}
println!("{:.9}", energy(&bodies));
}
I compile the two programs with (both with LTO):
rustc -C opt-level=3 -C lto nbody_simple.rs
rustc -C opt-level=3 -C lto nbody_simple_unrolled.rs
The output is unchanged. The run-time is, in seconds:
Rust #3: 1.96 s (basic)
Rust #4: 1.27 s (basic manually unrolled)
On my system this unrolled Rust#4 version is faster than the fastest C++ entry on the benchmark page (but on your system the results could be different).
(Also note that the “basic” version is faster than the more complex Rust #1 that’s on the Computer Game site. Straightforward code is sometimes better).
If I ask a more aggressive unrolling of the basic Rust#3 version with this:
rustc -C opt-level=3 -C llvm-args="-unroll-threshold=1000" nbody_simple.rs
I get an improvement:
Rust #3: 1.54 s (basic with auto-unrolling)
As you see the speed improvement with “-unroll-threshold” is limited, and LLVM unrolls loops in the energy() and the offset_momentum() functions, bloating the binary with no point (because those two functions have a run-time that’s irrelevent in this program).
If I add a #[inline(never)] annotation to the advance() function in the Rust#3 and Rust#4 programs I get their asm:
rustc -C opt-level=3 -C lto --emit asm nbody_simple.rs
rustc -C opt-level=3 -C lto --emit asm nbody_simple_unrolled.rs
The not-inlined advance() function of the basic Rust#3:
_ZN7advance20hc525da322d48cc5eLaaE:
pushq %r14
pushq %rsi
pushq %rdi
pushq %rbx
subq $104, %rsp
movapd %xmm11, 80(%rsp)
movapd %xmm10, 64(%rsp)
movapd %xmm9, 48(%rsp)
movapd %xmm8, 32(%rsp)
movapd %xmm7, 16(%rsp)
movapd %xmm6, (%rsp)
testq %rdx, %rdx
je .LBB0_8
leaq 104(%rcx), %r8
xorl %r9d, %r9d
movsd LCPI0_0(%rip), %xmm9
movapd LCPI0_1(%rip), %xmm8
.align 16, 0x90
.LBB0_2:
movq %r8, %r11
xorl %r10d, %r10d
.align 16, 0x90
.LBB0_4:
movq %r10, %r14
leaq 1(%r14), %r10
cmpq $4, %r10
ja .LBB0_3
imulq $56, %r14, %rax
movupd (%rcx,%rax), %xmm10
movsd 16(%rcx,%rax), %xmm11
movsd 48(%rcx,%rax), %xmm4
leaq 24(%rcx,%rax), %rsi
leaq 40(%rcx,%rax), %rdi
movapd %xmm4, %xmm5
movlhps %xmm5, %xmm5
movq %r11, %rbx
movl $4, %eax
.align 16, 0x90
.LBB0_6:
movupd -48(%rbx), %xmm0
movapd %xmm10, %xmm6
subpd %xmm0, %xmm6
movapd %xmm11, %xmm7
subsd -32(%rbx), %xmm7
movapd %xmm6, %xmm0
mulsd %xmm0, %xmm0
movapd %xmm6, %xmm1
shufpd $1, %xmm1, %xmm1
mulsd %xmm1, %xmm1
addsd %xmm0, %xmm1
movapd %xmm7, %xmm0
mulsd %xmm0, %xmm0
addsd %xmm1, %xmm0
sqrtsd %xmm0, %xmm0
movapd %xmm0, %xmm1
mulsd %xmm1, %xmm1
mulsd %xmm0, %xmm1
movapd %xmm9, %xmm0
divsd %xmm1, %xmm0
movsd (%rbx), %xmm1
movupd (%rsi), %xmm2
movapd %xmm7, %xmm3
mulsd %xmm1, %xmm3
movlhps %xmm1, %xmm1
mulpd %xmm6, %xmm1
mulsd %xmm0, %xmm3
mulsd %xmm4, %xmm7
mulsd %xmm0, %xmm7
movlhps %xmm0, %xmm0
mulpd %xmm0, %xmm1
subpd %xmm1, %xmm2
movupd %xmm2, (%rsi)
movsd (%rdi), %xmm1
subsd %xmm3, %xmm1
movsd %xmm1, (%rdi)
movupd -24(%rbx), %xmm1
mulpd %xmm5, %xmm6
mulpd %xmm0, %xmm6
addpd %xmm1, %xmm6
movupd %xmm6, -24(%rbx)
addsd -8(%rbx), %xmm7
movsd %xmm7, -8(%rbx)
decq %rax
addq $56, %rbx
cmpq %r14, %rax
jne .LBB0_6
.LBB0_3:
addq $56, %r11
cmpq $5, %r10
jne .LBB0_4
movupd (%rcx), %xmm0
movupd 24(%rcx), %xmm1
mulpd %xmm8, %xmm1
addpd %xmm0, %xmm1
movupd %xmm1, (%rcx)
movsd 40(%rcx), %xmm0
mulsd %xmm9, %xmm0
addsd 16(%rcx), %xmm0
movsd %xmm0, 16(%rcx)
movupd 56(%rcx), %xmm0
movupd 80(%rcx), %xmm1
mulpd %xmm8, %xmm1
addpd %xmm0, %xmm1
movupd %xmm1, 56(%rcx)
movsd 96(%rcx), %xmm0
mulsd %xmm9, %xmm0
addsd 72(%rcx), %xmm0
movsd %xmm0, 72(%rcx)
movupd 112(%rcx), %xmm0
movupd 136(%rcx), %xmm1
mulpd %xmm8, %xmm1
addpd %xmm0, %xmm1
movupd %xmm1, 112(%rcx)
movsd 152(%rcx), %xmm0
mulsd %xmm9, %xmm0
addsd 128(%rcx), %xmm0
movsd %xmm0, 128(%rcx)
movupd 168(%rcx), %xmm0
movupd 192(%rcx), %xmm1
mulpd %xmm8, %xmm1
addpd %xmm0, %xmm1
movupd %xmm1, 168(%rcx)
movsd 208(%rcx), %xmm0
mulsd %xmm9, %xmm0
addsd 184(%rcx), %xmm0
movsd %xmm0, 184(%rcx)
movupd 224(%rcx), %xmm0
movupd 248(%rcx), %xmm1
mulpd %xmm8, %xmm1
addpd %xmm0, %xmm1
movupd %xmm1, 224(%rcx)
movsd 264(%rcx), %xmm0
mulsd %xmm9, %xmm0
addsd 240(%rcx), %xmm0
movsd %xmm0, 240(%rcx)
incq %r9
cmpq %rdx, %r9
jne .LBB0_2
.LBB0_8:
movaps (%rsp), %xmm6
movaps 16(%rsp), %xmm7
movaps 32(%rsp), %xmm8
movaps 48(%rsp), %xmm9
movaps 64(%rsp), %xmm10
movaps 80(%rsp), %xmm11
addq $104, %rsp
popq %rbx
popq %rdi
popq %rsi
popq %r14
retq
As you see LLVM performs some unrolling of the inner loop, but it’s not enough.
The not-inlined advance() function of the manually unrolled Rust#4:
_ZN7advance20ha99dc24a84950b29LaaE:
subq $376, %rsp
movaps %xmm15, 352(%rsp)
movaps %xmm14, 336(%rsp)
movaps %xmm13, 320(%rsp)
movaps %xmm12, 304(%rsp)
movaps %xmm11, 288(%rsp)
movaps %xmm10, 272(%rsp)
movaps %xmm9, 256(%rsp)
movaps %xmm8, 240(%rsp)
movaps %xmm7, 224(%rsp)
movaps %xmm6, 208(%rsp)
movupd (%rcx), %xmm1
movupd 56(%rcx), %xmm0
movapd %xmm1, %xmm6
subpd %xmm0, %xmm6
movsd 16(%rcx), %xmm2
movsd 40(%rcx), %xmm9
movapd %xmm2, %xmm7
subsd 72(%rcx), %xmm7
movapd %xmm6, %xmm0
mulsd %xmm0, %xmm0
movapd %xmm6, %xmm4
shufpd $1, %xmm4, %xmm4
mulsd %xmm4, %xmm4
addsd %xmm0, %xmm4
movapd %xmm7, %xmm0
mulsd %xmm0, %xmm0
addsd %xmm4, %xmm0
sqrtsd %xmm0, %xmm0
movapd %xmm0, %xmm4
mulsd %xmm4, %xmm4
mulsd %xmm0, %xmm4
movsd LCPI0_0(%rip), %xmm0
movapd %xmm0, %xmm5
movapd %xmm0, %xmm11
divsd %xmm4, %xmm5
movsd 104(%rcx), %xmm4
movupd 24(%rcx), %xmm0
movapd %xmm7, %xmm3
mulsd %xmm4, %xmm3
movlhps %xmm4, %xmm4
mulpd %xmm6, %xmm4
mulsd %xmm5, %xmm3
movsd 48(%rcx), %xmm8
mulsd %xmm8, %xmm7
mulsd %xmm5, %xmm7
movlhps %xmm5, %xmm5
mulpd %xmm5, %xmm4
subpd %xmm4, %xmm0
subsd %xmm3, %xmm9
movupd 80(%rcx), %xmm4
movapd %xmm8, %xmm10
movlhps %xmm10, %xmm10
mulpd %xmm10, %xmm6
mulpd %xmm5, %xmm6
addpd %xmm4, %xmm6
movupd %xmm6, 80(%rcx)
addsd 96(%rcx), %xmm7
movsd %xmm7, 96(%rcx)
movupd 112(%rcx), %xmm4
movapd %xmm1, %xmm7
subpd %xmm4, %xmm7
movapd %xmm2, %xmm6
subsd 128(%rcx), %xmm6
movapd %xmm7, %xmm4
mulsd %xmm4, %xmm4
movapd %xmm7, %xmm5
shufpd $1, %xmm5, %xmm5
mulsd %xmm5, %xmm5
addsd %xmm4, %xmm5
movapd %xmm6, %xmm4
mulsd %xmm4, %xmm4
addsd %xmm5, %xmm4
sqrtsd %xmm4, %xmm4
movapd %xmm4, %xmm5
mulsd %xmm5, %xmm5
mulsd %xmm4, %xmm5
movapd %xmm11, %xmm4
divsd %xmm5, %xmm4
movsd 160(%rcx), %xmm5
movapd %xmm6, %xmm3
mulsd %xmm5, %xmm3
movlhps %xmm5, %xmm5
mulpd %xmm7, %xmm5
mulsd %xmm4, %xmm3
mulsd %xmm8, %xmm6
mulsd %xmm4, %xmm6
movlhps %xmm4, %xmm4
mulpd %xmm4, %xmm5
subpd %xmm5, %xmm0
subsd %xmm3, %xmm9
movupd 136(%rcx), %xmm3
mulpd %xmm10, %xmm7
mulpd %xmm4, %xmm7
addpd %xmm3, %xmm7
movupd %xmm7, 136(%rcx)
addsd 152(%rcx), %xmm6
movsd %xmm6, 152(%rcx)
movupd 168(%rcx), %xmm3
movapd %xmm1, %xmm7
subpd %xmm3, %xmm7
movapd %xmm2, %xmm6
subsd 184(%rcx), %xmm6
movapd %xmm7, %xmm3
mulsd %xmm3, %xmm3
movapd %xmm7, %xmm4
shufpd $1, %xmm4, %xmm4
mulsd %xmm4, %xmm4
addsd %xmm3, %xmm4
movapd %xmm6, %xmm3
mulsd %xmm3, %xmm3
addsd %xmm4, %xmm3
sqrtsd %xmm3, %xmm3
movapd %xmm3, %xmm4
mulsd %xmm4, %xmm4
mulsd %xmm3, %xmm4
movapd %xmm11, %xmm3
divsd %xmm4, %xmm3
movsd 216(%rcx), %xmm4
movapd %xmm6, %xmm5
mulsd %xmm4, %xmm5
movlhps %xmm4, %xmm4
mulpd %xmm7, %xmm4
mulsd %xmm3, %xmm5
mulsd %xmm8, %xmm6
mulsd %xmm3, %xmm6
movlhps %xmm3, %xmm3
mulpd %xmm3, %xmm4
subpd %xmm4, %xmm0
subsd %xmm5, %xmm9
movupd 192(%rcx), %xmm4
mulpd %xmm10, %xmm7
mulpd %xmm3, %xmm7
addpd %xmm4, %xmm7
movupd %xmm7, 192(%rcx)
addsd 208(%rcx), %xmm6
movsd %xmm6, 208(%rcx)
movupd 224(%rcx), %xmm3
subpd %xmm3, %xmm1
subsd 240(%rcx), %xmm2
movapd %xmm1, %xmm3
mulsd %xmm3, %xmm3
movapd %xmm1, %xmm4
shufpd $1, %xmm4, %xmm4
mulsd %xmm4, %xmm4
addsd %xmm3, %xmm4
movapd %xmm2, %xmm3
mulsd %xmm3, %xmm3
addsd %xmm4, %xmm3
sqrtsd %xmm3, %xmm3
movapd %xmm3, %xmm4
mulsd %xmm4, %xmm4
mulsd %xmm3, %xmm4
movapd %xmm11, %xmm3
movapd %xmm11, %xmm12
divsd %xmm4, %xmm3
movsd 272(%rcx), %xmm4
movapd %xmm2, %xmm5
mulsd %xmm4, %xmm5
movlhps %xmm4, %xmm4
mulpd %xmm1, %xmm4
mulsd %xmm3, %xmm5
mulsd %xmm8, %xmm2
mulsd %xmm3, %xmm2
movlhps %xmm3, %xmm3
mulpd %xmm3, %xmm4
subpd %xmm4, %xmm0
movupd %xmm0, 24(%rcx)
subsd %xmm5, %xmm9
movsd %xmm9, 40(%rcx)
movupd 248(%rcx), %xmm0
mulpd %xmm10, %xmm1
mulpd %xmm3, %xmm1
addpd %xmm0, %xmm1
movupd %xmm1, 248(%rcx)
addsd 264(%rcx), %xmm2
movsd %xmm2, 264(%rcx)
movupd 56(%rcx), %xmm3
movupd 112(%rcx), %xmm11
movapd %xmm3, %xmm2
subpd %xmm11, %xmm2
movapd %xmm2, 112(%rsp)
movsd 128(%rcx), %xmm1
movapd %xmm1, 192(%rsp)
movsd 160(%rcx), %xmm0
movapd %xmm0, %xmm9
movlhps %xmm9, %xmm9
mulpd %xmm2, %xmm9
movapd %xmm2, %xmm8
movupd 224(%rcx), %xmm2
movapd %xmm2, 176(%rsp)
movsd 240(%rcx), %xmm4
movapd %xmm4, 160(%rsp)
movapd %xmm11, %xmm7
subpd %xmm2, %xmm7
movapd %xmm7, 128(%rsp)
movsd 72(%rcx), %xmm6
movapd %xmm6, %xmm13
unpcklpd %xmm1, %xmm13
movapd %xmm1, %xmm2
unpcklpd %xmm4, %xmm2
subpd %xmm2, %xmm13
movapd %xmm13, 80(%rsp)
movapd %xmm8, %xmm2
unpcklpd %xmm7, %xmm2
mulpd %xmm2, %xmm2
shufpd $1, %xmm8, %xmm8
movapd %xmm8, 96(%rsp)
movapd %xmm7, %xmm4
movsd %xmm8, %xmm4
mulpd %xmm4, %xmm4
addpd %xmm2, %xmm4
movapd %xmm13, %xmm2
mulpd %xmm2, %xmm2
addpd %xmm4, %xmm2
sqrtpd %xmm2, %xmm2
movapd %xmm2, %xmm4
mulpd %xmm4, %xmm4
mulpd %xmm2, %xmm4
movapd LCPI0_1(%rip), %xmm10
divpd %xmm4, %xmm10
movapd %xmm10, %xmm2
movlhps %xmm2, %xmm2
mulpd %xmm9, %xmm2
movupd 80(%rcx), %xmm1
subpd %xmm2, %xmm1
movapd %xmm1, %xmm2
movsd 96(%rcx), %xmm9
mulsd %xmm13, %xmm0
mulsd %xmm10, %xmm0
subsd %xmm0, %xmm9
movsd 104(%rcx), %xmm5
mulsd %xmm5, %xmm13
mulsd %xmm10, %xmm13
addsd 152(%rcx), %xmm13
movsd 144(%rcx), %xmm7
movsd %xmm13, 152(%rcx)
movupd 168(%rcx), %xmm0
movapd %xmm0, 48(%rsp)
movapd %xmm3, %xmm8
subpd %xmm0, %xmm8
movapd %xmm8, %xmm0
mulsd %xmm0, %xmm0
movapd %xmm8, %xmm4
shufpd $1, %xmm4, %xmm4
mulsd %xmm4, %xmm4
addsd %xmm0, %xmm4
movsd 184(%rcx), %xmm0
movsd %xmm0, 40(%rsp)
movapd %xmm6, %xmm14
subsd %xmm0, %xmm14
movapd %xmm14, %xmm0
mulsd %xmm0, %xmm0
addsd %xmm4, %xmm0
sqrtsd %xmm0, %xmm0
movapd %xmm0, %xmm4
mulsd %xmm4, %xmm4
mulsd %xmm0, %xmm4
movapd %xmm12, %xmm15
movapd %xmm12, %xmm1
divsd %xmm4, %xmm15
movsd 216(%rcx), %xmm12
movapd %xmm12, %xmm0
movlhps %xmm0, %xmm0
movaps %xmm0, 64(%rsp)
movapd %xmm8, %xmm4
mulpd %xmm0, %xmm4
movapd %xmm14, %xmm0
mulsd %xmm12, %xmm0
mulsd %xmm15, %xmm0
mulsd %xmm5, %xmm14
mulsd %xmm15, %xmm14
movlhps %xmm15, %xmm15
mulpd %xmm15, %xmm4
subpd %xmm4, %xmm2
movapd %xmm2, (%rsp)
subsd %xmm0, %xmm9
movapd %xmm5, %xmm2
movlhps %xmm2, %xmm2
mulpd %xmm2, %xmm8
mulpd %xmm15, %xmm8
movupd 192(%rcx), %xmm4
addpd %xmm4, %xmm8
subpd 176(%rsp), %xmm3
movapd %xmm3, %xmm15
mulsd %xmm15, %xmm15
movapd %xmm3, %xmm4
shufpd $1, %xmm4, %xmm4
mulsd %xmm4, %xmm4
addsd %xmm15, %xmm4
movapd %xmm6, %xmm0
subsd 160(%rsp), %xmm0
movapd %xmm0, %xmm6
mulsd %xmm6, %xmm6
addsd %xmm4, %xmm6
xorps %xmm4, %xmm4
sqrtsd %xmm6, %xmm4
movapd %xmm4, %xmm6
mulsd %xmm6, %xmm6
mulsd %xmm4, %xmm6
movapd %xmm1, %xmm4
divsd %xmm6, %xmm4
movsd 272(%rcx), %xmm1
movapd %xmm1, 144(%rsp)
movapd %xmm1, %xmm15
movlhps %xmm15, %xmm15
mulpd %xmm3, %xmm2
mulpd %xmm15, %xmm3
movapd %xmm0, %xmm6
mulsd %xmm1, %xmm6
mulsd %xmm4, %xmm6
mulsd %xmm5, %xmm0
mulsd %xmm4, %xmm0
movapd %xmm0, 16(%rsp)
movlhps %xmm4, %xmm4
mulpd %xmm4, %xmm3
movapd (%rsp), %xmm0
subpd %xmm3, %xmm0
addsd 208(%rcx), %xmm14
movupd %xmm0, 80(%rcx)
subsd %xmm6, %xmm9
movsd %xmm9, 96(%rcx)
mulpd %xmm2, %xmm4
movupd 248(%rcx), %xmm0
addpd %xmm0, %xmm4
subpd 48(%rsp), %xmm11
movapd 192(%rsp), %xmm1
subsd 40(%rsp), %xmm1
movapd %xmm1, 192(%rsp)
movapd %xmm11, %xmm0
mulsd %xmm0, %xmm0
movapd %xmm11, %xmm3
shufpd $1, %xmm3, %xmm3
mulsd %xmm3, %xmm3
addsd %xmm0, %xmm3
movapd %xmm1, %xmm0
mulsd %xmm0, %xmm0
addsd %xmm3, %xmm0
sqrtsd %xmm0, %xmm0
movapd %xmm0, %xmm6
mulsd %xmm6, %xmm6
mulsd %xmm0, %xmm6
movsd LCPI0_0(%rip), %xmm2
movapd %xmm2, %xmm3
divsd %xmm6, %xmm3
mulsd %xmm1, %xmm12
mulsd %xmm3, %xmm12
subsd %xmm12, %xmm13
movapd 80(%rsp), %xmm1
movapd %xmm1, %xmm9
movapd 96(%rsp), %xmm0
movsd %xmm0, %xmm1
movapd 112(%rsp), %xmm6
mulsd %xmm5, %xmm6
unpcklpd 144(%rsp), %xmm5
mulpd %xmm1, %xmm5
unpcklpd %xmm13, %xmm7
mulpd %xmm10, %xmm5
movapd %xmm7, %xmm0
addpd %xmm5, %xmm0
subpd %xmm5, %xmm7
movapd %xmm7, %xmm1
movsd %xmm0, %xmm1
movapd %xmm2, %xmm12
movapd %xmm12, %xmm0
unpcklpd %xmm6, %xmm0
shufpd $1, %xmm10, %xmm7
mulpd %xmm0, %xmm7
movupd 128(%rcx), %xmm0
addpd %xmm0, %xmm7
movapd 16(%rsp), %xmm6
addsd 264(%rcx), %xmm6
movupd %xmm7, 128(%rcx)
movupd %xmm1, 144(%rcx)
movapd 64(%rsp), %xmm2
mulpd %xmm11, %xmm2
movsd 160(%rcx), %xmm1
movapd 192(%rsp), %xmm0
mulsd %xmm1, %xmm0
mulsd %xmm3, %xmm0
movapd %xmm0, %xmm5
movlhps %xmm3, %xmm3
mulpd %xmm3, %xmm2
movupd 136(%rcx), %xmm0
subpd %xmm2, %xmm0
shufpd $1, %xmm9, %xmm9
mulsd %xmm1, %xmm9
movlhps %xmm1, %xmm1
mulpd %xmm1, %xmm11
mulpd %xmm3, %xmm11
addpd %xmm8, %xmm11
addsd %xmm14, %xmm5
movapd 128(%rsp), %xmm3
mulpd %xmm3, %xmm1
mulpd %xmm15, %xmm3
movapd %xmm10, %xmm2
movhlps %xmm10, %xmm10
mulpd %xmm10, %xmm3
subpd %xmm3, %xmm0
shufpd $1, %xmm2, %xmm2
movupd %xmm0, 136(%rcx)
mulpd %xmm1, %xmm10
addpd %xmm4, %xmm10
mulsd %xmm2, %xmm9
addsd %xmm6, %xmm9
movupd 168(%rcx), %xmm13
movapd %xmm13, %xmm4
subpd 176(%rsp), %xmm4
movsd 184(%rcx), %xmm8
movapd %xmm8, %xmm1
subsd 160(%rsp), %xmm1
movapd %xmm4, %xmm6
mulsd %xmm6, %xmm6
movapd %xmm4, %xmm7
shufpd $1, %xmm7, %xmm7
mulsd %xmm7, %xmm7
addsd %xmm6, %xmm7
movapd %xmm1, %xmm6
mulsd %xmm6, %xmm6
addsd %xmm7, %xmm6
sqrtsd %xmm6, %xmm6
movapd %xmm6, %xmm7
mulsd %xmm7, %xmm7
mulsd %xmm6, %xmm7
movapd %xmm12, %xmm6
divsd %xmm7, %xmm6
mulpd %xmm4, %xmm15
movapd %xmm6, %xmm2
movlhps %xmm2, %xmm2
mulpd %xmm2, %xmm15
subpd %xmm15, %xmm11
movupd %xmm11, 192(%rcx)
movapd 144(%rsp), %xmm3
mulsd %xmm1, %xmm3
mulsd %xmm6, %xmm3
subsd %xmm3, %xmm5
movsd %xmm5, 208(%rcx)
movsd 216(%rcx), %xmm7
mulsd %xmm7, %xmm1
movlhps %xmm7, %xmm7
mulpd %xmm4, %xmm7
mulpd %xmm2, %xmm7
addpd %xmm10, %xmm7
movupd %xmm7, 248(%rcx)
mulsd %xmm6, %xmm1
addsd %xmm9, %xmm1
movsd %xmm1, 264(%rcx)
movupd (%rcx), %xmm2
movupd 24(%rcx), %xmm4
movapd LCPI0_1(%rip), %xmm3
mulpd %xmm3, %xmm4
addpd %xmm2, %xmm4
movupd %xmm4, (%rcx)
movsd 40(%rcx), %xmm2
movapd %xmm12, %xmm6
mulsd %xmm6, %xmm2
addsd 16(%rcx), %xmm2
movsd %xmm2, 16(%rcx)
movupd 56(%rcx), %xmm2
movupd 80(%rcx), %xmm4
mulpd %xmm3, %xmm4
addpd %xmm2, %xmm4
movupd %xmm4, 56(%rcx)
movsd 96(%rcx), %xmm2
mulsd %xmm6, %xmm2
addsd 72(%rcx), %xmm2
movsd %xmm2, 72(%rcx)
movupd 112(%rcx), %xmm2
mulpd %xmm3, %xmm0
addpd %xmm2, %xmm0
movupd %xmm0, 112(%rcx)
mulpd %xmm3, %xmm11
addpd %xmm13, %xmm11
movupd %xmm11, 168(%rcx)
mulsd %xmm6, %xmm5
addsd %xmm8, %xmm5
movsd %xmm5, 184(%rcx)
mulpd %xmm3, %xmm7
movupd 224(%rcx), %xmm0
addpd %xmm0, %xmm7
movupd %xmm7, 224(%rcx)
mulsd %xmm6, %xmm1
addsd 240(%rcx), %xmm1
movsd %xmm1, 240(%rcx)
movaps 208(%rsp), %xmm6
movaps 224(%rsp), %xmm7
movaps 240(%rsp), %xmm8
movaps 256(%rsp), %xmm9
movaps 272(%rsp), %xmm10
movaps 288(%rsp), %xmm11
movaps 304(%rsp), %xmm12
movaps 320(%rsp), %xmm13
movaps 336(%rsp), %xmm14
movaps 352(%rsp), %xmm15
addq $376, %rsp
retq
As you see the advance() of Rust#4 has no loops and no jumps. Modern CPUs love this kind of code, and they scream through it.
Finally the asm of the advance() function of Rust#3 compiled with (a more aggressive) automatic loop unrolling:
rustc -C opt-level=3 -C lto -C llvm-args="-unroll-threshold=1000" --emit asm nbody_simple.rs
_ZN7advance20hc525da322d48cc5eLaaE:
subq $840, %rsp
movapd %xmm15, 816(%rsp)
movapd %xmm14, 800(%rsp)
movapd %xmm13, 784(%rsp)
movapd %xmm12, 768(%rsp)
movapd %xmm11, 752(%rsp)
movapd %xmm10, 736(%rsp)
movapd %xmm9, 720(%rsp)
movaps %xmm8, 704(%rsp)
movapd %xmm7, 688(%rsp)
movapd %xmm6, 672(%rsp)
testq %rdx, %rdx
je .LBB0_7
xorl %eax, %eax
testb $1, %dl
je .LBB0_3
movupd (%rcx), %xmm4
movsd 48(%rcx), %xmm8
movapd %xmm8, %xmm9
movlhps %xmm9, %xmm9
movupd 56(%rcx), %xmm3
movapd %xmm4, %xmm1
subpd %xmm3, %xmm1
movsd 16(%rcx), %xmm3
movsd 40(%rcx), %xmm10
movapd %xmm3, %xmm7
subsd 72(%rcx), %xmm7
movapd %xmm1, %xmm6
mulsd %xmm6, %xmm6
movapd %xmm1, %xmm0
shufpd $1, %xmm0, %xmm0
mulsd %xmm0, %xmm0
addsd %xmm6, %xmm0
movapd %xmm7, %xmm6
mulsd %xmm6, %xmm6
addsd %xmm0, %xmm6
xorps %xmm0, %xmm0
sqrtsd %xmm6, %xmm0
movapd %xmm0, %xmm6
mulsd %xmm6, %xmm6
mulsd %xmm0, %xmm6
movsd LCPI0_0(%rip), %xmm2
movapd %xmm2, %xmm0
movapd %xmm2, %xmm12
divsd %xmm6, %xmm0
movsd 104(%rcx), %xmm2
movupd 24(%rcx), %xmm6
movapd %xmm7, %xmm5
mulsd %xmm2, %xmm5
movlhps %xmm2, %xmm2
mulpd %xmm1, %xmm2
mulsd %xmm0, %xmm5
mulsd %xmm8, %xmm7
mulsd %xmm0, %xmm7
movlhps %xmm0, %xmm0
mulpd %xmm0, %xmm2
subpd %xmm2, %xmm6
subsd %xmm5, %xmm10
movupd 80(%rcx), %xmm2
mulpd %xmm9, %xmm1
mulpd %xmm0, %xmm1
addpd %xmm2, %xmm1
movupd %xmm1, 80(%rcx)
addsd 96(%rcx), %xmm7
movsd %xmm7, 96(%rcx)
movupd 112(%rcx), %xmm0
movapd %xmm4, %xmm7
subpd %xmm0, %xmm7
movapd %xmm3, %xmm0
subsd 128(%rcx), %xmm0
movapd %xmm7, %xmm1
mulsd %xmm1, %xmm1
movapd %xmm7, %xmm2
shufpd $1, %xmm2, %xmm2
mulsd %xmm2, %xmm2
addsd %xmm1, %xmm2
movapd %xmm0, %xmm1
mulsd %xmm1, %xmm1
addsd %xmm2, %xmm1
sqrtsd %xmm1, %xmm1
movapd %xmm1, %xmm2
mulsd %xmm2, %xmm2
mulsd %xmm1, %xmm2
movapd %xmm12, %xmm1
divsd %xmm2, %xmm1
movsd 160(%rcx), %xmm2
movapd %xmm0, %xmm5
mulsd %xmm2, %xmm5
movlhps %xmm2, %xmm2
mulpd %xmm7, %xmm2
mulsd %xmm1, %xmm5
mulsd %xmm8, %xmm0
mulsd %xmm1, %xmm0
movlhps %xmm1, %xmm1
mulpd %xmm1, %xmm2
subpd %xmm2, %xmm6
subsd %xmm5, %xmm10
movupd 136(%rcx), %xmm2
mulpd %xmm9, %xmm7
mulpd %xmm1, %xmm7
addpd %xmm2, %xmm7
movupd %xmm7, 136(%rcx)
addsd 152(%rcx), %xmm0
movsd %xmm0, 152(%rcx)
movupd 168(%rcx), %xmm0
movapd %xmm4, %xmm7
subpd %xmm0, %xmm7
movapd %xmm3, %xmm0
subsd 184(%rcx), %xmm0
movapd %xmm7, %xmm1
mulsd %xmm1, %xmm1
movapd %xmm7, %xmm2
shufpd $1, %xmm2, %xmm2
mulsd %xmm2, %xmm2
addsd %xmm1, %xmm2
movapd %xmm0, %xmm1
mulsd %xmm1, %xmm1
addsd %xmm2, %xmm1
sqrtsd %xmm1, %xmm1
movapd %xmm1, %xmm2
mulsd %xmm2, %xmm2
mulsd %xmm1, %xmm2
movapd %xmm12, %xmm1
divsd %xmm2, %xmm1
movsd 216(%rcx), %xmm2
movapd %xmm0, %xmm5
mulsd %xmm2, %xmm5
movlhps %xmm2, %xmm2
mulpd %xmm7, %xmm2
mulsd %xmm1, %xmm5
mulsd %xmm8, %xmm0
mulsd %xmm1, %xmm0
movlhps %xmm1, %xmm1
mulpd %xmm1, %xmm2
subpd %xmm2, %xmm6
subsd %xmm5, %xmm10
movupd 192(%rcx), %xmm2
mulpd %xmm9, %xmm7
mulpd %xmm1, %xmm7
addpd %xmm2, %xmm7
movupd %xmm7, 192(%rcx)
addsd 208(%rcx), %xmm0
movsd %xmm0, 208(%rcx)
movupd 224(%rcx), %xmm0
subpd %xmm0, %xmm4
subsd 240(%rcx), %xmm3
movapd %xmm4, %xmm0
mulsd %xmm0, %xmm0
movapd %xmm4, %xmm1
shufpd $1, %xmm1, %xmm1
mulsd %xmm1, %xmm1
addsd %xmm0, %xmm1
movapd %xmm3, %xmm0
mulsd %xmm0, %xmm0
addsd %xmm1, %xmm0
sqrtsd %xmm0, %xmm0
movapd %xmm0, %xmm1
mulsd %xmm1, %xmm1
mulsd %xmm0, %xmm1
movapd %xmm12, %xmm0
divsd %xmm1, %xmm0
movsd 272(%rcx), %xmm1
movapd %xmm3, %xmm2
mulsd %xmm1, %xmm2
movlhps %xmm1, %xmm1
mulpd %xmm4, %xmm1
mulsd %xmm0, %xmm2
mulsd %xmm8, %xmm3
mulsd %xmm0, %xmm3
movlhps %xmm0, %xmm0
mulpd %xmm0, %xmm1
subpd %xmm1, %xmm6
movupd %xmm6, 24(%rcx)
subsd %xmm2, %xmm10
movsd %xmm10, 40(%rcx)
movupd 248(%rcx), %xmm1
mulpd %xmm9, %xmm4
mulpd %xmm0, %xmm4
addpd %xmm1, %xmm4
movupd %xmm4, 248(%rcx)
addsd 264(%rcx), %xmm3
movsd %xmm3, 264(%rcx)
movupd 56(%rcx), %xmm4
movsd 104(%rcx), %xmm14
movapd %xmm14, %xmm3
movlhps %xmm3, %xmm3
movupd 112(%rcx), %xmm0
movapd %xmm4, %xmm1
subpd %xmm0, %xmm1
movsd 72(%rcx), %xmm11
movsd 96(%rcx), %xmm5
movapd %xmm11, %xmm2
subsd 128(%rcx), %xmm2
movapd %xmm1, %xmm0
mulsd %xmm0, %xmm0
movapd %xmm1, %xmm7
shufpd $1, %xmm7, %xmm7
mulsd %xmm7, %xmm7
addsd %xmm0, %xmm7
movapd %xmm2, %xmm0
mulsd %xmm0, %xmm0
addsd %xmm7, %xmm0
sqrtsd %xmm0, %xmm0
movapd %xmm0, %xmm7
mulsd %xmm7, %xmm7
mulsd %xmm0, %xmm7
movapd %xmm12, %xmm8
divsd %xmm7, %xmm8
movsd 160(%rcx), %xmm7
movupd 80(%rcx), %xmm0
movapd %xmm2, %xmm6
mulsd %xmm7, %xmm6
movlhps %xmm7, %xmm7
mulpd %xmm1, %xmm7
mulsd %xmm8, %xmm6
mulsd %xmm14, %xmm2
mulsd %xmm8, %xmm2
movlhps %xmm8, %xmm8
mulpd %xmm8, %xmm7
subpd %xmm7, %xmm0
subsd %xmm6, %xmm5
movupd 136(%rcx), %xmm6
mulpd %xmm3, %xmm1
mulpd %xmm8, %xmm1
addpd %xmm6, %xmm1
movupd %xmm1, 136(%rcx)
addsd 152(%rcx), %xmm2
movsd %xmm2, 152(%rcx)
movupd 168(%rcx), %xmm9
movapd %xmm4, %xmm2
subpd %xmm9, %xmm2
movsd 184(%rcx), %xmm1
movsd %xmm1, 640(%rsp)
movsd 216(%rcx), %xmm15
movapd %xmm11, %xmm13
subsd %xmm1, %xmm13
movapd %xmm2, %xmm1
mulsd %xmm1, %xmm1
movapd %xmm2, %xmm6
shufpd $1, %xmm6, %xmm6
mulsd %xmm6, %xmm6
addsd %xmm1, %xmm6
movapd %xmm13, %xmm1
mulsd %xmm1, %xmm1
addsd %xmm6, %xmm1
sqrtsd %xmm1, %xmm1
movapd %xmm1, %xmm6
mulsd %xmm6, %xmm6
mulsd %xmm1, %xmm6
movapd %xmm12, %xmm1
movapd %xmm1, %xmm7
divsd %xmm6, %xmm7
movapd %xmm15, %xmm10
movlhps %xmm10, %xmm10
movapd %xmm3, %xmm8
mulpd %xmm2, %xmm8
mulpd %xmm10, %xmm2
movapd %xmm13, %xmm6
mulsd %xmm15, %xmm6
mulsd %xmm7, %xmm6
mulsd %xmm14, %xmm13
mulsd %xmm7, %xmm13
movapd %xmm7, %xmm12
movlhps %xmm12, %xmm12
mulpd %xmm12, %xmm2
subpd %xmm2, %xmm0
subsd %xmm6, %xmm5
movupd 192(%rcx), %xmm2
mulpd %xmm8, %xmm12
addpd %xmm2, %xmm12
movupd 224(%rcx), %xmm2
movapd %xmm2, 656(%rsp)
subpd %xmm2, %xmm4
movapd %xmm4, %xmm2
mulsd %xmm2, %xmm2
movapd %xmm4, %xmm6
shufpd $1, %xmm6, %xmm6
mulsd %xmm6, %xmm6
addsd %xmm2, %xmm6
movsd 240(%rcx), %xmm2
movsd %xmm2, 624(%rsp)
subsd %xmm2, %xmm11
movapd %xmm11, %xmm2
mulsd %xmm2, %xmm2
addsd %xmm6, %xmm2
sqrtsd %xmm2, %xmm2
movapd %xmm2, %xmm6
mulsd %xmm6, %xmm6
mulsd %xmm2, %xmm6
movapd %xmm1, %xmm2
movapd %xmm1, %xmm8
divsd %xmm6, %xmm2
movapd %xmm11, %xmm6
mulsd %xmm14, %xmm11
movsd 272(%rcx), %xmm14
movapd %xmm14, %xmm7
movlhps %xmm7, %xmm7
mulpd %xmm4, %xmm3
mulpd %xmm7, %xmm4
mulsd %xmm14, %xmm6
mulsd %xmm2, %xmm6
mulsd %xmm2, %xmm11
movlhps %xmm2, %xmm2
mulpd %xmm2, %xmm4
subpd %xmm4, %xmm0
addsd 208(%rcx), %xmm13
movupd %xmm0, 80(%rcx)
subsd %xmm6, %xmm5
movsd %xmm5, 96(%rcx)
movupd 248(%rcx), %xmm0
mulpd %xmm3, %xmm2
addpd %xmm0, %xmm2
movupd 112(%rcx), %xmm4
movapd %xmm4, %xmm3
subpd %xmm9, %xmm3
movsd 128(%rcx), %xmm1
movapd %xmm1, %xmm5
subsd 640(%rsp), %xmm5
movapd %xmm3, %xmm9
mulsd %xmm9, %xmm9
movapd %xmm3, %xmm0
shufpd $1, %xmm0, %xmm0
mulsd %xmm0, %xmm0
addsd %xmm9, %xmm0
movapd %xmm5, %xmm6
mulsd %xmm6, %xmm6
addsd %xmm0, %xmm6
xorps %xmm0, %xmm0
sqrtsd %xmm6, %xmm0
movapd %xmm0, %xmm6
mulsd %xmm6, %xmm6
mulsd %xmm0, %xmm6
movapd %xmm8, %xmm0
divsd %xmm6, %xmm0
mulpd %xmm3, %xmm10
mulsd %xmm5, %xmm15
mulsd %xmm0, %xmm15
movapd %xmm15, %xmm8
movsd 160(%rcx), %xmm9
mulsd %xmm9, %xmm5
mulsd %xmm0, %xmm5
movlhps %xmm0, %xmm0
mulpd %xmm0, %xmm10
movupd 136(%rcx), %xmm15
subpd %xmm10, %xmm15
movsd 152(%rcx), %xmm6
subsd %xmm8, %xmm6
subsd 624(%rsp), %xmm1
mulsd %xmm1, %xmm14
movapd %xmm1, %xmm8
mulsd %xmm9, %xmm1
movlhps %xmm9, %xmm9
mulpd %xmm9, %xmm3
mulpd %xmm0, %xmm3
addsd 264(%rcx), %xmm11
addpd %xmm12, %xmm3
addsd %xmm13, %xmm5
subpd 656(%rsp), %xmm4
movapd %xmm4, %xmm10
mulsd %xmm10, %xmm10
movapd %xmm4, %xmm0
shufpd $1, %xmm0, %xmm0
mulsd %xmm0, %xmm0
addsd %xmm10, %xmm0
mulsd %xmm8, %xmm8
addsd %xmm0, %xmm8
sqrtsd %xmm8, %xmm8
movapd %xmm8, %xmm0
mulsd %xmm0, %xmm0
mulsd %xmm8, %xmm0
movsd LCPI0_0(%rip), %xmm10
movapd %xmm10, %xmm8
divsd %xmm0, %xmm8
mulpd %xmm4, %xmm7
mulsd %xmm8, %xmm14
mulsd %xmm8, %xmm1
movlhps %xmm8, %xmm8
mulpd %xmm8, %xmm7
subpd %xmm7, %xmm15
movupd %xmm15, 136(%rcx)
subsd %xmm14, %xmm6
movsd %xmm6, 152(%rcx)
mulpd %xmm9, %xmm4
mulpd %xmm8, %xmm4
addpd %xmm2, %xmm4
addsd %xmm11, %xmm1
movupd 168(%rcx), %xmm11
movapd %xmm11, 624(%rsp)
movsd 184(%rcx), %xmm2
movsd 216(%rcx), %xmm14
movupd 224(%rcx), %xmm0
movapd %xmm0, 640(%rsp)
subpd %xmm0, %xmm11
movsd 240(%rcx), %xmm0
movsd %xmm0, 656(%rsp)
subsd %xmm0, %xmm2
movapd %xmm2, %xmm13
movsd 272(%rcx), %xmm0
movapd %xmm2, %xmm12
mulsd %xmm0, %xmm12
movlhps %xmm0, %xmm0
mulpd %xmm11, %xmm0
mulsd %xmm14, %xmm2
movlhps %xmm14, %xmm14
mulpd %xmm11, %xmm14
movapd %xmm11, %xmm7
mulsd %xmm11, %xmm11
shufpd $1, %xmm7, %xmm7
mulsd %xmm7, %xmm7
addsd %xmm11, %xmm7
mulsd %xmm13, %xmm13
addsd %xmm7, %xmm13
xorps %xmm11, %xmm11
sqrtsd %xmm13, %xmm11
movapd %xmm11, %xmm7
mulsd %xmm7, %xmm7
mulsd %xmm11, %xmm7
movapd %xmm10, %xmm11
divsd %xmm7, %xmm11
mulsd %xmm11, %xmm12
mulsd %xmm11, %xmm2
movlhps %xmm11, %xmm11
mulpd %xmm11, %xmm0
subpd %xmm0, %xmm3
movupd %xmm3, 192(%rcx)
subsd %xmm12, %xmm5
movsd %xmm5, 208(%rcx)
mulpd %xmm11, %xmm14
addpd %xmm4, %xmm14
movupd %xmm14, 248(%rcx)
addsd %xmm1, %xmm2
movsd %xmm2, 264(%rcx)
movupd (%rcx), %xmm1
movupd 24(%rcx), %xmm4
movupd 56(%rcx), %xmm12
movupd 80(%rcx), %xmm7
movupd 112(%rcx), %xmm11
movapd LCPI0_1(%rip), %xmm9
mulpd %xmm9, %xmm4
addpd %xmm1, %xmm4
movsd 40(%rcx), %xmm1
mulsd %xmm10, %xmm1
addsd 16(%rcx), %xmm1
movsd 96(%rcx), %xmm5
mulsd %xmm10, %xmm5
addsd 72(%rcx), %xmm5
mulsd %xmm10, %xmm6
addsd 128(%rcx), %xmm6
movupd 176(%rcx), %xmm13
movupd 200(%rcx), %xmm0
movupd %xmm4, (%rcx)
movsd %xmm1, 16(%rcx)
mulpd %xmm9, %xmm7
addpd %xmm12, %xmm7
movupd %xmm7, 56(%rcx)
movsd %xmm5, 72(%rcx)
mulpd %xmm9, %xmm15
addpd %xmm11, %xmm15
movupd %xmm15, 112(%rcx)
movsd %xmm6, 128(%rcx)
mulsd %xmm10, %xmm3
addsd 624(%rsp), %xmm3
movsd %xmm3, 168(%rcx)
mulpd %xmm9, %xmm0
addpd %xmm13, %xmm0
movupd %xmm0, 176(%rcx)
mulpd %xmm9, %xmm14
addpd 640(%rsp), %xmm14
movupd %xmm14, 224(%rcx)
mulsd %xmm10, %xmm2
addsd 656(%rsp), %xmm2
movsd %xmm2, 240(%rcx)
movl $1, %eax
.LBB0_3:
cmpq $1, %rdx
je .LBB0_7
movsd 48(%rcx), %xmm0
movapd %xmm0, 288(%rsp)
movsd 72(%rcx), %xmm12
movapd %xmm0, %xmm1
movlhps %xmm1, %xmm1
movaps %xmm1, 496(%rsp)
movsd 104(%rcx), %xmm1
movapd %xmm1, 272(%rsp)
movapd %xmm1, %xmm2
movlhps %xmm2, %xmm2
movaps %xmm2, 544(%rsp)
movsd 160(%rcx), %xmm2
movapd %xmm2, 256(%rsp)
movapd %xmm2, %xmm3
movlhps %xmm3, %xmm3
movaps %xmm3, 480(%rsp)
movsd 216(%rcx), %xmm3
movapd %xmm3, 240(%rsp)
movapd %xmm3, %xmm4
movlhps %xmm4, %xmm4
movaps %xmm4, 512(%rsp)
movsd 272(%rcx), %xmm4
movapd %xmm4, 224(%rsp)
movapd %xmm4, %xmm5
movlhps %xmm5, %xmm5
movaps %xmm5, 528(%rsp)
movsd 240(%rcx), %xmm9
movhpd 128(%rcx), %xmm9
movsd 264(%rcx), %xmm6
movhpd 152(%rcx), %xmm6
movhpd 16(%rcx), %xmm12
movsd 96(%rcx), %xmm5
movhpd 40(%rcx), %xmm5
movapd %xmm5, 640(%rsp)
movapd %xmm3, %xmm5
unpcklpd %xmm4, %xmm5
movapd %xmm5, 208(%rsp)
movapd %xmm2, %xmm4
unpcklpd %xmm3, %xmm4
movapd %xmm4, 32(%rsp)
unpcklpd %xmm1, %xmm0
movapd %xmm0, 16(%rsp)
unpcklpd %xmm2, %xmm1
movapd %xmm1, (%rsp)
subq %rax, %rdx
movupd (%rcx), %xmm7
movupd 56(%rcx), %xmm5
movups 24(%rcx), %xmm0
movaps %xmm0, 656(%rsp)
movups 80(%rcx), %xmm0
movaps %xmm0, 560(%rsp)
movupd 112(%rcx), %xmm10
movupd 136(%rcx), %xmm0
movapd %xmm0, 576(%rsp)
movupd 224(%rcx), %xmm4
movupd 248(%rcx), %xmm15
movsd 184(%rcx), %xmm1
movsd 208(%rcx), %xmm14
movupd 192(%rcx), %xmm11
.align 16, 0x90
.LBB0_5:
movapd %xmm10, 352(%rsp)
movapd %xmm9, 400(%rsp)
movapd %xmm4, 416(%rsp)
movapd %xmm6, 192(%rsp)
movapd %xmm5, 304(%rsp)
movapd %xmm7, 320(%rsp)
movapd %xmm12, 384(%rsp)
movapd %xmm7, %xmm2
subpd %xmm5, %xmm2
movapd %xmm2, 592(%rsp)
movapd %xmm2, %xmm0
mulsd %xmm0, %xmm0
shufpd $1, %xmm2, %xmm2
mulsd %xmm2, %xmm2
addsd %xmm0, %xmm2
movapd %xmm12, %xmm8
shufpd $1, %xmm8, %xmm8
subsd %xmm12, %xmm8
movapd %xmm5, %xmm0
movapd %xmm8, %xmm13
mulsd %xmm13, %xmm13
addsd %xmm2, %xmm13
movapd %xmm7, %xmm5
subpd %xmm10, %xmm5
movapd %xmm5, 368(%rsp)
movapd %xmm7, %xmm3
subpd %xmm4, %xmm3
movapd %xmm3, 176(%rsp)
movapd %xmm3, %xmm2
movapd %xmm3, %xmm4
unpcklpd %xmm5, %xmm2
mulpd %xmm2, %xmm2
movapd %xmm4, %xmm6
unpckhpd %xmm5, %xmm6
mulpd %xmm6, %xmm6
addpd %xmm2, %xmm6
movapd %xmm12, %xmm2
movhlps %xmm2, %xmm2
subpd %xmm9, %xmm2
movapd %xmm2, 336(%rsp)
movapd %xmm2, %xmm3
mulpd %xmm3, %xmm3
addpd %xmm6, %xmm3
movsd 184(%rcx), %xmm6
movapd %xmm6, 432(%rsp)
unpcklpd %xmm1, %xmm6
sqrtsd %xmm13, %xmm1
movapd %xmm1, %xmm5
mulsd %xmm5, %xmm5
mulsd %xmm1, %xmm5
subpd %xmm6, %xmm12
movapd %xmm12, 624(%rsp)
movapd %xmm12, %xmm13
sqrtpd %xmm3, %xmm3
movapd %xmm3, %xmm1
mulpd %xmm1, %xmm1
mulpd %xmm3, %xmm1
movupd 168(%rcx), %xmm2
movapd %xmm2, 464(%rsp)
subpd %xmm2, %xmm7
movapd %xmm0, %xmm9
subpd %xmm2, %xmm9
movapd %xmm7, %xmm3
mulsd %xmm3, %xmm3
movapd %xmm9, %xmm6
mulsd %xmm6, %xmm6
unpcklpd %xmm3, %xmm6
movapd %xmm7, %xmm3
movapd %xmm7, %xmm12
shufpd $1, %xmm3, %xmm3
mulsd %xmm3, %xmm3
movapd %xmm9, %xmm7
shufpd $1, %xmm7, %xmm7
mulsd %xmm7, %xmm7
unpcklpd %xmm3, %xmm7
movsd LCPI0_0(%rip), %xmm3
divsd %xmm5, %xmm3
addpd %xmm6, %xmm7
movapd %xmm13, %xmm5
mulpd %xmm5, %xmm5
addpd %xmm7, %xmm5
movapd LCPI0_1(%rip), %xmm6
movapd %xmm6, %xmm2
movapd %xmm2, %xmm4
divpd %xmm1, %xmm4
movapd %xmm4, 144(%rsp)
sqrtpd %xmm5, %xmm1
movapd %xmm1, %xmm6
mulpd %xmm6, %xmm6
mulpd %xmm1, %xmm6
movapd %xmm2, %xmm5
divpd %xmm6, %xmm5
movapd 592(%rsp), %xmm6
movapd %xmm6, %xmm1
mulpd 544(%rsp), %xmm1
movlhps %xmm3, %xmm3
mulpd %xmm3, %xmm1
movapd 656(%rsp), %xmm2
subpd %xmm1, %xmm2
movapd %xmm2, 656(%rsp)
movapd %xmm12, %xmm2
mulpd 512(%rsp), %xmm2
movapd %xmm5, %xmm1
movhlps %xmm1, %xmm1
movapd 496(%rsp), %xmm7
mulpd %xmm7, %xmm12
mulpd %xmm1, %xmm2
movapd %xmm2, 128(%rsp)
mulpd %xmm1, %xmm12
movapd %xmm12, 448(%rsp)
movapd %xmm13, %xmm2
shufpd $1, %xmm2, %xmm2
mulsd 288(%rsp), %xmm2
movapd %xmm5, %xmm1
shufpd $1, %xmm1, %xmm1
mulsd %xmm1, %xmm2
movapd %xmm2, 112(%rsp)
movlhps %xmm8, %xmm8
mulpd 16(%rsp), %xmm8
mulpd %xmm3, %xmm8
movapd %xmm6, %xmm2
mulpd %xmm7, %xmm2
mulpd %xmm3, %xmm2
movapd 640(%rsp), %xmm13
movapd %xmm13, %xmm1
addpd %xmm8, %xmm1
subpd %xmm8, %xmm13
movapd %xmm0, %xmm8
subpd %xmm10, %xmm8
movapd %xmm0, %xmm10
subpd 416(%rsp), %xmm10
movapd %xmm10, %xmm3
unpcklpd %xmm8, %xmm3
mulpd %xmm3, %xmm3
movapd %xmm10, %xmm6
unpckhpd %xmm8, %xmm6
mulpd %xmm6, %xmm6
addpd %xmm3, %xmm6
movaps 384(%rsp), %xmm4
movlhps %xmm4, %xmm4
subpd 400(%rsp), %xmm4
movapd %xmm4, 608(%rsp)
movapd %xmm4, %xmm3
mulpd %xmm3, %xmm3
addpd %xmm6, %xmm3
addpd 560(%rsp), %xmm2
movapd 368(%rsp), %xmm6
movapd 480(%rsp), %xmm0
mulpd %xmm0, %xmm6
movaps 144(%rsp), %xmm12
movaps %xmm12, %xmm7
movhlps %xmm7, %xmm7
movaps %xmm7, 96(%rsp)
mulpd %xmm7, %xmm6
movapd 656(%rsp), %xmm7
subpd %xmm6, %xmm7
movapd %xmm7, 656(%rsp)
sqrtpd %xmm3, %xmm3
movapd %xmm3, %xmm6
mulpd %xmm6, %xmm6
mulpd %xmm3, %xmm6
movapd %xmm13, %xmm7
movsd %xmm1, %xmm7
movapd LCPI0_1(%rip), %xmm13
divpd %xmm6, %xmm13
movapd %xmm8, %xmm1
mulpd %xmm0, %xmm1
movapd %xmm13, %xmm3
movhlps %xmm3, %xmm3
movaps %xmm3, 80(%rsp)
mulpd %xmm3, %xmm1
subpd %xmm1, %xmm2
movapd %xmm2, 592(%rsp)
movapd %xmm4, %xmm1
unpckhpd 336(%rsp), %xmm1
mulpd %xmm0, %xmm1
movapd %xmm13, %xmm6
unpckhpd %xmm12, %xmm6
mulpd %xmm1, %xmm6
subpd %xmm6, %xmm7
movapd %xmm7, 640(%rsp)
movapd 624(%rsp), %xmm2
movapd %xmm2, %xmm3
movapd 512(%rsp), %xmm0
mulpd %xmm0, %xmm3
mulpd %xmm5, %xmm3
movapd 544(%rsp), %xmm6
mulpd %xmm9, %xmm6
mulpd %xmm0, %xmm9
movapd %xmm2, %xmm0
mulsd 272(%rsp), %xmm0
mulsd %xmm5, %xmm0
movapd %xmm0, 624(%rsp)
movlhps %xmm5, %xmm5
mulpd %xmm5, %xmm9
mulpd %xmm6, %xmm5
movapd %xmm5, 560(%rsp)
movapd %xmm13, %xmm2
unpcklpd %xmm12, %xmm2
movapd 608(%rsp), %xmm4
movapd %xmm4, %xmm6
movapd 336(%rsp), %xmm1
unpcklpd %xmm1, %xmm6
movapd 528(%rsp), %xmm0
mulpd %xmm0, %xmm6
movapd %xmm0, %xmm5
mulpd %xmm6, %xmm2
movapd %xmm2, 160(%rsp)
movapd 496(%rsp), %xmm7
mulpd %xmm7, %xmm1
mulpd %xmm12, %xmm1
movapd 176(%rsp), %xmm0
movapd %xmm0, %xmm2
mulpd %xmm5, %xmm2
movlhps %xmm12, %xmm12
mulpd %xmm12, %xmm2
movapd %xmm2, 144(%rsp)
mulpd %xmm7, %xmm0
mulpd %xmm12, %xmm0
addpd %xmm15, %xmm0
movapd %xmm0, %xmm6
movapd 448(%rsp), %xmm0
addpd %xmm11, %xmm0
movapd %xmm0, 448(%rsp)
movapd 544(%rsp), %xmm2
mulpd %xmm2, %xmm4
mulpd %xmm13, %xmm4
movapd %xmm10, %xmm11
mulpd %xmm5, %xmm11
movlhps %xmm13, %xmm13
mulpd %xmm13, %xmm11
mulpd %xmm2, %xmm10
mulpd %xmm13, %xmm10
movapd 112(%rsp), %xmm5
addsd %xmm14, %xmm5
addpd %xmm6, %xmm10
movapd %xmm10, 336(%rsp)
addpd 192(%rsp), %xmm1
addpd %xmm1, %xmm4
movapd %xmm4, 608(%rsp)
movapd 656(%rsp), %xmm0
subpd 128(%rsp), %xmm0
movapd %xmm0, 656(%rsp)
movapd 368(%rsp), %xmm0
mulpd %xmm7, %xmm0
mulpd 96(%rsp), %xmm0
addpd 576(%rsp), %xmm0
movapd 592(%rsp), %xmm15
subpd %xmm9, %xmm15
mulpd %xmm2, %xmm8
mulpd 80(%rsp), %xmm8
addpd %xmm0, %xmm8
movapd %xmm8, %xmm0
movapd 640(%rsp), %xmm2
subpd %xmm3, %xmm2
movapd %xmm2, 640(%rsp)
movapd 352(%rsp), %xmm12
movapd %xmm12, %xmm9
subpd 464(%rsp), %xmm9
movapd 416(%rsp), %xmm8
subpd %xmm8, %xmm12
movapd %xmm12, %xmm3
mulsd %xmm3, %xmm3
movapd %xmm9, %xmm2
mulsd %xmm2, %xmm2
unpcklpd %xmm2, %xmm3
movapd %xmm9, %xmm4
shufpd $1, %xmm4, %xmm4
movapd %xmm12, %xmm2
shufpd $1, %xmm2, %xmm2
mulsd %xmm4, %xmm4
mulsd %xmm2, %xmm2
unpcklpd %xmm4, %xmm2
addpd %xmm3, %xmm2
movapd 400(%rsp), %xmm1
movapd %xmm1, %xmm6
shufpd $1, %xmm6, %xmm6
movapd %xmm6, %xmm10
movapd 432(%rsp), %xmm3
subsd %xmm3, %xmm10
subsd %xmm1, %xmm6
movapd %xmm1, %xmm4
movapd %xmm3, %xmm1
subsd %xmm4, %xmm1
movapd %xmm1, %xmm13
movapd %xmm1, %xmm14
unpcklpd %xmm6, %xmm1
movapd %xmm1, 432(%rsp)
unpcklpd %xmm10, %xmm6
movapd %xmm6, %xmm1
mulpd %xmm1, %xmm1
addpd %xmm2, %xmm1
movapd 560(%rsp), %xmm2
addpd 448(%rsp), %xmm2
movapd %xmm2, 560(%rsp)
sqrtpd %xmm1, %xmm1
movapd %xmm1, %xmm2
mulpd %xmm2, %xmm2
mulpd %xmm1, %xmm2
movapd 624(%rsp), %xmm1
addsd %xmm5, %xmm1
movapd %xmm1, 624(%rsp)
movapd LCPI0_1(%rip), %xmm7
divpd %xmm2, %xmm7
movapd %xmm9, %xmm2
movapd 512(%rsp), %xmm5
mulpd %xmm5, %xmm2
movapd %xmm7, %xmm1
movhlps %xmm1, %xmm1
mulpd %xmm1, %xmm2
subpd %xmm2, %xmm0
movapd 480(%rsp), %xmm4
mulpd %xmm4, %xmm9
mulpd %xmm1, %xmm9
movapd 656(%rsp), %xmm1
subpd 144(%rsp), %xmm1
movapd %xmm1, 656(%rsp)
movapd %xmm7, %xmm1
shufpd $1, %xmm1, %xmm1
mulsd 256(%rsp), %xmm10
mulsd %xmm1, %xmm10
subpd %xmm11, %xmm15
movapd %xmm15, 592(%rsp)
movapd 464(%rsp), %xmm3
subpd %xmm8, %xmm3
movapd %xmm3, %xmm2
mulsd %xmm2, %xmm2
movapd %xmm3, %xmm1
shufpd $1, %xmm1, %xmm1
mulsd %xmm1, %xmm1
addsd %xmm2, %xmm1
mulsd %xmm13, %xmm13
addsd %xmm1, %xmm13
xorps %xmm1, %xmm1
sqrtsd %xmm13, %xmm1
movapd %xmm1, %xmm2
mulsd %xmm2, %xmm2
mulsd %xmm1, %xmm2
movsd LCPI0_0(%rip), %xmm11
movapd %xmm11, %xmm1
divsd %xmm2, %xmm1
movapd 640(%rsp), %xmm13
subpd 160(%rsp), %xmm13
movapd %xmm13, 640(%rsp)
mulpd 32(%rsp), %xmm6
mulsd 224(%rsp), %xmm14
mulsd %xmm1, %xmm14
mulpd %xmm7, %xmm6
movapd %xmm1, 448(%rsp)
unpcklpd %xmm7, %xmm1
movapd %xmm4, %xmm2
mulpd %xmm12, %xmm2
movapd 528(%rsp), %xmm8
mulpd %xmm8, %xmm12
movlhps %xmm7, %xmm7
mulpd %xmm7, %xmm12
subpd %xmm12, %xmm0
movapd %xmm0, 368(%rsp)
mulpd %xmm2, %xmm7
addpd 560(%rsp), %xmm9
addpd 336(%rsp), %xmm7
movapd 608(%rsp), %xmm4
movapd %xmm4, %xmm2
addpd %xmm6, %xmm2
subpd %xmm6, %xmm4
addsd 624(%rsp), %xmm10
mulpd %xmm3, %xmm5
mulpd %xmm8, %xmm3
movaps 448(%rsp), %xmm0
movlhps %xmm0, %xmm0
mulpd %xmm0, %xmm3
subpd %xmm3, %xmm9
subsd %xmm14, %xmm10
mulpd %xmm5, %xmm0
movsd %xmm2, %xmm4
addpd %xmm7, %xmm0
movapd %xmm0, 448(%rsp)
movapd %xmm0, %xmm14
movapd 432(%rsp), %xmm0
mulpd 208(%rsp), %xmm0
mulpd %xmm0, %xmm1
movapd %xmm4, %xmm2
movupd %xmm9, 192(%rcx)
addpd %xmm1, %xmm2
subpd %xmm1, %xmm4
movsd %xmm2, %xmm4
movapd %xmm4, 608(%rsp)
movapd %xmm4, %xmm5
movsd %xmm10, 208(%rcx)
movupd 200(%rcx), %xmm2
movapd 656(%rsp), %xmm0
movapd LCPI0_1(%rip), %xmm3
mulpd %xmm3, %xmm0
addpd 320(%rsp), %xmm0
movapd %xmm0, 192(%rsp)
movapd %xmm0, %xmm7
movapd %xmm15, %xmm8
mulpd %xmm3, %xmm8
addpd 304(%rsp), %xmm8
movapd %xmm8, 80(%rsp)
movapd %xmm13, %xmm4
movapd 368(%rsp), %xmm10
mulpd %xmm3, %xmm10
addpd 352(%rsp), %xmm10
movapd %xmm10, 432(%rsp)
movupd 176(%rcx), %xmm1
mulpd %xmm3, %xmm4
mulsd %xmm11, %xmm9
movapd %xmm11, %xmm15
addsd 464(%rsp), %xmm9
movsd %xmm9, 168(%rcx)
mulpd %xmm3, %xmm2
addpd %xmm1, %xmm2
mulpd %xmm3, %xmm14
addpd 384(%rsp), %xmm4
movapd %xmm4, 176(%rsp)
addpd 416(%rsp), %xmm14
movapd %xmm14, 336(%rsp)
movapd %xmm5, %xmm9
mulpd %xmm3, %xmm9
addpd 400(%rsp), %xmm9
movapd %xmm9, 96(%rsp)
movapd %xmm7, %xmm11
subpd %xmm8, %xmm11
movapd %xmm4, %xmm6
shufpd $1, %xmm6, %xmm6
movapd %xmm11, %xmm0
mulsd %xmm0, %xmm0
movapd %xmm11, %xmm1
shufpd $1, %xmm1, %xmm1
mulsd %xmm1, %xmm1
addsd %xmm0, %xmm1
movapd %xmm6, %xmm0
movapd %xmm6, %xmm13
subsd %xmm4, %xmm0
movapd %xmm0, 624(%rsp)
mulsd %xmm0, %xmm0
addsd %xmm1, %xmm0
movapd %xmm9, %xmm1
movhlps %xmm1, %xmm1
movapd %xmm4, %xmm3
movapd %xmm4, %xmm12
subpd %xmm1, %xmm3
movapd %xmm3, 560(%rsp)
movapd %xmm3, %xmm4
movapd %xmm7, %xmm6
movapd %xmm6, %xmm5
subpd %xmm10, %xmm5
movapd %xmm5, 352(%rsp)
movapd %xmm8, %xmm3
subpd %xmm10, %xmm3
movapd %xmm3, 576(%rsp)
movapd %xmm3, %xmm1
unpcklpd %xmm5, %xmm1
mulpd %xmm1, %xmm1
unpckhpd %xmm5, %xmm3
mulpd %xmm3, %xmm3
addpd %xmm1, %xmm3
xorps %xmm1, %xmm1
sqrtsd %xmm0, %xmm1
movapd %xmm1, %xmm0
mulsd %xmm0, %xmm0
mulsd %xmm1, %xmm0
movapd %xmm4, %xmm1
mulpd %xmm1, %xmm1
addpd %xmm3, %xmm1
movupd %xmm2, 176(%rcx)
shufpd $1, %xmm2, %xmm2
subsd %xmm2, %xmm13
movapd %xmm13, 464(%rsp)
movupd 168(%rcx), %xmm2
movapd %xmm2, 400(%rsp)
movapd %xmm6, %xmm3
subpd %xmm2, %xmm3
movapd %xmm3, 416(%rsp)
movapd %xmm3, %xmm2
mulsd %xmm2, %xmm2
movapd %xmm3, %xmm4
shufpd $1, %xmm4, %xmm4
mulsd %xmm4, %xmm4
addsd %xmm2, %xmm4
movapd %xmm13, %xmm3
mulsd %xmm3, %xmm3
addsd %xmm4, %xmm3
sqrtpd %xmm1, %xmm2
movapd %xmm2, %xmm1
mulpd %xmm1, %xmm1
mulpd %xmm2, %xmm1
subpd %xmm14, %xmm6
movapd %xmm6, 128(%rsp)
movapd %xmm8, %xmm2
movapd %xmm8, %xmm7
subpd %xmm14, %xmm2
movapd %xmm2, 160(%rsp)
movapd %xmm2, %xmm4
unpcklpd %xmm6, %xmm4
mulpd %xmm4, %xmm4
movapd %xmm2, %xmm5
unpckhpd %xmm6, %xmm5
mulpd %xmm5, %xmm5
addpd %xmm4, %xmm5
movapd %xmm9, %xmm2
movapd %xmm2, %xmm4
movlhps %xmm4, %xmm4
movaps %xmm4, 144(%rsp)
movapd %xmm12, %xmm13
subpd %xmm4, %xmm13
movapd %xmm13, %xmm4
mulpd %xmm4, %xmm4
addpd %xmm5, %xmm4
divsd %xmm0, %xmm15
sqrtpd %xmm4, %xmm0
movapd %xmm0, %xmm4
mulpd %xmm4, %xmm4
mulpd %xmm0, %xmm4
movapd LCPI0_1(%rip), %xmm10
movapd %xmm10, %xmm0
divpd %xmm1, %xmm0
movapd %xmm0, 112(%rsp)
movapd %xmm10, %xmm8
divpd %xmm4, %xmm8
xorps %xmm0, %xmm0
sqrtsd %xmm3, %xmm0
movapd %xmm0, %xmm1
mulsd %xmm1, %xmm1
mulsd %xmm0, %xmm1
movsd %xmm1, 304(%rsp)
movapd %xmm11, %xmm5
mulpd 544(%rsp), %xmm5
movapd 624(%rsp), %xmm14
movapd %xmm14, %xmm0
mulsd 272(%rsp), %xmm0
movapd 288(%rsp), %xmm9
mulsd %xmm9, %xmm14
mulsd %xmm15, %xmm0
movapd %xmm0, 64(%rsp)
mulsd %xmm15, %xmm14
movapd %xmm14, %xmm3
movlhps %xmm15, %xmm15
movapd 496(%rsp), %xmm4
mulpd %xmm4, %xmm11
mulpd %xmm15, %xmm5
mulpd %xmm15, %xmm11
movapd %xmm11, %xmm14
movapd %xmm2, %xmm0
movsd %xmm12, %xmm0
movapd %xmm0, %xmm1
movsd 184(%rcx), %xmm0
movsd %xmm0, %xmm2
movapd %xmm2, 384(%rsp)
movlhps %xmm0, %xmm0
movapd %xmm1, %xmm2
subpd %xmm0, %xmm2
movapd %xmm2, 320(%rsp)
movapd %xmm7, %xmm15
movapd 400(%rsp), %xmm0
subpd %xmm0, %xmm15
movapd 432(%rsp), %xmm6
subpd %xmm0, %xmm6
movapd %xmm6, 48(%rsp)
movapd %xmm15, %xmm0
unpcklpd %xmm6, %xmm0
mulpd %xmm0, %xmm0
movapd %xmm15, %xmm1
unpckhpd %xmm6, %xmm1
mulpd %xmm1, %xmm1
addpd %xmm0, %xmm1
movapd %xmm2, %xmm0
mulpd %xmm0, %xmm0
addpd %xmm1, %xmm0
movsd LCPI0_0(%rip), %xmm1
divsd 304(%rsp), %xmm1
sqrtpd %xmm0, %xmm0
movapd %xmm0, %xmm7
mulpd %xmm7, %xmm7
mulpd %xmm0, %xmm7
movapd 656(%rsp), %xmm0
subpd %xmm5, %xmm0
movapd %xmm0, 656(%rsp)
movapd %xmm10, %xmm11
divpd %xmm7, %xmm11
movapd %xmm11, 304(%rsp)
movapd 640(%rsp), %xmm0
addsd %xmm0, %xmm3
movapd %xmm3, 624(%rsp)
movapd %xmm0, %xmm2
shufpd $1, %xmm2, %xmm2
subsd 64(%rsp), %xmm2
movapd 352(%rsp), %xmm0
movapd 480(%rsp), %xmm7
mulpd %xmm7, %xmm0
movaps 112(%rsp), %xmm12
movaps %xmm12, %xmm3
movhlps %xmm3, %xmm3
movaps %xmm3, 64(%rsp)
mulpd %xmm3, %xmm0
movapd 656(%rsp), %xmm3
subpd %xmm0, %xmm3
movapd %xmm3, 656(%rsp)
movapd %xmm4, %xmm0
movapd 416(%rsp), %xmm6
mulpd %xmm6, %xmm0
movapd 512(%rsp), %xmm10
mulpd %xmm10, %xmm6
movapd 464(%rsp), %xmm4
movapd %xmm4, %xmm3
mulsd %xmm9, %xmm4
mulsd %xmm1, %xmm4
movapd %xmm4, 464(%rsp)
unpcklpd %xmm1, %xmm11
movapd %xmm1, %xmm9
movlhps %xmm9, %xmm9
mulpd %xmm9, %xmm6
movapd %xmm6, 416(%rsp)
mulpd %xmm0, %xmm9
movapd %xmm14, %xmm1
addpd 592(%rsp), %xmm1
movapd %xmm13, %xmm6
movapd 560(%rsp), %xmm14
unpckhpd %xmm14, %xmm6
movapd %xmm13, %xmm4
unpcklpd %xmm14, %xmm13
movapd %xmm13, 592(%rsp)
mulpd %xmm7, %xmm14
movapd %xmm7, %xmm13
movaps %xmm12, %xmm7
mulpd %xmm7, %xmm14
movapd %xmm8, %xmm5
unpckhpd %xmm7, %xmm5
mulpd 528(%rsp), %xmm4
mulpd %xmm8, %xmm4
movapd %xmm4, 640(%rsp)
movapd %xmm8, %xmm4
movapd %xmm8, %xmm12
unpcklpd %xmm7, %xmm8
movapd 576(%rsp), %xmm0
mulpd %xmm13, %xmm0
movlhps %xmm7, %xmm7
mulpd %xmm7, %xmm0
subpd %xmm0, %xmm1
movapd %xmm1, 560(%rsp)
movapd 624(%rsp), %xmm0
unpcklpd %xmm2, %xmm0
subpd %xmm14, %xmm0
movapd %xmm0, 624(%rsp)
movapd 240(%rsp), %xmm1
mulsd %xmm1, %xmm3
movapd 320(%rsp), %xmm0
mulsd %xmm1, %xmm0
unpcklpd %xmm3, %xmm0
mulpd %xmm11, %xmm0
movapd 656(%rsp), %xmm2
subpd 416(%rsp), %xmm2
movapd %xmm2, 656(%rsp)
movapd %xmm15, %xmm11
mulpd %xmm10, %xmm11
movaps 304(%rsp), %xmm2
movlhps %xmm2, %xmm2
movapd 544(%rsp), %xmm10
mulpd %xmm10, %xmm15
mulpd %xmm2, %xmm11
mulpd %xmm2, %xmm15
movapd 128(%rsp), %xmm13
movapd %xmm13, %xmm2
movapd 528(%rsp), %xmm3
mulpd %xmm3, %xmm2
movhlps %xmm4, %xmm4
mulpd %xmm4, %xmm2
movapd 496(%rsp), %xmm14
mulpd %xmm14, %xmm13
mulpd %xmm4, %xmm13
movupd 192(%rcx), %xmm1
addpd %xmm1, %xmm9
addpd 448(%rsp), %xmm13
movapd %xmm10, %xmm1
movapd %xmm10, %xmm4
movapd 160(%rsp), %xmm10
mulpd %xmm10, %xmm1
mulpd %xmm3, %xmm10
movlhps %xmm12, %xmm12
mulpd %xmm12, %xmm10
movapd %xmm10, %xmm3
mulpd %xmm1, %xmm12
addpd %xmm13, %xmm12
movapd 560(%rsp), %xmm13
subpd %xmm11, %xmm13
mulpd %xmm14, %xmm6
mulpd %xmm6, %xmm5
addpd 608(%rsp), %xmm5
movapd 624(%rsp), %xmm1
subpd %xmm0, %xmm1
movapd %xmm1, 624(%rsp)
movapd %xmm4, %xmm1
movapd 592(%rsp), %xmm0
mulpd %xmm1, %xmm0
mulpd %xmm0, %xmm8
addpd %xmm5, %xmm8
movapd 656(%rsp), %xmm0
subpd %xmm2, %xmm0
movapd %xmm0, 656(%rsp)
movapd 352(%rsp), %xmm2
mulpd %xmm14, %xmm2
mulpd 64(%rsp), %xmm2
addpd 368(%rsp), %xmm2
movapd 576(%rsp), %xmm10
mulpd %xmm1, %xmm10
mulpd %xmm7, %xmm10
addpd %xmm2, %xmm10
movapd 384(%rsp), %xmm4
subpd 144(%rsp), %xmm4
movapd %xmm4, 384(%rsp)
addpd %xmm9, %xmm15
movapd %xmm15, 608(%rsp)
movapd 432(%rsp), %xmm2
movapd 336(%rsp), %xmm0
subpd %xmm0, %xmm2
movapd 400(%rsp), %xmm7
subpd %xmm0, %xmm7
movapd %xmm7, %xmm0
unpcklpd %xmm2, %xmm0
mulpd %xmm0, %xmm0
movapd %xmm7, %xmm1
unpckhpd %xmm2, %xmm1
mulpd %xmm1, %xmm1
addpd %xmm0, %xmm1
movapd %xmm13, %xmm0
subpd %xmm3, %xmm0
movapd %xmm0, 560(%rsp)
movapd %xmm4, %xmm0
mulpd %xmm0, %xmm0
addpd %xmm1, %xmm0
sqrtpd %xmm0, %xmm0
movapd %xmm0, %xmm1
mulpd %xmm1, %xmm1
mulpd %xmm0, %xmm1
movapd LCPI0_1(%rip), %xmm15
divpd %xmm1, %xmm15
movapd 320(%rsp), %xmm5
movapd %xmm5, %xmm14
mulpd (%rsp), %xmm14
movapd 304(%rsp), %xmm11
mulpd %xmm11, %xmm14
movapd %xmm15, %xmm0
unpckhpd %xmm11, %xmm0
movapd 480(%rsp), %xmm3
movapd %xmm3, %xmm1
movapd 48(%rsp), %xmm4
mulpd %xmm4, %xmm1
movapd 512(%rsp), %xmm13
mulpd %xmm13, %xmm4
movhlps %xmm11, %xmm11
mulpd %xmm11, %xmm4
subpd %xmm4, %xmm10
mulpd %xmm1, %xmm11
movapd 624(%rsp), %xmm9
subpd 640(%rsp), %xmm9
movapd %xmm9, 640(%rsp)
movapd %xmm2, %xmm1
movapd 528(%rsp), %xmm6
mulpd %xmm6, %xmm1
movapd %xmm15, %xmm4
movhlps %xmm4, %xmm4
mulpd %xmm3, %xmm2
mulpd %xmm4, %xmm1
mulpd %xmm4, %xmm2
addpd %xmm12, %xmm2
shufpd $1, %xmm5, %xmm5
mulsd 240(%rsp), %xmm5
movapd %xmm5, %xmm4
movapd 384(%rsp), %xmm5
movapd %xmm5, %xmm3
shufpd $1, %xmm3, %xmm3
mulsd 256(%rsp), %xmm3
unpcklpd %xmm4, %xmm3
subpd %xmm1, %xmm10
movapd %xmm10, 576(%rsp)
movapd 464(%rsp), %xmm1
addsd 208(%rcx), %xmm1
addsd %xmm14, %xmm1
shufpd $1, %xmm14, %xmm14
addsd %xmm1, %xmm14
mulpd %xmm0, %xmm3
movapd %xmm8, %xmm0
addpd %xmm3, %xmm0
subpd %xmm3, %xmm8
addpd 608(%rsp), %xmm11
movsd %xmm0, %xmm8
movapd %xmm5, %xmm1
movapd %xmm1, %xmm0
mulsd 224(%rsp), %xmm0
mulsd %xmm15, %xmm0
subsd %xmm0, %xmm14
mulpd %xmm7, %xmm13
mulpd %xmm6, %xmm7
mulpd 208(%rsp), %xmm1
mulpd %xmm15, %xmm1
movlhps %xmm15, %xmm15
mulpd %xmm15, %xmm7
mulpd %xmm13, %xmm15
subpd %xmm7, %xmm11
movupd %xmm11, 192(%rcx)
addpd %xmm2, %xmm15
movapd %xmm8, %xmm0
addpd %xmm1, %xmm0
movsd %xmm14, 208(%rcx)
subpd %xmm1, %xmm8
movapd 656(%rsp), %xmm7
movapd LCPI0_1(%rip), %xmm3
mulpd %xmm3, %xmm7
addpd 192(%rsp), %xmm7
movapd 560(%rsp), %xmm5
mulpd %xmm3, %xmm5
addpd 80(%rsp), %xmm5
movapd %xmm9, %xmm12
mulpd %xmm3, %xmm12
addpd 176(%rsp), %xmm12
mulpd %xmm3, %xmm10
addpd 432(%rsp), %xmm10
movapd %xmm11, %xmm2
mulsd LCPI0_0(%rip), %xmm2
addsd 400(%rsp), %xmm2
movupd 176(%rcx), %xmm4
movupd 200(%rcx), %xmm1
movsd %xmm2, 168(%rcx)
mulpd %xmm3, %xmm1
addpd %xmm4, %xmm1
movapd %xmm15, %xmm4
mulpd %xmm3, %xmm4
addpd 336(%rsp), %xmm4
movapd %xmm8, %xmm6
movsd %xmm0, %xmm6
movapd %xmm6, %xmm9
mulpd %xmm3, %xmm9
addpd 96(%rsp), %xmm9
movupd %xmm1, 176(%rcx)
addq $-2, %rdx
shufpd $1, %xmm1, %xmm1
jne .LBB0_5
movupd %xmm7, (%rcx)
movhpd %xmm12, 16(%rcx)
movupd %xmm5, 56(%rcx)
movlpd %xmm12, 72(%rcx)
movaps 656(%rsp), %xmm1
movups %xmm1, 24(%rcx)
movapd 640(%rsp), %xmm2
movhpd %xmm2, 40(%rcx)
movaps 560(%rsp), %xmm1
movups %xmm1, 80(%rcx)
movlpd %xmm2, 96(%rcx)
movupd %xmm10, 112(%rcx)
movhpd %xmm9, 128(%rcx)
movapd 576(%rsp), %xmm1
movupd %xmm1, 136(%rcx)
movhpd %xmm8, 152(%rcx)
movupd %xmm4, 224(%rcx)
movlpd %xmm9, 240(%rcx)
movupd %xmm15, 248(%rcx)
movlpd %xmm0, 264(%rcx)
.LBB0_7:
movaps 672(%rsp), %xmm6
movaps 688(%rsp), %xmm7
movaps 704(%rsp), %xmm8
movaps 720(%rsp), %xmm9
movaps 736(%rsp), %xmm10
movaps 752(%rsp), %xmm11
movaps 768(%rsp), %xmm12
movaps 784(%rsp), %xmm13
movaps 800(%rsp), %xmm14
movaps 816(%rsp), %xmm15
addq $840, %rsp
retq
This asm is partially unrolled, but it’s also a bit messy and too much long. I think it has unrolled in a not good way (there are three loops after all, you want to unroll only the inner two).
I’d like to annotate the code of this kernel like this (leaving the outer loop unmodified):
fn advance(bodies: &mut Planets, dt: f64, n: usize) {
#[unroll(never)] for _ in 0 .. n {
#[unroll(try_full)] for i in 0 .. bodies.len() {
#[unroll(try_full)] for j in i + 1 .. bodies.len() {
let dx = bodies[i].x - bodies[j].x;
let dy = bodies[i].y - bodies[j].y;
let dz = bodies[i].z - bodies[j].z;
let distance = (dx * dx + dy * dy + dz * dz).sqrt();
let mag = dt / (distance * distance * distance);
bodies[i].vx -= dx * bodies[j].mass * mag;
bodies[i].vy -= dy * bodies[j].mass * mag;
bodies[i].vz -= dz * bodies[j].mass * mag;
bodies[j].vx += dx * bodies[i].mass * mag;
bodies[j].vy += dy * bodies[i].mass * mag;
bodies[j].vz += dz * bodies[i].mass * mag;
}
}
for b in bodies.iter_mut() {
b.x += dt * b.vx;
b.y += dt * b.vy;
b.z += dt * b.vz;
}
}
}
Note: if the LLVM doesn’t have an annotation to unroll a single specific loop, then the Rust-front end has to do it…