Loop unrolling on request

In theory a modern compiler should not need annotations like:

#[inline]
#[inline(always)]

Because in theory a modern smart compiler knows better than the programmer (expecially with PGO https://en.wikipedia.org/wiki/Profile-guided_optimization ). In practice I’ve already seen they are sometimes useful in Rust code. And sometimes I’ve used the opposite annotation:

#[inline(never)]

To have a clean assembly listing of a function (reading the same asm of an inline function is harder).

From my experience with D programming, and from the small experience of Rust, I’ve seen a similar situation with loop unrolling. This means I’ve seen many situations where the compiler has not unrolled or has not unrolled a loop enough. With some tricks to force a higher unrolling I’ve seen several times the computational kernels my code get faster on my PC, both on older and newer CPUs.

So I suggest to add a way to ask the Rust compiler to unroll a specific loop (so this is an enhancement request).

This post is quite long because I try to give some evidence for my claims and request.


Currently there is a way to ask LLVM to unroll loops more, like this:

rustc -C llvm-args="-unroll-threshold=1000" my_code.r

But this is a blunt tool, that unrolls loops in the whole module (or the whole crate). If you unroll loops randomly, you often get a slowdown of your code. So you want a way to unroll only specific loops.


A possible syntax to do it (that uses #![feature(stmt_expr_attributes)] ):

This is just a generic hint to unroll a loop more:
#[unroll]

Disables unrolling for a loop:
#[unroll(never)]

asks the compiler to try to fully unroll a specific loop:
#[unroll(try_full)]

(A #[unroll(full)] annotations is also possible, but sounds dangerous, and it can’t be used with dynamic loops, where the number of loops is not statically known).

Asks the compiler to unroll the loop N times:
#[unroll(N)]

Example usages:

#[unroll(8)] for x in ....

#[unroll(try_full)] while pred { ....


And now two examples to show my point.

This first program computes the integer square root on u64 numbers. The basic function is isqrt1(), while in isqrt2() I have manually unrolled the loop:

// isqrt from:
// http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots

use std::u32;
use std::mem::size_of;

type Tin = u64;
type Tout = u32;

fn isqrt1(mut a: u64) -> Tout {
    let mut rem: Tin = 0;
    let mut root: Tin = 0;
    for _ in 0 .. size_of::<Tin>() * 4 {
        root <<= 1;
        rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
        a <<= 2;
        root += 1;
        if root <= rem {
            rem -= root;
            root += 1;
        } else {
            root -= 1;
        }
    }
    return (root >> 1) as Tout;
}

fn isqrt2(mut a: u64) -> Tout {
    let mut rem: Tin = 0;
    let mut root: Tin = 0;

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    a <<= 2;
    root += 1;
    if root <= rem {
        rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    root <<= 1;
    rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
    //a <<= 2;
    root += 1;
    if root <= rem {
        //rem -= root;
        root += 1;
    } else {
        root -= 1;
    }

    return (root >> 1) as Tout;
}

fn main() {
    let n_iter = 10_000_000;
    let mut n: u64 = u32::MAX as u64;
    let mut result: u64 = 0;
    for _ in 0 .. n_iter {
        result += isqrt1(n) as u64;
        n += u32::MAX as u64;
    }
    println!("{}", result);
}

Manually unrolling a loop is ugly, bug-prone, and it ties this isqrt2 to a specific data type (if I use u32 as inputs then the loop has to run only 16 times).

The generated asm, using this on a modern 64 bit i7 CPU:

rustc -C opt-level=3 -C target-cpu=native --emit asm test.rs

For isqrt1:

_ZN5sqrt120hb91a06ee8b9d7345GbaE:
    pushq   %r14
    pushq   %rsi
    pushq   %rdi
    pushq   %rbx
    xorl    %r8d, %r8d
    movl    $32, %edx
    movl    $572, %r9d
    movl    $570, %r10d
    movl    $568, %r11d
    xorl    %edi, %edi
    xorl    %eax, %eax
    .align  16, 0x90
.LBB0_1:
    shldq   $2, %rcx, %rdi
    movq    %rcx, %r14
    shlq    $8, %r14
    leaq    1(%rax,%rax), %rbx
    cmpq    %rbx, %rdi
    leaq    2(%rax,%rax), %rsi
    leaq    (%rax,%rax), %rax
    cmovaeq %rsi, %rax
    cmovbq  %r8, %rbx
    subq    %rbx, %rdi
    shlq    $2, %rdi
    bextrq  %r9, %rcx, %rsi
    orq %rdi, %rsi
    leaq    1(%rax,%rax), %rdi
    cmpq    %rdi, %rsi
    leaq    2(%rax,%rax), %rbx
    leaq    (%rax,%rax), %rax
    cmovaeq %rbx, %rax
    cmovbq  %r8, %rdi
    subq    %rdi, %rsi
    shlq    $2, %rsi
    bextrq  %r10, %rcx, %rbx
    orq %rsi, %rbx
    leaq    1(%rax,%rax), %rsi
    cmpq    %rsi, %rbx
    leaq    2(%rax,%rax), %rdi
    leaq    (%rax,%rax), %rax
    cmovaeq %rdi, %rax
    cmovbq  %r8, %rsi
    subq    %rsi, %rbx
    shlq    $2, %rbx
    bextrq  %r11, %rcx, %rdi
    orq %rbx, %rdi
    leaq    1(%rax,%rax), %rcx
    cmpq    %rcx, %rdi
    leaq    2(%rax,%rax), %rsi
    leaq    (%rax,%rax), %rax
    cmovaeq %rsi, %rax
    cmovbq  %r8, %rcx
    subq    %rcx, %rdi
    movq    %r14, %rcx
    addq    $-4, %rdx
    jne .LBB0_1
    shrq    %rax
    popq    %rbx
    popq    %rdi
    popq    %rsi
    popq    %r14
    retq

For isqrt2:

_ZN5sqrt220hb052487376d2024e4caE:
    pushq   %rsi
    movq    %rcx, %rdx
    shrq    $62, %rdx
    movl    %ecx, %r8d
    andl    $3, %r8d
    testq   %rdx, %rdx
    setne   %al
    movzbl  %al, %r9d
    leaq    -4(,%rdx,4), %rax
    cmoveq  %rdx, %rax
    movl    $572, %edx
    bextrq  %rdx, %rcx, %rdx
    orq %rax, %rdx
    leaq    1(,%r9,4), %rsi
    xorl    %r10d, %r10d
    cmpq    %rsi, %rdx
    leaq    2(,%r9,4), %r11
    leaq    (,%r9,4), %rax
    cmovaeq %r11, %rax
    cmovbq  %r10, %rsi
    subq    %rsi, %rdx
    shlq    $2, %rdx
    movl    $570, %esi
    bextrq  %rsi, %rcx, %rsi
    orq %rdx, %rsi
    leaq    1(%rax,%rax), %rdx
    cmpq    %rdx, %rsi
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rdx
    subq    %rdx, %rsi
    shlq    $2, %rsi
    movl    $568, %edx
    bextrq  %rdx, %rcx, %rdx
    orq %rsi, %rdx
    leaq    1(%rax,%rax), %rsi
    cmpq    %rsi, %rdx
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rsi
    subq    %rsi, %rdx
    shlq    $2, %rdx
    movl    $566, %esi
    bextrq  %rsi, %rcx, %rsi
    orq %rdx, %rsi
    leaq    1(%rax,%rax), %rdx
    cmpq    %rdx, %rsi
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rdx
    subq    %rdx, %rsi
    shlq    $2, %rsi
    movl    $564, %edx
    bextrq  %rdx, %rcx, %rdx
    orq %rsi, %rdx
    leaq    1(%rax,%rax), %rsi
    cmpq    %rsi, %rdx
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rsi
    subq    %rsi, %rdx
    shlq    $2, %rdx
    movl    $562, %esi
    bextrq  %rsi, %rcx, %rsi
    orq %rdx, %rsi
    leaq    1(%rax,%rax), %rdx
    cmpq    %rdx, %rsi
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rdx
    subq    %rdx, %rsi
    shlq    $2, %rsi
    movl    $560, %edx
    bextrq  %rdx, %rcx, %rdx
    orq %rsi, %rdx
    leaq    1(%rax,%rax), %rsi
    cmpq    %rsi, %rdx
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rsi
    subq    %rsi, %rdx
    shlq    $2, %rdx
    movl    $558, %esi
    bextrq  %rsi, %rcx, %rsi
    orq %rdx, %rsi
    leaq    1(%rax,%rax), %rdx
    cmpq    %rdx, %rsi
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rdx
    subq    %rdx, %rsi
    shlq    $2, %rsi
    movl    $556, %edx
    bextrq  %rdx, %rcx, %rdx
    orq %rsi, %rdx
    leaq    1(%rax,%rax), %rsi
    cmpq    %rsi, %rdx
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rsi
    subq    %rsi, %rdx
    shlq    $2, %rdx
    movl    $554, %esi
    bextrq  %rsi, %rcx, %rsi
    orq %rdx, %rsi
    leaq    1(%rax,%rax), %rdx
    cmpq    %rdx, %rsi
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rdx
    subq    %rdx, %rsi
    shlq    $2, %rsi
    movl    $552, %edx
    bextrq  %rdx, %rcx, %rdx
    orq %rsi, %rdx
    leaq    1(%rax,%rax), %rsi
    cmpq    %rsi, %rdx
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rsi
    subq    %rsi, %rdx
    shlq    $2, %rdx
    movl    $550, %esi
    bextrq  %rsi, %rcx, %rsi
    orq %rdx, %rsi
    leaq    1(%rax,%rax), %rdx
    cmpq    %rdx, %rsi
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rdx
    subq    %rdx, %rsi
    shlq    $2, %rsi
    movl    $548, %edx
    bextrq  %rdx, %rcx, %rdx
    orq %rsi, %rdx
    leaq    1(%rax,%rax), %rsi
    cmpq    %rsi, %rdx
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rsi
    subq    %rsi, %rdx
    shlq    $2, %rdx
    movl    $546, %esi
    bextrq  %rsi, %rcx, %rsi
    orq %rdx, %rsi
    leaq    1(%rax,%rax), %rdx
    cmpq    %rdx, %rsi
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rdx
    subq    %rdx, %rsi
    shlq    $2, %rsi
    movl    $544, %edx
    bextrq  %rdx, %rcx, %rdx
    orq %rsi, %rdx
    leaq    1(%rax,%rax), %rsi
    cmpq    %rsi, %rdx
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rsi
    subq    %rsi, %rdx
    shlq    $2, %rdx
    movl    $542, %esi
    bextrq  %rsi, %rcx, %rsi
    orq %rdx, %rsi
    leaq    1(%rax,%rax), %rdx
    cmpq    %rdx, %rsi
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rdx
    subq    %rdx, %rsi
    shlq    $2, %rsi
    movl    $540, %edx
    bextrq  %rdx, %rcx, %rdx
    orq %rsi, %rdx
    leaq    1(%rax,%rax), %rsi
    cmpq    %rsi, %rdx
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rsi
    subq    %rsi, %rdx
    shlq    $2, %rdx
    movl    $538, %esi
    bextrq  %rsi, %rcx, %rsi
    orq %rdx, %rsi
    leaq    1(%rax,%rax), %rdx
    cmpq    %rdx, %rsi
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rdx
    subq    %rdx, %rsi
    shlq    $2, %rsi
    movl    $536, %edx
    bextrq  %rdx, %rcx, %rdx
    orq %rsi, %rdx
    leaq    1(%rax,%rax), %rsi
    cmpq    %rsi, %rdx
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rsi
    subq    %rsi, %rdx
    shlq    $2, %rdx
    movl    $534, %esi
    bextrq  %rsi, %rcx, %rsi
    orq %rdx, %rsi
    leaq    1(%rax,%rax), %rdx
    cmpq    %rdx, %rsi
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rdx
    subq    %rdx, %rsi
    shlq    $2, %rsi
    movl    $532, %edx
    bextrq  %rdx, %rcx, %rdx
    orq %rsi, %rdx
    leaq    1(%rax,%rax), %rsi
    cmpq    %rsi, %rdx
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rsi
    subq    %rsi, %rdx
    shlq    $2, %rdx
    movl    $530, %esi
    bextrq  %rsi, %rcx, %rsi
    orq %rdx, %rsi
    leaq    1(%rax,%rax), %rdx
    cmpq    %rdx, %rsi
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rdx
    subq    %rdx, %rsi
    shlq    $2, %rsi
    movl    $528, %edx
    bextrq  %rdx, %rcx, %rdx
    orq %rsi, %rdx
    leaq    1(%rax,%rax), %rsi
    cmpq    %rsi, %rdx
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rsi
    subq    %rsi, %rdx
    shlq    $2, %rdx
    movl    $526, %esi
    bextrq  %rsi, %rcx, %rsi
    orq %rdx, %rsi
    leaq    1(%rax,%rax), %rdx
    cmpq    %rdx, %rsi
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rdx
    subq    %rdx, %rsi
    shlq    $2, %rsi
    movl    $524, %edx
    bextrq  %rdx, %rcx, %rdx
    orq %rsi, %rdx
    leaq    1(%rax,%rax), %rsi
    cmpq    %rsi, %rdx
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rsi
    subq    %rsi, %rdx
    shlq    $2, %rdx
    movl    $522, %esi
    bextrq  %rsi, %rcx, %rsi
    orq %rdx, %rsi
    leaq    1(%rax,%rax), %rdx
    cmpq    %rdx, %rsi
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rdx
    subq    %rdx, %rsi
    shlq    $2, %rsi
    movl    $520, %edx
    bextrq  %rdx, %rcx, %rdx
    orq %rsi, %rdx
    leaq    1(%rax,%rax), %rsi
    cmpq    %rsi, %rdx
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rsi
    subq    %rsi, %rdx
    shlq    $2, %rdx
    movl    $518, %esi
    bextrq  %rsi, %rcx, %rsi
    orq %rdx, %rsi
    leaq    1(%rax,%rax), %rdx
    cmpq    %rdx, %rsi
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rdx
    subq    %rdx, %rsi
    shlq    $2, %rsi
    movl    $516, %edx
    bextrq  %rdx, %rcx, %rdx
    orq %rsi, %rdx
    leaq    1(%rax,%rax), %rsi
    cmpq    %rsi, %rdx
    leaq    2(%rax,%rax), %r9
    leaq    (%rax,%rax), %rax
    cmovaeq %r9, %rax
    cmovbq  %r10, %rsi
    subq    %rsi, %rdx
    shlq    $2, %rdx
    movl    $514, %esi
    bextrq  %rsi, %rcx, %rcx
    orq %rdx, %rcx
    leaq    1(%rax,%rax), %rdx
    cmpq    %rdx, %rcx
    leaq    2(%rax,%rax), %rsi
    leaq    (%rax,%rax), %rax
    cmovaeq %rsi, %rax
    cmovbq  %r10, %rdx
    subq    %rdx, %rcx
    shlq    $2, %rcx
    orq %r8, %rcx
    leaq    1(%rax,%rax), %rdx
    cmpq    %rcx, %rdx
    leaq    2(%rax,%rax), %rcx
    leaq    (%rax,%rax), %rax
    cmovbeq %rcx, %rax
    shrq    %rax
    popq    %rsi
    retq

As you see the isqrt1 was unrolled 4 times. But the run-time shows it’s not enough.

On my system with n_iter=10_000_000, with isqrt1 takes 0.74 seconds, with isqrt2 it takes 0.26 seconds.

Both return the result: 1381620290024350

This annotation should suffice to fully unroll the loop, and keep the code more generic:

fn isqrt1b(mut a: u64) -> Tout {
    let mut rem: Tin = 0;
    let mut root: Tin = 0;
    #[unroll(try_full)] for _ in 0 .. size_of::<Tin>() * 4 {
        root <<= 1;
        rem = (rem << 2) + (a >> (size_of::<Tin>() * 8 - 2));
        a <<= 2;
        root += 1;
        if root <= rem {
            rem -= root;
            root += 1;
        } else {
            root -= 1;
        }
    }
    return (root >> 1) as Tout;
}

Extra note: this is also a situations where I’d really like a “typeof” in Rust, so I can write that code like the D one:

import std.typetuple: TypeTuple;

template Iota(uint stop) {
    static if (stop <= 0)
        alias Iota = TypeTuple!();
    else
        alias Iota = TypeTuple!(Iota!(stop - 1), stop - 1);
}

uint isqrt1c(ulong a) pure nothrow @safe @nogc {
    alias T = typeof(a);
    T rem = 0;
    T root = 0;
    foreach (immutable _; Iota!(T.sizeof * 4)) {
        root <<= 1;
        rem = (rem << 2) + (a >> (T.sizeof * 8 - 2));
        a <<= 2;
        root++;
        if (root <= rem) {
            rem -= root;
            root++;
        } else {
            root--;
        }
    }
    return cast(typeof(return))(root >> 1);
}

void main() {
    import std.stdio;
    enum size_t nIter = 10_000_000;
    ulong n = uint.max;
    ulong result = 0;
    foreach (immutable _; 0 .. nIter) {
        result += n.isqrt1c;
        n += uint.max;
    }
    result.writeln;
}

That Iota() is a simple way to force full loop unrolling in D language (the foreach is static).

The D code also uses a cast(typeof(return)) that makes the code more DRY. typeof(return) is the type of the return of the function, in this case an “uint” (u32 in Rust).


As second example I use this N-body benchmark: http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=nbody

In that page there are two Rust versions, that have a different run-time and efficiency.

This is the less fast, the Rust #1: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=nbody&lang=rust&id=1

The faster of the two is the Rust #2: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=nbody&lang=rust&id=2

In my benchmarks I give an input: 20000000

And the output is:

-0.169075164
-0.169031665

I compile the two programs with (the second program gets a bit slower with LTO, I don’t know why):

rustc -C opt-level=3 -C lto nbody1.rs
rustc -C opt-level=3 nbody2.rs

The run-time for the two programs on my PC is, in seconds:

Rust #1: 2.11 s
Rust #2: 1.62 s

This is a version of mine of the (slower) Rust #1 program, I’ve simplified it, making the code more straightforward, similar to a simple version in C language:


// Rust #3
// The Computer Language Benchmarks Game
// http://benchmarksgame.alioth.debian.org/

use std::f64::consts::PI;
const SOLAR_MASS: f64 = 4.0 * PI * PI;
const N_BODIES: usize = 5;

struct Planet {
    x: f64, y: f64, z: f64,
    vx: f64, vy: f64, vz: f64,
    mass: f64
}

type Planets = [Planet; N_BODIES];

fn advance(bodies: &mut Planets, dt: f64, n: usize) {
    for _ in 0 .. n {
        for i in 0 .. bodies.len() {
            for j in i + 1 .. bodies.len() {
                let dx = bodies[i].x - bodies[j].x;
                let dy = bodies[i].y - bodies[j].y;
                let dz = bodies[i].z - bodies[j].z;

                let distance = (dx * dx + dy * dy + dz * dz).sqrt();
                let mag = dt / (distance * distance * distance);

                bodies[i].vx -= dx * bodies[j].mass * mag;
                bodies[i].vy -= dy * bodies[j].mass * mag;
                bodies[i].vz -= dz * bodies[j].mass * mag;

                bodies[j].vx += dx * bodies[i].mass * mag;
                bodies[j].vy += dy * bodies[i].mass * mag;
                bodies[j].vz += dz * bodies[i].mass * mag;
            }
        }

        for b in bodies.iter_mut() {
            b.x += dt * b.vx;
            b.y += dt * b.vy;
            b.z += dt * b.vz;
        }
    }
}

fn energy(bodies: &Planets) -> f64 {
    let mut result = 0.0;

    for (i, bi) in bodies.iter().enumerate() {
        result += (bi.vx * bi.vx + bi.vy * bi.vy + bi.vz * bi.vz) * bi.mass / 2.0;
        for bj in bodies[i + 1 ..].iter() {
            let dx = bi.x - bj.x;
            let dy = bi.y - bj.y;
            let dz = bi.z - bj.z;
            let distance = (dx * dx + dy * dy + dz * dz).sqrt();
            result -= bi.mass * bj.mass / distance;
        }
    }
    result
}

fn offset_momentum(bodies: &mut Planets) {
    let (mut px, mut py, mut pz) = (0.0, 0.0, 0.0);
    for b in bodies.iter() {
        px += b.vx * b.mass;
        py += b.vy * b.mass;
        pz += b.vz * b.mass;
    }
    bodies[0].vx = - px / SOLAR_MASS;
    bodies[0].vy = - py / SOLAR_MASS;
    bodies[0].vz = - pz / SOLAR_MASS;
}

fn main() {
    const YEAR: f64 = 365.24;
    let n = std::env::args().nth(1).unwrap().parse().unwrap_or(1000);

    let mut bodies: Planets = [
        Planet { // Sun
            x: 0.0, y: 0.0, z: 0.0,
            vx: 0.0, vy: 0.0, vz: 0.0,
            mass: SOLAR_MASS
        },
        Planet { // Jupiter
            x: 4.84143144246472090e+00,
            y: -1.16032004402742839e+00,
            z: -1.03622044471123109e-01,
            vx: 1.66007664274403694e-03 * YEAR,
            vy: 7.69901118419740425e-03 * YEAR,
            vz: -6.90460016972063023e-05 * YEAR,
            mass: 9.54791938424326609e-04 * SOLAR_MASS
        },
        Planet { // Saturn
            x: 8.34336671824457987e+00,
            y: 4.12479856412430479e+00,
            z: -4.03523417114321381e-01,
            vx: -2.76742510726862411e-03 * YEAR,
            vy: 4.99852801234917238e-03 * YEAR,
            vz: 2.30417297573763929e-05 * YEAR,
            mass: 2.85885980666130812e-04 * SOLAR_MASS
        },
        Planet { // Uranus
            x: 1.28943695621391310e+01,
            y: -1.51111514016986312e+01,
            z: -2.23307578892655734e-01,
            vx: 2.96460137564761618e-03 * YEAR,
            vy: 2.37847173959480950e-03 * YEAR,
            vz: -2.96589568540237556e-05 * YEAR,
            mass: 4.36624404335156298e-05 * SOLAR_MASS
        },
        Planet { // Neptune
            x: 1.53796971148509165e+01,
            y: -2.59193146099879641e+01,
            z: 1.79258772950371181e-01,
            vx: 2.68067772490389322e-03 * YEAR,
            vy: 1.62824170038242295e-03 * YEAR,
            vz: -9.51592254519715870e-05 * YEAR,
            mass: 5.15138902046611451e-05 * SOLAR_MASS
        }
    ];

    offset_momentum(&mut bodies);
    println!("{:.9}", energy(&bodies));
    advance(&mut bodies, 0.01, n);
    println!("{:.9}", energy(&bodies));
}

And this is the same code, where I have manually fully unrolled the loops in the advance() function:


// Rust #4
// The Computer Language Benchmarks Game
// http://benchmarksgame.alioth.debian.org/

use std::f64::consts::PI;
const SOLAR_MASS: f64 = 4.0 * PI * PI;
const N_BODIES: usize = 5;

struct Planet {
    x: f64, y: f64, z: f64,
    vx: f64, vy: f64, vz: f64,
    mass: f64
}

type Planets = [Planet; N_BODIES];

fn advance(mut bodies: &mut Planets, dt: f64) {
    {
        let mut f = |i: usize, j: usize| {
            let dx = bodies[i].x - bodies[j].x;
            let dy = bodies[i].y - bodies[j].y;
            let dz = bodies[i].z - bodies[j].z;

            let distance = (dx * dx + dy * dy + dz * dz).sqrt();
            let mag = dt / (distance * distance * distance);

            bodies[i].vx -= dx * bodies[j].mass * mag;
            bodies[i].vy -= dy * bodies[j].mass * mag;
            bodies[i].vz -= dz * bodies[j].mass * mag;

            bodies[j].vx += dx * bodies[i].mass * mag;
            bodies[j].vy += dy * bodies[i].mass * mag;
            bodies[j].vz += dz * bodies[i].mass * mag;
        };

        // Manual loop unroll.
        f(0, 1); f(0, 2); f(0, 3); f(0, 4); f(1, 2);
        f(1, 3); f(1, 4); f(2, 3); f(2, 4); f(3, 4);
    }

    for b in bodies.iter_mut() {
        b.x += dt * b.vx;
        b.y += dt * b.vy;
        b.z += dt * b.vz;
    }
}

fn energy(bodies: &Planets) -> f64 {
    let mut result = 0.0;
    for (i, bi) in bodies.iter().enumerate() {
        result += (bi.vx * bi.vx + bi.vy * bi.vy + bi.vz * bi.vz) * bi.mass / 2.0;
        for bj in bodies[i + 1 ..].iter() {
            let dx = bi.x - bj.x;
            let dy = bi.y - bj.y;
            let dz = bi.z - bj.z;
            let distance = (dx * dx + dy * dy + dz * dz).sqrt();
            result -= bi.mass * bj.mass / distance;
        }
    }
    result
}

fn offset_momentum(bodies: &mut Planets) {
    let (mut px, mut py, mut pz) = (0.0, 0.0, 0.0);
    for b in bodies.iter() {
        px += b.vx * b.mass;
        py += b.vy * b.mass;
        pz += b.vz * b.mass;
    }
    bodies[0].vx = - px / SOLAR_MASS;
    bodies[0].vy = - py / SOLAR_MASS;
    bodies[0].vz = - pz / SOLAR_MASS;
}

fn main() {
    const YEAR: f64 = 365.24;
    let n = std::env::args().nth(1).unwrap().parse().unwrap_or(1000);

    let mut bodies: Planets = [
        Planet { // Sun
            x: 0.0, y: 0.0, z: 0.0,
            vx: 0.0, vy: 0.0, vz: 0.0,
            mass: SOLAR_MASS
        },
        Planet { // Jupiter
            x: 4.84143144246472090e+00,
            y: -1.16032004402742839e+00,
            z: -1.03622044471123109e-01,
            vx: 1.66007664274403694e-03 * YEAR,
            vy: 7.69901118419740425e-03 * YEAR,
            vz: -6.90460016972063023e-05 * YEAR,
            mass: 9.54791938424326609e-04 * SOLAR_MASS
        },
        Planet { // Saturn
            x: 8.34336671824457987e+00,
            y: 4.12479856412430479e+00,
            z: -4.03523417114321381e-01,
            vx: -2.76742510726862411e-03 * YEAR,
            vy: 4.99852801234917238e-03 * YEAR,
            vz: 2.30417297573763929e-05 * YEAR,
            mass: 2.85885980666130812e-04 * SOLAR_MASS
        },
        Planet { // Uranus
            x: 1.28943695621391310e+01,
            y: -1.51111514016986312e+01,
            z: -2.23307578892655734e-01,
            vx: 2.96460137564761618e-03 * YEAR,
            vy: 2.37847173959480950e-03 * YEAR,
            vz: -2.96589568540237556e-05 * YEAR,
            mass: 4.36624404335156298e-05 * SOLAR_MASS
        },
        Planet { // Neptune
            x: 1.53796971148509165e+01,
            y: -2.59193146099879641e+01,
            z: 1.79258772950371181e-01,
            vx: 2.68067772490389322e-03 * YEAR,
            vy: 1.62824170038242295e-03 * YEAR,
            vz: -9.51592254519715870e-05 * YEAR,
            mass: 5.15138902046611451e-05 * SOLAR_MASS
        }
    ];

    offset_momentum(&mut bodies);
    println!("{:.9}", energy(&bodies));
    for _ in 0 .. n {
        advance(&mut bodies, 0.01);
    }
    println!("{:.9}", energy(&bodies));
}

I compile the two programs with (both with LTO):

rustc -C opt-level=3 -C lto nbody_simple.rs
rustc -C opt-level=3 -C lto nbody_simple_unrolled.rs

The output is unchanged. The run-time is, in seconds:

Rust #3: 1.96 s (basic)
Rust #4: 1.27 s (basic manually unrolled)

On my system this unrolled Rust#4 version is faster than the fastest C++ entry on the benchmark page (but on your system the results could be different).

(Also note that the “basic” version is faster than the more complex Rust #1 that’s on the Computer Game site. Straightforward code is sometimes better).

If I ask a more aggressive unrolling of the basic Rust#3 version with this:

rustc -C opt-level=3 -C llvm-args="-unroll-threshold=1000" nbody_simple.rs

I get an improvement:

Rust #3: 1.54 s (basic with auto-unrolling)

As you see the speed improvement with “-unroll-threshold” is limited, and LLVM unrolls loops in the energy() and the offset_momentum() functions, bloating the binary with no point (because those two functions have a run-time that’s irrelevent in this program).

If I add a #[inline(never)] annotation to the advance() function in the Rust#3 and Rust#4 programs I get their asm:

rustc -C opt-level=3 -C lto --emit asm nbody_simple.rs
rustc -C opt-level=3 -C lto --emit asm nbody_simple_unrolled.rs

The not-inlined advance() function of the basic Rust#3:

_ZN7advance20hc525da322d48cc5eLaaE:
    pushq   %r14
    pushq   %rsi
    pushq   %rdi
    pushq   %rbx
    subq    $104, %rsp
    movapd  %xmm11, 80(%rsp)
    movapd  %xmm10, 64(%rsp)
    movapd  %xmm9, 48(%rsp)
    movapd  %xmm8, 32(%rsp)
    movapd  %xmm7, 16(%rsp)
    movapd  %xmm6, (%rsp)
    testq   %rdx, %rdx
    je  .LBB0_8
    leaq    104(%rcx), %r8
    xorl    %r9d, %r9d
    movsd   LCPI0_0(%rip), %xmm9
    movapd  LCPI0_1(%rip), %xmm8
    .align  16, 0x90
.LBB0_2:
    movq    %r8, %r11
    xorl    %r10d, %r10d
    .align  16, 0x90
.LBB0_4:
    movq    %r10, %r14
    leaq    1(%r14), %r10
    cmpq    $4, %r10
    ja  .LBB0_3
    imulq   $56, %r14, %rax
    movupd  (%rcx,%rax), %xmm10
    movsd   16(%rcx,%rax), %xmm11
    movsd   48(%rcx,%rax), %xmm4
    leaq    24(%rcx,%rax), %rsi
    leaq    40(%rcx,%rax), %rdi
    movapd  %xmm4, %xmm5
    movlhps %xmm5, %xmm5
    movq    %r11, %rbx
    movl    $4, %eax
    .align  16, 0x90
.LBB0_6:
    movupd  -48(%rbx), %xmm0
    movapd  %xmm10, %xmm6
    subpd   %xmm0, %xmm6
    movapd  %xmm11, %xmm7
    subsd   -32(%rbx), %xmm7
    movapd  %xmm6, %xmm0
    mulsd   %xmm0, %xmm0
    movapd  %xmm6, %xmm1
    shufpd  $1, %xmm1, %xmm1
    mulsd   %xmm1, %xmm1
    addsd   %xmm0, %xmm1
    movapd  %xmm7, %xmm0
    mulsd   %xmm0, %xmm0
    addsd   %xmm1, %xmm0
    sqrtsd  %xmm0, %xmm0
    movapd  %xmm0, %xmm1
    mulsd   %xmm1, %xmm1
    mulsd   %xmm0, %xmm1
    movapd  %xmm9, %xmm0
    divsd   %xmm1, %xmm0
    movsd   (%rbx), %xmm1
    movupd  (%rsi), %xmm2
    movapd  %xmm7, %xmm3
    mulsd   %xmm1, %xmm3
    movlhps %xmm1, %xmm1
    mulpd   %xmm6, %xmm1
    mulsd   %xmm0, %xmm3
    mulsd   %xmm4, %xmm7
    mulsd   %xmm0, %xmm7
    movlhps %xmm0, %xmm0
    mulpd   %xmm0, %xmm1
    subpd   %xmm1, %xmm2
    movupd  %xmm2, (%rsi)
    movsd   (%rdi), %xmm1
    subsd   %xmm3, %xmm1
    movsd   %xmm1, (%rdi)
    movupd  -24(%rbx), %xmm1
    mulpd   %xmm5, %xmm6
    mulpd   %xmm0, %xmm6
    addpd   %xmm1, %xmm6
    movupd  %xmm6, -24(%rbx)
    addsd   -8(%rbx), %xmm7
    movsd   %xmm7, -8(%rbx)
    decq    %rax
    addq    $56, %rbx
    cmpq    %r14, %rax
    jne .LBB0_6
.LBB0_3:
    addq    $56, %r11
    cmpq    $5, %r10
    jne .LBB0_4
    movupd  (%rcx), %xmm0
    movupd  24(%rcx), %xmm1
    mulpd   %xmm8, %xmm1
    addpd   %xmm0, %xmm1
    movupd  %xmm1, (%rcx)
    movsd   40(%rcx), %xmm0
    mulsd   %xmm9, %xmm0
    addsd   16(%rcx), %xmm0
    movsd   %xmm0, 16(%rcx)
    movupd  56(%rcx), %xmm0
    movupd  80(%rcx), %xmm1
    mulpd   %xmm8, %xmm1
    addpd   %xmm0, %xmm1
    movupd  %xmm1, 56(%rcx)
    movsd   96(%rcx), %xmm0
    mulsd   %xmm9, %xmm0
    addsd   72(%rcx), %xmm0
    movsd   %xmm0, 72(%rcx)
    movupd  112(%rcx), %xmm0
    movupd  136(%rcx), %xmm1
    mulpd   %xmm8, %xmm1
    addpd   %xmm0, %xmm1
    movupd  %xmm1, 112(%rcx)
    movsd   152(%rcx), %xmm0
    mulsd   %xmm9, %xmm0
    addsd   128(%rcx), %xmm0
    movsd   %xmm0, 128(%rcx)
    movupd  168(%rcx), %xmm0
    movupd  192(%rcx), %xmm1
    mulpd   %xmm8, %xmm1
    addpd   %xmm0, %xmm1
    movupd  %xmm1, 168(%rcx)
    movsd   208(%rcx), %xmm0
    mulsd   %xmm9, %xmm0
    addsd   184(%rcx), %xmm0
    movsd   %xmm0, 184(%rcx)
    movupd  224(%rcx), %xmm0
    movupd  248(%rcx), %xmm1
    mulpd   %xmm8, %xmm1
    addpd   %xmm0, %xmm1
    movupd  %xmm1, 224(%rcx)
    movsd   264(%rcx), %xmm0
    mulsd   %xmm9, %xmm0
    addsd   240(%rcx), %xmm0
    movsd   %xmm0, 240(%rcx)
    incq    %r9
    cmpq    %rdx, %r9
    jne .LBB0_2
.LBB0_8:
    movaps  (%rsp), %xmm6
    movaps  16(%rsp), %xmm7
    movaps  32(%rsp), %xmm8
    movaps  48(%rsp), %xmm9
    movaps  64(%rsp), %xmm10
    movaps  80(%rsp), %xmm11
    addq    $104, %rsp
    popq    %rbx
    popq    %rdi
    popq    %rsi
    popq    %r14
    retq

As you see LLVM performs some unrolling of the inner loop, but it’s not enough.


The not-inlined advance() function of the manually unrolled Rust#4:

_ZN7advance20ha99dc24a84950b29LaaE:
    subq    $376, %rsp
    movaps  %xmm15, 352(%rsp)
    movaps  %xmm14, 336(%rsp)
    movaps  %xmm13, 320(%rsp)
    movaps  %xmm12, 304(%rsp)
    movaps  %xmm11, 288(%rsp)
    movaps  %xmm10, 272(%rsp)
    movaps  %xmm9, 256(%rsp)
    movaps  %xmm8, 240(%rsp)
    movaps  %xmm7, 224(%rsp)
    movaps  %xmm6, 208(%rsp)
    movupd  (%rcx), %xmm1
    movupd  56(%rcx), %xmm0
    movapd  %xmm1, %xmm6
    subpd   %xmm0, %xmm6
    movsd   16(%rcx), %xmm2
    movsd   40(%rcx), %xmm9
    movapd  %xmm2, %xmm7
    subsd   72(%rcx), %xmm7
    movapd  %xmm6, %xmm0
    mulsd   %xmm0, %xmm0
    movapd  %xmm6, %xmm4
    shufpd  $1, %xmm4, %xmm4
    mulsd   %xmm4, %xmm4
    addsd   %xmm0, %xmm4
    movapd  %xmm7, %xmm0
    mulsd   %xmm0, %xmm0
    addsd   %xmm4, %xmm0
    sqrtsd  %xmm0, %xmm0
    movapd  %xmm0, %xmm4
    mulsd   %xmm4, %xmm4
    mulsd   %xmm0, %xmm4
    movsd   LCPI0_0(%rip), %xmm0
    movapd  %xmm0, %xmm5
    movapd  %xmm0, %xmm11
    divsd   %xmm4, %xmm5
    movsd   104(%rcx), %xmm4
    movupd  24(%rcx), %xmm0
    movapd  %xmm7, %xmm3
    mulsd   %xmm4, %xmm3
    movlhps %xmm4, %xmm4
    mulpd   %xmm6, %xmm4
    mulsd   %xmm5, %xmm3
    movsd   48(%rcx), %xmm8
    mulsd   %xmm8, %xmm7
    mulsd   %xmm5, %xmm7
    movlhps %xmm5, %xmm5
    mulpd   %xmm5, %xmm4
    subpd   %xmm4, %xmm0
    subsd   %xmm3, %xmm9
    movupd  80(%rcx), %xmm4
    movapd  %xmm8, %xmm10
    movlhps %xmm10, %xmm10
    mulpd   %xmm10, %xmm6
    mulpd   %xmm5, %xmm6
    addpd   %xmm4, %xmm6
    movupd  %xmm6, 80(%rcx)
    addsd   96(%rcx), %xmm7
    movsd   %xmm7, 96(%rcx)
    movupd  112(%rcx), %xmm4
    movapd  %xmm1, %xmm7
    subpd   %xmm4, %xmm7
    movapd  %xmm2, %xmm6
    subsd   128(%rcx), %xmm6
    movapd  %xmm7, %xmm4
    mulsd   %xmm4, %xmm4
    movapd  %xmm7, %xmm5
    shufpd  $1, %xmm5, %xmm5
    mulsd   %xmm5, %xmm5
    addsd   %xmm4, %xmm5
    movapd  %xmm6, %xmm4
    mulsd   %xmm4, %xmm4
    addsd   %xmm5, %xmm4
    sqrtsd  %xmm4, %xmm4
    movapd  %xmm4, %xmm5
    mulsd   %xmm5, %xmm5
    mulsd   %xmm4, %xmm5
    movapd  %xmm11, %xmm4
    divsd   %xmm5, %xmm4
    movsd   160(%rcx), %xmm5
    movapd  %xmm6, %xmm3
    mulsd   %xmm5, %xmm3
    movlhps %xmm5, %xmm5
    mulpd   %xmm7, %xmm5
    mulsd   %xmm4, %xmm3
    mulsd   %xmm8, %xmm6
    mulsd   %xmm4, %xmm6
    movlhps %xmm4, %xmm4
    mulpd   %xmm4, %xmm5
    subpd   %xmm5, %xmm0
    subsd   %xmm3, %xmm9
    movupd  136(%rcx), %xmm3
    mulpd   %xmm10, %xmm7
    mulpd   %xmm4, %xmm7
    addpd   %xmm3, %xmm7
    movupd  %xmm7, 136(%rcx)
    addsd   152(%rcx), %xmm6
    movsd   %xmm6, 152(%rcx)
    movupd  168(%rcx), %xmm3
    movapd  %xmm1, %xmm7
    subpd   %xmm3, %xmm7
    movapd  %xmm2, %xmm6
    subsd   184(%rcx), %xmm6
    movapd  %xmm7, %xmm3
    mulsd   %xmm3, %xmm3
    movapd  %xmm7, %xmm4
    shufpd  $1, %xmm4, %xmm4
    mulsd   %xmm4, %xmm4
    addsd   %xmm3, %xmm4
    movapd  %xmm6, %xmm3
    mulsd   %xmm3, %xmm3
    addsd   %xmm4, %xmm3
    sqrtsd  %xmm3, %xmm3
    movapd  %xmm3, %xmm4
    mulsd   %xmm4, %xmm4
    mulsd   %xmm3, %xmm4
    movapd  %xmm11, %xmm3
    divsd   %xmm4, %xmm3
    movsd   216(%rcx), %xmm4
    movapd  %xmm6, %xmm5
    mulsd   %xmm4, %xmm5
    movlhps %xmm4, %xmm4
    mulpd   %xmm7, %xmm4
    mulsd   %xmm3, %xmm5
    mulsd   %xmm8, %xmm6
    mulsd   %xmm3, %xmm6
    movlhps %xmm3, %xmm3
    mulpd   %xmm3, %xmm4
    subpd   %xmm4, %xmm0
    subsd   %xmm5, %xmm9
    movupd  192(%rcx), %xmm4
    mulpd   %xmm10, %xmm7
    mulpd   %xmm3, %xmm7
    addpd   %xmm4, %xmm7
    movupd  %xmm7, 192(%rcx)
    addsd   208(%rcx), %xmm6
    movsd   %xmm6, 208(%rcx)
    movupd  224(%rcx), %xmm3
    subpd   %xmm3, %xmm1
    subsd   240(%rcx), %xmm2
    movapd  %xmm1, %xmm3
    mulsd   %xmm3, %xmm3
    movapd  %xmm1, %xmm4
    shufpd  $1, %xmm4, %xmm4
    mulsd   %xmm4, %xmm4
    addsd   %xmm3, %xmm4
    movapd  %xmm2, %xmm3
    mulsd   %xmm3, %xmm3
    addsd   %xmm4, %xmm3
    sqrtsd  %xmm3, %xmm3
    movapd  %xmm3, %xmm4
    mulsd   %xmm4, %xmm4
    mulsd   %xmm3, %xmm4
    movapd  %xmm11, %xmm3
    movapd  %xmm11, %xmm12
    divsd   %xmm4, %xmm3
    movsd   272(%rcx), %xmm4
    movapd  %xmm2, %xmm5
    mulsd   %xmm4, %xmm5
    movlhps %xmm4, %xmm4
    mulpd   %xmm1, %xmm4
    mulsd   %xmm3, %xmm5
    mulsd   %xmm8, %xmm2
    mulsd   %xmm3, %xmm2
    movlhps %xmm3, %xmm3
    mulpd   %xmm3, %xmm4
    subpd   %xmm4, %xmm0
    movupd  %xmm0, 24(%rcx)
    subsd   %xmm5, %xmm9
    movsd   %xmm9, 40(%rcx)
    movupd  248(%rcx), %xmm0
    mulpd   %xmm10, %xmm1
    mulpd   %xmm3, %xmm1
    addpd   %xmm0, %xmm1
    movupd  %xmm1, 248(%rcx)
    addsd   264(%rcx), %xmm2
    movsd   %xmm2, 264(%rcx)
    movupd  56(%rcx), %xmm3
    movupd  112(%rcx), %xmm11
    movapd  %xmm3, %xmm2
    subpd   %xmm11, %xmm2
    movapd  %xmm2, 112(%rsp)
    movsd   128(%rcx), %xmm1
    movapd  %xmm1, 192(%rsp)
    movsd   160(%rcx), %xmm0
    movapd  %xmm0, %xmm9
    movlhps %xmm9, %xmm9
    mulpd   %xmm2, %xmm9
    movapd  %xmm2, %xmm8
    movupd  224(%rcx), %xmm2
    movapd  %xmm2, 176(%rsp)
    movsd   240(%rcx), %xmm4
    movapd  %xmm4, 160(%rsp)
    movapd  %xmm11, %xmm7
    subpd   %xmm2, %xmm7
    movapd  %xmm7, 128(%rsp)
    movsd   72(%rcx), %xmm6
    movapd  %xmm6, %xmm13
    unpcklpd    %xmm1, %xmm13
    movapd  %xmm1, %xmm2
    unpcklpd    %xmm4, %xmm2
    subpd   %xmm2, %xmm13
    movapd  %xmm13, 80(%rsp)
    movapd  %xmm8, %xmm2
    unpcklpd    %xmm7, %xmm2
    mulpd   %xmm2, %xmm2
    shufpd  $1, %xmm8, %xmm8
    movapd  %xmm8, 96(%rsp)
    movapd  %xmm7, %xmm4
    movsd   %xmm8, %xmm4
    mulpd   %xmm4, %xmm4
    addpd   %xmm2, %xmm4
    movapd  %xmm13, %xmm2
    mulpd   %xmm2, %xmm2
    addpd   %xmm4, %xmm2
    sqrtpd  %xmm2, %xmm2
    movapd  %xmm2, %xmm4
    mulpd   %xmm4, %xmm4
    mulpd   %xmm2, %xmm4
    movapd  LCPI0_1(%rip), %xmm10
    divpd   %xmm4, %xmm10
    movapd  %xmm10, %xmm2
    movlhps %xmm2, %xmm2
    mulpd   %xmm9, %xmm2
    movupd  80(%rcx), %xmm1
    subpd   %xmm2, %xmm1
    movapd  %xmm1, %xmm2
    movsd   96(%rcx), %xmm9
    mulsd   %xmm13, %xmm0
    mulsd   %xmm10, %xmm0
    subsd   %xmm0, %xmm9
    movsd   104(%rcx), %xmm5
    mulsd   %xmm5, %xmm13
    mulsd   %xmm10, %xmm13
    addsd   152(%rcx), %xmm13
    movsd   144(%rcx), %xmm7
    movsd   %xmm13, 152(%rcx)
    movupd  168(%rcx), %xmm0
    movapd  %xmm0, 48(%rsp)
    movapd  %xmm3, %xmm8
    subpd   %xmm0, %xmm8
    movapd  %xmm8, %xmm0
    mulsd   %xmm0, %xmm0
    movapd  %xmm8, %xmm4
    shufpd  $1, %xmm4, %xmm4
    mulsd   %xmm4, %xmm4
    addsd   %xmm0, %xmm4
    movsd   184(%rcx), %xmm0
    movsd   %xmm0, 40(%rsp)
    movapd  %xmm6, %xmm14
    subsd   %xmm0, %xmm14
    movapd  %xmm14, %xmm0
    mulsd   %xmm0, %xmm0
    addsd   %xmm4, %xmm0
    sqrtsd  %xmm0, %xmm0
    movapd  %xmm0, %xmm4
    mulsd   %xmm4, %xmm4
    mulsd   %xmm0, %xmm4
    movapd  %xmm12, %xmm15
    movapd  %xmm12, %xmm1
    divsd   %xmm4, %xmm15
    movsd   216(%rcx), %xmm12
    movapd  %xmm12, %xmm0
    movlhps %xmm0, %xmm0
    movaps  %xmm0, 64(%rsp)
    movapd  %xmm8, %xmm4
    mulpd   %xmm0, %xmm4
    movapd  %xmm14, %xmm0
    mulsd   %xmm12, %xmm0
    mulsd   %xmm15, %xmm0
    mulsd   %xmm5, %xmm14
    mulsd   %xmm15, %xmm14
    movlhps %xmm15, %xmm15
    mulpd   %xmm15, %xmm4
    subpd   %xmm4, %xmm2
    movapd  %xmm2, (%rsp)
    subsd   %xmm0, %xmm9
    movapd  %xmm5, %xmm2
    movlhps %xmm2, %xmm2
    mulpd   %xmm2, %xmm8
    mulpd   %xmm15, %xmm8
    movupd  192(%rcx), %xmm4
    addpd   %xmm4, %xmm8
    subpd   176(%rsp), %xmm3
    movapd  %xmm3, %xmm15
    mulsd   %xmm15, %xmm15
    movapd  %xmm3, %xmm4
    shufpd  $1, %xmm4, %xmm4
    mulsd   %xmm4, %xmm4
    addsd   %xmm15, %xmm4
    movapd  %xmm6, %xmm0
    subsd   160(%rsp), %xmm0
    movapd  %xmm0, %xmm6
    mulsd   %xmm6, %xmm6
    addsd   %xmm4, %xmm6
    xorps   %xmm4, %xmm4
    sqrtsd  %xmm6, %xmm4
    movapd  %xmm4, %xmm6
    mulsd   %xmm6, %xmm6
    mulsd   %xmm4, %xmm6
    movapd  %xmm1, %xmm4
    divsd   %xmm6, %xmm4
    movsd   272(%rcx), %xmm1
    movapd  %xmm1, 144(%rsp)
    movapd  %xmm1, %xmm15
    movlhps %xmm15, %xmm15
    mulpd   %xmm3, %xmm2
    mulpd   %xmm15, %xmm3
    movapd  %xmm0, %xmm6
    mulsd   %xmm1, %xmm6
    mulsd   %xmm4, %xmm6
    mulsd   %xmm5, %xmm0
    mulsd   %xmm4, %xmm0
    movapd  %xmm0, 16(%rsp)
    movlhps %xmm4, %xmm4
    mulpd   %xmm4, %xmm3
    movapd  (%rsp), %xmm0
    subpd   %xmm3, %xmm0
    addsd   208(%rcx), %xmm14
    movupd  %xmm0, 80(%rcx)
    subsd   %xmm6, %xmm9
    movsd   %xmm9, 96(%rcx)
    mulpd   %xmm2, %xmm4
    movupd  248(%rcx), %xmm0
    addpd   %xmm0, %xmm4
    subpd   48(%rsp), %xmm11
    movapd  192(%rsp), %xmm1
    subsd   40(%rsp), %xmm1
    movapd  %xmm1, 192(%rsp)
    movapd  %xmm11, %xmm0
    mulsd   %xmm0, %xmm0
    movapd  %xmm11, %xmm3
    shufpd  $1, %xmm3, %xmm3
    mulsd   %xmm3, %xmm3
    addsd   %xmm0, %xmm3
    movapd  %xmm1, %xmm0
    mulsd   %xmm0, %xmm0
    addsd   %xmm3, %xmm0
    sqrtsd  %xmm0, %xmm0
    movapd  %xmm0, %xmm6
    mulsd   %xmm6, %xmm6
    mulsd   %xmm0, %xmm6
    movsd   LCPI0_0(%rip), %xmm2
    movapd  %xmm2, %xmm3
    divsd   %xmm6, %xmm3
    mulsd   %xmm1, %xmm12
    mulsd   %xmm3, %xmm12
    subsd   %xmm12, %xmm13
    movapd  80(%rsp), %xmm1
    movapd  %xmm1, %xmm9
    movapd  96(%rsp), %xmm0
    movsd   %xmm0, %xmm1
    movapd  112(%rsp), %xmm6
    mulsd   %xmm5, %xmm6
    unpcklpd    144(%rsp), %xmm5
    mulpd   %xmm1, %xmm5
    unpcklpd    %xmm13, %xmm7
    mulpd   %xmm10, %xmm5
    movapd  %xmm7, %xmm0
    addpd   %xmm5, %xmm0
    subpd   %xmm5, %xmm7
    movapd  %xmm7, %xmm1
    movsd   %xmm0, %xmm1
    movapd  %xmm2, %xmm12
    movapd  %xmm12, %xmm0
    unpcklpd    %xmm6, %xmm0
    shufpd  $1, %xmm10, %xmm7
    mulpd   %xmm0, %xmm7
    movupd  128(%rcx), %xmm0
    addpd   %xmm0, %xmm7
    movapd  16(%rsp), %xmm6
    addsd   264(%rcx), %xmm6
    movupd  %xmm7, 128(%rcx)
    movupd  %xmm1, 144(%rcx)
    movapd  64(%rsp), %xmm2
    mulpd   %xmm11, %xmm2
    movsd   160(%rcx), %xmm1
    movapd  192(%rsp), %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm3, %xmm0
    movapd  %xmm0, %xmm5
    movlhps %xmm3, %xmm3
    mulpd   %xmm3, %xmm2
    movupd  136(%rcx), %xmm0
    subpd   %xmm2, %xmm0
    shufpd  $1, %xmm9, %xmm9
    mulsd   %xmm1, %xmm9
    movlhps %xmm1, %xmm1
    mulpd   %xmm1, %xmm11
    mulpd   %xmm3, %xmm11
    addpd   %xmm8, %xmm11
    addsd   %xmm14, %xmm5
    movapd  128(%rsp), %xmm3
    mulpd   %xmm3, %xmm1
    mulpd   %xmm15, %xmm3
    movapd  %xmm10, %xmm2
    movhlps %xmm10, %xmm10
    mulpd   %xmm10, %xmm3
    subpd   %xmm3, %xmm0
    shufpd  $1, %xmm2, %xmm2
    movupd  %xmm0, 136(%rcx)
    mulpd   %xmm1, %xmm10
    addpd   %xmm4, %xmm10
    mulsd   %xmm2, %xmm9
    addsd   %xmm6, %xmm9
    movupd  168(%rcx), %xmm13
    movapd  %xmm13, %xmm4
    subpd   176(%rsp), %xmm4
    movsd   184(%rcx), %xmm8
    movapd  %xmm8, %xmm1
    subsd   160(%rsp), %xmm1
    movapd  %xmm4, %xmm6
    mulsd   %xmm6, %xmm6
    movapd  %xmm4, %xmm7
    shufpd  $1, %xmm7, %xmm7
    mulsd   %xmm7, %xmm7
    addsd   %xmm6, %xmm7
    movapd  %xmm1, %xmm6
    mulsd   %xmm6, %xmm6
    addsd   %xmm7, %xmm6
    sqrtsd  %xmm6, %xmm6
    movapd  %xmm6, %xmm7
    mulsd   %xmm7, %xmm7
    mulsd   %xmm6, %xmm7
    movapd  %xmm12, %xmm6
    divsd   %xmm7, %xmm6
    mulpd   %xmm4, %xmm15
    movapd  %xmm6, %xmm2
    movlhps %xmm2, %xmm2
    mulpd   %xmm2, %xmm15
    subpd   %xmm15, %xmm11
    movupd  %xmm11, 192(%rcx)
    movapd  144(%rsp), %xmm3
    mulsd   %xmm1, %xmm3
    mulsd   %xmm6, %xmm3
    subsd   %xmm3, %xmm5
    movsd   %xmm5, 208(%rcx)
    movsd   216(%rcx), %xmm7
    mulsd   %xmm7, %xmm1
    movlhps %xmm7, %xmm7
    mulpd   %xmm4, %xmm7
    mulpd   %xmm2, %xmm7
    addpd   %xmm10, %xmm7
    movupd  %xmm7, 248(%rcx)
    mulsd   %xmm6, %xmm1
    addsd   %xmm9, %xmm1
    movsd   %xmm1, 264(%rcx)
    movupd  (%rcx), %xmm2
    movupd  24(%rcx), %xmm4
    movapd  LCPI0_1(%rip), %xmm3
    mulpd   %xmm3, %xmm4
    addpd   %xmm2, %xmm4
    movupd  %xmm4, (%rcx)
    movsd   40(%rcx), %xmm2
    movapd  %xmm12, %xmm6
    mulsd   %xmm6, %xmm2
    addsd   16(%rcx), %xmm2
    movsd   %xmm2, 16(%rcx)
    movupd  56(%rcx), %xmm2
    movupd  80(%rcx), %xmm4
    mulpd   %xmm3, %xmm4
    addpd   %xmm2, %xmm4
    movupd  %xmm4, 56(%rcx)
    movsd   96(%rcx), %xmm2
    mulsd   %xmm6, %xmm2
    addsd   72(%rcx), %xmm2
    movsd   %xmm2, 72(%rcx)
    movupd  112(%rcx), %xmm2
    mulpd   %xmm3, %xmm0
    addpd   %xmm2, %xmm0
    movupd  %xmm0, 112(%rcx)
    mulpd   %xmm3, %xmm11
    addpd   %xmm13, %xmm11
    movupd  %xmm11, 168(%rcx)
    mulsd   %xmm6, %xmm5
    addsd   %xmm8, %xmm5
    movsd   %xmm5, 184(%rcx)
    mulpd   %xmm3, %xmm7
    movupd  224(%rcx), %xmm0
    addpd   %xmm0, %xmm7
    movupd  %xmm7, 224(%rcx)
    mulsd   %xmm6, %xmm1
    addsd   240(%rcx), %xmm1
    movsd   %xmm1, 240(%rcx)
    movaps  208(%rsp), %xmm6
    movaps  224(%rsp), %xmm7
    movaps  240(%rsp), %xmm8
    movaps  256(%rsp), %xmm9
    movaps  272(%rsp), %xmm10
    movaps  288(%rsp), %xmm11
    movaps  304(%rsp), %xmm12
    movaps  320(%rsp), %xmm13
    movaps  336(%rsp), %xmm14
    movaps  352(%rsp), %xmm15
    addq    $376, %rsp
    retq

As you see the advance() of Rust#4 has no loops and no jumps. Modern CPUs love this kind of code, and they scream through it.

Finally the asm of the advance() function of Rust#3 compiled with (a more aggressive) automatic loop unrolling:

rustc -C opt-level=3 -C lto -C llvm-args="-unroll-threshold=1000" --emit asm nbody_simple.rs

_ZN7advance20hc525da322d48cc5eLaaE:
    subq    $840, %rsp
    movapd    %xmm15, 816(%rsp)
    movapd    %xmm14, 800(%rsp)
    movapd    %xmm13, 784(%rsp)
    movapd    %xmm12, 768(%rsp)
    movapd    %xmm11, 752(%rsp)
    movapd    %xmm10, 736(%rsp)
    movapd    %xmm9, 720(%rsp)
    movaps    %xmm8, 704(%rsp)
    movapd    %xmm7, 688(%rsp)
    movapd    %xmm6, 672(%rsp)
    testq    %rdx, %rdx
    je    .LBB0_7
    xorl    %eax, %eax
    testb    $1, %dl
    je    .LBB0_3
    movupd    (%rcx), %xmm4
    movsd    48(%rcx), %xmm8
    movapd    %xmm8, %xmm9
    movlhps    %xmm9, %xmm9
    movupd    56(%rcx), %xmm3
    movapd    %xmm4, %xmm1
    subpd    %xmm3, %xmm1
    movsd    16(%rcx), %xmm3
    movsd    40(%rcx), %xmm10
    movapd    %xmm3, %xmm7
    subsd    72(%rcx), %xmm7
    movapd    %xmm1, %xmm6
    mulsd    %xmm6, %xmm6
    movapd    %xmm1, %xmm0
    shufpd    $1, %xmm0, %xmm0
    mulsd    %xmm0, %xmm0
    addsd    %xmm6, %xmm0
    movapd    %xmm7, %xmm6
    mulsd    %xmm6, %xmm6
    addsd    %xmm0, %xmm6
    xorps    %xmm0, %xmm0
    sqrtsd    %xmm6, %xmm0
    movapd    %xmm0, %xmm6
    mulsd    %xmm6, %xmm6
    mulsd    %xmm0, %xmm6
    movsd    LCPI0_0(%rip), %xmm2
    movapd    %xmm2, %xmm0
    movapd    %xmm2, %xmm12
    divsd    %xmm6, %xmm0
    movsd    104(%rcx), %xmm2
    movupd    24(%rcx), %xmm6
    movapd    %xmm7, %xmm5
    mulsd    %xmm2, %xmm5
    movlhps    %xmm2, %xmm2
    mulpd    %xmm1, %xmm2
    mulsd    %xmm0, %xmm5
    mulsd    %xmm8, %xmm7
    mulsd    %xmm0, %xmm7
    movlhps    %xmm0, %xmm0
    mulpd    %xmm0, %xmm2
    subpd    %xmm2, %xmm6
    subsd    %xmm5, %xmm10
    movupd    80(%rcx), %xmm2
    mulpd    %xmm9, %xmm1
    mulpd    %xmm0, %xmm1
    addpd    %xmm2, %xmm1
    movupd    %xmm1, 80(%rcx)
    addsd    96(%rcx), %xmm7
    movsd    %xmm7, 96(%rcx)
    movupd    112(%rcx), %xmm0
    movapd    %xmm4, %xmm7
    subpd    %xmm0, %xmm7
    movapd    %xmm3, %xmm0
    subsd    128(%rcx), %xmm0
    movapd    %xmm7, %xmm1
    mulsd    %xmm1, %xmm1
    movapd    %xmm7, %xmm2
    shufpd    $1, %xmm2, %xmm2
    mulsd    %xmm2, %xmm2
    addsd    %xmm1, %xmm2
    movapd    %xmm0, %xmm1
    mulsd    %xmm1, %xmm1
    addsd    %xmm2, %xmm1
    sqrtsd    %xmm1, %xmm1
    movapd    %xmm1, %xmm2
    mulsd    %xmm2, %xmm2
    mulsd    %xmm1, %xmm2
    movapd    %xmm12, %xmm1
    divsd    %xmm2, %xmm1
    movsd    160(%rcx), %xmm2
    movapd    %xmm0, %xmm5
    mulsd    %xmm2, %xmm5
    movlhps    %xmm2, %xmm2
    mulpd    %xmm7, %xmm2
    mulsd    %xmm1, %xmm5
    mulsd    %xmm8, %xmm0
    mulsd    %xmm1, %xmm0
    movlhps    %xmm1, %xmm1
    mulpd    %xmm1, %xmm2
    subpd    %xmm2, %xmm6
    subsd    %xmm5, %xmm10
    movupd    136(%rcx), %xmm2
    mulpd    %xmm9, %xmm7
    mulpd    %xmm1, %xmm7
    addpd    %xmm2, %xmm7
    movupd    %xmm7, 136(%rcx)
    addsd    152(%rcx), %xmm0
    movsd    %xmm0, 152(%rcx)
    movupd    168(%rcx), %xmm0
    movapd    %xmm4, %xmm7
    subpd    %xmm0, %xmm7
    movapd    %xmm3, %xmm0
    subsd    184(%rcx), %xmm0
    movapd    %xmm7, %xmm1
    mulsd    %xmm1, %xmm1
    movapd    %xmm7, %xmm2
    shufpd    $1, %xmm2, %xmm2
    mulsd    %xmm2, %xmm2
    addsd    %xmm1, %xmm2
    movapd    %xmm0, %xmm1
    mulsd    %xmm1, %xmm1
    addsd    %xmm2, %xmm1
    sqrtsd    %xmm1, %xmm1
    movapd    %xmm1, %xmm2
    mulsd    %xmm2, %xmm2
    mulsd    %xmm1, %xmm2
    movapd    %xmm12, %xmm1
    divsd    %xmm2, %xmm1
    movsd    216(%rcx), %xmm2
    movapd    %xmm0, %xmm5
    mulsd    %xmm2, %xmm5
    movlhps    %xmm2, %xmm2
    mulpd    %xmm7, %xmm2
    mulsd    %xmm1, %xmm5
    mulsd    %xmm8, %xmm0
    mulsd    %xmm1, %xmm0
    movlhps    %xmm1, %xmm1
    mulpd    %xmm1, %xmm2
    subpd    %xmm2, %xmm6
    subsd    %xmm5, %xmm10
    movupd    192(%rcx), %xmm2
    mulpd    %xmm9, %xmm7
    mulpd    %xmm1, %xmm7
    addpd    %xmm2, %xmm7
    movupd    %xmm7, 192(%rcx)
    addsd    208(%rcx), %xmm0
    movsd    %xmm0, 208(%rcx)
    movupd    224(%rcx), %xmm0
    subpd    %xmm0, %xmm4
    subsd    240(%rcx), %xmm3
    movapd    %xmm4, %xmm0
    mulsd    %xmm0, %xmm0
    movapd    %xmm4, %xmm1
    shufpd    $1, %xmm1, %xmm1
    mulsd    %xmm1, %xmm1
    addsd    %xmm0, %xmm1
    movapd    %xmm3, %xmm0
    mulsd    %xmm0, %xmm0
    addsd    %xmm1, %xmm0
    sqrtsd    %xmm0, %xmm0
    movapd    %xmm0, %xmm1
    mulsd    %xmm1, %xmm1
    mulsd    %xmm0, %xmm1
    movapd    %xmm12, %xmm0
    divsd    %xmm1, %xmm0
    movsd    272(%rcx), %xmm1
    movapd    %xmm3, %xmm2
    mulsd    %xmm1, %xmm2
    movlhps    %xmm1, %xmm1
    mulpd    %xmm4, %xmm1
    mulsd    %xmm0, %xmm2
    mulsd    %xmm8, %xmm3
    mulsd    %xmm0, %xmm3
    movlhps    %xmm0, %xmm0
    mulpd    %xmm0, %xmm1
    subpd    %xmm1, %xmm6
    movupd    %xmm6, 24(%rcx)
    subsd    %xmm2, %xmm10
    movsd    %xmm10, 40(%rcx)
    movupd    248(%rcx), %xmm1
    mulpd    %xmm9, %xmm4
    mulpd    %xmm0, %xmm4
    addpd    %xmm1, %xmm4
    movupd    %xmm4, 248(%rcx)
    addsd    264(%rcx), %xmm3
    movsd    %xmm3, 264(%rcx)
    movupd    56(%rcx), %xmm4
    movsd    104(%rcx), %xmm14
    movapd    %xmm14, %xmm3
    movlhps    %xmm3, %xmm3
    movupd    112(%rcx), %xmm0
    movapd    %xmm4, %xmm1
    subpd    %xmm0, %xmm1
    movsd    72(%rcx), %xmm11
    movsd    96(%rcx), %xmm5
    movapd    %xmm11, %xmm2
    subsd    128(%rcx), %xmm2
    movapd    %xmm1, %xmm0
    mulsd    %xmm0, %xmm0
    movapd    %xmm1, %xmm7
    shufpd    $1, %xmm7, %xmm7
    mulsd    %xmm7, %xmm7
    addsd    %xmm0, %xmm7
    movapd    %xmm2, %xmm0
    mulsd    %xmm0, %xmm0
    addsd    %xmm7, %xmm0
    sqrtsd    %xmm0, %xmm0
    movapd    %xmm0, %xmm7
    mulsd    %xmm7, %xmm7
    mulsd    %xmm0, %xmm7
    movapd    %xmm12, %xmm8
    divsd    %xmm7, %xmm8
    movsd    160(%rcx), %xmm7
    movupd    80(%rcx), %xmm0
    movapd    %xmm2, %xmm6
    mulsd    %xmm7, %xmm6
    movlhps    %xmm7, %xmm7
    mulpd    %xmm1, %xmm7
    mulsd    %xmm8, %xmm6
    mulsd    %xmm14, %xmm2
    mulsd    %xmm8, %xmm2
    movlhps    %xmm8, %xmm8
    mulpd    %xmm8, %xmm7
    subpd    %xmm7, %xmm0
    subsd    %xmm6, %xmm5
    movupd    136(%rcx), %xmm6
    mulpd    %xmm3, %xmm1
    mulpd    %xmm8, %xmm1
    addpd    %xmm6, %xmm1
    movupd    %xmm1, 136(%rcx)
    addsd    152(%rcx), %xmm2
    movsd    %xmm2, 152(%rcx)
    movupd    168(%rcx), %xmm9
    movapd    %xmm4, %xmm2
    subpd    %xmm9, %xmm2
    movsd    184(%rcx), %xmm1
    movsd    %xmm1, 640(%rsp)
    movsd    216(%rcx), %xmm15
    movapd    %xmm11, %xmm13
    subsd    %xmm1, %xmm13
    movapd    %xmm2, %xmm1
    mulsd    %xmm1, %xmm1
    movapd    %xmm2, %xmm6
    shufpd    $1, %xmm6, %xmm6
    mulsd    %xmm6, %xmm6
    addsd    %xmm1, %xmm6
    movapd    %xmm13, %xmm1
    mulsd    %xmm1, %xmm1
    addsd    %xmm6, %xmm1
    sqrtsd    %xmm1, %xmm1
    movapd    %xmm1, %xmm6
    mulsd    %xmm6, %xmm6
    mulsd    %xmm1, %xmm6
    movapd    %xmm12, %xmm1
    movapd    %xmm1, %xmm7
    divsd    %xmm6, %xmm7
    movapd    %xmm15, %xmm10
    movlhps    %xmm10, %xmm10
    movapd    %xmm3, %xmm8
    mulpd    %xmm2, %xmm8
    mulpd    %xmm10, %xmm2
    movapd    %xmm13, %xmm6
    mulsd    %xmm15, %xmm6
    mulsd    %xmm7, %xmm6
    mulsd    %xmm14, %xmm13
    mulsd    %xmm7, %xmm13
    movapd    %xmm7, %xmm12
    movlhps    %xmm12, %xmm12
    mulpd    %xmm12, %xmm2
    subpd    %xmm2, %xmm0
    subsd    %xmm6, %xmm5
    movupd    192(%rcx), %xmm2
    mulpd    %xmm8, %xmm12
    addpd    %xmm2, %xmm12
    movupd    224(%rcx), %xmm2
    movapd    %xmm2, 656(%rsp)
    subpd    %xmm2, %xmm4
    movapd    %xmm4, %xmm2
    mulsd    %xmm2, %xmm2
    movapd    %xmm4, %xmm6
    shufpd    $1, %xmm6, %xmm6
    mulsd    %xmm6, %xmm6
    addsd    %xmm2, %xmm6
    movsd    240(%rcx), %xmm2
    movsd    %xmm2, 624(%rsp)
    subsd    %xmm2, %xmm11
    movapd    %xmm11, %xmm2
    mulsd    %xmm2, %xmm2
    addsd    %xmm6, %xmm2
    sqrtsd    %xmm2, %xmm2
    movapd    %xmm2, %xmm6
    mulsd    %xmm6, %xmm6
    mulsd    %xmm2, %xmm6
    movapd    %xmm1, %xmm2
    movapd    %xmm1, %xmm8
    divsd    %xmm6, %xmm2
    movapd    %xmm11, %xmm6
    mulsd    %xmm14, %xmm11
    movsd    272(%rcx), %xmm14
    movapd    %xmm14, %xmm7
    movlhps    %xmm7, %xmm7
    mulpd    %xmm4, %xmm3
    mulpd    %xmm7, %xmm4
    mulsd    %xmm14, %xmm6
    mulsd    %xmm2, %xmm6
    mulsd    %xmm2, %xmm11
    movlhps    %xmm2, %xmm2
    mulpd    %xmm2, %xmm4
    subpd    %xmm4, %xmm0
    addsd    208(%rcx), %xmm13
    movupd    %xmm0, 80(%rcx)
    subsd    %xmm6, %xmm5
    movsd    %xmm5, 96(%rcx)
    movupd    248(%rcx), %xmm0
    mulpd    %xmm3, %xmm2
    addpd    %xmm0, %xmm2
    movupd    112(%rcx), %xmm4
    movapd    %xmm4, %xmm3
    subpd    %xmm9, %xmm3
    movsd    128(%rcx), %xmm1
    movapd    %xmm1, %xmm5
    subsd    640(%rsp), %xmm5
    movapd    %xmm3, %xmm9
    mulsd    %xmm9, %xmm9
    movapd    %xmm3, %xmm0
    shufpd    $1, %xmm0, %xmm0
    mulsd    %xmm0, %xmm0
    addsd    %xmm9, %xmm0
    movapd    %xmm5, %xmm6
    mulsd    %xmm6, %xmm6
    addsd    %xmm0, %xmm6
    xorps    %xmm0, %xmm0
    sqrtsd    %xmm6, %xmm0
    movapd    %xmm0, %xmm6
    mulsd    %xmm6, %xmm6
    mulsd    %xmm0, %xmm6
    movapd    %xmm8, %xmm0
    divsd    %xmm6, %xmm0
    mulpd    %xmm3, %xmm10
    mulsd    %xmm5, %xmm15
    mulsd    %xmm0, %xmm15
    movapd    %xmm15, %xmm8
    movsd    160(%rcx), %xmm9
    mulsd    %xmm9, %xmm5
    mulsd    %xmm0, %xmm5
    movlhps    %xmm0, %xmm0
    mulpd    %xmm0, %xmm10
    movupd    136(%rcx), %xmm15
    subpd    %xmm10, %xmm15
    movsd    152(%rcx), %xmm6
    subsd    %xmm8, %xmm6
    subsd    624(%rsp), %xmm1
    mulsd    %xmm1, %xmm14
    movapd    %xmm1, %xmm8
    mulsd    %xmm9, %xmm1
    movlhps    %xmm9, %xmm9
    mulpd    %xmm9, %xmm3
    mulpd    %xmm0, %xmm3
    addsd    264(%rcx), %xmm11
    addpd    %xmm12, %xmm3
    addsd    %xmm13, %xmm5
    subpd    656(%rsp), %xmm4
    movapd    %xmm4, %xmm10
    mulsd    %xmm10, %xmm10
    movapd    %xmm4, %xmm0
    shufpd    $1, %xmm0, %xmm0
    mulsd    %xmm0, %xmm0
    addsd    %xmm10, %xmm0
    mulsd    %xmm8, %xmm8
    addsd    %xmm0, %xmm8
    sqrtsd    %xmm8, %xmm8
    movapd    %xmm8, %xmm0
    mulsd    %xmm0, %xmm0
    mulsd    %xmm8, %xmm0
    movsd    LCPI0_0(%rip), %xmm10
    movapd    %xmm10, %xmm8
    divsd    %xmm0, %xmm8
    mulpd    %xmm4, %xmm7
    mulsd    %xmm8, %xmm14
    mulsd    %xmm8, %xmm1
    movlhps    %xmm8, %xmm8
    mulpd    %xmm8, %xmm7
    subpd    %xmm7, %xmm15
    movupd    %xmm15, 136(%rcx)
    subsd    %xmm14, %xmm6
    movsd    %xmm6, 152(%rcx)
    mulpd    %xmm9, %xmm4
    mulpd    %xmm8, %xmm4
    addpd    %xmm2, %xmm4
    addsd    %xmm11, %xmm1
    movupd    168(%rcx), %xmm11
    movapd    %xmm11, 624(%rsp)
    movsd    184(%rcx), %xmm2
    movsd    216(%rcx), %xmm14
    movupd    224(%rcx), %xmm0
    movapd    %xmm0, 640(%rsp)
    subpd    %xmm0, %xmm11
    movsd    240(%rcx), %xmm0
    movsd    %xmm0, 656(%rsp)
    subsd    %xmm0, %xmm2
    movapd    %xmm2, %xmm13
    movsd    272(%rcx), %xmm0
    movapd    %xmm2, %xmm12
    mulsd    %xmm0, %xmm12
    movlhps    %xmm0, %xmm0
    mulpd    %xmm11, %xmm0
    mulsd    %xmm14, %xmm2
    movlhps    %xmm14, %xmm14
    mulpd    %xmm11, %xmm14
    movapd    %xmm11, %xmm7
    mulsd    %xmm11, %xmm11
    shufpd    $1, %xmm7, %xmm7
    mulsd    %xmm7, %xmm7
    addsd    %xmm11, %xmm7
    mulsd    %xmm13, %xmm13
    addsd    %xmm7, %xmm13
    xorps    %xmm11, %xmm11
    sqrtsd    %xmm13, %xmm11
    movapd    %xmm11, %xmm7
    mulsd    %xmm7, %xmm7
    mulsd    %xmm11, %xmm7
    movapd    %xmm10, %xmm11
    divsd    %xmm7, %xmm11
    mulsd    %xmm11, %xmm12
    mulsd    %xmm11, %xmm2
    movlhps    %xmm11, %xmm11
    mulpd    %xmm11, %xmm0
    subpd    %xmm0, %xmm3
    movupd    %xmm3, 192(%rcx)
    subsd    %xmm12, %xmm5
    movsd    %xmm5, 208(%rcx)
    mulpd    %xmm11, %xmm14
    addpd    %xmm4, %xmm14
    movupd    %xmm14, 248(%rcx)
    addsd    %xmm1, %xmm2
    movsd    %xmm2, 264(%rcx)
    movupd    (%rcx), %xmm1
    movupd    24(%rcx), %xmm4
    movupd    56(%rcx), %xmm12
    movupd    80(%rcx), %xmm7
    movupd    112(%rcx), %xmm11
    movapd    LCPI0_1(%rip), %xmm9
    mulpd    %xmm9, %xmm4
    addpd    %xmm1, %xmm4
    movsd    40(%rcx), %xmm1
    mulsd    %xmm10, %xmm1
    addsd    16(%rcx), %xmm1
    movsd    96(%rcx), %xmm5
    mulsd    %xmm10, %xmm5
    addsd    72(%rcx), %xmm5
    mulsd    %xmm10, %xmm6
    addsd    128(%rcx), %xmm6
    movupd    176(%rcx), %xmm13
    movupd    200(%rcx), %xmm0
    movupd    %xmm4, (%rcx)
    movsd    %xmm1, 16(%rcx)
    mulpd    %xmm9, %xmm7
    addpd    %xmm12, %xmm7
    movupd    %xmm7, 56(%rcx)
    movsd    %xmm5, 72(%rcx)
    mulpd    %xmm9, %xmm15
    addpd    %xmm11, %xmm15
    movupd    %xmm15, 112(%rcx)
    movsd    %xmm6, 128(%rcx)
    mulsd    %xmm10, %xmm3
    addsd    624(%rsp), %xmm3
    movsd    %xmm3, 168(%rcx)
    mulpd    %xmm9, %xmm0
    addpd    %xmm13, %xmm0
    movupd    %xmm0, 176(%rcx)
    mulpd    %xmm9, %xmm14
    addpd    640(%rsp), %xmm14
    movupd    %xmm14, 224(%rcx)
    mulsd    %xmm10, %xmm2
    addsd    656(%rsp), %xmm2
    movsd    %xmm2, 240(%rcx)
    movl    $1, %eax
.LBB0_3:
    cmpq    $1, %rdx
    je    .LBB0_7
    movsd    48(%rcx), %xmm0
    movapd    %xmm0, 288(%rsp)
    movsd    72(%rcx), %xmm12
    movapd    %xmm0, %xmm1
    movlhps    %xmm1, %xmm1
    movaps    %xmm1, 496(%rsp)
    movsd    104(%rcx), %xmm1
    movapd    %xmm1, 272(%rsp)
    movapd    %xmm1, %xmm2
    movlhps    %xmm2, %xmm2
    movaps    %xmm2, 544(%rsp)
    movsd    160(%rcx), %xmm2
    movapd    %xmm2, 256(%rsp)
    movapd    %xmm2, %xmm3
    movlhps    %xmm3, %xmm3
    movaps    %xmm3, 480(%rsp)
    movsd    216(%rcx), %xmm3
    movapd    %xmm3, 240(%rsp)
    movapd    %xmm3, %xmm4
    movlhps    %xmm4, %xmm4
    movaps    %xmm4, 512(%rsp)
    movsd    272(%rcx), %xmm4
    movapd    %xmm4, 224(%rsp)
    movapd    %xmm4, %xmm5
    movlhps    %xmm5, %xmm5
    movaps    %xmm5, 528(%rsp)
    movsd    240(%rcx), %xmm9
    movhpd    128(%rcx), %xmm9
    movsd    264(%rcx), %xmm6
    movhpd    152(%rcx), %xmm6
    movhpd    16(%rcx), %xmm12
    movsd    96(%rcx), %xmm5
    movhpd    40(%rcx), %xmm5
    movapd    %xmm5, 640(%rsp)
    movapd    %xmm3, %xmm5
    unpcklpd    %xmm4, %xmm5
    movapd    %xmm5, 208(%rsp)
    movapd    %xmm2, %xmm4
    unpcklpd    %xmm3, %xmm4
    movapd    %xmm4, 32(%rsp)
    unpcklpd    %xmm1, %xmm0
    movapd    %xmm0, 16(%rsp)
    unpcklpd    %xmm2, %xmm1
    movapd    %xmm1, (%rsp)
    subq    %rax, %rdx
    movupd    (%rcx), %xmm7
    movupd    56(%rcx), %xmm5
    movups    24(%rcx), %xmm0
    movaps    %xmm0, 656(%rsp)
    movups    80(%rcx), %xmm0
    movaps    %xmm0, 560(%rsp)
    movupd    112(%rcx), %xmm10
    movupd    136(%rcx), %xmm0
    movapd    %xmm0, 576(%rsp)
    movupd    224(%rcx), %xmm4
    movupd    248(%rcx), %xmm15
    movsd    184(%rcx), %xmm1
    movsd    208(%rcx), %xmm14
    movupd    192(%rcx), %xmm11
    .align    16, 0x90
.LBB0_5:
    movapd    %xmm10, 352(%rsp)
    movapd    %xmm9, 400(%rsp)
    movapd    %xmm4, 416(%rsp)
    movapd    %xmm6, 192(%rsp)
    movapd    %xmm5, 304(%rsp)
    movapd    %xmm7, 320(%rsp)
    movapd    %xmm12, 384(%rsp)
    movapd    %xmm7, %xmm2
    subpd    %xmm5, %xmm2
    movapd    %xmm2, 592(%rsp)
    movapd    %xmm2, %xmm0
    mulsd    %xmm0, %xmm0
    shufpd    $1, %xmm2, %xmm2
    mulsd    %xmm2, %xmm2
    addsd    %xmm0, %xmm2
    movapd    %xmm12, %xmm8
    shufpd    $1, %xmm8, %xmm8
    subsd    %xmm12, %xmm8
    movapd    %xmm5, %xmm0
    movapd    %xmm8, %xmm13
    mulsd    %xmm13, %xmm13
    addsd    %xmm2, %xmm13
    movapd    %xmm7, %xmm5
    subpd    %xmm10, %xmm5
    movapd    %xmm5, 368(%rsp)
    movapd    %xmm7, %xmm3
    subpd    %xmm4, %xmm3
    movapd    %xmm3, 176(%rsp)
    movapd    %xmm3, %xmm2
    movapd    %xmm3, %xmm4
    unpcklpd    %xmm5, %xmm2
    mulpd    %xmm2, %xmm2
    movapd    %xmm4, %xmm6
    unpckhpd    %xmm5, %xmm6
    mulpd    %xmm6, %xmm6
    addpd    %xmm2, %xmm6
    movapd    %xmm12, %xmm2
    movhlps    %xmm2, %xmm2
    subpd    %xmm9, %xmm2
    movapd    %xmm2, 336(%rsp)
    movapd    %xmm2, %xmm3
    mulpd    %xmm3, %xmm3
    addpd    %xmm6, %xmm3
    movsd    184(%rcx), %xmm6
    movapd    %xmm6, 432(%rsp)
    unpcklpd    %xmm1, %xmm6
    sqrtsd    %xmm13, %xmm1
    movapd    %xmm1, %xmm5
    mulsd    %xmm5, %xmm5
    mulsd    %xmm1, %xmm5
    subpd    %xmm6, %xmm12
    movapd    %xmm12, 624(%rsp)
    movapd    %xmm12, %xmm13
    sqrtpd    %xmm3, %xmm3
    movapd    %xmm3, %xmm1
    mulpd    %xmm1, %xmm1
    mulpd    %xmm3, %xmm1
    movupd    168(%rcx), %xmm2
    movapd    %xmm2, 464(%rsp)
    subpd    %xmm2, %xmm7
    movapd    %xmm0, %xmm9
    subpd    %xmm2, %xmm9
    movapd    %xmm7, %xmm3
    mulsd    %xmm3, %xmm3
    movapd    %xmm9, %xmm6
    mulsd    %xmm6, %xmm6
    unpcklpd    %xmm3, %xmm6
    movapd    %xmm7, %xmm3
    movapd    %xmm7, %xmm12
    shufpd    $1, %xmm3, %xmm3
    mulsd    %xmm3, %xmm3
    movapd    %xmm9, %xmm7
    shufpd    $1, %xmm7, %xmm7
    mulsd    %xmm7, %xmm7
    unpcklpd    %xmm3, %xmm7
    movsd    LCPI0_0(%rip), %xmm3
    divsd    %xmm5, %xmm3
    addpd    %xmm6, %xmm7
    movapd    %xmm13, %xmm5
    mulpd    %xmm5, %xmm5
    addpd    %xmm7, %xmm5
    movapd    LCPI0_1(%rip), %xmm6
    movapd    %xmm6, %xmm2
    movapd    %xmm2, %xmm4
    divpd    %xmm1, %xmm4
    movapd    %xmm4, 144(%rsp)
    sqrtpd    %xmm5, %xmm1
    movapd    %xmm1, %xmm6
    mulpd    %xmm6, %xmm6
    mulpd    %xmm1, %xmm6
    movapd    %xmm2, %xmm5
    divpd    %xmm6, %xmm5
    movapd    592(%rsp), %xmm6
    movapd    %xmm6, %xmm1
    mulpd    544(%rsp), %xmm1
    movlhps    %xmm3, %xmm3
    mulpd    %xmm3, %xmm1
    movapd    656(%rsp), %xmm2
    subpd    %xmm1, %xmm2
    movapd    %xmm2, 656(%rsp)
    movapd    %xmm12, %xmm2
    mulpd    512(%rsp), %xmm2
    movapd    %xmm5, %xmm1
    movhlps    %xmm1, %xmm1
    movapd    496(%rsp), %xmm7
    mulpd    %xmm7, %xmm12
    mulpd    %xmm1, %xmm2
    movapd    %xmm2, 128(%rsp)
    mulpd    %xmm1, %xmm12
    movapd    %xmm12, 448(%rsp)
    movapd    %xmm13, %xmm2
    shufpd    $1, %xmm2, %xmm2
    mulsd    288(%rsp), %xmm2
    movapd    %xmm5, %xmm1
    shufpd    $1, %xmm1, %xmm1
    mulsd    %xmm1, %xmm2
    movapd    %xmm2, 112(%rsp)
    movlhps    %xmm8, %xmm8
    mulpd    16(%rsp), %xmm8
    mulpd    %xmm3, %xmm8
    movapd    %xmm6, %xmm2
    mulpd    %xmm7, %xmm2
    mulpd    %xmm3, %xmm2
    movapd    640(%rsp), %xmm13
    movapd    %xmm13, %xmm1
    addpd    %xmm8, %xmm1
    subpd    %xmm8, %xmm13
    movapd    %xmm0, %xmm8
    subpd    %xmm10, %xmm8
    movapd    %xmm0, %xmm10
    subpd    416(%rsp), %xmm10
    movapd    %xmm10, %xmm3
    unpcklpd    %xmm8, %xmm3
    mulpd    %xmm3, %xmm3
    movapd    %xmm10, %xmm6
    unpckhpd    %xmm8, %xmm6
    mulpd    %xmm6, %xmm6
    addpd    %xmm3, %xmm6
    movaps    384(%rsp), %xmm4
    movlhps    %xmm4, %xmm4
    subpd    400(%rsp), %xmm4
    movapd    %xmm4, 608(%rsp)
    movapd    %xmm4, %xmm3
    mulpd    %xmm3, %xmm3
    addpd    %xmm6, %xmm3
    addpd    560(%rsp), %xmm2
    movapd    368(%rsp), %xmm6
    movapd    480(%rsp), %xmm0
    mulpd    %xmm0, %xmm6
    movaps    144(%rsp), %xmm12
    movaps    %xmm12, %xmm7
    movhlps    %xmm7, %xmm7
    movaps    %xmm7, 96(%rsp)
    mulpd    %xmm7, %xmm6
    movapd    656(%rsp), %xmm7
    subpd    %xmm6, %xmm7
    movapd    %xmm7, 656(%rsp)
    sqrtpd    %xmm3, %xmm3
    movapd    %xmm3, %xmm6
    mulpd    %xmm6, %xmm6
    mulpd    %xmm3, %xmm6
    movapd    %xmm13, %xmm7
    movsd    %xmm1, %xmm7
    movapd    LCPI0_1(%rip), %xmm13
    divpd    %xmm6, %xmm13
    movapd    %xmm8, %xmm1
    mulpd    %xmm0, %xmm1
    movapd    %xmm13, %xmm3
    movhlps    %xmm3, %xmm3
    movaps    %xmm3, 80(%rsp)
    mulpd    %xmm3, %xmm1
    subpd    %xmm1, %xmm2
    movapd    %xmm2, 592(%rsp)
    movapd    %xmm4, %xmm1
    unpckhpd    336(%rsp), %xmm1
    mulpd    %xmm0, %xmm1
    movapd    %xmm13, %xmm6
    unpckhpd    %xmm12, %xmm6
    mulpd    %xmm1, %xmm6
    subpd    %xmm6, %xmm7
    movapd    %xmm7, 640(%rsp)
    movapd    624(%rsp), %xmm2
    movapd    %xmm2, %xmm3
    movapd    512(%rsp), %xmm0
    mulpd    %xmm0, %xmm3
    mulpd    %xmm5, %xmm3
    movapd    544(%rsp), %xmm6
    mulpd    %xmm9, %xmm6
    mulpd    %xmm0, %xmm9
    movapd    %xmm2, %xmm0
    mulsd    272(%rsp), %xmm0
    mulsd    %xmm5, %xmm0
    movapd    %xmm0, 624(%rsp)
    movlhps    %xmm5, %xmm5
    mulpd    %xmm5, %xmm9
    mulpd    %xmm6, %xmm5
    movapd    %xmm5, 560(%rsp)
    movapd    %xmm13, %xmm2
    unpcklpd    %xmm12, %xmm2
    movapd    608(%rsp), %xmm4
    movapd    %xmm4, %xmm6
    movapd    336(%rsp), %xmm1
    unpcklpd    %xmm1, %xmm6
    movapd    528(%rsp), %xmm0
    mulpd    %xmm0, %xmm6
    movapd    %xmm0, %xmm5
    mulpd    %xmm6, %xmm2
    movapd    %xmm2, 160(%rsp)
    movapd    496(%rsp), %xmm7
    mulpd    %xmm7, %xmm1
    mulpd    %xmm12, %xmm1
    movapd    176(%rsp), %xmm0
    movapd    %xmm0, %xmm2
    mulpd    %xmm5, %xmm2
    movlhps    %xmm12, %xmm12
    mulpd    %xmm12, %xmm2
    movapd    %xmm2, 144(%rsp)
    mulpd    %xmm7, %xmm0
    mulpd    %xmm12, %xmm0
    addpd    %xmm15, %xmm0
    movapd    %xmm0, %xmm6
    movapd    448(%rsp), %xmm0
    addpd    %xmm11, %xmm0
    movapd    %xmm0, 448(%rsp)
    movapd    544(%rsp), %xmm2
    mulpd    %xmm2, %xmm4
    mulpd    %xmm13, %xmm4
    movapd    %xmm10, %xmm11
    mulpd    %xmm5, %xmm11
    movlhps    %xmm13, %xmm13
    mulpd    %xmm13, %xmm11
    mulpd    %xmm2, %xmm10
    mulpd    %xmm13, %xmm10
    movapd    112(%rsp), %xmm5
    addsd    %xmm14, %xmm5
    addpd    %xmm6, %xmm10
    movapd    %xmm10, 336(%rsp)
    addpd    192(%rsp), %xmm1
    addpd    %xmm1, %xmm4
    movapd    %xmm4, 608(%rsp)
    movapd    656(%rsp), %xmm0
    subpd    128(%rsp), %xmm0
    movapd    %xmm0, 656(%rsp)
    movapd    368(%rsp), %xmm0
    mulpd    %xmm7, %xmm0
    mulpd    96(%rsp), %xmm0
    addpd    576(%rsp), %xmm0
    movapd    592(%rsp), %xmm15
    subpd    %xmm9, %xmm15
    mulpd    %xmm2, %xmm8
    mulpd    80(%rsp), %xmm8
    addpd    %xmm0, %xmm8
    movapd    %xmm8, %xmm0
    movapd    640(%rsp), %xmm2
    subpd    %xmm3, %xmm2
    movapd    %xmm2, 640(%rsp)
    movapd    352(%rsp), %xmm12
    movapd    %xmm12, %xmm9
    subpd    464(%rsp), %xmm9
    movapd    416(%rsp), %xmm8
    subpd    %xmm8, %xmm12
    movapd    %xmm12, %xmm3
    mulsd    %xmm3, %xmm3
    movapd    %xmm9, %xmm2
    mulsd    %xmm2, %xmm2
    unpcklpd    %xmm2, %xmm3
    movapd    %xmm9, %xmm4
    shufpd    $1, %xmm4, %xmm4
    movapd    %xmm12, %xmm2
    shufpd    $1, %xmm2, %xmm2
    mulsd    %xmm4, %xmm4
    mulsd    %xmm2, %xmm2
    unpcklpd    %xmm4, %xmm2
    addpd    %xmm3, %xmm2
    movapd    400(%rsp), %xmm1
    movapd    %xmm1, %xmm6
    shufpd    $1, %xmm6, %xmm6
    movapd    %xmm6, %xmm10
    movapd    432(%rsp), %xmm3
    subsd    %xmm3, %xmm10
    subsd    %xmm1, %xmm6
    movapd    %xmm1, %xmm4
    movapd    %xmm3, %xmm1
    subsd    %xmm4, %xmm1
    movapd    %xmm1, %xmm13
    movapd    %xmm1, %xmm14
    unpcklpd    %xmm6, %xmm1
    movapd    %xmm1, 432(%rsp)
    unpcklpd    %xmm10, %xmm6
    movapd    %xmm6, %xmm1
    mulpd    %xmm1, %xmm1
    addpd    %xmm2, %xmm1
    movapd    560(%rsp), %xmm2
    addpd    448(%rsp), %xmm2
    movapd    %xmm2, 560(%rsp)
    sqrtpd    %xmm1, %xmm1
    movapd    %xmm1, %xmm2
    mulpd    %xmm2, %xmm2
    mulpd    %xmm1, %xmm2
    movapd    624(%rsp), %xmm1
    addsd    %xmm5, %xmm1
    movapd    %xmm1, 624(%rsp)
    movapd    LCPI0_1(%rip), %xmm7
    divpd    %xmm2, %xmm7
    movapd    %xmm9, %xmm2
    movapd    512(%rsp), %xmm5
    mulpd    %xmm5, %xmm2
    movapd    %xmm7, %xmm1
    movhlps    %xmm1, %xmm1
    mulpd    %xmm1, %xmm2
    subpd    %xmm2, %xmm0
    movapd    480(%rsp), %xmm4
    mulpd    %xmm4, %xmm9
    mulpd    %xmm1, %xmm9
    movapd    656(%rsp), %xmm1
    subpd    144(%rsp), %xmm1
    movapd    %xmm1, 656(%rsp)
    movapd    %xmm7, %xmm1
    shufpd    $1, %xmm1, %xmm1
    mulsd    256(%rsp), %xmm10
    mulsd    %xmm1, %xmm10
    subpd    %xmm11, %xmm15
    movapd    %xmm15, 592(%rsp)
    movapd    464(%rsp), %xmm3
    subpd    %xmm8, %xmm3
    movapd    %xmm3, %xmm2
    mulsd    %xmm2, %xmm2
    movapd    %xmm3, %xmm1
    shufpd    $1, %xmm1, %xmm1
    mulsd    %xmm1, %xmm1
    addsd    %xmm2, %xmm1
    mulsd    %xmm13, %xmm13
    addsd    %xmm1, %xmm13
    xorps    %xmm1, %xmm1
    sqrtsd    %xmm13, %xmm1
    movapd    %xmm1, %xmm2
    mulsd    %xmm2, %xmm2
    mulsd    %xmm1, %xmm2
    movsd    LCPI0_0(%rip), %xmm11
    movapd    %xmm11, %xmm1
    divsd    %xmm2, %xmm1
    movapd    640(%rsp), %xmm13
    subpd    160(%rsp), %xmm13
    movapd    %xmm13, 640(%rsp)
    mulpd    32(%rsp), %xmm6
    mulsd    224(%rsp), %xmm14
    mulsd    %xmm1, %xmm14
    mulpd    %xmm7, %xmm6
    movapd    %xmm1, 448(%rsp)
    unpcklpd    %xmm7, %xmm1
    movapd    %xmm4, %xmm2
    mulpd    %xmm12, %xmm2
    movapd    528(%rsp), %xmm8
    mulpd    %xmm8, %xmm12
    movlhps    %xmm7, %xmm7
    mulpd    %xmm7, %xmm12
    subpd    %xmm12, %xmm0
    movapd    %xmm0, 368(%rsp)
    mulpd    %xmm2, %xmm7
    addpd    560(%rsp), %xmm9
    addpd    336(%rsp), %xmm7
    movapd    608(%rsp), %xmm4
    movapd    %xmm4, %xmm2
    addpd    %xmm6, %xmm2
    subpd    %xmm6, %xmm4
    addsd    624(%rsp), %xmm10
    mulpd    %xmm3, %xmm5
    mulpd    %xmm8, %xmm3
    movaps    448(%rsp), %xmm0
    movlhps    %xmm0, %xmm0
    mulpd    %xmm0, %xmm3
    subpd    %xmm3, %xmm9
    subsd    %xmm14, %xmm10
    mulpd    %xmm5, %xmm0
    movsd    %xmm2, %xmm4
    addpd    %xmm7, %xmm0
    movapd    %xmm0, 448(%rsp)
    movapd    %xmm0, %xmm14
    movapd    432(%rsp), %xmm0
    mulpd    208(%rsp), %xmm0
    mulpd    %xmm0, %xmm1
    movapd    %xmm4, %xmm2
    movupd    %xmm9, 192(%rcx)
    addpd    %xmm1, %xmm2
    subpd    %xmm1, %xmm4
    movsd    %xmm2, %xmm4
    movapd    %xmm4, 608(%rsp)
    movapd    %xmm4, %xmm5
    movsd    %xmm10, 208(%rcx)
    movupd    200(%rcx), %xmm2
    movapd    656(%rsp), %xmm0
    movapd    LCPI0_1(%rip), %xmm3
    mulpd    %xmm3, %xmm0
    addpd    320(%rsp), %xmm0
    movapd    %xmm0, 192(%rsp)
    movapd    %xmm0, %xmm7
    movapd    %xmm15, %xmm8
    mulpd    %xmm3, %xmm8
    addpd    304(%rsp), %xmm8
    movapd    %xmm8, 80(%rsp)
    movapd    %xmm13, %xmm4
    movapd    368(%rsp), %xmm10
    mulpd    %xmm3, %xmm10
    addpd    352(%rsp), %xmm10
    movapd    %xmm10, 432(%rsp)
    movupd    176(%rcx), %xmm1
    mulpd    %xmm3, %xmm4
    mulsd    %xmm11, %xmm9
    movapd    %xmm11, %xmm15
    addsd    464(%rsp), %xmm9
    movsd    %xmm9, 168(%rcx)
    mulpd    %xmm3, %xmm2
    addpd    %xmm1, %xmm2
    mulpd    %xmm3, %xmm14
    addpd    384(%rsp), %xmm4
    movapd    %xmm4, 176(%rsp)
    addpd    416(%rsp), %xmm14
    movapd    %xmm14, 336(%rsp)
    movapd    %xmm5, %xmm9
    mulpd    %xmm3, %xmm9
    addpd    400(%rsp), %xmm9
    movapd    %xmm9, 96(%rsp)
    movapd    %xmm7, %xmm11
    subpd    %xmm8, %xmm11
    movapd    %xmm4, %xmm6
    shufpd    $1, %xmm6, %xmm6
    movapd    %xmm11, %xmm0
    mulsd    %xmm0, %xmm0
    movapd    %xmm11, %xmm1
    shufpd    $1, %xmm1, %xmm1
    mulsd    %xmm1, %xmm1
    addsd    %xmm0, %xmm1
    movapd    %xmm6, %xmm0
    movapd    %xmm6, %xmm13
    subsd    %xmm4, %xmm0
    movapd    %xmm0, 624(%rsp)
    mulsd    %xmm0, %xmm0
    addsd    %xmm1, %xmm0
    movapd    %xmm9, %xmm1
    movhlps    %xmm1, %xmm1
    movapd    %xmm4, %xmm3
    movapd    %xmm4, %xmm12
    subpd    %xmm1, %xmm3
    movapd    %xmm3, 560(%rsp)
    movapd    %xmm3, %xmm4
    movapd    %xmm7, %xmm6
    movapd    %xmm6, %xmm5
    subpd    %xmm10, %xmm5
    movapd    %xmm5, 352(%rsp)
    movapd    %xmm8, %xmm3
    subpd    %xmm10, %xmm3
    movapd    %xmm3, 576(%rsp)
    movapd    %xmm3, %xmm1
    unpcklpd    %xmm5, %xmm1
    mulpd    %xmm1, %xmm1
    unpckhpd    %xmm5, %xmm3
    mulpd    %xmm3, %xmm3
    addpd    %xmm1, %xmm3
    xorps    %xmm1, %xmm1
    sqrtsd    %xmm0, %xmm1
    movapd    %xmm1, %xmm0
    mulsd    %xmm0, %xmm0
    mulsd    %xmm1, %xmm0
    movapd    %xmm4, %xmm1
    mulpd    %xmm1, %xmm1
    addpd    %xmm3, %xmm1
    movupd    %xmm2, 176(%rcx)
    shufpd    $1, %xmm2, %xmm2
    subsd    %xmm2, %xmm13
    movapd    %xmm13, 464(%rsp)
    movupd    168(%rcx), %xmm2
    movapd    %xmm2, 400(%rsp)
    movapd    %xmm6, %xmm3
    subpd    %xmm2, %xmm3
    movapd    %xmm3, 416(%rsp)
    movapd    %xmm3, %xmm2
    mulsd    %xmm2, %xmm2
    movapd    %xmm3, %xmm4
    shufpd    $1, %xmm4, %xmm4
    mulsd    %xmm4, %xmm4
    addsd    %xmm2, %xmm4
    movapd    %xmm13, %xmm3
    mulsd    %xmm3, %xmm3
    addsd    %xmm4, %xmm3
    sqrtpd    %xmm1, %xmm2
    movapd    %xmm2, %xmm1
    mulpd    %xmm1, %xmm1
    mulpd    %xmm2, %xmm1
    subpd    %xmm14, %xmm6
    movapd    %xmm6, 128(%rsp)
    movapd    %xmm8, %xmm2
    movapd    %xmm8, %xmm7
    subpd    %xmm14, %xmm2
    movapd    %xmm2, 160(%rsp)
    movapd    %xmm2, %xmm4
    unpcklpd    %xmm6, %xmm4
    mulpd    %xmm4, %xmm4
    movapd    %xmm2, %xmm5
    unpckhpd    %xmm6, %xmm5
    mulpd    %xmm5, %xmm5
    addpd    %xmm4, %xmm5
    movapd    %xmm9, %xmm2
    movapd    %xmm2, %xmm4
    movlhps    %xmm4, %xmm4
    movaps    %xmm4, 144(%rsp)
    movapd    %xmm12, %xmm13
    subpd    %xmm4, %xmm13
    movapd    %xmm13, %xmm4
    mulpd    %xmm4, %xmm4
    addpd    %xmm5, %xmm4
    divsd    %xmm0, %xmm15
    sqrtpd    %xmm4, %xmm0
    movapd    %xmm0, %xmm4
    mulpd    %xmm4, %xmm4
    mulpd    %xmm0, %xmm4
    movapd    LCPI0_1(%rip), %xmm10
    movapd    %xmm10, %xmm0
    divpd    %xmm1, %xmm0
    movapd    %xmm0, 112(%rsp)
    movapd    %xmm10, %xmm8
    divpd    %xmm4, %xmm8
    xorps    %xmm0, %xmm0
    sqrtsd    %xmm3, %xmm0
    movapd    %xmm0, %xmm1
    mulsd    %xmm1, %xmm1
    mulsd    %xmm0, %xmm1
    movsd    %xmm1, 304(%rsp)
    movapd    %xmm11, %xmm5
    mulpd    544(%rsp), %xmm5
    movapd    624(%rsp), %xmm14
    movapd    %xmm14, %xmm0
    mulsd    272(%rsp), %xmm0
    movapd    288(%rsp), %xmm9
    mulsd    %xmm9, %xmm14
    mulsd    %xmm15, %xmm0
    movapd    %xmm0, 64(%rsp)
    mulsd    %xmm15, %xmm14
    movapd    %xmm14, %xmm3
    movlhps    %xmm15, %xmm15
    movapd    496(%rsp), %xmm4
    mulpd    %xmm4, %xmm11
    mulpd    %xmm15, %xmm5
    mulpd    %xmm15, %xmm11
    movapd    %xmm11, %xmm14
    movapd    %xmm2, %xmm0
    movsd    %xmm12, %xmm0
    movapd    %xmm0, %xmm1
    movsd    184(%rcx), %xmm0
    movsd    %xmm0, %xmm2
    movapd    %xmm2, 384(%rsp)
    movlhps    %xmm0, %xmm0
    movapd    %xmm1, %xmm2
    subpd    %xmm0, %xmm2
    movapd    %xmm2, 320(%rsp)
    movapd    %xmm7, %xmm15
    movapd    400(%rsp), %xmm0
    subpd    %xmm0, %xmm15
    movapd    432(%rsp), %xmm6
    subpd    %xmm0, %xmm6
    movapd    %xmm6, 48(%rsp)
    movapd    %xmm15, %xmm0
    unpcklpd    %xmm6, %xmm0
    mulpd    %xmm0, %xmm0
    movapd    %xmm15, %xmm1
    unpckhpd    %xmm6, %xmm1
    mulpd    %xmm1, %xmm1
    addpd    %xmm0, %xmm1
    movapd    %xmm2, %xmm0
    mulpd    %xmm0, %xmm0
    addpd    %xmm1, %xmm0
    movsd    LCPI0_0(%rip), %xmm1
    divsd    304(%rsp), %xmm1
    sqrtpd    %xmm0, %xmm0
    movapd    %xmm0, %xmm7
    mulpd    %xmm7, %xmm7
    mulpd    %xmm0, %xmm7
    movapd    656(%rsp), %xmm0
    subpd    %xmm5, %xmm0
    movapd    %xmm0, 656(%rsp)
    movapd    %xmm10, %xmm11
    divpd    %xmm7, %xmm11
    movapd    %xmm11, 304(%rsp)
    movapd    640(%rsp), %xmm0
    addsd    %xmm0, %xmm3
    movapd    %xmm3, 624(%rsp)
    movapd    %xmm0, %xmm2
    shufpd    $1, %xmm2, %xmm2
    subsd    64(%rsp), %xmm2
    movapd    352(%rsp), %xmm0
    movapd    480(%rsp), %xmm7
    mulpd    %xmm7, %xmm0
    movaps    112(%rsp), %xmm12
    movaps    %xmm12, %xmm3
    movhlps    %xmm3, %xmm3
    movaps    %xmm3, 64(%rsp)
    mulpd    %xmm3, %xmm0
    movapd    656(%rsp), %xmm3
    subpd    %xmm0, %xmm3
    movapd    %xmm3, 656(%rsp)
    movapd    %xmm4, %xmm0
    movapd    416(%rsp), %xmm6
    mulpd    %xmm6, %xmm0
    movapd    512(%rsp), %xmm10
    mulpd    %xmm10, %xmm6
    movapd    464(%rsp), %xmm4
    movapd    %xmm4, %xmm3
    mulsd    %xmm9, %xmm4
    mulsd    %xmm1, %xmm4
    movapd    %xmm4, 464(%rsp)
    unpcklpd    %xmm1, %xmm11
    movapd    %xmm1, %xmm9
    movlhps    %xmm9, %xmm9
    mulpd    %xmm9, %xmm6
    movapd    %xmm6, 416(%rsp)
    mulpd    %xmm0, %xmm9
    movapd    %xmm14, %xmm1
    addpd    592(%rsp), %xmm1
    movapd    %xmm13, %xmm6
    movapd    560(%rsp), %xmm14
    unpckhpd    %xmm14, %xmm6
    movapd    %xmm13, %xmm4
    unpcklpd    %xmm14, %xmm13
    movapd    %xmm13, 592(%rsp)
    mulpd    %xmm7, %xmm14
    movapd    %xmm7, %xmm13
    movaps    %xmm12, %xmm7
    mulpd    %xmm7, %xmm14
    movapd    %xmm8, %xmm5
    unpckhpd    %xmm7, %xmm5
    mulpd    528(%rsp), %xmm4
    mulpd    %xmm8, %xmm4
    movapd    %xmm4, 640(%rsp)
    movapd    %xmm8, %xmm4
    movapd    %xmm8, %xmm12
    unpcklpd    %xmm7, %xmm8
    movapd    576(%rsp), %xmm0
    mulpd    %xmm13, %xmm0
    movlhps    %xmm7, %xmm7
    mulpd    %xmm7, %xmm0
    subpd    %xmm0, %xmm1
    movapd    %xmm1, 560(%rsp)
    movapd    624(%rsp), %xmm0
    unpcklpd    %xmm2, %xmm0
    subpd    %xmm14, %xmm0
    movapd    %xmm0, 624(%rsp)
    movapd    240(%rsp), %xmm1
    mulsd    %xmm1, %xmm3
    movapd    320(%rsp), %xmm0
    mulsd    %xmm1, %xmm0
    unpcklpd    %xmm3, %xmm0
    mulpd    %xmm11, %xmm0
    movapd    656(%rsp), %xmm2
    subpd    416(%rsp), %xmm2
    movapd    %xmm2, 656(%rsp)
    movapd    %xmm15, %xmm11
    mulpd    %xmm10, %xmm11
    movaps    304(%rsp), %xmm2
    movlhps    %xmm2, %xmm2
    movapd    544(%rsp), %xmm10
    mulpd    %xmm10, %xmm15
    mulpd    %xmm2, %xmm11
    mulpd    %xmm2, %xmm15
    movapd    128(%rsp), %xmm13
    movapd    %xmm13, %xmm2
    movapd    528(%rsp), %xmm3
    mulpd    %xmm3, %xmm2
    movhlps    %xmm4, %xmm4
    mulpd    %xmm4, %xmm2
    movapd    496(%rsp), %xmm14
    mulpd    %xmm14, %xmm13
    mulpd    %xmm4, %xmm13
    movupd    192(%rcx), %xmm1
    addpd    %xmm1, %xmm9
    addpd    448(%rsp), %xmm13
    movapd    %xmm10, %xmm1
    movapd    %xmm10, %xmm4
    movapd    160(%rsp), %xmm10
    mulpd    %xmm10, %xmm1
    mulpd    %xmm3, %xmm10
    movlhps    %xmm12, %xmm12
    mulpd    %xmm12, %xmm10
    movapd    %xmm10, %xmm3
    mulpd    %xmm1, %xmm12
    addpd    %xmm13, %xmm12
    movapd    560(%rsp), %xmm13
    subpd    %xmm11, %xmm13
    mulpd    %xmm14, %xmm6
    mulpd    %xmm6, %xmm5
    addpd    608(%rsp), %xmm5
    movapd    624(%rsp), %xmm1
    subpd    %xmm0, %xmm1
    movapd    %xmm1, 624(%rsp)
    movapd    %xmm4, %xmm1
    movapd    592(%rsp), %xmm0
    mulpd    %xmm1, %xmm0
    mulpd    %xmm0, %xmm8
    addpd    %xmm5, %xmm8
    movapd    656(%rsp), %xmm0
    subpd    %xmm2, %xmm0
    movapd    %xmm0, 656(%rsp)
    movapd    352(%rsp), %xmm2
    mulpd    %xmm14, %xmm2
    mulpd    64(%rsp), %xmm2
    addpd    368(%rsp), %xmm2
    movapd    576(%rsp), %xmm10
    mulpd    %xmm1, %xmm10
    mulpd    %xmm7, %xmm10
    addpd    %xmm2, %xmm10
    movapd    384(%rsp), %xmm4
    subpd    144(%rsp), %xmm4
    movapd    %xmm4, 384(%rsp)
    addpd    %xmm9, %xmm15
    movapd    %xmm15, 608(%rsp)
    movapd    432(%rsp), %xmm2
    movapd    336(%rsp), %xmm0
    subpd    %xmm0, %xmm2
    movapd    400(%rsp), %xmm7
    subpd    %xmm0, %xmm7
    movapd    %xmm7, %xmm0
    unpcklpd    %xmm2, %xmm0
    mulpd    %xmm0, %xmm0
    movapd    %xmm7, %xmm1
    unpckhpd    %xmm2, %xmm1
    mulpd    %xmm1, %xmm1
    addpd    %xmm0, %xmm1
    movapd    %xmm13, %xmm0
    subpd    %xmm3, %xmm0
    movapd    %xmm0, 560(%rsp)
    movapd    %xmm4, %xmm0
    mulpd    %xmm0, %xmm0
    addpd    %xmm1, %xmm0
    sqrtpd    %xmm0, %xmm0
    movapd    %xmm0, %xmm1
    mulpd    %xmm1, %xmm1
    mulpd    %xmm0, %xmm1
    movapd    LCPI0_1(%rip), %xmm15
    divpd    %xmm1, %xmm15
    movapd    320(%rsp), %xmm5
    movapd    %xmm5, %xmm14
    mulpd    (%rsp), %xmm14
    movapd    304(%rsp), %xmm11
    mulpd    %xmm11, %xmm14
    movapd    %xmm15, %xmm0
    unpckhpd    %xmm11, %xmm0
    movapd    480(%rsp), %xmm3
    movapd    %xmm3, %xmm1
    movapd    48(%rsp), %xmm4
    mulpd    %xmm4, %xmm1
    movapd    512(%rsp), %xmm13
    mulpd    %xmm13, %xmm4
    movhlps    %xmm11, %xmm11
    mulpd    %xmm11, %xmm4
    subpd    %xmm4, %xmm10
    mulpd    %xmm1, %xmm11
    movapd    624(%rsp), %xmm9
    subpd    640(%rsp), %xmm9
    movapd    %xmm9, 640(%rsp)
    movapd    %xmm2, %xmm1
    movapd    528(%rsp), %xmm6
    mulpd    %xmm6, %xmm1
    movapd    %xmm15, %xmm4
    movhlps    %xmm4, %xmm4
    mulpd    %xmm3, %xmm2
    mulpd    %xmm4, %xmm1
    mulpd    %xmm4, %xmm2
    addpd    %xmm12, %xmm2
    shufpd    $1, %xmm5, %xmm5
    mulsd    240(%rsp), %xmm5
    movapd    %xmm5, %xmm4
    movapd    384(%rsp), %xmm5
    movapd    %xmm5, %xmm3
    shufpd    $1, %xmm3, %xmm3
    mulsd    256(%rsp), %xmm3
    unpcklpd    %xmm4, %xmm3
    subpd    %xmm1, %xmm10
    movapd    %xmm10, 576(%rsp)
    movapd    464(%rsp), %xmm1
    addsd    208(%rcx), %xmm1
    addsd    %xmm14, %xmm1
    shufpd    $1, %xmm14, %xmm14
    addsd    %xmm1, %xmm14
    mulpd    %xmm0, %xmm3
    movapd    %xmm8, %xmm0
    addpd    %xmm3, %xmm0
    subpd    %xmm3, %xmm8
    addpd    608(%rsp), %xmm11
    movsd    %xmm0, %xmm8
    movapd    %xmm5, %xmm1
    movapd    %xmm1, %xmm0
    mulsd    224(%rsp), %xmm0
    mulsd    %xmm15, %xmm0
    subsd    %xmm0, %xmm14
    mulpd    %xmm7, %xmm13
    mulpd    %xmm6, %xmm7
    mulpd    208(%rsp), %xmm1
    mulpd    %xmm15, %xmm1
    movlhps    %xmm15, %xmm15
    mulpd    %xmm15, %xmm7
    mulpd    %xmm13, %xmm15
    subpd    %xmm7, %xmm11
    movupd    %xmm11, 192(%rcx)
    addpd    %xmm2, %xmm15
    movapd    %xmm8, %xmm0
    addpd    %xmm1, %xmm0
    movsd    %xmm14, 208(%rcx)
    subpd    %xmm1, %xmm8
    movapd    656(%rsp), %xmm7
    movapd    LCPI0_1(%rip), %xmm3
    mulpd    %xmm3, %xmm7
    addpd    192(%rsp), %xmm7
    movapd    560(%rsp), %xmm5
    mulpd    %xmm3, %xmm5
    addpd    80(%rsp), %xmm5
    movapd    %xmm9, %xmm12
    mulpd    %xmm3, %xmm12
    addpd    176(%rsp), %xmm12
    mulpd    %xmm3, %xmm10
    addpd    432(%rsp), %xmm10
    movapd    %xmm11, %xmm2
    mulsd    LCPI0_0(%rip), %xmm2
    addsd    400(%rsp), %xmm2
    movupd    176(%rcx), %xmm4
    movupd    200(%rcx), %xmm1
    movsd    %xmm2, 168(%rcx)
    mulpd    %xmm3, %xmm1
    addpd    %xmm4, %xmm1
    movapd    %xmm15, %xmm4
    mulpd    %xmm3, %xmm4
    addpd    336(%rsp), %xmm4
    movapd    %xmm8, %xmm6
    movsd    %xmm0, %xmm6
    movapd    %xmm6, %xmm9
    mulpd    %xmm3, %xmm9
    addpd    96(%rsp), %xmm9
    movupd    %xmm1, 176(%rcx)
    addq    $-2, %rdx
    shufpd    $1, %xmm1, %xmm1
    jne    .LBB0_5
    movupd    %xmm7, (%rcx)
    movhpd    %xmm12, 16(%rcx)
    movupd    %xmm5, 56(%rcx)
    movlpd    %xmm12, 72(%rcx)
    movaps    656(%rsp), %xmm1
    movups    %xmm1, 24(%rcx)
    movapd    640(%rsp), %xmm2
    movhpd    %xmm2, 40(%rcx)
    movaps    560(%rsp), %xmm1
    movups    %xmm1, 80(%rcx)
    movlpd    %xmm2, 96(%rcx)
    movupd    %xmm10, 112(%rcx)
    movhpd    %xmm9, 128(%rcx)
    movapd    576(%rsp), %xmm1
    movupd    %xmm1, 136(%rcx)
    movhpd    %xmm8, 152(%rcx)
    movupd    %xmm4, 224(%rcx)
    movlpd    %xmm9, 240(%rcx)
    movupd    %xmm15, 248(%rcx)
    movlpd    %xmm0, 264(%rcx)
.LBB0_7:
    movaps    672(%rsp), %xmm6
    movaps    688(%rsp), %xmm7
    movaps    704(%rsp), %xmm8
    movaps    720(%rsp), %xmm9
    movaps    736(%rsp), %xmm10
    movaps    752(%rsp), %xmm11
    movaps    768(%rsp), %xmm12
    movaps    784(%rsp), %xmm13
    movaps    800(%rsp), %xmm14
    movaps    816(%rsp), %xmm15
    addq    $840, %rsp
    retq

This asm is partially unrolled, but it’s also a bit messy and too much long. I think it has unrolled in a not good way (there are three loops after all, you want to unroll only the inner two).

I’d like to annotate the code of this kernel like this (leaving the outer loop unmodified):

fn advance(bodies: &mut Planets, dt: f64, n: usize) {
    #[unroll(never)] for _ in 0 .. n {
        #[unroll(try_full)] for i in 0 .. bodies.len() {
            #[unroll(try_full)] for j in i + 1 .. bodies.len() {
                let dx = bodies[i].x - bodies[j].x;
                let dy = bodies[i].y - bodies[j].y;
                let dz = bodies[i].z - bodies[j].z;

                let distance = (dx * dx + dy * dy + dz * dz).sqrt();
                let mag = dt / (distance * distance * distance);

                bodies[i].vx -= dx * bodies[j].mass * mag;
                bodies[i].vy -= dy * bodies[j].mass * mag;
                bodies[i].vz -= dz * bodies[j].mass * mag;

                bodies[j].vx += dx * bodies[i].mass * mag;
                bodies[j].vy += dy * bodies[i].mass * mag;
                bodies[j].vz += dz * bodies[i].mass * mag;
            }
        }

        for b in bodies.iter_mut() {
            b.x += dt * b.vx;
            b.y += dt * b.vy;
            b.z += dt * b.vz;
        }
    }
}

Note: if the LLVM doesn’t have an annotation to unroll a single specific loop, then the Rust-front end has to do it…

23 Likes

I have compared the manually unrolled nbody version, with a SIMD version here:

And the unrolled version is faster on my PC than the SIMD version (with N=50_000_000, the simd version runs in about 3.97 seconds, while the manually unrolled non-simd version runs in about 3.64 seconds).

3 Likes

What’s the default threshold value?

1 Like

Here there are some initializations:

http://llvm.org/docs/doxygen/html/LoopUnrollPass_8cpp_source.html

00099   // Set up the defaults
00100   UP.Threshold = 150;
00101   UP.PercentDynamicCostSavedThreshold = 20;
00102   UP.DynamicCostSavingsDiscount = 2000;
00103   UP.OptSizeThreshold = 50;
00104   UP.PartialThreshold = UP.Threshold;
00105   UP.PartialOptSizeThreshold = UP.OptSizeThreshold;
00106   UP.Count = 0;
00107   UP.MaxCount = UINT_MAX;
00108   UP.Partial = false;
00109   UP.Runtime = false;
00110   UP.AllowExpensiveTripCount = false;
1 Like

I have another code example to show. This solves the Euler problem n.222 (from the original Java code https://github.com/nayuki/Project-Euler-solutions/blob/master/java/p222.java with some optimizations added):

#![feature(box_syntax)]

const N_RADII: usize = 21;
const M: usize = 1 << N_RADII;

struct E222 {
    sphere_radii: [f32; N_RADII],
    min_len: Box<[[f32; N_RADII]; M]>
}

fn min(x: f32, y: f32) -> f32 {
    if x < y { x } else { y }
}

impl E222 {
    fn find_min_len(&mut self, current_sphere_index: usize, set_of_spheres: usize) -> f32 {
        debug_assert!((set_of_spheres & (1 << current_sphere_index)) != 0);

        if self.min_len[set_of_spheres][current_sphere_index] == 0.0 {
            self.min_len[set_of_spheres][current_sphere_index] =
                if set_of_spheres.count_ones() == 1 {
                    self.sphere_radii[current_sphere_index]
                } else {
                    let mut result = std::f32::INFINITY;
                    let new_set_of_spheres = set_of_spheres ^ (1 << current_sphere_index);
                    for i in 0 .. N_RADII {
                        if (new_set_of_spheres & (1 << i)) == 0 {
                            continue;
                        }
                        let temp = ((self.sphere_radii[i] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(i, new_set_of_spheres);
                        result = min(result, temp);
                    }
                    result
                };
        }

        self.min_len[set_of_spheres][current_sphere_index]
    }
}

fn main() {
    let mut e222 = E222 {
        sphere_radii: [0f32; N_RADII],
        min_len: box [[0f32; N_RADII]; M]
    };

    for (i, sri) in e222.sphere_radii.iter_mut().enumerate() {
        *sri = (i as f32 + 30.0) * 1000.0;
    }

    let mut res = std::f32::INFINITY;
    for i in 0 .. N_RADII {
        res = min(res, e222.find_min_len(i, (1 << N_RADII) - 1) + e222.sphere_radii[i]);
    }

    println!("{}", res.round() as u32); // 1_590_933
}

The same with the inner loop unrolled manually:

#![feature(box_syntax)]

const N_RADII: usize = 21;
const M: usize = 1 << N_RADII;

struct E222 {
    sphere_radii: [f32; N_RADII],
    min_len: Box<[[f32; N_RADII]; M]>
}

fn min(x: f32, y: f32) -> f32 {
    if x < y { x } else { y }
}

impl E222 {
    fn find_min_len(&mut self, current_sphere_index: usize, set_of_spheres: usize) -> f32 {
        debug_assert!((set_of_spheres & (1 << current_sphere_index)) != 0);

        if self.min_len[set_of_spheres][current_sphere_index] == 0.0 {
            self.min_len[set_of_spheres][current_sphere_index] =
                if set_of_spheres.count_ones() == 1 {
                    self.sphere_radii[current_sphere_index]
                } else {
                    let mut result = std::f32::INFINITY;
                    let new_set_of_spheres = set_of_spheres ^ (1 << current_sphere_index);

                    // Loop manually unrolled 21 times.

                    if (new_set_of_spheres & (1 << 0)) != 0 {
                        let temp = ((self.sphere_radii[0] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(0, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 1)) != 0 {
                        let temp = ((self.sphere_radii[1] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(1, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 2)) != 0 {
                        let temp = ((self.sphere_radii[2] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(2, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 3)) != 0 {
                        let temp = ((self.sphere_radii[3] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(3, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 4)) != 0 {
                        let temp = ((self.sphere_radii[4] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(4, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 5)) != 0 {
                        let temp = ((self.sphere_radii[5] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(5, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 6)) != 0 {
                        let temp = ((self.sphere_radii[6] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(6, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 7)) != 0 {
                        let temp = ((self.sphere_radii[7] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(7, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 8)) != 0 {
                        let temp = ((self.sphere_radii[8] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(8, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 9)) != 0 {
                        let temp = ((self.sphere_radii[9] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(9, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 10)) != 0 {
                        let temp = ((self.sphere_radii[10] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(10, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 11)) != 0 {
                        let temp = ((self.sphere_radii[11] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(11, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 12)) != 0 {
                        let temp = ((self.sphere_radii[12] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(12, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 13)) != 0 {
                        let temp = ((self.sphere_radii[13] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(13, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 14)) != 0 {
                        let temp = ((self.sphere_radii[14] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(14, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 15)) != 0 {
                        let temp = ((self.sphere_radii[15] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(15, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 16)) != 0 {
                        let temp = ((self.sphere_radii[16] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(16, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 17)) != 0 {
                        let temp = ((self.sphere_radii[17] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(17, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 18)) != 0 {
                        let temp = ((self.sphere_radii[18] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(18, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 19)) != 0 {
                        let temp = ((self.sphere_radii[19] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(19, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    if (new_set_of_spheres & (1 << 20)) != 0 {
                        let temp = ((self.sphere_radii[20] + self.sphere_radii[current_sphere_index] - 50000.0) * 200000.0).sqrt() +
                                   self.find_min_len(20, new_set_of_spheres);
                        result = min(result, temp);
                    }

                    result
                };
        }

        self.min_len[set_of_spheres][current_sphere_index]
    }
}

fn main() {
    let mut e222 = E222 {
        sphere_radii: [0f32; N_RADII],
        min_len: box [[0f32; N_RADII]; M]
    };

    for (i, sri) in e222.sphere_radii.iter_mut().enumerate() {
        *sri = (i as f32 + 30.0) * 1000.0;
    }

    let mut res = std::f32::INFINITY;
    for i in 0 .. N_RADII {
        res = min(res, e222.find_min_len(i, (1 << N_RADII) - 1) + e222.sphere_radii[i]);
    }

    println!("{}", res.round() as u32); // 1_590_933
}

The first version runs in about 2.99 seconds on my PC, while the second runs in about 2.21 seconds. With a #[unroll(try_full)] or #[unroll(full)] annotation added to the first program you can avoid the code duplication and its potential mistakes, and the code keeps being flexible (in the second code if you change the N_RADII constant you have to modify the code). And compiling the first program with -C llvm-args="-unroll-threshold=10000" doesn’t improve its performance.

Should be easy enough to implement unroll_for!(...) that will take care of the code duplication, that’s what macros are for!

As usual in engineering, you have first to decide how much well you want your job done.

#[unroll(full)] gives an error if the compiler can't do a full unrolling (like when the loop bounds are dynamic), but #[unroll(try_full)] doesn't give an error in that case, and unrolls the dynamic loop some times (like 4 or 8 times).

#[unroll(never)] disables loop unrolling for a loop to reduce code bloat and to make the asm more readable to humans (it's like when I use inline(never) to read the asm of a function more clearly).

#[unroll(N)] unrolls both fixed-length and dynamic-length for/while loops N times, or it gives a compilation error if it can't.

All those things need compiler integration, a macro can't do that. But of course do you really need that level of control? Perhaps a dumb macro is enough for most cases and keeps the logic out of the compiler.

I think a macro isn't enough, but the problem is still open.

5 Likes

Could you please draft this macro? I am not that fluent with macros, but as I understand you'll need procedural macros for that. not sure if even macros 1.1 will be up to the task.

A similar pragma:

https://software.intel.com/en-us/node/524556

1 Like

Maybe it’s worth to create RFC for loop unrolling? Though I think it will be best to start with loops with constant bounds so it will have a smaller scope.

I would imagine that #[unroll] would be a great use for a procedural or macros 2.0 decorator, once those are fully supported. I think I would prefer to wait until those are stable and solve the problem that way, rather than building it into the compiler.

Btw, this is false. #[inline] is necessary; it signals to the compiler that it should include the AST of the function so that it may be used for cross-crate inlining. The optimizer can't figure out whether or not a function is going to be inlined based on its usage in other crates which it hasn't seen yet.

#[inline(always)] is not too useful though -- it only tends to increase the compile times while rarely being better than the optimizer.

5 Likes

From LLVM slides:

!llvm.loop.* Metadata Optimization Control:
!llvm.loop.interleave.count: Sets the preferred interleaving (modulo unrolling) count
!llvm.loop.vectorize.enable: Enable loop vectorization for this loop, even if vectorization is otherwise disabled
!llvm.loop.vectorize.width: Sets the preferred vector width for loop vectorization
!llvm.loop.unroll.disable: Disable loop unrolling for this loop, even when it is otherwise enabled
!llvm.loop.unroll.full: Suggest that the loop be fully unrolled (overriding the cost model)
!llvm.loop.unroll.count: Sets the preferred unrolling factor for partial and runtime unrolling (overriding the cost model)
Clang exposes these via the pragma:
#pragma clang loop vectorize/interleave/vectorize_width/interleave_count/unroll/unroll_count

I've seen a similar situation with loop unrolling.

I like the proposal, but first of all, neither inlining nor loop unrolling are solved problems, and both perfect inlining and perfect loop unrolling are probably unsolvable in practice. This is well known, and sad, but when the compiler does the wrong thing with respect to inlining or loop unrolling/vectorization/etc. most compilers allow their users to tell the compiler exactly what to do.

So I suggest to add a way to ask the Rust compiler to unroll a specific loop (so this is an enhancement request).

clang (also a LLVM based compiler) has #pragma clang vectorize (+- a couple of options) for this, and so does OpenMP (also available in LLVM). As @nikomatsakis mentions, it might be worth it to phrase the #[unroll] attributes in the context of macros 2.0 (Rust has "user-defined pragmas", so to speak), and maybe phrase the semantics using similar wording to clang, and then also discuss a bit in which cases you want the macros to behave differently to those in clang and openmp.

Also, once we get into macros for loop unrolling, it is important to realize that these macros are actually workarounds optimization bugs. So I think it is very important that they go hand in hand with a macro that will tell you how a particular snippet of code is optimized. LLVM opt will tell you for a particular loop, whether it attempted to unroll it, by how much, what happened for each time (was it profitable? was it vectorizable? why? why not? alignment issues with some variable, some function could not get inlined, ...). That is, if we are going to add an easy way to work around optimization bugs, we should also add an easy way to detect and debug optimization bugs in the first place.

Clang has this, and it is very useful. Most of the time (if you have the LLVM Polly plugin enabled) the only thing you actually need is make some alignment guarantees (e.g. this pointer is 16bit aligned) or mark some function as inline for all optimizations to trigger. Unrolling nitty-willy is rarely the answer if the rest of the code is "optimization friendly".

6 Likes

Something is moving forward: https://github.com/rust-lang/rfcs/issues/2219

3 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.