Array bound tests

A little example function:

fn foo() -> u64 {
    const N: usize = 5_000;
    const M: u64 = 10_000_000_000_000_000;
    let check_every = fto!{(((1_u64 << 63) / M) as f64).log2().ceil(), usize};
    let primes = eratosthenes_sieve(N);
    let mut ways = vec![0_u64; primes.iter().sum::<usize>() + 1];
    ways[0] = 1;
    let mut max: usize = 0;

    for (i, &p) in primes.iter().enumerate() {
        for j in (0 ..= max).rev() {
            unsafe {
                *ways.get_unchecked_mut(j + p) +=
                    *ways.get_unchecked(j);
            }
        }

        if (i + 1) % check_every == 0 {
            for w in ways[..= max].iter_mut() {
                *w = *w % M;
            }
        }
        max += p;
    }

    let sieve = SieveBits::new(max);

    (3 ..= max)
    .step_by(2)
    .filter(|&i| sieve.is_prime(i))
    .fold(ways[2], |acc, i| (acc + ways[i]) % M)
}

In the most critical cycle I've used a unsafe{} with a get_unchecked/get_unchecked_mut to omit array bound tests, because they slow down that code a little. In some simple cases LLVM is able to remove some bound tests, but in most slightly more complex situations it isn't able to. Sometimes you can give some safe or unsafe suggestions to LLVM. The unsafe ones are similar to using get_unchecked, like using an assume() (some people say that assume cause bad codegen, but I usually don't see this downside in practice). The safe ones are asserts, or slicing of slices/vectors/arrays that in some way tell the compiler that an array is long enough. Both introduce new tests in the asm, but sometimes this few tests can help LLVM to remove many other tests (because of Rust semantics that asks for panics to happen only exactly where they are needed instead of slightly before). A third strategy is to use zip and the like to walk arrays in parallel, sometimes this needs a slice::split_at_mut. But in some situations all those strategies fail. I didn't find a safe way to remove in a safe way the array bound tests in that inner loop (if you have ideas I'll welcome them. I can't use a zip because of the borrowck).

Future languages (or present languages like Ada-SPARK) will allow the programmer to prove that some array accesses are always in-bound (and probably some divisions are never by zero) allowing faster code while keeping memory safety and code correctness. Perhaps Rust will do that (using get_unchecked by itself where you have proved a slice access to be in-bound).

Array bound tests are fast, they are just a comparison plus a jump, and sometimes you don't need the comparison because you already have some useful flags because of something else. And the CPU jump predictors are able to waste very little time from a jump that's nearly never taken. The problem is that such array bound tests don't allow LLVM to generate efficient asm code. So it's almost like using 15 years old CPU with a slightly faster clock.

If in that code I use regular vec indexing in the innermost j cycle:

ways[j + p] += *ways[j];

The resulting asm contains the two jae jump to the out-of-bounds panics (panics not shown):

.LBB0_5:
    cmp     rax, rsi
    jae     .LBB0_18
    lea     r10, [r14 + rax]
    cmp     r10, rsi
    jae     .LBB0_19
    mov     rbx, qword ptr [r8 + 8*rax]
    add     qword ptr [rdx + 8*rax], rbx
    add     rax, -1
    jb      .LBB0_5

If I use the get_unchecked/get_unchecked_mut it omits the jumps, vectorizes the code and I think it unrolls it 32 times:

.LBB0_30:
    vmovdqu ymm0, ymmword ptr [rbx + 8*rsi]
    vmovdqu ymm1, ymmword ptr [rbx + 8*rsi - 32]
    vmovdqu ymm2, ymmword ptr [rbx + 8*rsi - 64]
    vmovdqu ymm3, ymmword ptr [rbx + 8*rsi - 96]
    vpaddq  ymm0, ymm0, ymmword ptr [rbp + 8*rsi]
    vpaddq  ymm1, ymm1, ymmword ptr [rbp + 8*rsi - 32]
    vpaddq  ymm2, ymm2, ymmword ptr [rbp + 8*rsi - 64]
    vpaddq  ymm3, ymm3, ymmword ptr [rbp + 8*rsi - 96]
    vmovdqu ymmword ptr [rbx + 8*rsi], ymm0
    vmovdqu ymmword ptr [rbx + 8*rsi - 32], ymm1
    vmovdqu ymmword ptr [rbx + 8*rsi - 64], ymm2
    vmovdqu ymmword ptr [rbx + 8*rsi - 96], ymm3
    vmovdqu ymm0, ymmword ptr [rbx + 8*rsi - 224]
    vmovdqu ymm1, ymmword ptr [rbx + 8*rsi - 192]
    vmovdqu ymm2, ymmword ptr [rbx + 8*rsi - 160]
    vmovdqu ymm3, ymmword ptr [rbx + 8*rsi - 128]
    vpaddq  ymm3, ymm3, ymmword ptr [rbp + 8*rsi - 128]
    vpaddq  ymm2, ymm2, ymmword ptr [rbp + 8*rsi - 160]
    vpaddq  ymm1, ymm1, ymmword ptr [rbp + 8*rsi - 192]
    vpaddq  ymm0, ymm0, ymmword ptr [rbp + 8*rsi - 224]
    vmovdqu ymmword ptr [rbx + 8*rsi - 128], ymm3
    vmovdqu ymmword ptr [rbx + 8*rsi - 160], ymm2
    vmovdqu ymmword ptr [rbx + 8*rsi - 192], ymm1
    vmovdqu ymmword ptr [rbx + 8*rsi - 224], ymm0
    add     rsi, -32
    add     rdx, 2
    jne     .LBB0_30

And this gives some speed boost.

I don’t have your context, so I put in some stubs for parts of the code that are missing but (with a bit more [meaning 2x] memory usage) I could get rid of both unsafe code and bound checks in the inner loops, like so:

use core::mem::swap;
fn eratosthenes_sieve(_: usize) -> Vec<usize> { unimplemented!() }

pub fn foo() -> u64 {
    const N: usize = 5_000;
    const M: u64 = 10_000_000_000_000_000;
    // idk what fto does.
    let check_every = (((1_u64 << 63) / M) as f64).log2().ceil() as usize;
    let primes = eratosthenes_sieve(N);
    let mut ways = vec![0_u64; primes.iter().sum::<usize>() + 1];
    let mut new_ways = vec![0_u64; primes.iter().sum::<usize>() + 1];
    ways[0] = 1;
    let mut max: usize = 0;

    for (i, &p) in primes.iter().enumerate() {
        new_ways[0..p].copy_from_slice(&ways[0..p]);
        ways[0..=max].iter().zip(&ways[p..=max+p]).zip(&mut new_ways[p..=max+p])
        .for_each(|((a,b),dest)| *dest = a+b);
        

        if (i + 1) % check_every == 0 {
            // isn't max+p a better bound here?
            for w in new_ways[..= max+p].iter_mut() {
                *w = *w % M;
            }
        }
        max += p;
        swap(&mut ways, &mut new_ways);
    }
    0
}

Of course, the original behavior could also be archived by some kind of safe abstraction for iterating an array with two mutable references in parallel with an offset. Perhaps this exists in some crate already.