A little example function:
fn foo() -> u64 {
const N: usize = 5_000;
const M: u64 = 10_000_000_000_000_000;
let check_every = fto!{(((1_u64 << 63) / M) as f64).log2().ceil(), usize};
let primes = eratosthenes_sieve(N);
let mut ways = vec![0_u64; primes.iter().sum::<usize>() + 1];
ways[0] = 1;
let mut max: usize = 0;
for (i, &p) in primes.iter().enumerate() {
for j in (0 ..= max).rev() {
unsafe {
*ways.get_unchecked_mut(j + p) +=
*ways.get_unchecked(j);
}
}
if (i + 1) % check_every == 0 {
for w in ways[..= max].iter_mut() {
*w = *w % M;
}
}
max += p;
}
let sieve = SieveBits::new(max);
(3 ..= max)
.step_by(2)
.filter(|&i| sieve.is_prime(i))
.fold(ways[2], |acc, i| (acc + ways[i]) % M)
}
In the most critical cycle I've used a unsafe{}
with a get_unchecked
/get_unchecked_mut
to omit array bound tests, because they slow down that code a little. In some simple cases LLVM is able to remove some bound tests, but in most slightly more complex situations it isn't able to. Sometimes you can give some safe or unsafe suggestions to LLVM. The unsafe ones are similar to using get_unchecked, like using an assume() (some people say that assume cause bad codegen, but I usually don't see this downside in practice). The safe ones are asserts, or slicing of slices/vectors/arrays that in some way tell the compiler that an array is long enough. Both introduce new tests in the asm, but sometimes this few tests can help LLVM to remove many other tests (because of Rust semantics that asks for panics to happen only exactly where they are needed instead of slightly before). A third strategy is to use zip and the like to walk arrays in parallel, sometimes this needs a slice::split_at_mut
. But in some situations all those strategies fail. I didn't find a safe way to remove in a safe way the array bound tests in that inner loop (if you have ideas I'll welcome them. I can't use a zip because of the borrowck).
Future languages (or present languages like Ada-SPARK) will allow the programmer to prove that some array accesses are always in-bound (and probably some divisions are never by zero) allowing faster code while keeping memory safety and code correctness. Perhaps Rust will do that (using get_unchecked
by itself where you have proved a slice access to be in-bound).
Array bound tests are fast, they are just a comparison plus a jump, and sometimes you don't need the comparison because you already have some useful flags because of something else. And the CPU jump predictors are able to waste very little time from a jump that's nearly never taken. The problem is that such array bound tests don't allow LLVM to generate efficient asm code. So it's almost like using 15 years old CPU with a slightly faster clock.
If in that code I use regular vec indexing in the innermost j
cycle:
ways[j + p] += *ways[j];
The resulting asm contains the two jae
jump to the out-of-bounds panics (panics not shown):
.LBB0_5:
cmp rax, rsi
jae .LBB0_18
lea r10, [r14 + rax]
cmp r10, rsi
jae .LBB0_19
mov rbx, qword ptr [r8 + 8*rax]
add qword ptr [rdx + 8*rax], rbx
add rax, -1
jb .LBB0_5
If I use the get_unchecked
/get_unchecked_mut
it omits the jumps, vectorizes the code and I think it unrolls it 32 times:
.LBB0_30:
vmovdqu ymm0, ymmword ptr [rbx + 8*rsi]
vmovdqu ymm1, ymmword ptr [rbx + 8*rsi - 32]
vmovdqu ymm2, ymmword ptr [rbx + 8*rsi - 64]
vmovdqu ymm3, ymmword ptr [rbx + 8*rsi - 96]
vpaddq ymm0, ymm0, ymmword ptr [rbp + 8*rsi]
vpaddq ymm1, ymm1, ymmword ptr [rbp + 8*rsi - 32]
vpaddq ymm2, ymm2, ymmword ptr [rbp + 8*rsi - 64]
vpaddq ymm3, ymm3, ymmword ptr [rbp + 8*rsi - 96]
vmovdqu ymmword ptr [rbx + 8*rsi], ymm0
vmovdqu ymmword ptr [rbx + 8*rsi - 32], ymm1
vmovdqu ymmword ptr [rbx + 8*rsi - 64], ymm2
vmovdqu ymmword ptr [rbx + 8*rsi - 96], ymm3
vmovdqu ymm0, ymmword ptr [rbx + 8*rsi - 224]
vmovdqu ymm1, ymmword ptr [rbx + 8*rsi - 192]
vmovdqu ymm2, ymmword ptr [rbx + 8*rsi - 160]
vmovdqu ymm3, ymmword ptr [rbx + 8*rsi - 128]
vpaddq ymm3, ymm3, ymmword ptr [rbp + 8*rsi - 128]
vpaddq ymm2, ymm2, ymmword ptr [rbp + 8*rsi - 160]
vpaddq ymm1, ymm1, ymmword ptr [rbp + 8*rsi - 192]
vpaddq ymm0, ymm0, ymmword ptr [rbp + 8*rsi - 224]
vmovdqu ymmword ptr [rbx + 8*rsi - 128], ymm3
vmovdqu ymmword ptr [rbx + 8*rsi - 160], ymm2
vmovdqu ymmword ptr [rbx + 8*rsi - 192], ymm1
vmovdqu ymmword ptr [rbx + 8*rsi - 224], ymm0
add rsi, -32
add rdx, 2
jne .LBB0_30
And this gives some speed boost.