Performance of isqrt

I'm really curious about the performance issues about isqrt, since they are quite long.

There is a very simple idea come into my mind: Will it faster to convert u32 into f64, calculating its square root directly, and convert its result back, is faster than calling .isqrt() directly?

$ cat main.rs && cargo rr
fn fsqrt(u: u32) -> u32 {
    (u as f64).sqrt() as u32
}
fn isqrt(u: u32) -> u32 {
    u.isqrt()
}
fn timing(f: fn(u32) -> u32, range: impl Iterator<Item = u32>) {
    let now = std::time::Instant::now();
    let mut res = 0;
    for i in range {
        res += f(i);
    }
    println!("res = {res} {:?}", now.elapsed());
}
fn main() {
    timing(fsqrt, 0..=u32::MAX);
    timing(isqrt, 0..=u32::MAX);
}
    Finished `release` profile [optimized] target(s) in 0.00s
     Running `/me/test-isqrt/target/release/test-isqrt`
res = 715816960 8.357294192s
res = 715816960 18.820145928s

I have no idea whether it deserved to create a PR about this trick.

1 Like

I feel I have seen a thread on this already here, a year ago or so. Wasn't the issue raised (and never addressed) that not all targets have hardware floating point?

If the issue is "not all targets have hardware floating point", we could pick all the target that isqrt is slower than convert it to f64 directly.

for example:

#[cfg(or(target_arch = "x86_64", target_arch = "x86"))]
fn isqrt(i: u32) -> u32 {
    (i as f64).sqrt() as f32
}
#[cfg(not(or(target_arch = "x86_64", target_arch = "x86")))]
fn isqrt(i: u32) -> u32 {
    // old impl
}

IIRC part of the problem is that it's a full target question, not just an arch problem. There was something like the linux kernel not wanting to use SSE (which rust does for f64 operations like this) in kernel mode even on x64 chips that all have it.

3 Likes

Some previous discussion of this: Do the square root intrinsics work on all platforms?

Note that the discussion only start part way through, it starts out as a different question and... meanders. As often happens.

This is mostly about not wanting to save and restore floating point state from the userspace process, especially since interrupts and preemption can happen at any time. There are no callee/caller-save ABI rules to limit state here, so the kernel has to save everything that it might clobber.

3 Likes

We already implement f32::midpoint in platform-dependent way.

But I think there might be a much bigger variance between integer and float behavior compared to the gap between f32 and f64 (with the assumption that both of the latter are hardfloat).

see also, stabilization report of the isqrt feature: Tracking Issue for `{u8,i8,...}::isqrt` · Issue #116226 · rust-lang/rust · GitHub