I'm really curious about the performance issues about isqrt, since they are quite long.
There is a very simple idea come into my mind: Will it faster to convert u32 into f64, calculating its square root directly, and convert its result back, is faster than calling .isqrt() directly?
$ cat main.rs && cargo rr
fn fsqrt(u: u32) -> u32 {
(u as f64).sqrt() as u32
}
fn isqrt(u: u32) -> u32 {
u.isqrt()
}
fn timing(f: fn(u32) -> u32, range: impl Iterator<Item = u32>) {
let now = std::time::Instant::now();
let mut res = 0;
for i in range {
res += f(i);
}
println!("res = {res} {:?}", now.elapsed());
}
fn main() {
timing(fsqrt, 0..=u32::MAX);
timing(isqrt, 0..=u32::MAX);
}
Finished `release` profile [optimized] target(s) in 0.00s
Running `/me/test-isqrt/target/release/test-isqrt`
res = 715816960 8.357294192s
res = 715816960 18.820145928s
I have no idea whether it deserved to create a PR about this trick.
I feel I have seen a thread on this already here, a year ago or so. Wasn't the issue raised (and never addressed) that not all targets have hardware floating point?
IIRC part of the problem is that it's a full target question, not just an arch problem. There was something like the linux kernel not wanting to use SSE (which rust does for f64 operations like this) in kernel mode even on x64 chips that all have it.
This is mostly about not wanting to save and restore floating point state from the userspace process, especially since interrupts and preemption can happen at any time. There are no callee/caller-save ABI rules to limit state here, so the kernel has to save everything that it might clobber.
We already implement f32::midpoint in platform-dependent way.
But I think there might be a much bigger variance between integer and float behavior compared to the gap between f32 and f64 (with the assumption that both of the latter are hardfloat).