jyn recommended that I ask whether this is "an 'api does not guarantee it' thing or an 'implementation changes all the time' thing".
Sorry, I should have mentioned it earlier, but one of the places I planned on using the square root intrinsics was in core itself as the fastest implementation for isqrt that I could find for most integer types.
Is it OK to use these intrinsics there if an addition to the core test suite to test the outputs of isqrt would detect a breaking change in these two intrinsics? If not, is there a libm used by core? If not, is there some kind of way to implement a method on the integer types differently in core and std so that the floating point method can at least be used in std (or a better way to handle it than that)?
So one potential answer would be to make a new isqrt intrinsic. The LLVM codegen backend for that could then either emit something using llvm.sqrt or the integer-only implementation depending on whether the target has an efficient sqrt.
llvm.sqrtmight work in core on targets without an instruction for it, but if it did it would probably be emitted as a call to a function that would be expected to be linked in, which you probably don't ever want for integer sqrt.
Looks like FSQRT on x86 at least is something like 20-40 cycles, so hard to say for sure. That said, it's almost certainly faster than anything doing integer divisions for newton's method, since a general IDIV is about the same latency.
My question wasn't about the algorithm. If you are doing a lot of square roots then floating point will be faster as the OP's benchmarks showed. However there is a fixed cost to gratuitously using SIMD that is obscured by microbenchmarks.
fsqrt is the wrong instruction to check, that's the old x87 instruction that no one uses unless they have to or they're using f80. the instructions to check are [v]sqrtss for f32 (15 cycle latency on Zen4, 12 on Icelake) and [v]sqrtsd for f64 (21 cycle latency on Zen4, 15-16 on Icelake).
Does that include the time to convert back and forth between integers and floating point though?
Also, won't there be a problem of precision anyway: f64 has 53 bit mantissa. By the pigeon hole principle it cannot represent all 64-bit integers, meaning it can't be used for square root, without a range check and fallback.
It might be a pessimisation rather than optimisation in some cases.
if you need to do an integer version, maybe use goldschmidt's algorithm with fixed-point arithmetic? it only uses table lookup, multiply, add, sub and shifting (after changing the input range to 1/2 to 2 using count leading zeros and shifting), so should be pretty fast.
If that doesn't work, is there some way for rustc to detect code that shouldn't use hardware floating point operations? For example, is kernel code compiled with +soft-float or something else to disable accidental hardware floating point use?
Or perhaps is there some way to detect kernel code specifically?