Do the square root intrinsics work on all platforms?

ChaiTRex · October 8, 2023, 1:24am

I'm trying to use core::intrinsics::sqrtf32 and sqrtf64 in a no_std environment. It appears that they use llvm.sqrt.f32 and llvm.sqrt.f64. Are those Rust functions guaranteed to work on all platforms, even those without hardware floating point support?

Edit: One place I plan to use these intrinsics is in a few integer methods defined in core.

pitaj · October 8, 2023, 2:03am

I believe these will use soft fp when hardware fp is unavailable

scottmcm · October 8, 2023, 7:35am

If you want to use things like this without std, link a libm and call that.

They're in intrinsics; they're not guaranteed to do anything. They're an internal implementation detail allowed to change whenever rustc wants; not something to use.

You'll notice the intrinsics are only used in std. Where they're declared is basically unimportant.

ChaiTRex · October 8, 2023, 1:55pm

jyn recommended that I ask whether this is "an 'api does not guarantee it' thing or an 'implementation changes all the time' thing".

Sorry, I should have mentioned it earlier, but one of the places I planned on using the square root intrinsics was in core itself as the fastest implementation for isqrt that I could find for most integer types.

Is it OK to use these intrinsics there if an addition to the core test suite to test the outputs of isqrt would detect a breaking change in these two intrinsics? If not, is there a libm used by core? If not, is there some kind of way to implement a method on the integer types differently in core and std so that the floating point method can at least be used in std (or a better way to handle it than that)?

pitaj · October 8, 2023, 3:19pm

I'm confused. Why can't you just use f64::sqrt or f32::sqrt? There essentially the same thing.

ChaiTRex · October 8, 2023, 3:21pm

Because I'm doing something in the core crate, but sqrt isn't defined there (it's defined in std only).

pitaj · October 8, 2023, 3:24pm

Ah yeah never mind. Rustdoc search defaulting to all crates led me astray.

scottmcm · October 8, 2023, 10:55pm

Ah, you're not just in no_std, you're in core (so technically #![no_core]). That makes a huge difference.

In https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/Priorities.20in.20implementing.20.60isqrt.60.20for.20integer.20types/near/394690537 , Jacob mentioned that there's a haveFastSqrt in LLVM. And if you're on a platform that doesn't have a fast sqrt, I'm guessing you don't want to use the float impl for isqrt.

So one potential answer would be to make a new isqrt intrinsic. The LLVM codegen backend for that could then either emit something using llvm.sqrt or the integer-only implementation depending on whether the target has an efficient sqrt.

llvm.sqrt might work in core on targets without an instruction for it, but if it did it would probably be emitted as a call to a function that would be expected to be linked in, which you probably don't ever want for integer sqrt.

bascule · October 9, 2023, 1:14pm

@ChaiTRex since you sound particularly worried about performance, the micromath crate provides an implementation which trades ~5% precision for additional performance: F32 in micromath - Rust

Vorpal · October 9, 2023, 1:21pm

If I understand this correctly, OP is working on code for core though. I don't think such a tradeoff is suitable for the standard library.

user16251 · October 9, 2023, 1:50pm

Is this necessarily faster? Benchmarks can be misleading because they don't include what can be a significant penalty to power on SIMD hardware.

Vorpal · October 9, 2023, 2:34pm

If this goes into core, won't this be a problem in no-std code that can't use floating point registers? Like for example the Linux kernel.

scottmcm · October 9, 2023, 5:05pm

Looks like FSQRT on x86 at least is something like 20-40 cycles, so hard to say for sure. That said, it's almost certainly faster than anything doing integer divisions for newton's method, since a general IDIV is about the same latency.

If the solution in https://github.com/rust-lang/rust/pull/116176/files is O(bits/2), then I can easily imagine the fsqrt being better in practice.

user16251 · October 9, 2023, 5:27pm

My question wasn't about the algorithm. If you are doing a lot of square roots then floating point will be faster as the OP's benchmarks showed. However there is a fixed cost to gratuitously using SIMD that is obscured by microbenchmarks.

programmerjake · October 9, 2023, 5:33pm

fsqrt is the wrong instruction to check, that's the old x87 instruction that no one uses unless they have to or they're using f80. the instructions to check are [v]sqrtss for f32 (15 cycle latency on Zen4, 12 on Icelake) and [v]sqrtsd for f64 (21 cycle latency on Zen4, 15-16 on Icelake).

Vorpal · October 9, 2023, 5:41pm

Does that include the time to convert back and forth between integers and floating point though?

Also, won't there be a problem of precision anyway: f64 has 53 bit mantissa. By the pigeon hole principle it cannot represent all 64-bit integers, meaning it can't be used for square root, without a range check and fallback.

It might be a pessimisation rather than optimisation in some cases.

programmerjake · October 9, 2023, 5:43pm

if you need to do an integer version, maybe use goldschmidt's algorithm with fixed-point arithmetic? it only uses table lookup, multiply, add, sub and shifting (after changing the input range to 1/2 to 2 using count leading zeros and shifting), so should be pretty fast.

programmerjake · October 9, 2023, 5:59pm

no.

not necessarily, because the integer sqrt can fit in f64's mantissa, so we'd probably only need a off-by-one correction:

// not tested, may not be correct
fn isqrt(v: u64) -> u64 {
    let s = (v as f64).sqrt() as u64;
    match s.checked_mul(s) {
        Some(p) if p <= v => s,
        _ => s - 1,
    }
}

pitaj · October 9, 2023, 6:01pm

You may even be able to avoid that using a clever addition, subtraction, or bitwise or/and.

ChaiTRex · October 9, 2023, 6:07pm

@programmerjake had suggested using LLVM's TargetTransformInfo::haveFastSqrt. If Linux kernel code like drivers had disabled hardware floating point support during compile time, that might be a way to avoid the problem.

If that doesn't work, is there some way for rustc to detect code that shouldn't use hardware floating point operations? For example, is kernel code compiled with +soft-float or something else to disable accidental hardware floating point use?

Or perhaps is there some way to detect kernel code specifically?

Topic		Replies	Views
Attribute for specifying floating-point fast-math flags language design	7	1344	September 30, 2021
Decimal Floating-Point (DFP) types d32, d64 & d128? language design	6	892	January 3, 2024
Missing fneg and fcmp fast float LLVM intrinsics libs	8	1075	February 28, 2022
What's Rust policy towards LLVM intrinsics? libs	4	660	November 8, 2024
Problems with floats language design	12	2523	December 24, 2019

Do the square root intrinsics work on all platforms?

Related topics