Suggestion: FFastMath intrinsics in stable

What is the problem

After that comment on reddit, I think about the effect of potential optimizations which we prevent by making ffast-math intrinsics like fadd_fast or fmul_fast unstable.

I already met few advices like "if you need performance don't rely on optimizer and write SIMD manually". Writing SIMD instructions by hand is very error-prone because user need to check alignments, sizes, don't forget remaining elements which doesn't fit into SIMD registers, check instructions availability and similar things. Also, user need to repeat a lot of code because different CPU architectures have different instructions.

Additionally, lack of -ffast-math can prevent some C/C++ users, especially from gamedev, from adoption Rust.

On my computer, using of instrinsics (without manual vectorizing) speeds up calculation of dot product of 20k f64s from 155 microseconds to 69 in release build, and to 54 if I use -C target-cpu=native

Possible solution

I suggest to add some new type FastFloat(T) similar way as Wrapping for integers. It would just have implementations for f32 and f64, which replaces all simple arithmetical operations by calls to instrinsics and therefore generate faster code. Code which uses such intrinsics utilizes modern SIMD extensions automatically and programmers wouldn't need to write SIMD code for audio/video processing manually.

Another option

Just add methods like fast_add, fast_mul, etc. to the float types. I like this option less because I prefer when properties of program described by types (using different type would better separate IEEE-754 floats from fast ones) but other people may like this option more.

What do you think?

5 Likes

It would be great to have it. There's an issue for it in rust-lang/rust, and some of the things that merit thought and design, are mentioned or hinted there: https://github.com/rust-lang/rust/issues/21690 The central questions (for me) of the design are - Which fast-math flags to enable? and Which fast-math flags are compatible with safe Rust, and which not?

2 Likes

My usual question here is which optimizations you're hoping to actually get from things.

Notably, the full fast flag allows a bunch of things that can easily result in UB, and I've never seen a single program that uses -ffast-math in C++ that actually did the work to check that those cases can't happen, which seems sketchy at best.

Note that there's a portable SIMD working group aiming to give a rusty, cross-platform interface to SIMD, which may well end up being a common choice for mathy uses of SIMD.

Also we have freeze available in LLVM, now, so there are some possibilities have have opened up that didn't exist back in LLVM 6 or so. Something like raw nnan can't be exposed to safe rust, but we might be able to make a safe version that can freeze the result so that poison becomes just an unspecified value instead of being able to trigger UB.

(That's in addition to the options that have existed for a while like nsz and arcp and such that never produce poison, so could always be exposed safely.)

4 Likes

FWIW: at least everywhere I've been so far (but I'm young) has specifically not used -ffast-math and instead enforced strict IEEE floats. This isn't by itself enough to get consistent behavior on all machines (you need a fixed integration timestep and to be careful in a lot of other places), but it is competently untenable to try to calculate the same float result on multiple machines (of potentially different CPUs) with -ffast-math.

Most of the advantages that -ffast-math provides (in C++, anyway, which has less optimization out of your math lib and into your game's use of it), you can basically get all of the -ffast-math advantage by just simplifying yourself in the routines like dot product you want to be vocabulary level cheap. Fused multiply-add is probably the biggest perf bump you'll get from -ffast-math that isn't simple recognizable patterns, and it's not that difficult to just write FMA when it's available.

(That said, I'd love a FastFloat type, especially if it's NoisyFloat in debug and panics on NaN! (EDIT: I always forget that -ffast-math means NaN is UB rather than just nondeterministic.))

3 Likes

UE4 uses relaxed math on Windows if not compiled by Intel or Clang (this compilers emit rsqrtss which too impresize). And UE4 is very common choice for a lot of different companies, especially, indie developers. Note: to look at this page, you need to join EpicGames organization in GitHub, and I don't know if I allowed by EULA to share code itself here. https://github.com/EpicGames/UnrealEngine/blob/5df54b7ef1714f28fb5da319c3e83d96f0bedf08/Engine/Source/Programs/UnrealBuildTool/Platform/Windows/VCToolChain.cs#L471

It don't need anything like this right now (because I write Rust only in my sparse free time), I just wonder if someone else want this. For me, probably, best thing is an ability to use that instructions to make autovectorization without writing SIMD manually and thinking at such low level. In the topic in Reddit, I just write for loop over zipped values and add one by one, and compiler handle everything for me to make it 3 times faster than IEEE-754 implementation.

Thanks, never heard about this. Their approach still very low level, IMHO. https://rust-lang.github.io/stdsimd/core_simd/index.html

I've proposed this before, and the discussion has been lost to bikeshedding. It went like this:

  1. If you allow fast math, then it needs the floats to be non-NaN.
  2. People who want floats to be non-NaN may also want to control whether these are signaling NaNs and whether Inf is allowed.
  3. This would require a complex type wrapper for floats and have generic arguments that customize every IEEE-754 feature under the sun.
  4. Such wrapper would be too complex and couldn't possibly work, because upholding all of the invariants would require runtime checks, e.g. after every division.
1 Like

Sounds like what's really needed is a version of fast-math that limits the effects of incorrect non-NaN assumptions to "your math goes wrong" rather than "nasal demons".

You could probably achieve at least some of that by passing LLVM only the subset of fast-math flags that don't produce poison, i.e. the ones other than nnan and ninf. But then it wouldn't be able to do as many optimizations as full fast-math. Alternately, you could theoretically use nnan and ninf but then freeze the result before it can cause UB, but that might end up being a net loss due to hindering optimizations...

5 Likes

My thought there is that, with a wrapper type, it would only need to freeze in impl From<FunkyF32> for f32. So long as it stays in the newtype¸ it wouldn't block any optimizations. (Which might need something like struct FunkyF32(MaybeUninit<f32>); because it could have some weird stuff in it.)

2 Likes

I read this docs of LLVM: LLVM Language Reference Manual — LLVM 12 documentation

It looks like we can use nsz, arcp, contract and reassoc flags without risks of UB and this would allow us to still pass and get NaNs and infinities.

I would later test their effect.

6 Likes

Hmm, poison is somewhat tricky. It's actually less volatile than I thought: it's safe to do further arithmetic on it, and even store it to memory and load it back. But it's undefined behavior to use poison as a memory address, a branch condition, or a few other things. And roughly speaking, any operation with poison as one of its inputs produces poison as output. So if you do anything with a poison float that results in an integer, like:

  • compare it to another float (resulting in a boolean, or 1-bit integer)
  • convert to integer
  • transmute to integer

...and then use that integer in either a branch condition or an array index (the latter resulting in a poison memory address), you get UB.

In theory the frontend could track which values are potentially poison, note when you do something potentially unsafe with them, and freeze them at that point. But even setting aside the cost of such tracking, it would be hard to prevent the tracking from breaking as soon as math is hidden behind a function call (which LLVM would inline, but only after the frontend is already finished generating IR).

It would probably be easier and more effective to 'just' modify LLVM to add a weaker version of nnan and ninf that doesn't produce poison – though I'm not sure which optimizations exactly are enabled by those flags anyway.

I don't think it's that bad -- the point is to use the newtype so the type system does the tracking.

So, for example, we just don't offer to_bits() on the newtype, and thus people who want to do that need to call f32::from(x).to_bits() instead, and because the from does the freeze, it might be garbage, but it'll never be UB (even if you use it as an index, or whatever.) And we make sure to remember to freeze the result of the comparison in all the PartialCmp methods, and whatnot. (In general, any time it doesn't give back a FunkyF32 it'd need to be frozen. I guess maybe the wrapper could be MaybePoison, even...)

Isn't that just "freeze it afterwards"?

3 Likes

I tested! And it worked pretty good.

Running IEEE-754 floating math
Elapsed 155 microseconds
Result: 2666686666700000

Running fast floating math
Elapsed 104 microseconds
Result: 2666686666700000

And it without using any flags which can cause UB.

I did it very hacky way though. I generated unoptimized LLVM IR for shared lib with 2 functions, added fast-math flags to one of them, then run LLVM optimization and build it.

I would be grateful if you test it too. I wrote a script to make testing easier but you would need linux, clang, rust and python to run it. Here is code:

Only nnan ninf fast flags introduce UB and I think we can just avoid using them without need to freeze values later. Also it would allow to run math on NaN and infinity causing only calculation erros.

So I used such flags:

  • nsz - ignore sign of zero
  • arcp - allow using multiply by 1/x instead of /x
  • contract - contraction (combining few instructions into single)
  • reassoc - associative arithmetics

I avoided such flags:

  • nnan - can cause UB
  • ninf - can cause UB
  • afn - UE4 repository complains that errors generated by it too big even for games.
  • fast imply other avoided flags.
8 Likes

I think, I should start writing an RFC.

1 Like

One thing you might want to ponder is whether everyone wants those same flags you picked here.

For example, I suspect the dot product example only actually needed the contract and reassoc flags. But I can also easily imagine code that wants arcp and contract, because it can afford an extra ½ULP of uncertainty per operation, but would be unwilling to use reassoc since that can "dramatically change results" (because of cancellation effects).

We have (min) const generics now, so we have a few options:

Option A: full granularity

struct LooseF32<
  const IGNORE_SIGNED_ZERO: bool,
  const ALLOW_EXPANSION: bool,
  const ALLOW_CONTRACTION: bool,
  const ALLOW_REASSOCIATION: bool,
>(f32);

(A super "fun" way to do this is to make actual primitive type f32 generic over a bitflags! struct IeeeConformance and default to full conformance.)

Option B: no granularity

Offer a -ffast-math flag, which turns on nsz, arcp, contract, and reassoc globally for all f32/f64. (Or without reassoc, to stick to ½ULP error.)

Option C: binary granularity

A single struct LooseF32(f32); which enables the non-UB fast math flags while wrapped. (Or without reassoc, ½ULP.)

Option D: some granularity

A single struct LooseF32(f32);, and global configuration for how loose it actually is. If unconfigured, probably not reassoc, to stick to ½ULP extra error.

A library using LooseF32 is saying that it's correct under looser floating point math, but the final application compilation gets to decide just how much accuracy they're willing to give up.

But what about non-LLVM backends

Honestly, the four flags here seem reasonably concerned with just IEEE float (which Rust the language knows about) and divorced from any one optimizing backend. If we go the source-configurable route, we'd want to make sure we can add further flags in the future if other backends have other things available (such as no-UB ASSUME_FINITE).

I also think it's still fairly reasonable to offer the doesn't-introduce-language-UB options as global switches for f32/f64, or at least global across a single crate at a time. Crates relying on perfect IEEE conformance for correctness (beyond just precision) should be fairly rare and obvious.

(And, as always, wrappers around primitives with literals won't be super ergonomic until it's possible to have literals inferred to them.)

6 Likes

I really like this because it would give user much better and explicit control over low level details. And this expressiveness is one of main properties of language why rustaceans love Rust.

I tried to implement full granularity (without good intrinsics yet, only flags). It seems that it is possible.

Core idea is such:

// More granular choice for user
pub struct LooseF32<
    const IGNORE_SIGNED_ZERO: bool,
    const ALLOW_EXPANSION: bool,
    const ALLOW_CONTRACTION: bool,
    const ALLOW_REASSOCIATION: bool,
    const ALLOW_APPROXIMATION: bool,
>(f32);

// Default choice for user
pub type DefaultLooseF32 = LooseF32<true, true, true, true, true>;

which implement safe traits Add, Mul, Div and etc.

and have unsafe methods like

    pub unsafe fn unsafe_add<const IGNORE_NAN: bool, const IGNORE_INF: bool>(
        self,
        right: Self,
    ) -> Self
    {...}

or

    pub unsafe fn unsafe_add2(self, right: Self, unsafe_flags: UnsafeFloatingParams) -> Self {
        match (unsafe_flags.ignore_nan, unsafe_flags.ignore_inf) {
            (true, true) => self.unsafe_add::<true, true>(right),
            (true, false) => self.unsafe_add::<true, false>(right),
            (false, true) => self.unsafe_add::<false, true>(right),
            (false, false) => self.unsafe_add::<false, false>(right),
        }
    }

I don't know which of unsafe methods more ergonomic but I tend to think that first is because can't be produce inefficient code if used incorrectly.

I strongly against this because it would constrain user. If user has a single place where precise floating math is necessary and a lot of places where it isn't, user wouldn't be able to use this. Also, it would lead to possible errors in big projects due to misconfiguration of build (e.g. when build configured by one person and code is written by another). IMHO, we need to make decisions about using unprecise math more local.

It is my original thought but I like you first option much more.

I think, global configs which change semantics of code is bad. One part of codebase could assume one semantics and other assume another. It would essencially lead to the problems in Option B.

Well, AFAIK there is only 2 alternative backends now: GCC and Cranelif.

As for GCC, I tried to find documentation for their IR but I failed.

Cranelift is made especially for Rust and it would likely to support fast math instructions only when this instructions appear in language.

I think, we should emit precise instructions if our backend doesn't support fast ones.

Another problems is changing fast of set of flags on backend side. I don't know how much probability for this but it can be mitigated by using struct as generic parameter of our type (this would, however, make stabilization of this feature depended on full const generics).

Do you like @CAD97 idea of full granular API and my sketch of it in playground? User would be able to pick needed flags and even use unsafe ones.

2 Likes

It's great to see that NaN/Inf doesn't have to be solved, but I'm worried that we're at the stage 3 again.

On LLVM level these properties are given to operations, not to types. So perhaps instead of LooseF32 there should be f32.loose_mul(x)/f32.loose_add(x), etc? (or f32.loose_mul(x, flags))

Exposing this as methods/intrinsics is the minimum necessary. Newtypes with fancy generics can be created in user libraries.

8 Likes

There's some precedent in std in the form of the various (wrapping|overflowing|etc)_(add|mul|etc) methods. On the other hand, there's precedent to also having a convenient wrapper type in the form of std::num::Wrapping :slight_smile: (And as has been discussed, arguably there should be other wrappers as well given that Wrapping is already there.) I do find that the many similar methods pollute the docs somewhat, for what it's worth.

Thinking about it a bit, I think the usage patterns differ somewhat though. With floating point it seems it's more likely that you just want to say "make all operations in this module/package/crate fast", whereas integer overflow control tends to be more fine-grained in my experience. The stakes are different, too: an unexpected wraparound usually makes your results catastrophically wrong.

1 Like

It would be very unergonomic, I think :frowning:

Compare this 2 codes:

with generic struct

// In case if we would use struct as generic param
const FLAGS: LooseFloatFlags = LooseFloatFlags::new()
         .no_signed_zero()
         .reassocciatable()
         .expandable()
         .approximable();

type Float = LooseFloat<FLAGS>;

fn calc_dot_product(a: &[f64], b: &[f64])->f64{
    assert_eq!(a.len(), b.len());
    let mut result = Float(0.0f64);
    for (a,b) in a.iter().copied().map(Float).zip(b.iter().copied().map(Float)){
          result += a*b;
    }
    result.into_inner()
}

With methods in floats

const FLAGS: LooseFloatFlags = LooseFloatFlags::new()
         .no_signed_zero()
         .reassocciatable()
         .expandable()
         .approximable();

fn calc_dot_product(a: &[f64], b: &[f64])->f64{
    assert_eq!(a.len(), b.len());
    let mut result = 0.0f64;
    for (a,b) in a.iter().copied().zip(b.iter().copied()){
          result = unsafe{
                result.loose_add::<FLAGS>(a.loose_mul::<FLAGS>(b))
          };
    }
    result
}

If we decide to use separate generic arguments instead of struct, we would need to turbofish all 7 attributes to each method call which is too verbose in my taste.

I think the approach of "make all operations in this module/package/crate fast" is just colored by the precedent in C where a global -ffast-math is all you can have. It's not a good design.

It's super ambiguous in its meaning. Does it affect iter().sum()? sum is defined in a different crate! If it doesn't affect code defined elsewhere, then it basically won't work at all, because all float std::ops::* are defined in std. Does it simply affect only code generated during compilation of the crate, like ffast-math? Now monomorphisation, inlining decisions and LTO will affect code's semantics. Or does it affect meaning of all operations in all function calls, recursively? That's going to be hard to compile, and will affect code in functions that may have never been written with relaxed precision in mind.

11 Likes