Pre-RFC: What's the best way to implement `-ffast-math`?


#1

There are several situations where giving the compiler the freedom to reorder mathematical operations in ways that may lead to numerical imprecision also leads to big gains in performance. Here’s the open issue about it:

https://github.com/rust-lang/rust/issues/21690

And here’s a matrix multiplication benchmark that shows that it’s easy to find 20-30% speed improvements just by enabling that compiler flag:

To try and get an initial discussion going here are a few options of how this could be added to rust.

Option 1: Add a compiler flag

The most similar thing to C/C++ would be to have a compiler flag that enables that optimization (by passing that option to LLVM) for everything being compiled.

Advantages

  • Most similar to what programmers are used to in other languages

Disadvantages

  • Suffers from the same issues that -ffast-math has in other languages that make it dangerous in several cases. For example end-users will enable it in cases where it’s unsafe to do so, get broken software, and then proceed to blame the software authors for it.

Option 2: Specific tagging for this in functions

Add a tag that can be used in functions that sets the relevant compilation flag in LLVM. Something like:

#[fast-math]
fn myfunc() {
//... your f32/f64 using code goes here
}

Advantages

  • Works like other rust features
  • Allows only the original author to tag only the specific bits of code where this is safe to do

Disadvantages

  • It’s another feature specific tag that polutes this namespace

Option 3: Use the target_feature machinery for this

In some of ways -ffast-math is just another special CPU feature that can be used or not as there are even special instructions that can only be used if you allow reordering of math. See the target_feature discussion for details but this would be something like:

#[target_feature = "fast-math"]
fn myfunc() {
//... your f32/f64 using code goes here
}

Advantages

  • Doesn’t introduce a new tag but instead just reuses a similar concept
  • Should work nicely together with other target_feature uses like enabling certain SIMD features in CPUs as often you will only get the best benefit from those extra instructions in non-handrolled code by allowing the optimizer to reorder math operations

Disadvantages

  • Stretches the concept too far?

Option 4: Use a wrapper type

Wrappers for f32/f64 would be used that imply that feature:

fn myfunc(a: f32, b:f32) -> f32{
    let afast = FastFloat(a);
    let bfast = FastFloat(b);
    //... your f32/f64 using code goes here
    result.into()
}

Advantages

  • Doesn’t introduce anything into the language itself

Disadvantages

  • Requires a bunch of wrapping/unwrapping which makes for ugly code
  • It’s much harder to compose with other things you might want to do. If you’re using f32x4 that needs a wrapper type too. If you’re using ordered_float you need a way to wrap twice.

Discussion

From my view Option 3 of using target_feature for this seems the best. The syntax looks good and it composes well with other things that will be done with it like the SIMD use cases.

Curious as to what other people think, if there are entirely new options I haven’t thought of or if there are advantages and disadvantages to these ones that I haven’t considered.


#2

I’ve implemented the fourth option:

I think this approach is promising:

Advantages

  • Easy to apply to whole code/use-cases. For example an alias for a “pixel” type can be wrapped, speeding up all graphics calculations, without affecting physics calculations in the same program.
  • Easy to limit effects. Even within a function, or even an expression, it’s possible to control precision (e.g. compute a value within a loop using fast-math, but add it to an accumulator using “slow” math to avoid systemic errors in the sum)

Disadvantages

  • Rust doesn’t have a syntax for fast-math literals. Needs extra syntactic noise like 1.0.to_fast(). It could be fixed if it was made a native rust type.
  • Last time I checked the wrapper type inhibited some optimizations. That’s just a bug.

#3

@kornel I mostly dislike the wrapper because of code bloat and non-composability with other reasons to wrap f32 and f64. If finer grained control is wanted then there’s no other option although I’m not sure your example holds. As far as I know there’s no slow math for individual operations like sum. What’s fast about -ffast-math is that you allow the compiler to reorder operations arithmetically to take advantage of faster ways to do the same thing that don’t have the same precision characteristics (may even have higher precision I think).

Even if we conclude that wrapper types are desireable for fine grained use cases I’d still like to have the simpler tagging options available as those are much easier to use in most cases.


#4

-ffast-math is not actually all-or-nothing. This property can be switched on or off for each operation. There is a “slow” and “fast” add in Rust today: https://doc.rust-lang.org/core/intrinsics/fn.fadd_fast.html


#5

Another solution (4b) is to introduce generic marker traits for algebraic proprieties of numbers (that regular floating point values don’t have fully) and use them to tell the compiler to perform integer-like optimizations on the user-defined numbers (like FP wrappers). Later the same traits are useful for Bigints and other user-defined numbers.


#6

@kornel I believe that what that does is mark the operation as being reorderable but the hardware doesn’t have a faster add that can be used in isolation from other operations.


#7

@leonardo could you please clarify what that would look like in code that wants -ffast-math for f32 for example?


#8

What I’ve suggested in 4b is just a different way to create the wrappers, so the user-code is the same as in the fourth option.

In that case the FP wrappers use those marker traits that allow the compiler to perform some of the fast math optimizations (but NaN-related optimizations need more specific annotations beside the marker traits).


#9

-ffast-math can be decomposed into a number of lower-level knobs that may be useful to expose independently. This is cribbed from the GCC manual; clang’s manual fails to document how many of these are supported as command-line switches, but I expect at the level of LLVM IR they are all present. This list is in roughly decreasing order of how likely they are to make your code produce results that are just totally wrong, which is sadly also decreasing order of how useful they are to the optimizer.

-fassociative-math: Allow re-association of operands in series of floating-point operations. For instance, a * (b * c) can be transformed into (a * b) * c and vice versa. Extremely dangerous, as it can create opportunities for catastrophic cancellation where none existed in the source code (in fact, it can undo manual adjustments intended to avoid catastrophic cancellation).

Also allows the compiler to assume that a < b and b > a are equivalent, which is not true in the presence of NaN.

-fcx-limited-range: Complex multiplication, division, and absolute value may use the “usual mathematical formulas” instead of ones that avoid overflow and catastrophic cancellation in intermediate stages. (This one’s behavior is actually specified in C99 under the name of the CX_LIMITED_RANGE pragma.) Also disables checks for half-NaN complex numbers (i.e. NaN + i*finite or vice versa) in these calculations.

-freciprocal-math: Allow replacement of x / y with x * (1/y). This hurts precision, but can be huge performance-wise if it means the compiler can hoist the 1/y part out of a loop.

-ffinite-math-only: Assume that Inf and NaN never occur, either as arguments to or results from operations. This is relatively safe unless your code is specifically looking for Inf and NaN (e.g. to detect overflows).

-fno-signed-zeros: Treat +0.0 and -0.0 as equivalent. This allows e.g. replacing x+0.0 with x and 0.0*x with 0.0. Only problematic if your code relies on the presence of negative zero (e.g. to ensure that an underflowed negative value remains negative and therefore stays on the appropriate side of a discontinuity).

-fno-trapping-math: Generate code assuming that floating-point operations do not generate traps.
-fno-rounding-math: Generate code assuming that the program never changes the rounding mode from the default.

These exist only because a C compiler has no way of knowing whether IEEE floating-point traps have been enabled, or the rounding mode has been changed; I would like to think that Rust could specify these to work lexically, such that (in safe code) the compiler would always know which mode was in effect. Note however that “no trapping” implies “signaling NaN is never used” and that’s not something that can be statically known.

-fno-math-errno: Math library functions need not bother setting errno, even under circumstances where the C standard says they do.

This one isn’t dangerous at all and should be on by default for Rust code; IEEE floating point has perfectly good in-band signalling of domain and range errors, even when you’re not cutting corners. In fact, since it uses Inf and NaN, it works better when you aren’t.


#10

Thanks for those. It seems to me the target_feature way is the easiest to be able to expose all this. There can be a fast_math feature that implies them all and then each of these also available independently.


#11

LLVM equivalents: https://llvm.org/docs/LangRef.html#fast-math-flags

Does ordered float use nnan?


#12

I think it just handles things manually. For example:


#13

Target-feature may not be a good idea when you need both fast and accurate results in the same build. I am working on a small compiler that allows cleaner input and generates pretty decent code.

it looks like this:

math!(sin(x) cos(x+y^2) - 2)

The code generated is roughly

x.sin().mul(x.add(y.powi(2)).cos()).add(2f32)

However it would be relativly simple to extend this with custom attributes to specify the precision.

(There is lack of documentation. Please ping me (sebk) on #rust-sci for details.)


#14

Not sure what you mean. target-feature allows you to have a specific function compiled with specific settings and there are already procedural macro solutions being implemented to allow you to have multiple versions of the function compiled with different settings and dynamic dispatch:

Things like fadd_fast() already exist if what you want is op by op granularity the use case that’s not so easily covered is more “compile this block of code with fast math optimizations”. This can already be implemented today with wrapper types but it’s quite cumbersome.


#15

I mean… target-feature affects the whole build. But what if one part requires precise arithmetic, and the other one could use some speed up?

target-feature only allows you to either make everything precise or fast.


#16

@sebk that’s not how target-feature works. It allows you to tag functions and sections of code to enable specific features. See here: