Pre-RFC: What's the best way to implement `-ffast-math`?

pedrocr · August 9, 2017, 6:14pm

There are several situations where giving the compiler the freedom to reorder mathematical operations in ways that may lead to numerical imprecision also leads to big gains in performance. Here's the open issue about it:

And here's a matrix multiplication benchmark that shows that it's easy to find 20-30% speed improvements just by enabling that compiler flag:

To try and get an initial discussion going here are a few options of how this could be added to rust.

Option 1: Add a compiler flag

The most similar thing to C/C++ would be to have a compiler flag that enables that optimization (by passing that option to LLVM) for everything being compiled.

Advantages

Most similar to what programmers are used to in other languages

Disadvantages

Suffers from the same issues that -ffast-math has in other languages that make it dangerous in several cases. For example end-users will enable it in cases where it's unsafe to do so, get broken software, and then proceed to blame the software authors for it.

Option 2: Specific tagging for this in functions

Add a tag that can be used in functions that sets the relevant compilation flag in LLVM. Something like:

#[fast-math]
fn myfunc() {
//... your f32/f64 using code goes here
}

Advantages

Works like other rust features
Allows only the original author to tag only the specific bits of code where this is safe to do

Disadvantages

It's another feature specific tag that polutes this namespace

Option 3: Use the `target_feature` machinery for this

In some of ways -ffast-math is just another special CPU feature that can be used or not as there are even special instructions that can only be used if you allow reordering of math. See the target_feature discussion for details but this would be something like:

#[target_feature = "fast-math"]
fn myfunc() {
//... your f32/f64 using code goes here
}

Advantages

Doesn't introduce a new tag but instead just reuses a similar concept
Should work nicely together with other target_feature uses like enabling certain SIMD features in CPUs as often you will only get the best benefit from those extra instructions in non-handrolled code by allowing the optimizer to reorder math operations

Disadvantages

Stretches the concept too far?

Option 4: Use a wrapper type

Wrappers for f32/f64 would be used that imply that feature:

fn myfunc(a: f32, b:f32) -> f32{
    let afast = FastFloat(a);
    let bfast = FastFloat(b);
    //... your f32/f64 using code goes here
    result.into()
}

Advantages

Doesn't introduce anything into the language itself

Disadvantages

Requires a bunch of wrapping/unwrapping which makes for ugly code
It's much harder to compose with other things you might want to do. If you're using f32x4 that needs a wrapper type too. If you're using ordered_float you need a way to wrap twice.

Discussion

From my view Option 3 of using target_feature for this seems the best. The syntax looks good and it composes well with other things that will be done with it like the SIMD use cases.

Curious as to what other people think, if there are entirely new options I haven't thought of or if there are advantages and disadvantages to these ones that I haven't considered.

kornel · August 9, 2017, 6:44pm

I've implemented the fourth option:

gitlab.com

kornelski/ffast-math/blob/master/src/lib.rs

#![feature(core_intrinsics)]
use std::ops::*;
use std::fmt;
use std::fmt::Write;
use std::cmp::Ordering;

#[allow(non_camel_case_types)]
#[derive(Copy, Clone, PartialEq, PartialOrd)]
#[repr(C)]
/// Fast, finite, floating-point
pub struct fff(pub f32);

/// It's necessary for literals: 0.fff()
pub trait ToFFF32 {
    fn fff(&self) -> fff;
}

macro_rules! impl_cast {
    ($ty:ty) => {
        impl ToFFF32 for $ty {

This file has been truncated. show original

I think this approach is promising:

Advantages

Easy to apply to whole code/use-cases. For example an alias for a "pixel" type can be wrapped, speeding up all graphics calculations, without affecting physics calculations in the same program.
Easy to limit effects. Even within a function, or even an expression, it's possible to control precision (e.g. compute a value within a loop using fast-math, but add it to an accumulator using "slow" math to avoid systemic errors in the sum)

Disadvantages

Rust doesn't have a syntax for fast-math literals. Needs extra syntactic noise like 1.0.to_fast(). It could be fixed if it was made a native rust type.
Last time I checked the wrapper type inhibited some optimizations. That's just a bug.

pedrocr · August 9, 2017, 7:04pm

@kornel I mostly dislike the wrapper because of code bloat and non-composability with other reasons to wrap f32 and f64. If finer grained control is wanted then there’s no other option although I’m not sure your example holds. As far as I know there’s no slow math for individual operations like sum. What’s fast about -ffast-math is that you allow the compiler to reorder operations arithmetically to take advantage of faster ways to do the same thing that don’t have the same precision characteristics (may even have higher precision I think).

Even if we conclude that wrapper types are desireable for fine grained use cases I’d still like to have the simpler tagging options available as those are much easier to use in most cases.

kornel · August 9, 2017, 7:45pm

-ffast-math is not actually all-or-nothing. This property can be switched on or off for each operation. There is a “slow” and “fast” add in Rust today: https://doc.rust-lang.org/core/intrinsics/fn.fadd_fast.html

leonardo · August 9, 2017, 7:52pm

Another solution (4b) is to introduce generic marker traits for algebraic proprieties of numbers (that regular floating point values don’t have fully) and use them to tell the compiler to perform integer-like optimizations on the user-defined numbers (like FP wrappers). Later the same traits are useful for Bigints and other user-defined numbers.

pedrocr · August 9, 2017, 8:03pm

@kornel I believe that what that does is mark the operation as being reorderable but the hardware doesn’t have a faster add that can be used in isolation from other operations.

pedrocr · August 9, 2017, 8:05pm

@leonardo could you please clarify what that would look like in code that wants -ffast-math for f32 for example?

leonardo · August 9, 2017, 8:13pm

What I've suggested in 4b is just a different way to create the wrappers, so the user-code is the same as in the fourth option.

In that case the FP wrappers use those marker traits that allow the compiler to perform some of the fast math optimizations (but NaN-related optimizations need more specific annotations beside the marker traits).

zackw · August 9, 2017, 8:32pm

-ffast-math can be decomposed into a number of lower-level knobs that may be useful to expose independently. This is cribbed from the GCC manual; clang's manual fails to document how many of these are supported as command-line switches, but I expect at the level of LLVM IR they are all present. This list is in roughly decreasing order of how likely they are to make your code produce results that are just totally wrong, which is sadly also decreasing order of how useful they are to the optimizer.

-fassociative-math: Allow re-association of operands in series of floating-point operations. For instance, a * (b * c) can be transformed into (a * b) * c and vice versa. Extremely dangerous, as it can create opportunities for catastrophic cancellation where none existed in the source code (in fact, it can undo manual adjustments intended to avoid catastrophic cancellation).

Also allows the compiler to assume that a < b and b > a are equivalent, which is not true in the presence of NaN.

-fcx-limited-range: Complex multiplication, division, and absolute value may use the "usual mathematical formulas" instead of ones that avoid overflow and catastrophic cancellation in intermediate stages. (This one's behavior is actually specified in C99 under the name of the CX_LIMITED_RANGE pragma.) Also disables checks for half-NaN complex numbers (i.e. NaN + i*finite or vice versa) in these calculations.

-freciprocal-math: Allow replacement of x / y with x * (1/y). This hurts precision, but can be huge performance-wise if it means the compiler can hoist the 1/y part out of a loop.

-ffinite-math-only: Assume that Inf and NaN never occur, either as arguments to or results from operations. This is relatively safe unless your code is specifically looking for Inf and NaN (e.g. to detect overflows).

-fno-signed-zeros: Treat +0.0 and -0.0 as equivalent. This allows e.g. replacing x+0.0 with x and 0.0*x with 0.0. Only problematic if your code relies on the presence of negative zero (e.g. to ensure that an underflowed negative value remains negative and therefore stays on the appropriate side of a discontinuity).

-fno-trapping-math: Generate code assuming that floating-point operations do not generate traps.
-fno-rounding-math: Generate code assuming that the program never changes the rounding mode from the default.

These exist only because a C compiler has no way of knowing whether IEEE floating-point traps have been enabled, or the rounding mode has been changed; I would like to think that Rust could specify these to work lexically, such that (in safe code) the compiler would always know which mode was in effect. Note however that "no trapping" implies "signaling NaN is never used" and that's not something that can be statically known.

-fno-math-errno: Math library functions need not bother setting errno, even under circumstances where the C standard says they do.

This one isn't dangerous at all and should be on by default for Rust code; IEEE floating point has perfectly good in-band signalling of domain and range errors, even when you're not cutting corners. In fact, since it uses Inf and NaN, it works better when you aren't.

pedrocr · August 9, 2017, 10:23pm

Thanks for those. It seems to me the target_feature way is the easiest to be able to expose all this. There can be a fast_math feature that implies them all and then each of these also available independently.

scottmcm · August 10, 2017, 2:51am

LLVM equivalents: https://llvm.org/docs/LangRef.html#fast-math-flags

Does ordered float use nnan?

pedrocr · August 10, 2017, 8:13am

I think it just handles things manually. For example:

github.com

reem/rust-ordered-float/blob/b27bbd49c7fe9c3cc3819221dd355d7a193d2443/src/lib.rs#L58-L75


      
          impl<T: Float + PartialOrd> Ord for OrderedFloat<T> {
              fn cmp(&self, other: &OrderedFloat<T>) -> Ordering {
                  match self.partial_cmp(&other) {
                      Some(ordering) => ordering,
                      None => {
                          if self.as_ref().is_nan() {
                              if other.as_ref().is_nan() {
                                  Ordering::Equal
                              } else {
                                  Ordering::Greater
                              }
                          } else {
                              Ordering::Less
                          }
                      }
                  }
              }
          }

sebk · September 6, 2017, 10:33am

Target-feature may not be a good idea when you need both fast and accurate results in the same build. I am working on a small compiler that allows cleaner input and generates pretty decent code.

it looks like this:

math!(sin(x) cos(x+y^2) - 2)

The code generated is roughly

x.sin().mul(x.add(y.powi(2)).cos()).add(2f32)

However it would be relativly simple to extend this with custom attributes to specify the precision.

https://github.com/s3bk/math/ (There is lack of documentation. Please ping me (sebk) on #rust-sci for details.)

pedrocr · September 6, 2017, 11:31am

Not sure what you mean. target-feature allows you to have a specific function compiled with specific settings and there are already procedural macro solutions being implemented to allow you to have multiple versions of the function compiled with different settings and dynamic dispatch:

Things like fadd_fast() already exist if what you want is op by op granularity the use case that's not so easily covered is more "compile this block of code with fast math optimizations". This can already be implemented today with wrapper types but it's quite cumbersome.

sebk · September 6, 2017, 11:43am

I mean… target-feature affects the whole build. But what if one part requires precise arithmetic, and the other one could use some speed up?

target-feature only allows you to either make everything precise or fast.

pedrocr · September 6, 2017, 12:01pm

@sebk that’s not how target-feature works. It allows you to tag functions and sections of code to enable specific features. See here:

github.com

gnzlbg/rfcs/blob/1c94fe7c9d0a3433c8a666fdadf13c17b9f7d4c8/text/0000-target-feature.md

- Feature Name: `target_feature` / `cfg_target_feature`
- Start Date: 2017-06-26
- RFC PR: (leave this empty)
- Rust Issue: (leave this empty)

# Motivation and Summary
[summary]: #summary

While architectures like `x86_64` or `ARMv8` define the lowest-common denominator of instructions that all CPUs must support, many CPUs extend these with vector ([AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)), bitwise manipulation ([BMI](https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets)) and/or cryptographic ([AES](https://en.wikipedia.org/wiki/AES_instruction_set)) instruction sets. By default, the Rust compiler produces portable binaries that are able to run on all CPUs of a particular architecture. Users that know in which CPUs their binaries are going to run on are able to allow the compiler to use these extra instructions by using the compiler flags `--target-feature` and `--target-cpu`. Running these binaries on mismatching CPUs is undefined behavior. Currently, these users have no way in stable Rust to:

- determine which features are available at compile-time, and
- determine which features are available at run-time, and
- embed code for different sets of features into the same binary,

such that the programs can use different algorithms depending on the features available, and allowing portable ust binaries to efficiently run on many CPU families of a particular architecture. 

The objective of this RFC is to extend the Rust language to solve these three problems, and it does so by adding the following three language features:

- **compile-time feature detection**: using configuration macros `cfg!(target_feature = "avx2")` to detect whether a feature is enabled or disabled in a context (`#![cfg(target_feature = "avx2")]`, ...), 
- **run-time feature detection**: using the `std::target_feature("avx2")` API to detect whether the current host supports the feature, and

This file has been truncated. show original

system · March 25, 2019, 8:28am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Suggestion: FFastMath intrinsics in stable libs	60	5768	January 24, 2022
-ffast-math as a floating-point type ideas (deprecated)	3	1816	March 25, 2019
Attribute for specifying floating-point fast-math flags language design	7	1360	September 30, 2021
Avoiding PartialOrd problems by introducing fast finite floating-point types	46	5064	March 25, 2019
pre-Proposal: Math Working Group community	13	2086	February 4, 2021

Pre-RFC: What's the best way to implement `-ffast-math`?

Option 1: Add a compiler flag

Advantages

Disadvantages

Option 2: Specific tagging for this in functions

Advantages

Disadvantages

Option 3: Use the target_feature machinery for this

Advantages

Disadvantages

Option 4: Use a wrapper type

Advantages

Disadvantages

Discussion

Advantages

Disadvantages

Related topics

Option 3: Use the `target_feature` machinery for this