Suggestion: FFastMath intrinsics in stable

Absolutely agree.

Also, if someone wants make "all operations in crate is fast", it is possible to wrap all incoming floats into LooseFloat and use them inside everywhere.

I presume in both cases people would use the first option if they needed to use it a lot.

The difference is only whether it'd be std::float::LooseFloat that internally uses intrinsics::loose_add() or some_crate::LooseFloat that internally uses std::f32::loose_add().

Yeah, I agree. My wording was rather poor. More accurately, what I meant was that usually you'd probably want to make the choice at the type level rather than the level of individual operation invocations, and additionally that you'd probably want to do that by choosing a single type (alias) that's used consistently by the whole module or what have you.

But on the other hand, I guess that even in many fp-intensive uses you might only want to trade accuracy for speed in tight inner loops rather than everywhere.

I could imagine both, really. Perhaps the rendering and particle system effects code wants loose math everywhere and the gameplay physics code wants reproducible math everywhere -- even thought they're both in the same app or even crate.

8 Likes

(sorry for bringing up the overly generic types and risking derailing)

So here's a question. We know the maximal set of -ffast-math optimizations that LLVM can provide, as well as the maximal set that can be provided safely. (And it's worth noting that inlined code can often infer nnan and ninf and propogate it for optimization. (fpsev has gotten even better and the author has written more posts about it, but I found those two interesting.))

Providing intrinsics for f32::add, a single f32::loose_add, and maybe an unsafe f32::assume_finite for nnan/ninf is at least somewhat tenable. Having loose_op intrinsics for every combination of potential fast math flags is obviously combinatorially prohibitive, though.

So the actual question: does rustc have a mechanism to provide a const generic intrinsic, that would be const generic over IeeeConformance/FastMathFlags?

If rustc already has the capability, then I think adding (or modifying) intrinsics to expose the ability to call operations with fast math flags is an obvious next step, which would face little to no opposition to a PR. This would allow experimentation on nightly, to see what kind of benefit this provides to real-world cases, and to experiment with a more well-typed wrapper.

If rustc doesn't have this capacity, then it would need to be added before a general mechanism could even be provided. In that case, I believe the way forward would be to add (or modify) intrinsics to provide just the maximal set of safe fast math flags, and then draft a plan and MCP to add the requisite functionality for const generic intrinsics.


Even if a wrapper type is provided by std, the intrinsics need to be there for it to call. And I see little to no reason not to (unstably) provide the intrinsics, so someone should do the leg work to make a PR to provide them.

(The only exception I could possibly see is if f32 itself became the generic type, as f32 + f32 isn't a (normal) intrinsic, it's literally baked into the implementation of the language. Though this could change so that we have impl Add for f32 { fn add(self, o: f32) -> f32 { intrinsics::add(self, o) } } instead of the current impl Add for f32 { fn add(self, o: f32) -> f32 { self + o } }, I suppose. Maybe that'd even be better, for some definition of the word better.)

3 Likes

If necessary, the underlying intrinsic could accept the mode value as a regular function parameter which is required to be constant, rather than a const generic. This is already used for SIMD shuffles (plus other SIMD things where the constant value is pulled out by LLVM rather than the frontend).

1 Like

For the top level ergonomics, a macro block wrapper seems more appropriate for Rust than crate compilation flags.

let out = loose_math!(maybe_params) {
  multizip!(activations, weights, biases)
    .map(|(a, w, b)| a * w + b)
    .reduce(|a, b| a + b)
}

This is insensitive to the final choice of intrinsic, but should be enough to write a coherent experimental implementation, Motivation, and usage examples without resorting to crate-wide or compilation-wide loosening.


Additionally, I'd like to raise a little bit of caution around defining this in terms of LLVM flags. Instead, list the permitted changes which happen to currently match what LLVM provides.

2 Likes

Does it make sense to do this with an attribute that allows approximating optimizations for the covered code? A crate that wants everything approximated can put, for example #![approximate(float-reassoc)] at the top, or items/statements can be annotated individually.

I’d expect that the argument here would be a comma-separated list of terms like:

  • name or enable name: Allows (but doesn’t require) the compiler to use the named approximation.
  • disable name: Forbids the named approximation from being used, even if it was enabled at a higher level.
  • inherit name: (only applies to function definitions) Allow the named approximation iff it is allowed at the callsite. If the compiler can’t determine this statically, it disables the approximation.

Add the unsafe keyword in there somewhere for approximations that can cause UB.

(Having never looked at the internals of the compiler, I have no idea how feasible this is to implement, especially inherit)

2 Likes

I agree with kornel's post above on this. Anything affecting a scope is awkward because it's unclear how it should affect calls to other functions -- be that .sum(), .sqrt(), overloaded operators, ndarray, etc.

6 Likes

Several folks in this thread talked about this over on Zulip a few months ago and my implementation plan had been to add intrinsics like fn fadd_with_flags<T: Copy>(a: T, b: T, f: Flags) -> T with intent to stabilize (only allowing safe flags). Pandemic life got in the way, but I still think it's the right first step since:

  1. This maps clearly to LLVM's intrinsics and mirrors existing fadd_fast.
  2. It's a necessary first step to more ergonomic interfaces.
  3. There is an eager user base for such features right now.
  4. More ergonomic interfaces should be worked out in libraries first and only added to core if consensus forms or the desired interface can't be implemented as a library.

If nobody beats me to it, I should be able to work up a PR after my semester ends. The current state is a roadblock to taking Rust seriously in a lot of numerical/scientific computing contexts.

BTW, I agree with others that almost everyone wants contract for performance-sensitive numerical code, but point out that despite usually being at least as accurate, the relative error under contract can be infinite due to a*a - a*a != 0, and there exists real-world performance sensitive code has eschewed contraction due to substantial changes in results [1]. Meanwihle, reassoc is much more important than contract for fast reductions (dot products and the like) because it's necesssary for both vectorization and ILP (and doing those manually and arch-aware is tedious). However, reassoc can significantly change results in more circumstances, so code dealing with ill conditioning and convergence tests should usualy avoid it. In my experience, arcp is a mild convenience and I don't think I've worked with real code that it would harm.

[1] The Community Earth System Model (CESM) is one such example, and it's likely that the FMA code is not actually less accurate, but it does require recalibrating parameters (a labor-intensive multi-stakeholder endeavor).

6 Likes

This can be done in an unambiguous way. For example if the appropriate calls are added to the primitive types the annotation only affects which call is invoked when using operators. (IE the meaning of + or * is changed in that scope)

This of course means the answer to "Does this affect .sum(), .sqrt() etc." is "No".

On the surface it might seem like the lack of propagation to called methods is a downside. However I think it may be necessary regardless of the approach.

It ought to be possible for a particular crate / method to control its own meaning.

It is highly undesirable for an application to depend on a library, where the library has many tests to check its correctness, but these are circumvented because the application it is embedded into changes the meaning of multiplication inside of it. This pitfall is precisely why --fast-math has a bad reputation.

Regardless of syntax, if the control is local, we can maintain consistency and testability and don't need to regard it as unsafe.

1 Like

There could be an attribute on functions, something like #[inherits_floating_point_fastness], that allows functions like sqrt to opt-in to having -ffast-math propogate into them. However rust already has too many magical, ad-hoc, built-in attributes designed to paper over the lack of a proper effects system.

3 Likes

A flag for inheriting floating point behavior would effectively be an alternative syntax for generics (f32 -> impl FloatAtAnySpeed). It's not possible to (sensibly) compile such function by generating its code only once like a regular function. It would have to be monomorphized, possibly more than once, with float implementation chosen by the caller.

3 Likes

Though then again, that's how it works for #[rustc_inherit_overflow_checks], though, right? Although I suppose that has the advantage of being exclusively a global toggle for the default.

The number of f32::method() math functions provided is an interesting winkle in fastmath flags. At the very least, afn doesn't even do anything (AIUI) for the primitive ops, only for calling a builtin (or well-known) function.

I still don't know how much I like it (it's still around a mild dislike), but this is kind of an argument to just making it f32::<IeeeConformance>::method(), i.e. making f32 itself generic, and doing something like -Zshare-generics to still provide the current impl precompiled.

But no, I do think that at the root language level, exposing these as new operations on f32, rather than a new primitive type, is the proper way of doing things, as fundamentally the flags we're talking about apply to the operations over the type, not the type itself. There's just a few more than I originally thought.

2 Likes

It absolutely can, but that was a bit of a trap question -- if I'm using afn fast-math, then I really want 1/x.sqrt() to be compiled to a special invsqrt.

So if it's not, then we either need to make different methods for all these things -- of which there are far more than just +-*/% -- or to make a new type on which sqrt does that. At which point it seems mostly pointless to have a magic scope that changes the operators (that only affects primatives, not overloaded ones) when those could just also be methods or on the other type.

Yes, though that's also a major hack. If we ever get build-std support figured out, it could just go away so that std could follow whatever your normal build flags are -- and this is desirable for a bunch of other reasons too, like being able to have debug_asserts in core that actually do something outside of rustc's CI runs.

1 Like

It would. It is already transformed to this by LLVM optimizer, not by frontend.

But LLVM wouldn't have permission to do this if fast math flags don't propagate. You'd effectively have (forgive my LLIR non-knowledge)

%2 = llvm.builtins.sqrt f32 %1
%3 = fast llvm.builtins.div f32 1.0 f32 %2

And this cannot be legally transformed into llvm.builtins.invsqrt AIUI, because the sqrt is not tagged with the fastmath flags, so must be accurate. Or do the fastmath flags back-propagate in this case, so this can contract into invsqrt without sqrt being fastmath tagged?

3 Likes

Fundamentally there are only three options:

  1. Fixed at language level - Unavoidably slow
  2. Inherited settings - Unsafe and impossible to test
  3. Local settings - Developers choose between slow or verbose

This is an argument towards making it a type-level knob - then if you're using the correct flag, both (all) ops will be tagged with it and llvm can perform the contraction. Making f32 itself generic seems the most elegant, as long as it doesn't break anything.

However as mentioned above, the intrinsics need to first be available, regardless of the more ergonomic syntax built on top of it.

This variant would not optimize because you don't say that we allow optimizations. To make it convert to inverse square root, we need emit such IR:

%2 = call afn @llvm.builtin.sqrt f32 %1
%3 = fdiv afn f32 1.0 %2

You could specify fast math flags on calls of instrinsics like sqrt, sin, exp, log and etc. https://llvm.org/docs/LangRef.html#i-call

Also, you cannot just add this flags to make whole inner function optimized because they work only in certain situations:

The optional fast-math flags marker indicates that the call has one or more fast-math flags, which are optimization hints to enable otherwise unsafe floating-point optimizations. Fast-math flags are only valid for calls that return a floating-point scalar or vector type, or an array (nested to any depth) of floating-point scalar or vector types.

I think, if we would use my type based approach, we can just add such flags to LLVM intrinsics calls when needed.