Absolutely agree.
Also, if someone wants make "all operations in crate is fast", it is possible to wrap all incoming floats into LooseFloat and use them inside everywhere.
Absolutely agree.
Also, if someone wants make "all operations in crate is fast", it is possible to wrap all incoming floats into LooseFloat and use them inside everywhere.
I presume in both cases people would use the first option if they needed to use it a lot.
The difference is only whether it'd be std::float::LooseFloat
that internally uses intrinsics::loose_add()
or some_crate::LooseFloat
that internally uses std::f32::loose_add()
.
Yeah, I agree. My wording was rather poor. More accurately, what I meant was that usually you'd probably want to make the choice at the type level rather than the level of individual operation invocations, and additionally that you'd probably want to do that by choosing a single type (alias) that's used consistently by the whole module or what have you.
But on the other hand, I guess that even in many fp-intensive uses you might only want to trade accuracy for speed in tight inner loops rather than everywhere.
I could imagine both, really. Perhaps the rendering and particle system effects code wants loose math everywhere and the gameplay physics code wants reproducible math everywhere -- even thought they're both in the same app or even crate.
(sorry for bringing up the overly generic types and risking derailing)
So here's a question. We know the maximal set of -ffast-math
optimizations that LLVM can provide, as well as the maximal set that can be provided safely. (And it's worth noting that inlined code can often infer nnan and ninf and propogate it for optimization. (fpsev has gotten even better and the author has written more posts about it, but I found those two interesting.))
Providing intrinsics for f32::add
, a single f32::loose_add
, and maybe an unsafe f32::assume_finite
for nnan/ninf is at least somewhat tenable. Having loose_op
intrinsics for every combination of potential fast math flags is obviously combinatorially prohibitive, though.
So the actual question: does rustc have a mechanism to provide a const generic intrinsic, that would be const generic over IeeeConformance
/FastMathFlags
?
If rustc already has the capability, then I think adding (or modifying) intrinsics to expose the ability to call operations with fast math flags is an obvious next step, which would face little to no opposition to a PR. This would allow experimentation on nightly, to see what kind of benefit this provides to real-world cases, and to experiment with a more well-typed wrapper.
If rustc doesn't have this capacity, then it would need to be added before a general mechanism could even be provided. In that case, I believe the way forward would be to add (or modify) intrinsics to provide just the maximal set of safe fast math flags, and then draft a plan and MCP to add the requisite functionality for const generic intrinsics.
Even if a wrapper type is provided by std, the intrinsics need to be there for it to call. And I see little to no reason not to (unstably) provide the intrinsics, so someone should do the leg work to make a PR to provide them.
(The only exception I could possibly see is if f32
itself became the generic type, as f32 + f32
isn't a (normal) intrinsic, it's literally baked into the implementation of the language. Though this could change so that we have impl Add for f32 { fn add(self, o: f32) -> f32 { intrinsics::add(self, o) } }
instead of the current impl Add for f32 { fn add(self, o: f32) -> f32 { self + o } }
, I suppose. Maybe that'd even be better, for some definition of the word better.)
If necessary, the underlying intrinsic could accept the mode value as a regular function parameter which is required to be constant, rather than a const generic. This is already used for SIMD shuffles (plus other SIMD things where the constant value is pulled out by LLVM rather than the frontend).
For the top level ergonomics, a macro block wrapper seems more appropriate for Rust than crate compilation flags.
let out = loose_math!(maybe_params) {
multizip!(activations, weights, biases)
.map(|(a, w, b)| a * w + b)
.reduce(|a, b| a + b)
}
This is insensitive to the final choice of intrinsic, but should be enough to write a coherent experimental implementation, Motivation, and usage examples without resorting to crate-wide or compilation-wide loosening.
Additionally, I'd like to raise a little bit of caution around defining this in terms of LLVM flags. Instead, list the permitted changes which happen to currently match what LLVM provides.
Does it make sense to do this with an attribute that allows approximating optimizations for the covered code? A crate that wants everything approximated can put, for example #![approximate(float-reassoc)]
at the top, or items/statements can be annotated individually.
I’d expect that the argument here would be a comma-separated list of terms like:
enable
name: Allows (but doesn’t require) the compiler to use the named approximation.disable
name: Forbids the named approximation from being used, even if it was enabled at a higher level.inherit
name: (only applies to function definitions) Allow the named approximation iff it is allowed at the callsite. If the compiler can’t determine this statically, it disables the approximation.Add the unsafe
keyword in there somewhere for approximations that can cause UB.
(Having never looked at the internals of the compiler, I have no idea how feasible this is to implement, especially inherit
)
I agree with kornel's post above on this. Anything affecting a scope is awkward because it's unclear how it should affect calls to other functions -- be that .sum()
, .sqrt()
, overloaded operators, ndarray, etc.
Several folks in this thread talked about this over on Zulip a few months ago and my implementation plan had been to add intrinsics like fn fadd_with_flags<T: Copy>(a: T, b: T, f: Flags) -> T
with intent to stabilize (only allowing safe flags). Pandemic life got in the way, but I still think it's the right first step since:
fadd_fast
.If nobody beats me to it, I should be able to work up a PR after my semester ends. The current state is a roadblock to taking Rust seriously in a lot of numerical/scientific computing contexts.
BTW, I agree with others that almost everyone wants contract
for performance-sensitive numerical code, but point out that despite usually being at least as accurate, the relative error under contract
can be infinite due to a*a - a*a != 0
, and there exists real-world performance sensitive code has eschewed contraction due to substantial changes in results [1]. Meanwihle, reassoc
is much more important than contract
for fast reductions (dot products and the like) because it's necesssary for both vectorization and ILP (and doing those manually and arch-aware is tedious). However, reassoc
can significantly change results in more circumstances, so code dealing with ill conditioning and convergence tests should usualy avoid it. In my experience, arcp
is a mild convenience and I don't think I've worked with real code that it would harm.
[1] The Community Earth System Model (CESM) is one such example, and it's likely that the FMA code is not actually less accurate, but it does require recalibrating parameters (a labor-intensive multi-stakeholder endeavor).
This can be done in an unambiguous way. For example if the appropriate calls are added to the primitive types the annotation only affects which call is invoked when using operators. (IE the meaning of +
or *
is changed in that scope)
This of course means the answer to "Does this affect .sum()
, .sqrt()
etc." is "No".
On the surface it might seem like the lack of propagation to called methods is a downside. However I think it may be necessary regardless of the approach.
It ought to be possible for a particular crate / method to control its own meaning.
It is highly undesirable for an application to depend on a library, where the library has many tests to check its correctness, but these are circumvented because the application it is embedded into changes the meaning of multiplication inside of it. This pitfall is precisely why --fast-math
has a bad reputation.
Regardless of syntax, if the control is local, we can maintain consistency and testability and don't need to regard it as unsafe.
There could be an attribute on functions, something like #[inherits_floating_point_fastness]
, that allows functions like sqrt
to opt-in to having -ffast-math
propogate into them. However rust already has too many magical, ad-hoc, built-in attributes designed to paper over the lack of a proper effects system.
A flag for inheriting floating point behavior would effectively be an alternative syntax for generics (f32
-> impl FloatAtAnySpeed
). It's not possible to (sensibly) compile such function by generating its code only once like a regular function. It would have to be monomorphized, possibly more than once, with float implementation chosen by the caller.
Though then again, that's how it works for #[rustc_inherit_overflow_checks]
, though, right? Although I suppose that has the advantage of being exclusively a global toggle for the default.
The number of f32::method()
math functions provided is an interesting winkle in fastmath flags. At the very least, afn
doesn't even do anything (AIUI) for the primitive ops, only for calling a builtin (or well-known) function.
I still don't know how much I like it (it's still around a mild dislike), but this is kind of an argument to just making it f32::<IeeeConformance>::method()
, i.e. making f32
itself generic, and doing something like -Zshare-generics
to still provide the current impl precompiled.
But no, I do think that at the root language level, exposing these as new operations on f32
, rather than a new primitive type, is the proper way of doing things, as fundamentally the flags we're talking about apply to the operations over the type, not the type itself. There's just a few more than I originally thought.
It absolutely can, but that was a bit of a trap question -- if I'm using afn
fast-math, then I really want 1/x.sqrt()
to be compiled to a special invsqrt
.
So if it's not, then we either need to make different methods for all these things -- of which there are far more than just +-*/%
-- or to make a new type on which sqrt
does that. At which point it seems mostly pointless to have a magic scope that changes the operators (that only affects primatives, not overloaded ones) when those could just also be methods or on the other type.
Yes, though that's also a major hack. If we ever get build-std
support figured out, it could just go away so that std
could follow whatever your normal build flags are -- and this is desirable for a bunch of other reasons too, like being able to have debug_assert
s in core
that actually do something outside of rustc's CI runs.
It would. It is already transformed to this by LLVM optimizer, not by frontend.
But LLVM wouldn't have permission to do this if fast math flags don't propagate. You'd effectively have (forgive my LLIR non-knowledge)
%2 = llvm.builtins.sqrt f32 %1
%3 = fast llvm.builtins.div f32 1.0 f32 %2
And this cannot be legally transformed into llvm.builtins.invsqrt
AIUI, because the sqrt is not tagged with the fastmath flags, so must be accurate. Or do the fastmath flags back-propagate in this case, so this can contract into invsqrt without sqrt being fastmath tagged?
Fundamentally there are only three options:
This is an argument towards making it a type-level knob - then if you're using the correct flag, both (all) ops will be tagged with it and llvm can perform the contraction. Making f32
itself generic seems the most elegant, as long as it doesn't break anything.
However as mentioned above, the intrinsics need to first be available, regardless of the more ergonomic syntax built on top of it.
This variant would not optimize because you don't say that we allow optimizations. To make it convert to inverse square root, we need emit such IR:
%2 = call afn @llvm.builtin.sqrt f32 %1
%3 = fdiv afn f32 1.0 %2
You could specify fast math flags on calls of instrinsics like sqrt, sin, exp, log and etc. https://llvm.org/docs/LangRef.html#i-call
Also, you cannot just add this flags to make whole inner function optimized because they work only in certain situations:
The optional
fast-math flags
marker indicates that the call has one or more fast-math flags, which are optimization hints to enable otherwise unsafe floating-point optimizations. Fast-math flags are only valid for calls that return a floating-point scalar or vector type, or an array (nested to any depth) of floating-point scalar or vector types.
I think, if we would use my type based approach, we can just add such flags to LLVM intrinsics calls when needed.