My (old) thought on this was to introduce some way of narrowing the set of valid bit patterns for a type. Use this to construct r32/r64 types with “holes” in them where the NaNs would be. Those holes could be used for differentiating between Some(x) and None, much like you can mark a value as being NonZero.
Then, all operations that cannot result in NaN for non-NaN inputs would return r32/r64, everything else would return Option<r32>/Option<r64>, which would be bit-compatible with f32/f64 (i.e. you could transmute between them).
However, I think “fast math” is an independent problem. There was a proposal a while ago to add an annotation that could be used to turn range checking on or off in a block of code (i.e. detect overflow and underflow). It seems like that might be a better approach for speed optimisation. Allow programmers to say “I need speed for this bit of code, so #[fast_math(3)] and feel free to sacrifice up to 3 bits of precision”. If you want it crate-wide, just apply to the crate, then override if you need precision with #[fast_math(0)].