[Pre-RFC?] NonNaN type

I believe that it would be beneficial and consistent to have a NonNaN struct for Floats. A NonNull and NonZero type exist for the reason that the Option of these types have the same underlying representation as each other. It’s type safety win and a performance none issue.

I believe there are a large number of fields that would use Option<NonNaN> to great effect. For example in gamedev you may see a lot of things like:

struct World {
     pos: Vec<Vector3<f32>>,
     vel: Vec<Option<Vector3<f32>>>,
     target: Vec<Option<Vector3<f32>>>,
}

Allowing to specify that velocity, position, and targets cannot be NaN allows for all three vectors to have the same representation.

Another benefit is that bugs are caught sooner. One of the most frustrating parts of NaNs is that they propagate before causing a crash. If an assertion failure or panic was caused at the precise operation the NaN was produced, debugging NaNs would be a lot less painful.

Overall, I think NonNaN would be a struct I’d love in the standard library.

15 Likes

NaNs cause crashes?

I’d rather see fun glitches powered by NaNs than DoS exploits caused by NonNaN panicking.

So are you advocating for undefined behavior as being more desirable than panicking? Or are you joking?

7 Likes

Some prior art: (Pre-?)Pre-RFC: Range-restricting wrappers for floating-point types

I definitely want non-NaN types. I think they’re the best answer to “why can’t I sort floats” question. But there does seem to be disagreement about how math should work on them. (Some prefer the noisy-float solution.)

9 Likes

It’s not undefined, it’s just unexpected. (if NaNs were some sort of magical CSPRNG I think everyone would be using them.)

have you seen the NaNs of Minecraft, for example? they don’t cause the server to go down (generally), they just exist and the game does its best to keep on going.

1 Like

The types in the aforementioned noisy-float crate will panic in debug when set to NaN and will act like normal floats (ie. allowing NaN) in release mode.

Perhaps crash is the wrong word because it implies some sort of system exception. NaNs may not create errors in that sense, but NaNs definitely cause crashes in the sense of causing unusable system states and assertion failures/panics.

And if you’d rather weird glitches that are hard to debug, you can always just use the standard float types :stuck_out_tongue:

2 Likes

Indeed. This also implies that if the representation of Option<NonNaN> and Option are the same, that in release mode NaNs would simply propagate as Nones. This could potentially be a lot safer.

2 Likes

I just wanted to link the older, related discussion: Avoiding PartialOrd problems by introducing fast finite floating-point types

I don’t know if its good to conflate this issue with fast-math, but its not completely orthogonal imo.

To illustrate the “disagreement about how math should work” @scottmcm described, let’s walk through some options for the signature of addition for this type. To keep this manageable I’ll assume that it is unacceptable (and not merely undesirable) to have a NonNaN value that actually contains a NaN bit pattern – this is required for us to do layout optimizations in Option<NonNaN> as the OP brought up.

  • No addition functionality, the newtype is just “for storage” (similar to NonZero and NonNull) – this means having to unpack and often re-wrap (which costs a runtime check) any time you want to do anything with your numbers
  • fn add(self: NonNaN, other: NonNan) -> NonNaN – this function isn’t total since addition of non-NaN floats can result in NaN [1] so it has to panic (or return a nonsensical value) and that requires a runtime check
  • fn add(self: NonNaN, other: NonNaN) -> Option<NonNaN> – can return None instead of panicking, but only shifts the problem to the caller (who, to add insult to injury, can’t even chain this operation)
  • fn add(self: Option<NonNaN>, other: Option<NonNaN>) -> Option<NonNaN> – this is total and can be chained, and I believe clever layout can possibly avoid the runtime check in some circumstances [2], but it’s roughly isomorphic to just operating on regular NaN-ful floats. It is significantly less useful than most Option<T> types: even though one is forced to check for None/NaN, after doing that and getting your hands on the x in Some(x), you can’t do a lot of useful things with it.

Most other arithmetic operations don’t fare better. This design space is full of trade offs:

  • there’s most likely to be runtime overhead, either in the arithmetic operations themselves or in Option/Result handling outside of it
  • eliding some of the runtime checks means losing detail about when the NaN was produced
  • due to NonNaN not being closed under arithmetic you have to choose between convenience of chaining vs ruling out NaNs early & widely
  • do you even want to soundly rule out NaNs to be able to do layout optimizations or do some optional checking like noisy-float does (so NaNs are still possible and the compiler can’t do optimizations assuming they’re impossible)

Thus I would caution everyone to be very precise about what exactly it is they want from a “no NaNs” type, and be aware of the consequences those requirements have – if they are even mutually compatible at all. It’s very easy to produce a long list of problems caused by NaNs and wish for a solution for all of them, but as far as I can tell there is no universal solution in this space, just several incompatible partial solutions.


[1] for addition one example is INFINITY + NEG_INFINITY, and note that NonNan also isn’t closed under subtraction, multiplication, division, square root, and many transcendental functions. [2] basically we’d have to pick one bit NaN pattern (e.g., quiet, positive sign, payload = 0) that the floating point implementation always produces in the operations we want to support on NonNaN. Treating all NaNs as None seems quite tricky to support since the same variant would have many different tags in memory, which is unprecedented.

11 Likes

Fair points. I’d say that perhaps my original thinking of having operations cause panics at the source of operation may have been misguided. I may argue that it could be done by implementing fn add(self: NonNaN, other: NonNaN) -> Option<NonNaN> as well as fn add(self: Option<NonNaN>, other: Option<NonNaN>) -> Option<NonNaN> in order to cover both cases, but I have no idea if the orphan trait rules even allow this.

I believe there is still some fundamental value in adding no additional functionality besides storage, because (as it has been pointed out) it would allow for NonNaN to be used in data structures more fluidly.

I don’t think there is much of a usecase for an operation like fn add(self: NonNaN, other: NonNaN) -> Option<NonNaN>

This may sway a little bit into meta commentary and is not only a reply to this but also to other discussion about floating point math in the community.

The reason FP exist is performance. There are other representations that are better approximations of real numbers and preserve more characteristics. Its about getting as fast as possible while still getting something workable out of it. Since its a standard and there is a wide variety of use cases its to slow for some and “to wrong” for others. But the ubiquity of IEEE 764 floats is testament to the quality of the trade off and that its at least near a sweet spot.

To come back to the initial point: A nan check on every operation skews the trade off and makes it worse overall. If you can stomach the performance hit, you probably can stomach using a more precise number representation.

If we’re getting native non-conforming floats, then we should be explicit about the trade off we’re making and the use cases we want to serve.

2 Likes

I think NaN is perfectly analogous to nullptr. It's the same bad design: the absence of a value, but baked into the actual type. I am pretty sure that f64 and f32 are most of the time used as a number, and not as maybe-a-number. I don't have statistical data, but I guess that about 95% of code does assume that their floating point data isn't not a number. Just like probably 85% of code assumes that the objects it uses aren't nullptr. As we all know, the difference between what we think the code does, and what the code actually does, is bugs.

How is nullptr avoided in Rust? It's the Option type. It is returned where a computation may return nothing instead of the actual result, and where a variable may contain a value or nothing.

I think the same pattern should be applied to floating point types.

I think there are mainly two use cases considering NaN:

  1. Control Flow. It is required to check the validity of a computation at runtime. Example: A division may divide by zero, so the programmer has to check if the dividend is zero before dividing. This use case will always require a runtime check, and it can be done before division by checking the dividend, or after computing by checking the result for NaN.

  2. Asserting a number. A computation is assumed to not return not a number (it should return a number). Example: Computing the average age of two persons. As mentioned before, a simple addition may return NaN where one operand is infinity and the other is negative infinity. In this group of use cases however, the programmer assumes that each operands is finite anyways. I estimate that roughly 80% of code assumes all the variables are finite numbers (finite and not not-a-number).

So does accessing an element in an array. But Rust was designed to be ergonomic and safe, so we decided to check array indexing every single time.

Code that uses numbers[i] assumes that the index is in range, but Rust will check for it anyways. Every time. Even in release builds. Rust checks the array index, even though the programmer probably feels really confident that the array index will never fall out of bounds. There is however a performance optimization: the programmer can disable this behavior for release builds. This corresponds to the second use case.

For cases where the Programmer is uncertain about the index bounds, he uses get: numbers.get(i).unwrap_or(1). This corresponds to the first use case.

The same could be done for floats. For example with division: There could be two ways to divide a number.

  1. a.div(b).unwrap_or(1), which returns an optional. This method could also be called 'try_div' or something similar. This is used for the first use case, where computation would require a check anyways.

  2. a / b just assumes the computation will return a number. The result would have to be checked internally. This could panic if the operands are not valid, just like indexing directly by using numbers[i]. These checks will always be performed, if the programmer does not actively request omitting the checks for the 1% of code where a performance bottleneck has been measured.

I think the perfect solution would be to have all default floats (f64, f32, ...) garantueed to never contain NaN, proven at compile time. I know this is a big request. But just adding a separate non-NaN type won't make the average programmer produce safer code. This way, in memory we could represent Option<f64> just like the old f64 with NaN representing None, just like the OP suggested.

7 Likes

This is not a really good comparison. Out of bounds errors are not only logic errors but they're memory unsafe. Thats a whole other ballgame. Also i'm pretty sure that the performance impact is a lot smaller than checking each and every floating point operation since we work with iterators most of the time and if not, the CPU is probably waiting for the indexed memory anyway.

A better (but still not 100% apt) comparison would be integers. Overflow is checked in debug builds, but wrap around in release builds. This can lead to logic (but not memory access) errors, but is the right trade off since the performance hit would be to large for the gain.

2 Likes

Yes, I thing that would be a good alternative. I release builds, NaN could just be ignored by the runtime.

2 Likes

I think that this is a real important point. If you want something that has true real/integer/rational semantics, use, BigNumber/BigInteger/BigRational types that implement those semantics and don't have the limitations of floats/ints which do not actually follow the normal rules of mathematics 100% of the time (which is what all this NaN, Infinitiy, Wrap-Around, Min/Max values, limited precision, etc. is all about). You are sacrificing exacty numeric adherence for performance. Most of the time, this is good, but, when you need exact rules and precision, there are types for that. Don't try to make one into the other. It serves no useful purpose.

2 Likes

Just to validate my understanding of floating point: There is no nice, clean proper subset of f32 or f64 values that’s closed under basic arithmetic. Excluding NaN doesn’t work because of things like INF + -INF and 0/0. Excluding INFs too doesn’t work because the largest/smallest finite value +/- 1 is +/-INF, as is 1 divided by +/-0. Excluding -0 fails because -1*0 is -0. So @hanna-kruppe’s points about the fundamental trade offs here apply equally to all the other “weird” FP values, unless you have a use case that’s willing to abandon IEEE 754 and its many hardware implementations just to change how float arithmetic works. Does that all sound right?

5 Likes

To add to that, the other reason to wrap around in release mode is that debug and release modes have different purposes. In debug mode we want the program to crash eagerly whenever something unexpected happens so that the bug can be found and fixed. In release mode we often want the program to keep hobbling along as best it can, rather than crashing on the user or taking down the server. This isn't always the case - if the code in question is handling sensitive data then it's better to just have it crash rather than start misbehaving - but I doubt this caveat applies to much code which is handling floats.

I don't think ubiquity and quality are related. Like with x86, it's just a cycle of it being there, so it being popular and supported, so it's being used, because there isn't anything else viable.

For my uses (image processing) floats are awful. NaNs, Infs, irregular precision and lack of fixed range (e.g. uniform 0..1), rounding modes, impossibility to safely optimize arithmetic, fast integer conversions being UB (on x86 at least), lack of 16-bit size on CPUs.

For my use-cases everything about IEEE 754 is terrible. And yet, I still use them, because the hardware already has them, and half of the CPU sits idle if I use fixed-point exclusively.

5 Likes

So, in other words, they are the only available performant alternative despite all their naggling idiosyncrasies. They are what we have for performance. If you can tolerate less performance and need better accuracy and behavior, use types that support that like BigInteger/BigNumber/BigRational etc.