Adding "minifloats" (f24, f16, f8) as native types

Currently Rust has support for the quite common floating point types f32 (single precision) and f64 (double precision), which I consider sufficient for everyday purposes.

When it comes to bigger data-structures, which require a higher dynamic range than integers can provide (raw image photography, videos, voxel data, etc.) f32 has some disadvantages. The obvious one is size: Using f32 as data type, a raw image of a 20 Mpx camera would produce 80 MB of data. The other reason is speed when it comes to real-time applications (like the computation of optical flow in computer vision)

This is why f16 (aka half float) is becoming more and more common on specialized hardware (OpenCL Specification 2.0, Minimum List of Supported Image Formats), in imaging software (GIMP if i remember correctly) and even file formats (TIFF Technical Note 3)

I’d like so see f24 implemented for similar reasons. It finds its use where a mantissa of 11 bits (as in f16) isn’t enough, but f32 is to be considered oversized.

f8 ist not so common (from my perspective), but it could find its use on embedded devices like (8-bit-)microcontrollers. An example could be a lookup table holding a gamma correction for a PWM to compensate for the nonlinearity of the human eye when perceiving brightness. Other possible uses might be per-pixel-filter-kernels.

I don’t see any bigger downsides (despite some extra work of course). Platforms not supporting minifloats natively, have several options of emulation:

  • Classic computation in Software using downscaled versions of existing float-libraries
  • Upcasting to f32 → compute → downcasting (maybe using appropriate rounding)
  • Lookup-tables and -trees for conversion and (in some cases) for computation, if memory doesn’t matter that much

The relation between f32 :left_right_arrow: f64 is just like f16 :left_right_arrow: f32 which is just like f8 :left_right_arrow: f16. So there’s nothing special to learn about.

Support for f128 whould be nice, but introduces a bigger challenge as emulation would be computationally intensive. I’d rather like to see arbitrary precision … different topic.

/edit Fixed link to TIFF paper (sorry)


I am sympathetic to the general idea, but there are several issues and open questions off the top of my head:

  • Except f16, these are not IEEE standards. Most of the behavior (rounding, for example) is size-agnostic or easy to extrapolate, but how many bits the exponent and significand get is not obvious, at least to me. Is all this widely agreed upon by vendors who implement these types or is “24 bit float” a family of several different types?
  • Emulation is tricky. Very tricky if you care about accuracy (even more important if you have fewer bits). Just doing f32 arithmetic and rounding the result is wrong, double rounding can amplify the rounding error beyond what the error of the f32 operation. The x87 FPU stack has a long history with causing such problems (admittedly in part because compilers would be inconsistent about at which steps they round the 80 bit intermediate results), and it has often bitten programmers. Let’s not repeat this mistake. However, re-implementing all the float operations in software is not just slow, it is extremely complicated once you get to transcendental functions. Basic arithmetic is not trivial either.
  • What level of support is actually needed for various applications? For example, to interact with OpenCL applications (as opposed to writing the Kernel in pseudo-Rust) does not necessarily require any operations on the data, or converting to and from f32 is sufficient. Likewise, dealing with image file formats (aside: I could not find 16 bit floats in the TIFF Technical Note you link) probably won’t require transcendental functions. However, if the surface of these types is going to be much smaller, does it make sense to make them built-in types rather than as library types? Doesn’t the f64 :left_right_arrow: f32 :left_right_arrow: f16 :left_right_arrow: f8 analogy break down then?
  • I have huge doubts about the usefulness of f8. I’m not very familiar with the embedded industry but it is my impression that systems so small and low-power are well on their way out. In any case, your description makes it sound like such a niche use case that I’m not sure supporting it is worth more than an hour of our work, especially since these platforms are unlikely to have LLVM backends anyway, and in many cases will be partially programmed in assembly.

Sorry, I fixed the link. It actually covers support of 16- and 24-bit.

Note that f128 was a thing, a year ago.

1 Like

I totally agree that Rust should support only formats backed by an official standard or at least a solid common practice; this rules out the f8-format, for which i was unable to find anything established.

f24 at least has a common and reproducible layout (8 exponent bits, 1+15 mantissa bits).

Could you explain this a little more? As far as I know, its common practice and the primary reason for the 80-bits-format to exist on x87 FPUs (see also IEEE floating point - Minimizing the effect of accuracy problems). I don't see a reason why a calculation with higher precision should yield worse results than directly using the lower precision.

As long as the error (typically measured in ULPs) is within the range specified by the IEEE, I don't see any source of confusion. Or are you worried about reproducibility? Then, how is that problem different from a soft-float implementation for f32/f64 which is required on some platforms?

(I'm currently unaware of how Rust handles strict math and rounding.)

[quote="hanna-kruppe, post:2, topic:2367"]However, re-implementing all the float operations in software is not just slow, it is extremely complicated once you get to transcendental functions. Basic arithmetic is not trivial either. [/quote]

Basic arithmetic isn't that hard. Transcendental functions are mostly unary and as such could be emulated using a simple 128 KiB-lookup table (for f16) as easiest and possibly fastest fall-back.

As far as I know OpenCL currently supports only conversion from/to f16, which might change in near future. If these were the only operations adopted by Rust, I agree that an integration as built-in type f16 wouldn't make sense.

It's a bit off-topic, but at least for the AVR microcontrollers a GCC-backend exists and I personally think Rust would be a great candidate to target those omnipresent devices. As much as I like assembly languages, it isn't always the right choice, even for microcontrollers :smile:

I don’t think we should support these other types natively, as they are so uncommon, and the use cases seem so niche. Supporting features isn’t as simple as flicking a switch somewhere to “on”, it takes effort to maintain them. There’s no obvious reason these types couldn’t be supported as library types. The only disadvantage is that you don’t get casting via as for them, hardly a deal-breaker in my eyes.

You don’t only need to show how these types are useful, you also need to show why they should to be supported on a language level.


I know it's really counter intuitive, but it's true. At least when double rounding after every operation. Here's an example with four bits and six bit significands for simplicity. Consider adding 1.010 >> 1 to 1.000 << 4 after extending to six bit:


Rounding this to six bits yields 1000.10 (half to even), further rounding to four bits yields 1000 (again half to even). Directly rounding to four bits would round up since the fractional part if 0.5 + 0.125, i.e. more than half. The result of 1000 is more than half an ULP off!

It seems plausible that performing an entire chain of calculations in higher precision and rounding once at the end might improve things, if not always, then at least in most cases. But double rounding after every operation means there is no opportunity for the extra bits to help, only for the information discarded by the first rounding to make the second worse.

Note that double rounding being wrong is not a IEEE 754 quirk. It can happen with base 10 and other rounding modes (let's take round-half-up): 0.049 rounded to one decimal digit is 0.0, but first rounding to two decimal digits yields 0.05 and rounding that again yields 0.1.

Having spent a good chunk of the last weeks with rounding floating point numbers, I respectfully disagree. Regardless, a 128 KiB lookup table for every transcendental function would result in one or even several megabytes of binary size added. That is a pretty big cost.

But why can't it be a newtype struct f8(u8); with inline assembly to access the hardware support for those numbers? More generally, I agree with @Aatch here.

f16 is less weird than the others. It’s an IEEE standard, it’s supported by LLVM and most modern hardware supports it.

Seems like there’s a case for adding it at least.


Thanks a lot for clarifying! I bet there's a solution (like intelligently choosing differend roundings), but it's not as simple as I previously thought.

As there's no hardware support for f8 in sight, I don't see any downsides with your solution and recommend this approach. Same with f24.

I've seen ARM and SSE5 instructions about f16, but they are all conversions from and to f32/f64.

f16 is widely used on GPUs (for example: ). Several years ago lots of video cards even used them to emulate small integers.

1 Like

@nodakai My bad. Looks like most hardware supports it as a storage-only format. Although this article suggests they can still be faster than f32s by using less cache and memory bandwidth.

OTOH I tried to build this

define half @addem(half %x, half %y) {
  %ret = fadd half %x, %y
  ret half %ret

and LLVM segfaulted during ‘X86 DAG->DAG Instruction Selection’. So I’m guessing we’d have to insert the conversion intrinsics manually when generating f16-handling code on an architecture that doesn’t fully support them, meaning they’ll be more of a PITA to support than f32/f64.

The latest CUDA (7.5) adds support for f16 (half-floats).

1 Like

GCC at least has a few options for f16 on ARM, which is where I’m interested in using it:

I’m a little worried about how GCC seems to have two incompatible implementations here.

If it’s used as a “storage only” format, perhaps the best thing to do is start with a library in cargo for loading and saving floats to/from f32/f16, ala vcvt_f16_f32() in armcc.

1 Like

Relevant C++ standardization proposal: Adding Fundamental Type for Short Float (contains some overview and motivation).
According to Botond Ballo’s report the proposal was received favorably by the language evolution group.


Interesting! Unsurprisingly given the tradition of C and C++, this new float type seems to be very platform-dependent, both in size (it’s allowed to be 32 bit) and behavior (most importantly, whether it’s roughly IEEE-compatible). Aside from the tradition aspect though, it might indicate that the C++ people are also skeptical if 16 bit float can reasonably be implemented everywhere. This thread named many reasons why it’s hard.

What feature might we need to add to Rust in order to handle user-defined f8 or whatever as if they were built-in types? What is preventing a library author from writing an usable external f8 crate?

User-defined literals and const fn support?

A note for posterity that the half crate supports f16 as storage-only, and (on nightly) can use the LLVM intrinsics that map to the F16C instruction set. I haven’t performed experiments to test how well LLVM actually vectorizes these yet.


I’d like to see d64 and d128 floats, myself, to provide decimal floats (ALU-based) as an alternative to the binary floats (FPU-based) offered by default. Very useful in any situation that requires accurate results, like financial calculations.

1 Like