Pre-RFC Introduction of Half and Quadruple Precision Floats (f16 and f128)



Introduction of 16-bit floats (f16, half precision) and 128-bit floats (f18, quadruple precision).


16-bit floats are used in some data formats for storage and in Computer Graphics, such as OpenGL, OpenEXR, JPEG XR. Half precision floats are used for increased dynamic range over 8 and 16 bit integers, and only requiring half the amount of space required for storage, memory and bandwidth.

128-bit floats are used in scientific computing and other computation, with the increased accuracy provided by the extra bits available, up to 33-36 decimal digits of accuracy. This can be useful for computation where accuracy is required. It can also be used for computation of double precision results with increased accuracy, reduced rounding and minimizing overflow.

Guide Level explanation

It will be the same as f32 and f64, except the types being named f16 and f128.

Reference Level

LLVM has support for half and quad precision first class types. This should allow for implementation similar to how f32 and f64 are currently implemented in the language.


f16: It does not have many applications to do with arithmetic and mainly storage/application. It may be more apt to pull it into a library instead.

f128: Not many processors have native support for 128-bit operations (None of Rust’s Tier 1 Architectures), which has similar drawbacks to f64 on 32-bit systems and f32 on 16-bit systems.

Unresolved Questions

Given the drawbacks of f128, should f80 be implemented instead? This is used by gcc as long double and is also available in LLVM.

There are two different versions of 128-bit floats in LLVM, one based on IEEE 754-2008’s binary128 floats, and ppc_fp128, which is two 64-bits.

ARM has two different half precision float types, the IEEE one and an alternative one. How will support be provided?

Should it be implemented in std? or should only the core::intrinsics be added?


I also see this, though:

For most target platforms, half precision floating-point is a storage-only format. This means that it is a dense encoding (in memory) but does not support computation in the format.

This means that code must first load the half-precision floating-point value as an i16, then convert it to float with llvm.convert.from.fp16. Computation can then be performed on the float value (including extending to double etc). To store the value back to memory, it is first converted to float if needed, then converted to i16 with, then storing as an i16 value.

So would the same implementation work, and LLVM will lower things automatically, or does it need to be handled specially?


As far as I can tell, clang does exactly that, where it has this to convert half precision floats to single precision and perform the operation. Also as a note from clang, a lot of the results of searching for half are for OpenCL, which is not a Rust target.


Letting aside all the technical details (like, if it can be implemented), I believe this is a great example of a feature that doesn’t increase the cognitive size of the language. The same model that works for f32 works for f16, so I guess adding this ‒ if there’s an actual use case for it ‒ sounds fine. If there are scientific applications this could enable (even when emulating the f128), it could be a good PR for Rust.

The fact that f16 could be slower than f32 might be a bit surprising, but I think the same can be said for eg. i8 over isize on some platforms, so it’s also nothing completely new.


Aren’t GPUs and TPUs the primary users of f16? If so, Rust should support interworking with those specialized processing units.


It appears that the Google TPU’s giant FMAs takes u8 inputs and outputs u16. I believe they train the network at a higher precision then quantize it, which allegedly doesn’t impact predictive results much.

CUDA only recently (Pascal, within past two yearsish) was empowered to perform operations on f16x2, u16x2, and u8x4 vectors. Before that, f16 were used as a compressed storage representation for textures and implicitly upcast to f32 before any operations. NVIDIA has heavily segmented this feature, so it’s only really useable on some high-end enterprise cards. These ops are much slower on consumer hardware.

It’s not clear what hardware actually supports non-vectorized f16 operations. Notably, std::simd on nightly currently does not have a f16x2 primitive yet. I can still imagine having non-vectorized intrinsic softfloat operations being useful if your application deals with f16 and you want to perform a handful small operation without incurring the latency/power cost of activating the very large vector functional unit.


Which processors have native support for 128-bit wide floats?

f16: It does not have many applications to do with arithmetic and mainly storage/application.

Note that many processors have support for arithmetic on f16s. For example, NVPTX and RISC-V have full-support for these on some of their products - Correct me if I am wrong, but IIRC RISC-V actually requires full f16 support on all chips with floating-point support, which IIRC means full f16 support is required for all chips.

Other architectures like x86, arm, etc. have some degrees of varying support for arithmetic on these, mainly via f16s vector types.

Notably, std::simd on nightly currently does not have a f16x2 primitive yet.

The main reason I haven’t implemented these is that we lack a clear story for f16 in Rust. It is unclear whether these vector types should be focused for storage, in which case we can implement them with an API that uses f32s (like: f16x2::new(1_f32, 2_f32)), or whether f16 should be used on their API (like: f16x2::new(1_f16, 2_f16)). At the same time, f16s are blocked on whether vector types are enough.

From an implementation point-of-view, having f16s in the language would allow f16xN vector types to have the exact same API as f32xN and f64xN vector types, which would be nice.

Also, because we don’t have f16 vector types in std::simd, none of the std::arch intrinsics working on f16 vectors are currently implemented. There are a couple of them in x86, and quite a lot in ARM.



IBM Power 9 CPU has support for it, which I believe is a Tier 2 target at the moment. Some older architectures have a 128-bit float, but those are before 2008, and the IEEE standard, and that means that the mantissa and exponent bits are different lengths…

You are correct on the use of f16 on RISC-V Platforms, but in my opinion, the main purpose of adding in f16 would be for GPU and other similar platforms, where Rust could prove to be a competitor to OpenCL C given the functions required.

ARM is an interesting one, because ARM has an alternate half precision type that id not the same as the IEEE one (it doesn’t have NaN or inf) but can encode larger numbers. This would lead to compromises needing to be made whichever one Rust (and LLVM I presume) uses.


I think that is a pretty big could. There are many languages that are competitors of OpenCL C , like Khronos’ C++ SYCL and CUDA, both of which are more ergonomic than OpenCL C and will remain more ergonomic than Rust for the foreseeable future.

After how long it took to get i128/u128 kind of right many would be wary of adding new primitive types to the language, and f16 is in a spot where it would be nice to have, yet not a priority to solve in the near term.

You might have more chances of success with an RFC if you split the proposal for f16 from the one for f128 and tackle those independently. For f16 you can play the GPGPU angle, and also the angle that “we need it for SIMD which is a 2018 roadmap goal”. For f128 I am unsure which angle you could play. IIRC Power9 altivec and VSX had 128x1 vector types that aren’t implemented yet either, but those vectors contain exclusively integer types, so I don’t think we need f128 for SIMD.

In any case, for a primitive type, you should probably stick to the IEEE semantics. The ARM and Power specific versions can be provided in std::arch, for example.


For f128 I am unsure which angle you could play.

I’ve heard that there are many applications (scientific things in particular) that rely on 128-bit floats for the extended precision.

I am not sure how widespread that is, though.


If you take a look at the Top 500 you’ll see that while there are many PowerPC based systems there, most of the systems aren’t PowerPC based nor are based on any hardware with native support for 128-bit floats - I don’t know if all PowerPC systems there have 128-bit float support either since there are actually many Power8 still in use.

I suspect that if they would be widely required, more hardware would support them. For example, hardware for f16s has evolved pretty quickly out of true need. This hasn’t really happened for f128.

All of this is obviously anecdotal and not evidence of anything. There are certain applications, like solving linear systems of equations, where using 128-bit floats can “make sense”, but like with everything doing that has trade-offs. I personally haven’t heard of anyone widely using them in production nor of libraries that advertise their support. Not even on PowerPC systems.


Note that f16 can largely be provided be a crate today:

As such, another direction to approach this would be something like user-defined literal support, so that let x: f16 = 3.141592; could work via a library. That’d also be useful for BigInteger and such…


One issue is that you would like the half crate to use hardware support for f16s when available, and that might involve using some std::arch intrinsics, which themselves require f16s in their API.

So the coupling there might not be just as loose as “let’s just use a crate” the moment core adds that crate as a dependency.


That is my plan at the moment, but I thought it would be best to have one thread on internals because there are many shared discussions and concerns about both types.

As far as I can tell, GCC 8.1 and clang 6.1 on x86_64 creates different instructions for __float128 than a double double type. Also after some more documentation digging, clang calls __multf3, which according to the docs, is for soft float for long double, but long double uses fmul instead.

Digging further into this, clang generates code for __float128 using xmm0 and xmm1, which would imply SIMD for it, but I can’t seem to be able to find much on how __multf3 works. Either way, it could make sense to add this into Rust, as although there doesn’t seem to be much hardware support at the moment, there does seem to be implementations that allow for optimizations and instructions that support it, and these aren’t available without compiler intrinsics (although inline asm may allow for it?)


Cross-post: (closing that issue in favor of this internals thread…)


That’d also be useful for BigInteger and such…

It’s slightly off-topic, but I conceptually do not understand how floats can act as a building block for large integer accounting. Could someone explain this to me? I tried googling but all I keep finding is pages about Java’s BigInteger and BigDecimal rather than conceptual explanations.


I believe @scottmcm was referring to adding the ability for more types to have the ability to have support for literals, so something like this:

let x: BigInteger = 1_000_000;


I see, that would be nice to have. Just to double-check, BigInteger here is conceptually an integer with a truly arbitrary amount of digits (within RAM limits of course)?


@jjpe Yes, like

(And @shingtaklam1324 is correct. Also things like let x: NonZeroU32 = 1;.)



An alternative is to expose a native f16 (and/or f128) type via std::arch only in those archs in which it is available.

Then users could use cfg(...) to either use the type from std::arch or the half crate, or the half crate could do this internally so that users don’t have to do anything.

This could be pretty much uncontroversial since the std::arch module already exposes architecture-specific types like _m128.