Summary
Introduction of 16-bit floats (f16, half precision) and 128-bit floats (f18, quadruple precision).
Motivation
16-bit floats are used in some data formats for storage and in Computer Graphics, such as OpenGL, OpenEXR, JPEG XR. Half precision floats are used for increased dynamic range over 8 and 16 bit integers, and only requiring half the amount of space required for storage, memory and bandwidth.
128-bit floats are used in scientific computing and other computation, with the increased accuracy provided by the extra bits available, up to 33-36 decimal digits of accuracy. This can be useful for computation where accuracy is required. It can also be used for computation of double precision results with increased accuracy, reduced rounding and minimizing overflow.
Guide Level explanation
It will be the same as f32
and f64
, except the types being named f16
and f128
.
Reference Level
LLVM has support for half and quad precision first class types. This should allow for implementation similar to how f32
and f64
are currently implemented in the language.
Drawbacks
f16
: It does not have many applications to do with arithmetic and mainly storage/application. It may be more apt to pull it into a library instead.
f128
: Not many processors have native support for 128-bit operations (None of Rust’s Tier 1 Architectures), which has similar drawbacks to f64
on 32-bit systems and f32
on 16-bit systems.
Unresolved Questions
Given the drawbacks of f128
, should f80
be implemented instead? This is used by gcc
as long double
and is also available in LLVM.
There are two different versions of 128-bit floats in LLVM, one based on IEEE 754-2008’s binary128 floats, and ppc_fp128
, which is two 64-bits.
ARM has two different half precision float types, the IEEE one and an alternative one. How will support be provided?
Should it be implemented in std
? or should only the core::intrinsics
be added?