Introduction of 16-bit floats (f16, half precision) and 128-bit floats (f18, quadruple precision).
16-bit floats are used in some data formats for storage and in Computer Graphics, such as OpenGL, OpenEXR, JPEG XR. Half precision floats are used for increased dynamic range over 8 and 16 bit integers, and only requiring half the amount of space required for storage, memory and bandwidth.
128-bit floats are used in scientific computing and other computation, with the increased accuracy provided by the extra bits available, up to 33-36 decimal digits of accuracy. This can be useful for computation where accuracy is required. It can also be used for computation of double precision results with increased accuracy, reduced rounding and minimizing overflow.
Guide Level explanation
It will be the same as
f64, except the types being named
LLVM has support for half and quad precision first class types. This should allow for implementation similar to how
f64 are currently implemented in the language.
f16: It does not have many applications to do with arithmetic and mainly storage/application. It may be more apt to pull it into a library instead.
f128: Not many processors have native support for 128-bit operations (None of Rust’s Tier 1 Architectures), which has similar drawbacks to
f64 on 32-bit systems and
f32 on 16-bit systems.
Given the drawbacks of
f80 be implemented instead? This is used by
long double and is also available in LLVM.
There are two different versions of 128-bit floats in LLVM, one based on IEEE 754-2008’s binary128 floats, and
ppc_fp128, which is two 64-bits.
ARM has two different half precision float types, the IEEE one and an alternative one. How will support be provided?
Should it be implemented in
std? or should only the
core::intrinsics be added?