Implementing a Fast, Correct Float Parser

Alexhuszagh · May 10, 2021, 7:38pm

~2 years ago, I posted about making Rust float-parsing fast and correct. Although I had planned on working towards integrating these changes into the Rust core library, due to personal reasons (mental health), I needed to take a step back. I'm much better suited to implement the necessary changes now, and also, have improved various algorithms along the way. In addition, for integration into a popular crate, I've created a barebones library that should be a sufficient starting point for integration this functionality into core.

If this is of interest, please respond here so I can find appropriate channels to work towards integration this into the rust core library.

Background

First, a quick bit of background, in order to ensure everyone reading this is familiarized with the issues in accurate float parsing. IEEE-754 floating point numbers have a sign bit, exponent bits, and then mantissa (significant digit) bits, and the precision is 1+mantissa_size, due to an implicit hidden bit. Although float parsing is generally simple, decimal (base-10) numbers cannot be exactly represented in a fixed-width binary (base-2) float, and it can take up to 767 digits to unambiguously determine how to round the number.

A simple example of this phenomenon is, for single-precision (f32) floats:

16777216.0 => 100000000000000000000000 0
16777217.0 => 100000000000000000000000 1
16777218.0 => 100000000000000000000001 0
16777219.0 => 100000000000000000000001 1
16777220.0 => 100000000000000000000010 0

The first bit is the implicit, hidden bit. The next 23 bits are the significant digits, prior to rounding. The bit after the space is truncated, prior to rounding. IEEE-754 defaults to round-nearest, tie-even, so the examples would therefore round to:

16777216.0 => 100000000000000000000000 0 => 16777216.0
16777217.0 => 100000000000000000000000 1 => 16777216.0
16777218.0 => 100000000000000000000001 0 => 16777218.0
16777219.0 => 100000000000000000000001 1 => 16777220.0
16777220.0 => 100000000000000000000010 0 => 16777220.0

This can get significantly more complex with larger more complicated numbers, since we can require 112 digits for f32, or 767 digits for f64 to determine the correct way to round.

For terminology, for near-halfway cases we will use the terms b to represent the float, rounded-down, b+h to represent the halfway case, and b+u to represent the next, positive float.

Approach

A correct float parser is generally broken into 3 general algorithms:

A fast path algorithm, using native floats.
0 or more moderate path algorithms, using extended-precision, fixed-width floats.
A slow path algorithm, using arbitrary-precision arithmetic.

We'll assume we've parsed the number at this stage, and extracted the significant digits, the raw exponent, and the exponent relative to the significant digits. This expects all leading zeros in the mantissa to be trimmed, and all trailing zeros (after the decimal point) to similarly be trimmed.

For example, "1.2345e5" would be:

let mantissa = "12345";
let raw_exponent = 5;
let mantissa_exponent = 1;
let digits = mantissa.parse::<u64>().unwrap();

Just a quick notice: all the estimates of performance below are estimated on an i7-6560U CPU @ 2.20GHz, using the rust target x86_64-unknown-linux-gnu.

Fast Path

A decimal number can be exactly represented as a binary float if the number of digits can fit exactly in the mantissa of the float, without truncation, and the exponent can also be represented as a native float. For an f32, the exponent limits are [-10, 10], inclusive, and for an f64, they are [-22, 22], inclusive. This is simply ⌊(mantissa_size + 1) / log2(5)⌋, since we can remove a power-of-two (since it can be exactly represented).

If the significant digits can be represented exactly without truncation (digits >> 53 == 0), and the exponent can be represented exactly (mantissa_exponent >= -22 && mantissa_exponent <= 22), then we have a valid float, and can simply multiply digits * 10**exponent, where the power is precomputed to ensure accuracy.

If the exponent is larger than this limit, we might still have a disguised fast-path algorithm, and can shift digits from the exponent to the mantissa. If the exponent is >10 && <= 17 (for f32) or > 2 &&2 <= 37 (for f64), we can try to shift digits over, see if we still have an untruncated mantissa, and then multiply digits * 10**exponent as before.

Parsing using the fast-path can take ~20-45ns, depending on the length of the input, and almost entirely depends on the speed of parsing the mantissa. Rust already has a fast-path algorithm, although ~~I am unsure if it handles~~ it does not handle disguised fast-path cases.

Moderate Path

If we cannot use the fast-path algorithm, we need to fall back to 0 or more moderate-path algorithms, which are considerably faster than the slow-path algorithms. My library uses extended-precision float representations, and there are 2 algorithms I'd like to propose for inclusion (of which, any or both of them may be included).

An extended-float representation using 64-bits for the mantissa and 16-bits for the binary exponent.
The Eisel-Lemire algorithm, which creates a 128-bit (fallback 192-bit) representation of the mantissa and infers the binary exponent from the mantissa exponent.

Both algorithms use 64-bit multiplication to scale the normalized mantissa (the MSB is 1) by the decimal exponent, using precomputed powers of 10. However, the extended-float representation truncates intermediate results to 64-bits, and calculates the error from this truncation, to estimate if the errors could lead to ambiguity in halfway cases. An implementation of this algorithm can be found here, although this is based off the Go implementation. For licensing reasons, I would likely explain the algorithm to a core developer ( which are trivial to implement), add the necessary routines in the ExtendedFloat struct to core (which I can contribute freely). Meanwhile, the Eisel-Lemire algorithm computes the high and low bits of the 64-bit multiplication, and uses them to create a 128-bit representation of the float. If this representation cannot be differentiated from halfway cases, a 192-bit representation is attempted. An implementation of this algorithm can be found here, and it is similarly based off the Go implementation.

Both algorithms, however, are trivial to implement and may be done in <80 lines of code. Parsing using the fast-path can take ~50-60ns, depending on the length of the input, and handles the vast majority of the remaining cases. Note that there is not 100% overlap between both algorithms, so if performance is the primary concern, both may be implemented together for optimal speed.

If the extended-float algorithm is used, it requires of 76 u64s for precomputed powers of 10. If the Eisel-Lemire algorithm is used, it requires ~1300 u64s for precomputed powers of 10. If both are used, the extended-float algorithm can use Eisel-Lemire's precomputed powers, so no additional storage is required.

Slow Path

Although initially (as described in the post) I used a modified form of Algorithm-M (which Rust currently uses, to my knowledge), I have implemented my own algorithms with much faster performance here. The general approach is quite simple: create an arbitrary-precision integer (with at most ~3600 bits of precision), parse the significant digits into the big integer.

If the value is larger than 1, we can use large_atof, which then scales the significant digits by a power-of-ten, extracts the high 64-bits of the number, creates an ExtendedFloat from this representation, and rounds-nearest, ties-even.

If the value is less than 1 (including denormal floats), we use the create a theoretical representation of the bits of b (generated from the moderate path). We then scale the two numbers to the same binary exponent (first by removing a power-of-two) and significant digits, compare the digits, and round up or down accordingly.

These algorithms are significantly faster than rust core (by up to 500x), and are more comprehensive (can parse large and subnormal floats that are beyond the scope of core right now). The speed of these algorithms for near-halfway cases is ~200ns (for 20-30 significant digits), ~1.2μs for 400 digits, and ~6.5μs for 6400 digits. This is orders of magnitude faster than rust's existing implementation, and comparable with GCC's strtod for very large inputs (but considerably faster for small inputs).

Validation

In order to ensure the float parser is correct, I've done extensive validation, using a curated test-suite of near-halfway cases and other difficult-to-parse examples, submitted it to the comprehensive float tests from the rust core library, and the curated tests used by Wuzz and Go for validation. Furthermore, these algorithms have been used in serde-json, nom, and downloaded more than 5 million times with no reported issues in correctness since very initial versions. If there are errors in correctness, it is extremely unlikely.

Additional Considerations

The last time I proposed integrating slow-path algorithms into rust, I learned that the internal big-integer types do not do normalization after operations (this may have changed recently). As my algorithms only need normalization to the nearest digit (so there are no zero most-significant-digits), this should be trivial to alleviate with a normalize function.

Appendix - Radix Conversions

The algorithm for determining the number of digits required to round is the following, in Python:

import math

def digits(emin, p2, b=10):
    return -emin + p2 + math.floor((emin+ 1)*math.log(2, b)-math.log(1-2**(-p2), b))

f32_digits = digits(-126, 24)
f64_digits = digits(-1022, 53)

Appendix - Repository

I've created a minimal repository (so I don't have to build Rust everytime I make changes) tracking these changes, with each branch representing changes to improve each alghorithm.

Appendix - Open Issues and Pull Requests

#85198: Improved Performance for Disguised Fast-Path Cases in Float Parsing

#85214 Infer, Rather than Store, Binary Exponents when Float Parsing

#85234 Improve Float Parsing Speeds by up to 99.94% Through Improvements to the Bellerophon Algorithm

Alexhuszagh · May 10, 2021, 8:28pm

Looking more carefully at the fast path implementation, it does not seem core currently handles disguised fast path cases. This would be a trivial PR, and would be a first step towards improving float parsing in rust.

Alexhuszagh · May 12, 2021, 6:32pm

I've since opened 3 issues with improvements to the fast-path and moderate-path algorithms. These focus on relatively minor, incremental changes with major performance impacts, since Hanna Kruppe, who originally authored Rust's float parsing algorithms, has retired from Rust core development.

github.com/rust-lang/rust

Improved Performance for Disguised Fast-Path Cases in Float Parsing

opened 09:59PM - 11 May 21 UTC

Alexhuszagh

# Summary Rust's float-parsing algorithm dec2flt uses a slower parsing algori…thm than necessary than required to parse numbers like `"1.2345e30"`, which can slow down parsing times by nearly 300%. Adding trivial changes to dec2flt leads to dramatically improved parsing times, without increasing binary sizes, or slowing down other parse cases. Please see the "Sample Repository" below for the exact specifics, or in order to replicate these changes. This is an [initial attempt](https://internals.rust-lang.org/t/implementing-a-fast-correct-float-parser/14670) as part of an ongoing effort to speed up float parsing in Rust, and aims to integrate algorithms I've implemented (currently used in nom and serde-json) back in the core library. # Issue When parsing floating-point numbers, there is a fast-path algorithm that uses native floats to parse the float if applicable. This only occurs if: - The significant digits of the float, or mantissa, can be represented in `mantissa_size+1` bits. - The exponent can be exactly represented, or the absolute value is less than `⌊(mantissa_size+1) / log2(5) ⌋`. Please note that this is the exponent relative to the significant digits, for example, for `"1.2345e5"`, this exponent would be `1`, but for `"12345e5"` this exponent would be `5`. The reason why we use `mantissa_size+1` is due to the implicit, hidden bit of the float. A longer post detailing the attempts to improve float parsing on rust-internals can be found [here](https://internals.rust-lang.org/t/implementing-a-fast-correct-float-parser/14670). The exact values for `f32` are as follows: **f32:** - significant digit bits: 24 - exponent range: `[-10, 10]` **f64:** - significant digit bits: 53 - exponent range: `[-22, 22]` However, there is an exception: if the value has less significant bits than the maximum, but has an exponent larger than our range, we can shift powers-of-10 from the exponent to the significant digits. For example, `"1.2345e30"` would have significant digits of `12345` and an exponent of `26`, which is outside our range of `[-22, 22]`. However, if we shift `10^4` from the exponent to the significant digits, we get significant digits of `123450000` and an exponent of `22`, which is a valid fast-path case. This leads to a massive performance improvement with a large number of real-world float cases, and has an insignificant impact on other cases. # Binary Sizes These were compiled on a target of `x86_64-unknown-linux-gnu`, running kernel version `5.11.16-100`, on a Rust version of `rustc 1.53.0-nightly (132b4e5d1 2021-04-13)`. The sizes reflect the binary sizes reported by `ls -sh`, both before and after running the `strip` command. The debug profile was used for opt-levels `0` and `1`, and was as follows: ```toml [profile.dev] opt-level = "..." debug = true lto = false ``` The release profile was used for opt-levels `2`, `3`, `s` and `z` and was as follows: ```toml [profile.release] opt-level = "..." debug = false debug-assertions = false lto = true ``` **core** These are the binary sizes prior to making changes. opt-level|size|size(stripped) |:-:|:-:|:-:| 0|3.6M|360K 1|3.5M|316K 2|1.3M|236K 3|1.3M|248K s|1.3M|244K z|1.3M|248K **disguised** These are the binary sizes after making changes to speed up disguised fast-path cases. opt-level|size|size(stripped) |:-:|:-:|:-:| 0|3.6M|360K 1|3.5M|316K 2|1.3M|236K 3|1.3M|248K s|1.3M|252K z|1.3M|248K # Performance Overall, the changes to speed up disguised fast-path cases led to ~-75% change in performance relative to core, without impacting any other benchmarks. These benchmarks were run on an `i7-6560U CPU @ 2.20GHz`, on a target of `x86_64-unknown-linux-gnu`, running kernel version `5.11.16-100`, on a Rust version of `rustc 1.53.0-nightly (132b4e5d1 2021-04-13)`. The performance CPU governor was used for all benchmarks, and were run consecutively on A/C power with only tmux and Sublime Text open for all benchmarks. The floats that were parsed are as follows: ```rust // Example fast-path value. const FAST: &str = "1.2345e22"; // Example disguised fast-path value. const DISGUISED: &str = "1.2345e30"; // Example moderate path value: clearly not halfway `1 << 53`. const MODERATE: &str = "9007199254740992.0"; // Example exactly-halfway value `(1<<53) + 1`. const HALFWAY: &str = "9007199254740993.0"; // Example large, near-halfway value. const LARGE: &str = "8.988465674311580536566680e307"; // Example denormal, near-halfway value. const DENORMAL: &str = "8.442911973260991817129021e-309"; ``` **core** These are the benchmarks prior to making changes. |float|speed| |:-:|:-:| |fast|32.952ns| |disguised|129.86ns| |moderate|237.08ns| |halfway|371.21ns| |large|287.81us| |denormal|122.36us| **disguised** These are the benchmarks after making changes to speed up disguised fast-path cases. |float|speed| |:-:|:-:| |fast|32.572ns| |disguised|33.813ns| |moderate|233.03ns| |halfway|350.99ns| |large|300.29us| |denormal|129.36us| # Correctness Concerns None, since this merely transfer powers-of-10 from the exponent to the significant digits, using integer multiplication, and therefore can trivially be verified for correctness. # Changes The diff, which would be relative to `library/core/src/num`, is as follows: ```diff diff --git a/src/dec2flt/algorithm.rs b/src/dec2flt/algorithm.rs index 2b0b4cb..76d8105 100644 --- a/src/dec2flt/algorithm.rs +++ b/src/dec2flt/algorithm.rs @@ -110,7 +110,7 @@ mod fpu_precision { /// /// This is extracted into a separate function so that it can be attempted before constructing /// a bignum. -pub fn fast_path<T: RawFloat>(integral: &[u8], fractional: &[u8], e: i64) -> Option<T> { +pub fn fast_path<T: RawFloat>(integral: &[u8], fractional: &[u8], mut e: i64) -> Option<T> { let num_digits = integral.len() + fractional.len(); // log_10(f64::MAX_SIG) ~ 15.95. We compare the exact value to MAX_SIG near the end, // this is just a quick, cheap rejection (and also frees the rest of the code from @@ -118,14 +118,29 @@ pub fn fast_path<T: RawFloat>(integral: &[u8], fractional: &[u8], e: i64) -> Opt if num_digits > 16 { return None; } - if e.abs() >= T::CEIL_LOG5_OF_MAX_SIG as i64 { + let max_exp = T::FLOOR_LOG5_OF_MAX_SIG as i64; + let min_exp = -max_exp; + let shift_exp = T::FLOOR_LOG10_OF_MAX_SIG as i64; + let disguised_exp = max_exp + shift_exp; + if e < min_exp || e > disguised_exp { return None; } - let f = num::from_str_unchecked(integral.iter().chain(fractional.iter())); + let mut f = num::from_str_unchecked(integral.iter().chain(fractional.iter())); if f > T::MAX_SIG { return None; } + // Handle a disguised fast path case here. + if e > max_exp { + let shift = e - max_exp; + let value = f.checked_mul(T::short_int_pow10(shift as usize))?; + if value > T::MAX_SIG { + return None; + } + f = value; + e = max_exp; + } + // The fast path crucially depends on arithmetic being rounded to the correct number of bits // without any intermediate rounding. On x86 (without SSE or SSE2) this requires the precision // of the x87 FPU stack to be changed so that it directly rounds to 64/32 bit. diff --git a/src/dec2flt/rawfp.rs b/src/dec2flt/rawfp.rs index a3acf3d..15a5839 100644 --- a/src/dec2flt/rawfp.rs +++ b/src/dec2flt/rawfp.rs @@ -73,13 +73,21 @@ pub trait RawFloat: /// represented, the other code in this module makes sure to never let that happen. fn from_int(x: u64) -> Self; + fn short_int_pow10(e: usize) -> u64 { + table::SHORT_POWERS[e] + } + /// Gets the value 10<sup>e</sup> from a pre-computed table. - /// Panics for `e >= CEIL_LOG5_OF_MAX_SIG`. + /// Panics for `e >= FLOOR_LOG5_OF_MAX_SIG`. fn short_fast_pow10(e: usize) -> Self; /// What the name says. It's easier to hard code than juggling intrinsics and /// hoping LLVM constant folds it. - const CEIL_LOG5_OF_MAX_SIG: i16; + const FLOOR_LOG5_OF_MAX_SIG: i16; + + /// What the name says. It's easier to hard code than juggling intrinsics and + /// hoping LLVM constant folds it. + const FLOOR_LOG10_OF_MAX_SIG: i16; // A conservative bound on the decimal digits of inputs that can't produce overflow or zero or /// subnormals. Probably the decimal exponent of the maximum normal value, hence the name. @@ -147,7 +155,8 @@ impl RawFloat for f32 { const SIG_BITS: u8 = 24; const EXP_BITS: u8 = 8; - const CEIL_LOG5_OF_MAX_SIG: i16 = 11; + const FLOOR_LOG5_OF_MAX_SIG: i16 = 10; + const FLOOR_LOG10_OF_MAX_SIG: i16 = 7; const MAX_NORMAL_DIGITS: usize = 35; const INF_CUTOFF: i64 = 40; const ZERO_CUTOFF: i64 = -48; @@ -196,7 +205,8 @@ impl RawFloat for f64 { const SIG_BITS: u8 = 53; const EXP_BITS: u8 = 11; - const CEIL_LOG5_OF_MAX_SIG: i16 = 23; + const FLOOR_LOG5_OF_MAX_SIG: i16 = 22; + const FLOOR_LOG10_OF_MAX_SIG: i16 = 15; const MAX_NORMAL_DIGITS: usize = 305; const INF_CUTOFF: i64 = 310; const ZERO_CUTOFF: i64 = -326; diff --git a/src/dec2flt/table.rs b/src/dec2flt/table.rs index 97b497e..bd9e53d 100644 --- a/src/dec2flt/table.rs +++ b/src/dec2flt/table.rs @@ -1234,6 +1234,30 @@ pub static POWERS: ([u64; 611], [i16; 611]) = ( ], ); +#[rustfmt::skip] +pub const SHORT_POWERS: [u64; 20] = [ + 1, + 10, + 100, + 1000, + 10000, + 100000, + 1000000, + 10000000, + 100000000, + 1000000000, + 10000000000, + 100000000000, + 1000000000000, + 10000000000000, + 100000000000000, + 1000000000000000, + 10000000000000000, + 100000000000000000, + 1000000000000000000, + 10000000000000000000, +]; + #[rustfmt::skip] pub const F32_SHORT_POWERS: [f32; 11] = [ 1e0, ``` I'd be happy to submit a pull request with these changes, if they are satisfactory to you. # Sample Repository I've created a simple, minimal repository tracking these changes on [rust-dec2flt](https://github.com/Alexhuszagh/rust-dec2flt), which has a [core branch](https://github.com/Alexhuszagh/rust-dec2flt/tree/core) that is identical to Rust's current implementation in the core library. The [disguised branch](https://github.com/Alexhuszagh/rust-dec2flt/tree/disguised) contains the changes to improve parsing speeds for disguised fast-path cases. I will also, if there is interest, gradually be making changes for the moderate and slow-path algorithms.

github.com/rust-lang/rust

Infer, Rather than Store, Binary Exponents when Float Parsing

opened 06:04AM - 12 May 21 UTC

Alexhuszagh

# Issue When float-parsing, [precomputed](https://github.com/rust-lang/rust/b…lob/ea3068efe44f11d379a28a812d4a78ab73a80137/library/core/src/num/dec2flt/table.rs#L8) powers-of-10, along with a binary exponent, are stored to scale the significant digits of an extended-precision (80-bit) floating-point type to the decimal exponent. The general appoach is as follows: ```rust // Get our extended-precision float type from the significant digits and the decimal exponent. let mantissa = "..."; let exp10 = "..."; let fp = Fp { f: mantissa, e: 0 }; // Get the scaling factor, so we can multiply the two. let i = exp10 - table::MIN_E; let sig = table::POWERS.0[i as usize]; let e = table::POWERS.1[i as usize]; let pow10 = Fp { sig, e }; // Multiply the two, then do float rounding. let scaled = fp.mul(pow10); ... ``` However, the binary exponents (stored in `table::POWERS.1`) do not need to be explicitly stored, and there is no significant performance penalty for doing so. We can replace the above code with the following: ```rust // Get our extended-precision float type from the significant digits and the decimal exponent. let mantissa = "..."; let exp10 = "..."; let fp = Fp { f: mantissa, e: 0 }; // Get the scaling factor, so we can multiply the two. let i = exp10 - table::MIN_E; let sig = table::POWERS[i as usize]; let e = ((217706 * exp10 as i64) >> 16) - 63; let pow10 = Fp { sig, e }; // Multiply the two, then do float rounding. let scaled = fp.mul(pow10); ... ``` # Related Work This is an [initial attempt](https://internals.rust-lang.org/t/implementing-a-fast-correct-float-parser/14670) as part of an ongoing effort to speed up float parsing in Rust, and aims to integrate algorithms I've implemented (currently used in nom and serde-json) back in the core library. # Binary Sizes Overall, when compiling with opt-levels of `s` or `z`, binary sizes were ~4KB smaller than before. These were compiled on a target of `x86_64-unknown-linux-gnu`, running kernel version `5.11.16-100`, on a Rust version of `rustc 1.53.0-nightly (132b4e5d1 2021-04-13)`. The sizes reflect the binary sizes reported by `ls -sh`, both before and after running the `strip` command. The debug profile was used for opt-levels `0` and `1`, and was as follows: ```toml [profile.dev] opt-level = "..." debug = true lto = false ``` The release profile was used for opt-levels `2`, `3`, `s` and `z` and was as follows: ```toml [profile.release] opt-level = "..." debug = false debug-assertions = false lto = true ``` **core** These are the binary sizes prior to making changes. opt-level|size|size(stripped) |:-:|:-:|:-:| 0|3.6M|360K 1|3.5M|316K 2|1.3M|236K 3|1.3M|248K s|1.3M|244K z|1.3M|248K **infer** These are the binary sizes after making changes to infer the binary exponents. opt-level|size|size(stripped) |:-:|:-:|:-:| 0|3.6M|360K 1|3.5M|316K 2|1.3M|236K 3|1.3M|248K s|1.3M|244K z|1.3M|244K # Performance Overall, no significant change in performance was detected for any of the example floats. These benchmarks were run on an `i7-6560U CPU @ 2.20GHz`, on a target of `x86_64-unknown-linux-gnu`, running kernel version `5.11.16-100`, on a Rust version of `rustc 1.53.0-nightly (132b4e5d1 2021-04-13)`. The performance CPU governor was used for all benchmarks, and were run on A/C power with only tmux and Sublime Text open for all benchmarks. The floats that were parsed are as follows: ```rust // Example fast-path value. const FAST: &str = "1.2345e22"; // Example disguised fast-path value. const DISGUISED: &str = "1.2345e30"; // Example moderate path value: clearly not halfway `1 << 53`. const MODERATE: &str = "9007199254740992.0"; // Example exactly-halfway value `(1<<53) + 1`. const HALFWAY: &str = "9007199254740993.0"; // Example large, near-halfway value. const LARGE: &str = "8.988465674311580536566680e307"; // Example denormal, near-halfway value. const DENORMAL: &str = "8.442911973260991817129021e-309"; ``` **core** These are the benchmarks prior to making changes. |float|speed| |:-:|:-:| |fast|32.952ns| |disguised|129.86ns| |moderate|237.08ns| |halfway|371.21ns| |large|287.81us| |denormal|122.36us| **infer** These are the benchmarks after making changes to infer the binary exponent. |float|speed| |:-:|:-:| |fast|31.753ns| |disguised|124.73ns| |moderate|229.22ns| |halfway|319.39ns| |large|266.29us| |denormal|116.24us| # Correctness Concerns None, since the inferred exponents can be trivially shown using the Python code below to be identical to those stored in the dec2flt table. This only uses integer multiplication that cannot overflow, and a fixed shr of 16 bits. # Magic Number Generation The code to generate the magic number to convert the decimal exponent to the binary exponent is as follows, which verifies the magic number is valid over the entire range. ```python import math def get_range(max_exp, bitshift): den = 1 << bitshift num = int(math.ceil(math.log2(10) * den)) for exp10 in range(0, max_exp): exp2_exact = int(math.log2(10**exp10)) exp2_guess = num * exp10 // den if exp2_exact != exp2_guess: raise ValueError(f'{exp10}') return num, den get_range(350, 16) # (217706, 16) ``` # Sample Repository I've created a simple, minimal repository tracking these changes on [rust-dec2flt](https://github.com/Alexhuszagh/rust-dec2flt), which has a [core branch](https://github.com/Alexhuszagh/rust-dec2flt/tree/core) that is identical to Rust's current implementation in the core library. The [infer branch](https://github.com/Alexhuszagh/rust-dec2flt/tree/infer) contains the changes to infer the binary exponents rather than explicitly store them. I will also, if there is interest, gradually be making changes for the moderate and slow-path algorithms.

github.com/rust-lang/rust

Improve Float Parsing Speeds by up to 99.94% Through Improvements to the Bellerophon Algorithm

opened 06:26PM - 12 May 21 UTC

Alexhuszagh

# Summary When the fast-path algorithm cannot be used (see #85198), Rust defa…ults back to a the Bellerophon algorithm, based off this [paper](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.4152). Examples of floats that can be correctly parsed via the Bellerophon algorithm include `"9007199254740992.0"` (`1 << 53`), while near-halfway cases such `"9007199254740992992e-3"` must fall back to slower algorithms (just less than halfway of `(1 << 53) + 1`). Unfortunately, the current implementation of the Bellerophon algorithm requires the use of arbitrary-precision arithmetic, which can lead to a 10,000x performance penalty. Please see the "Sample Repository" below for the exact specifics, or in order to replicate these changes. This is an [initial attempt](https://internals.rust-lang.org/t/implementing-a-fast-correct-float-parser/14670) as part of an ongoing effort to speed up float parsing in Rust, and aims to integrate algorithms I've implemented (currently used in nom and serde-json) back in the core library. # Issue When parsing floating-point numbers, if the number cannot be exactly parsed by the fast-path algorithm, it falls back to an extended-precision representation (often consisting of 64-bits for the significant digits, or mantissa, and 16-bits for the exponent). For a more detailed description of halfway cases, see the halfway cases section below. If this extended-precision algorithm can be unambiguously rounded to the nearest native float, by showing that the max error is less than the different to the nearest halfway case, then we have an accurate representation and can skip slower algorithms. These slower algorithms make use of arbitrary-precision arithmetic to exactly represent the significant digits of the float, and therefore round to the nearest native float. The current implementation of Bellerophon, however, generates the significant digits from a [big integer](https://github.com/rust-lang/rust/blob/28e2b29b8952485679367cc05699fb5154f4e5c3/library/core/src/num/dec2flt/algorithm.rs#L161-L178), which leads to significantly reduced performance. By using a 64-bit representation of the significant digits parsed from the first 19-20 digits of the float, we can improve performance by orders of magnitude. # Halfway Cases When parsing floats, the most significant problem is determining how to round the resulting value. The IEEE-754 standard specifies rounding to nearest, then tie even. For example, using this rounding scheme to decimal numbers: - `8.9` would round to `9.0`. - `9.1` would round to `9.0`. - `9.5` would round to `10.0`. - `10.5` would round to `10.0`. With parsing from decimal strings to binary, fixed-width floating point numbers, we must round to the nearest float. This becomes tricky when values are near their halfway point. For example, with a single-precision float `f32`, we would round as follows: - `16777216.9` rounds to `16777216.0` - `16777217.0` rounds to `16777216.0` - `16777217.1` rounds to `16777218.0` This is easier illustrated if we represent the float in binary. First, here's the layout of an IEEE-754 single-precision float as bits: 🟦🟩🟩🟩🟩🟩🟩🟩🟩🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪 Where: - 🟦 is the sign bit. - 🟩 are the exponent bits. - 🟪 are the mantissa, or significant digit, bits. We'll ignore the exponent and sign bits right now, and only consider the mantissa, or significant digits. The lowest exponent bit, also called the hidden bit, is used as an implicit, extra bit of precision for normal floats, meaning we have 24-bits of precision. For 3 numbers, we would therefore have the following representations, where the last bit is truncated off: - `16777216.0` => `100000000000000000000000 0` - `16777217.0` => `100000000000000000000000 1` - `16777218.0` => `100000000000000000000001 0` Therefore, `16777217.0` is exactly halfway between `16777216.0` and `16777218.0`. Although solving these halfway cases can superficially seem easy, simple algorithms will fail even when parsing the shortest, accurate decimal representation. # Binary Sizes These were compiled on a target of `x86_64-unknown-linux-gnu`, running kernel version `5.11.16-100`, on a Rust version of `rustc 1.53.0-nightly (132b4e5d1 2021-04-13)`. The sizes reflect the binary sizes reported by `ls -sh`, both before and after running the `strip` command. The debug profile was used for opt-levels `0` and `1`, and was as follows: ```toml [profile.dev] opt-level = "..." debug = true lto = false ``` The release profile was used for opt-levels `2`, `3`, `s` and `z` and was as follows: ```toml [profile.release] opt-level = "..." debug = false debug-assertions = false lto = true ``` **core** These are the binary sizes prior to making changes. opt-level|size|size(stripped) |:-:|:-:|:-:| 0|3.6M|360K 1|3.5M|316K 2|1.3M|236K 3|1.3M|248K s|1.3M|244K z|1.3M|248K **moderate** These are the binary sizes after making changes to speed up the Bellerophon algorithm. opt-level|size|size(stripped) |:-:|:-:|:-:| 0|3.6M|364K 1|3.5M|316K 2|1.3M|248K 3|1.3M|252K s|1.3M|244K z|1.3M|244K # Performance Overall, the changes to speed up Bellerophon algorithm led to a: - ~-79% change in performance for the `MODERATE` float. - ~-99.7% change in performance for the `LARGE` float. - ~-99.94% change in performance for the `DENORMAL` float. And it did not affect the performance of the fast-path algorithm. These benchmarks were run on an `i7-6560U CPU @ 2.20GHz`, on a target of `x86_64-unknown-linux-gnu`, running kernel version `5.11.16-100`, on a Rust version of `rustc 1.53.0-nightly (132b4e5d1 2021-04-13)`. The performance CPU governor was used for all benchmarks, and were run on A/C power with only tmux and Sublime Text open for all benchmarks. The floats that were parsed are as follows: ```rust // Example fast-path value. const FAST: &str = "1.2345e22"; // Example disguised fast-path value. const DISGUISED: &str = "1.2345e30"; // Example moderate path value: clearly not halfway `1 << 53`. const MODERATE: &str = "9007199254740992.0"; // Example exactly-halfway value `(1<<53) + 1`. const HALFWAY: &str = "9007199254740993.0"; // Example large, near-halfway value. const LARGE: &str = "8.988465674311580536566680e307"; // Example denormal, near-halfway value. const DENORMAL: &str = "8.442911973260991817129021e-309"; ``` **core** These are the benchmarks prior to making changes. |float|speed| |:-:|:-:| |fast|32.952ns| |disguised|129.86ns| |moderate|237.08ns| |halfway|371.21ns| |large|287.81us| |denormal|122.36us| **moderate** These are the binary sizes after making changes to speed up the Bellerophon algorithm. |float|speed| |:-:|:-:| |fast|26.668ns| |disguised|34.599ns| |moderate|49.378ns| |halfway|224.81ns| |large|796.34ns| |denormal|63.763ns| # Correctness Concerns There are a few correctness concerns, since this uses a potentially truncated representation of the significant digits for error calculation. I've therefore made the error detection stricter, so it rejects more [halfway cases](https://github.com/Alexhuszagh/rust-dec2flt/blob/ad4596cc85ee3647cb20d594651a6061b16ca6a9/src/dec2flt/algorithm.rs#L168-L184) than before and correctly compounds error with truncated cases and [non-normalized representations](https://github.com/Alexhuszagh/rust-dec2flt/blob/ad4596cc85ee3647cb20d594651a6061b16ca6a9/src/dec2flt/algorithm.rs#L186-L188) after multiplication. In practice, this only rejects a [handful of cases](https://github.com/Alexhuszagh/rust-dec2flt/blob/ad4596cc85ee3647cb20d594651a6061b16ca6a9/tests/bellerophon.rs) that would be normally accept by the algorithm, with a major benefit to overall performance. I've also extended the powers-of-10 to handle [denormal floats](https://github.com/Alexhuszagh/rust-dec2flt/blob/ad4596cc85ee3647cb20d594651a6061b16ca6a9/src/dec2flt/table.rs#L4-L6), as well as values that could lead to Infinity, and updated the internal logic to ensure correct rounding. This passes all of Rust's float parsing [tests](https://github.com/rust-lang/rust/blob/master/src/etc/test-float-parse/runtests.py), as well as carefully crafted examples to try to detect errors, and therefore is unlikely to have any correctness issues. # Sample Repository I've created a simple, minimal repository tracking these changes on [rust-dec2flt](https://github.com/Alexhuszagh/rust-dec2flt), which has a [core branch](https://github.com/Alexhuszagh/rust-dec2flt/tree/core) that is identical to Rust's current implementation in the core library. The [moderate branch](https://github.com/Alexhuszagh/rust-dec2flt/tree/moderate) contains the changes to improve parsing speeds for Bellerophon algorithm. This currently relies on changes made to [infer binary exponents](https://github.com/rust-lang/rust/issues/85214), however, can be trivially re-written to explicitly store them.

Any review would be greatly appreciated. In addition, another library has considered integrating its float-parsing algorithms into Rust:

Alexhuszagh · June 30, 2021, 9:50pm

I've submitted a PR implementing a modified version of fast-float-rust for better safety guarantees and improved code generation to core. Feedback is more than appreciated.

system · September 28, 2021, 9:51pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Optimizing Fallback Algorithms for Float Parsing internals	6	448	January 28, 2025
Pre-RFC: Parsing a Float From Parts language design	10	1313	August 17, 2021
Pre-RFC: Dealing with broken floating point language design	27	12904	March 25, 2019
Scientific notation when formatting floating point numbers	37	2842	August 11, 2024
Suggestion: FFastMath intrinsics in stable libs	60	5923	January 24, 2022

Implementing a Fast, Correct Float Parser

Related topics