Optimizing Fallback Algorithms for Float Parsing

After implementing a fast and correct float parsing algorithm, there is one major point of optimization still. This only occurs in rare cases, but would make the core library implementation optimal for every case. I'm the author of the algorithm and Rust implementation, so there are no licensing concerns.

A detailed description can be found on the inclusion of this in the reference implementation that was used for original PR. Some key advantages are:

  1. Less static storage
  2. Faster performance: more than 30x in some cases
  3. Fewer magic numbers and an easy-to-reason algorithm.

This is only the case where the float parsing cannot be exactly represented by machine floats, nor can the Eisel-Lemire algorithm unambiguously round it. In short, this is for rare cases, but provides major optimizations under these conditions.

This is due to the use of big-integer arithmetic and a few algorithms optimized for big integer math for parsing decimal strings, which minimizes the number of operations relative to the decimal class in fast_float. There is more code involved, however, this code relies on fewer magic numbers and instead uses simple, big-integer algorithms instead. The only additional static storage required is ~32 64-bit integers, which is considerably smaller than the decimal implementation.

This performance gap is quite noticeable with floats like "8.988465674311580536566680e307", where the performance differences goes from ~500ns/iter (lexical) to ~14.465us/iter (fast_float). For denormal floats or those with negative exponents like "8.442911973260991817129021e-309", the performance is slightly better, from ~58.073 ns/iter (lexical) ~69.264ns/iter (fast_float). This scales very well also with input size: scaling well to 768 digits (767, the max for a double-precision float, with 1 extra) and beyond.

This implementation is rather simple, as it uses big-integer arithmetic with grade-school multiplication and an extended float representation.

2 Likes

What are the icache effects of this in the context of a whole program. I don't doubt this is faster on its own in a microbenchmark, but many real world programs are L1 icache starved, so what is faster in a microbenchmark can often be slower in a real program.

Plus you said this is for rare edge cases.

3 Likes

I may be wrong, but the graphs above appear to be from the C version if I understand the issue you linked correctly. Do you have measurements from the rust implementation of the algorithm?

1 Like

Which of these algorithms is the existing one in rustc, and which is the new one you are proposing? Your link goes to fastfloat/fast_float so I assume fast_float is the new one, but that is the slower one in these numbers you are giving here, so I am confused.

Later you compare "core" and "lexical", what do those names refer to?

This is actually the Rust implementation, however, the C results are quite similar. This has been pretty successful in JSON parser implementations, so I'd assume the benchmarks are not just something that works well in microbenchmarks but works well more generally. I can get more formal benchmarks later.

As far of the performance, the decimal representation has to repeatedly shift over an array of 768 elements, which is why the performance is so slow.

However, you are right: this is for rare edge-cases, and large datasets with floating point numbers tend to either have simpler numbers or stores them in binary and not decimal format from my own experience (which is mass spectrometry, which admittedly, is not representative of many forms of scientific data), in which those specific cases (such as Mascot General Files) these cases would be exceedingly rare.

Sorry if this wasn't clear: fastfloat/fast_float is the version prior to my PR being merged. The benchmarks are from the Rust implementation, and the benchmarks in my cases are for the near-halfway representation algorithm in my implementation vs. the one currently in the Rust core library. I hope that makes sense?

I understand entirely if these changes are not desirable: unlike the original PR, these are edge-cases and rarer in real-world datasets.

So "lexical" is your new proposal?

I don't have a strong opinion either way. Faster and easier to reason about sounds nice, but more code sounds bad. I also don't have to maintain this code since I am not on the libs team, so I'll defer such decisions to them.

I don't know what the best way is to get a libs-team opinion; probably filing an issue and labeling it I-libs-nominated.

1 Like