Pre-RFC: Parsing a Float From Parts

Alexhuszagh · May 16, 2021, 7:48pm

With the discussion and major advance of improving the float-parsing algorithms in Rust core, I believe it's time to consider a major use-case of float parsing in Rust: parsing floats in storage formats such as ENDF-6, TOML, JSON, and many more.

These decimal strings representing floats may have major, syntactical differences from what Rust considers a valid string. For example, in JSON, we can annotate the following strings as valid floats or not:

"NaN"       // invalid
"nan"       // invalid
"1.23"      // valid
"1.23e"     // invalid
"1."        // invalid
".1"        // invalid
"1.23e5"    // valid
"+1.23e5"   // invalid
"-1.23e5"   // valid

Meanwhile, the following are valid when using str::parse:

"NaN"       // valid
"nan"       // invalid
"1.23"      // valid
"1.23e"     // invalid
"1."        // valid
".1"        // valid
"1.23e5"    // valid
"+1.23e5"   // valid
"-1.23e5"   // valid

In short, these implementations cannot use Rust core's float parser, due to the design choices of the core library. Although this might be fine in most languages, Rust is a common choice for implementing high-performance parsers for data interchange formats, and the performance and abundance of features in the standard library are two of the major reasons for this. There are two, common alternatives therefore when parsing a data interchange format:

Create your own float parser (or fork an existing implementation, such as serde-json).
Tokenize the float, and re-format it to be passed on the str::parse.

Neither of these solutions is ideal for high-performance parsing, either due to the complexity of correct float parsers in the former, or the major performance issues in the latter.

A solution that would satisfy the vast majority of cases would be as follows:

pub trait FloatFromParts {
    fn from_parts(integral: &str, fractional: &str, exponent: i64, negative: bool) -> Self;
}

impl FloatFromParts for f32 {
    ...
}

impl FloatFromParts for f64{
    ...
}

This has major advantages:

It allows significant code re-use with dec2flt, since we already need to parse the integral and fractional digits separately, parse an exponent to an i64, and determine if the float is negative. All the internal algorithms will therefore share the same code.
It covers the vast majority of cases, without adding performance penalties. The only common cases this does not cover is floats with digit separators.

This would allow numerous data-interchange parsers (as well as compilers written in Rust) to use the Rust core library for float parsing. This would also require minimal additions to dec2flt. This would not include special values (which is a feature, since many data interchange formats, such as JSON, do not support special floats).

Note: This would only encompass decimal strings, so float strings like C/C++ hexadecimal strings would not be included.

matklad · May 16, 2021, 8:10pm

It seems like there are two ideas here:

introduce glassbox float parsing API, reusable in different contexts
add this API to core

I think it’s pretty clear that the first is a big benefit, but it’s unclear why we want to do the second as well. I have a strong feeling that we don’t want this in std, as the design space is too large.

Seems we might want do the converse? Add crates.io library with the trait, and use that from core?

Alexhuszagh · May 16, 2021, 8:31pm

@matklad There's already a few attempts at the former (all of which have been authored by me, with significant feedback from real-world cases of improving Rust float parsing), however, they're somewhat limited because it requires a complete float-parser to be implemented for this API, since we cannot merely re-format the strings for performance reasons to then use str.parse.

Due to the complexity of float parsing, this means a lot of duplicated work. If any performance or correctness changes happen in core, these changes will not also be reflected in any libraries using effectively duplicated algorithms. Also, I believe you mean use this crate instead of core? This raises questions of maintenance: I accidentally forced a delayed release of {integer}::BITS (from 1.51.0 to 1.53.0) because my mental health issues meant I had to take a step back from open source work.

I understand API space is a major concern, however, I feel this is complex enough, and important enough for a lot of applications using Rust (such as serde-json) to justify adding a single trait and trait method. It could be renamed parse_from_parts (or many other things) to further minimize the risk of any future functions sharing this name.

My perspective is notably biased due to numerous interactions with people needing high-performance, correct float parsers (both on issue trackers and in private) and therefore needed significant modifications to popular crates (including serde-json). A user emailed me today because updating their FORTRAN77 code to Rust was delayed since such functionality is not present.

The goal of this function is the limit API space with a single, extensible function that handles the vast majority of use-cases, without duplicating significant effort for what is effectively core functionality written in 3rd-party crates.

Aloso · May 16, 2021, 11:45pm

I really like the idea. However, I don't think a trait is necessary for this -- inherent {f32, f64}::from_parts methods are sufficient. Also, I'd prefer an API that works with digit separators (and doesn't require allocating a String, like s.chars().filter(|c| c != '_').collect()). Maybe using impl IntoIterator<Item = char> instead of &str is enough?

Alexhuszagh · May 17, 2021, 12:03am

That is a decent idea, it does however disable some optimizations (in the grand scheme of things, those optimizations are pretty small). minimal-lexical is an example of this (I am the author), and it uses Iterator<Item=&'a u8>, which we could trivially change into IntoIterator<Item = char>.

If we go the more generalized approach (which is the goal, after all), this should definitely work. I guess no trait needs to be exposed as long as the RawFloat trait is used internally, which should simplify the API space concerns.

There's also numerous things I don't think are worth adding to the Rust core library:

parse_partial This only makes sense in a small subset of parsers, with a grammar we've already known. Either it only applies to parsers recognizing Rust's syntax for floats, or it opens Pandora's box.
Any form of syntax validation internally. Lexical does this, and it's a mess for both performance, API design, and internal implementations. It most notably leads to bloated compile times. It has its place, and the standard library is absolutely not one of them.

The goal here is a single function that can encompass nearly every use-case (there's a very small minority of programming languages that accept non-ASCII characters as digits, but they're extremely rare and well beyond the scope of this proposal).

If we support digit separators, this would allow us to parse floats for (and these are only the ones I've personally tested):

Rust
Python
C++ (98, 03, 11, 14, 17)
C (89, 90, 99, 11, 18)
Ruby
Swift
Go
Haskell
Javascript
Perl
PHP
Java
R
Kotlin
Julia
C# (ISO-1, ISO-2, 3, 4, 5, 6, 7)
Kawa
Gambit-C
Guile
Clojure
Erlang
Elm
Scala
Elixir
FORTRAN
D
Coffeescript
Cobol
F#
Visual Basic
OCaml
Objective-C
ReasonML
Octave
Matlab
Zig
SageMath
JSON
TOML
XML
SQLite
PostgreSQL
MySQL
MongoDB

This would make Rust an even more ideal language for implementing a data interchange format parser, or an interpreter/compiler.

riking · May 17, 2021, 7:11am

This API doesn't appear to support hex floats (0x0.1a2b3cp67) - is the assumption that the caller will recognize this case and separately use int parse + from_bits instead of parse_from_parts? Seems fair but you might want to call it out.

kornel · May 17, 2021, 2:11pm

If you're worried about implementation work, then this may be solved without changes to core. core is open-source, you can take its sources, copy the parser and adjust the implementation to do what you want.

Parsing of IEEE754 floats is a "solved" problem, so forking may be fine. I don't think it'd be important to track changes to it (but if you build on top of rust repo, you could rebase your changes to track core's changes).

If your concern is code size, to reuse functions in applications that parse both Rust's syntax and other languages' syntax... then that may depend on how much overhead a flexible API does, and how reusable it's going to be.

aldanor · May 19, 2021, 12:30pm

I think that's one of the points here. Since float parsing is a "solved problem", if we replace the subpar dec2flt in core with a more efficient algorithm, there would be no need for anyone to fork anything for performance reasons because it's as good as it gets - if the interface is more than just FromStr and allows parsing tokenized inputs at the very least. Many crates like serde-json etc could then potentially ditch copied/forked/customised float parsers and use stdlib for efficient float parsing.

As for whether it should be a trait or not - doesn't really matter, as long as it's exposed to the outer world in some way.

Alexhuszagh · May 19, 2021, 9:45pm

Yes, sorry the documentation would clearly say "decimal strings", since it cannot realistically support a lot of esoteric floats. Since implementing your own, C/C++ hexadecimal float parser is quite trivial, this shouldn't be an issue.

Alexhuszagh · May 19, 2021, 9:52pm

There's also a few things the Rust core library can do that 3rd party libraries can't do (at least on stable), which can matter on i386 without SSE/SSE2 (due to the weird behavior of the x87). Although I've never seen a practical issue where an error in correctness actually occurs, we can theoretically get rounding issues that 3rd party libraries cannot solve on stable:

github.com

Alexhuszagh/rust-dec2flt/blob/ad4596cc85ee3647cb20d594651a6061b16ca6a9/src/dec2flt/algorithm.rs#L37-L113

    
      
          // On x86, the x87 FPU is used for float operations if the SSE/SSE2 extensions are not available.
          // The x87 FPU operates with 80 bits of precision by default, which means that operations will
          // round to 80 bits causing double rounding to happen when values are eventually represented as
          // 32/64 bit float values. To overcome this, the FPU control word can be set so that the
          // computations are performed in the desired precision.
          #[cfg(all(target_arch = "x86", not(target_feature = "sse2")))]
          mod fpu_precision {
              use crate::mem::size_of;
          
          
    /// A structure used to preserve the original value of the FPU control word, so that it can be
              /// restored when the structure is dropped.
              ///
              /// The x87 FPU is a 16-bits register whose fields are as follows:
              ///
              /// | 12-15 | 10-11 | 8-9 | 6-7 |  5 |  4 |  3 |  2 |  1 |  0 |
              /// |------:|------:|----:|----:|---:|---:|---:|---:|---:|---:|
              /// |       | RC    | PC  |     | PM | UM | OM | ZM | DM | IM |
              ///
              /// The documentation for all of the fields is available in the IA-32 Architectures Software
              /// Developer's Manual (Volume 1).

This file has been truncated. show original

Specifically, the x87 has 3 modes of precision, which are set via the fnstcw instruction.

The 3 modes are:

f32 => 0x0000, // 32 bits
f64 => 0x0200, // 64 bits
 _ => 0x0300, // default, 80 bits

In practice, no 3rd-party library can use this functionality, since making it nightly-only would be a deal-breaker (and the only other solution would be a nightly-only feature or build script).

system · August 17, 2021, 9:53pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Implementing a Fast, Correct Float Parser internals	4	4510	September 28, 2021
Implementation of a direct way to read floats and ints from the terminal language design	11	2918	June 10, 2019
Pre-RFC: Hex float literals language design	5	1663	March 25, 2019
pre-RFC/draft: {:g}, or "floating points for humans"	14	3398	May 7, 2019
Pre-RFC: Add explicitly-named numeric conversion APIs libs	26	4943	March 11, 2020

Pre-RFC: Parsing a Float From Parts

Related topics