Pre-RFC: Parsing a Float From Parts

With the discussion and major advance of improving the float-parsing algorithms in Rust core, I believe it's time to consider a major use-case of float parsing in Rust: parsing floats in storage formats such as ENDF-6, TOML, JSON, and many more.

These decimal strings representing floats may have major, syntactical differences from what Rust considers a valid string. For example, in JSON, we can annotate the following strings as valid floats or not:

"NaN"       // invalid
"nan"       // invalid
"1.23"      // valid
"1.23e"     // invalid
"1."        // invalid
".1"        // invalid
"1.23e5"    // valid
"+1.23e5"   // invalid
"-1.23e5"   // valid

Meanwhile, the following are valid when using str::parse:

"NaN"       // valid
"nan"       // invalid
"1.23"      // valid
"1.23e"     // invalid
"1."        // valid
".1"        // valid
"1.23e5"    // valid
"+1.23e5"   // valid
"-1.23e5"   // valid

In short, these implementations cannot use Rust core's float parser, due to the design choices of the core library. Although this might be fine in most languages, Rust is a common choice for implementing high-performance parsers for data interchange formats, and the performance and abundance of features in the standard library are two of the major reasons for this. There are two, common alternatives therefore when parsing a data interchange format:

  1. Create your own float parser (or fork an existing implementation, such as serde-json).
  2. Tokenize the float, and re-format it to be passed on the str::parse.

Neither of these solutions is ideal for high-performance parsing, either due to the complexity of correct float parsers in the former, or the major performance issues in the latter.

A solution that would satisfy the vast majority of cases would be as follows:

pub trait FloatFromParts {
    fn from_parts(integral: &str, fractional: &str, exponent: i64, negative: bool) -> Self;
}

impl FloatFromParts for f32 {
    ...
}

impl FloatFromParts for f64{
    ...
}

This has major advantages:

  1. It allows significant code re-use with dec2flt, since we already need to parse the integral and fractional digits separately, parse an exponent to an i64, and determine if the float is negative. All the internal algorithms will therefore share the same code.
  2. It covers the vast majority of cases, without adding performance penalties. The only common cases this does not cover is floats with digit separators.

This would allow numerous data-interchange parsers (as well as compilers written in Rust) to use the Rust core library for float parsing. This would also require minimal additions to dec2flt. This would not include special values (which is a feature, since many data interchange formats, such as JSON, do not support special floats).

Note: This would only encompass decimal strings, so float strings like C/C++ hexadecimal strings would not be included.

4 Likes

It seems like there are two ideas here:

  • introduce glassbox float parsing API, reusable in different contexts
  • add this API to core

I think it’s pretty clear that the first is a big benefit, but it’s unclear why we want to do the second as well. I have a strong feeling that we don’t want this in std, as the design space is too large.

Seems we might want do the converse? Add crates.io library with the trait, and use that from core?

1 Like

@matklad There's already a few attempts at the former (all of which have been authored by me, with significant feedback from real-world cases of improving Rust float parsing), however, they're somewhat limited because it requires a complete float-parser to be implemented for this API, since we cannot merely re-format the strings for performance reasons to then use str.parse.

Due to the complexity of float parsing, this means a lot of duplicated work. If any performance or correctness changes happen in core, these changes will not also be reflected in any libraries using effectively duplicated algorithms. Also, I believe you mean use this crate instead of core? This raises questions of maintenance: I accidentally forced a delayed release of {integer}::BITS (from 1.51.0 to 1.53.0) because my mental health issues meant I had to take a step back from open source work.

I understand API space is a major concern, however, I feel this is complex enough, and important enough for a lot of applications using Rust (such as serde-json) to justify adding a single trait and trait method. It could be renamed parse_from_parts (or many other things) to further minimize the risk of any future functions sharing this name.

My perspective is notably biased due to numerous interactions with people needing high-performance, correct float parsers (both on issue trackers and in private) and therefore needed significant modifications to popular crates (including serde-json). A user emailed me today because updating their FORTRAN77 code to Rust was delayed since such functionality is not present.

The goal of this function is the limit API space with a single, extensible function that handles the vast majority of use-cases, without duplicating significant effort for what is effectively core functionality written in 3rd-party crates.

1 Like

I really like the idea. However, I don't think a trait is necessary for this -- inherent {f32, f64}::from_parts methods are sufficient. Also, I'd prefer an API that works with digit separators (and doesn't require allocating a String, like s.chars().filter(|c| c != '_').collect()). Maybe using impl IntoIterator<Item = char> instead of &str is enough?

2 Likes

That is a decent idea, it does however disable some optimizations (in the grand scheme of things, those optimizations are pretty small). minimal-lexical is an example of this (I am the author), and it uses Iterator<Item=&'a u8>, which we could trivially change into IntoIterator<Item = char>.

If we go the more generalized approach (which is the goal, after all), this should definitely work. I guess no trait needs to be exposed as long as the RawFloat trait is used internally, which should simplify the API space concerns.

There's also numerous things I don't think are worth adding to the Rust core library:

  • parse_partial This only makes sense in a small subset of parsers, with a grammar we've already known. Either it only applies to parsers recognizing Rust's syntax for floats, or it opens Pandora's box.
  • Any form of syntax validation internally. Lexical does this, and it's a mess for both performance, API design, and internal implementations. It most notably leads to bloated compile times. It has its place, and the standard library is absolutely not one of them.

The goal here is a single function that can encompass nearly every use-case (there's a very small minority of programming languages that accept non-ASCII characters as digits, but they're extremely rare and well beyond the scope of this proposal).

If we support digit separators, this would allow us to parse floats for (and these are only the ones I've personally tested):

  1. Rust
  2. Python
  3. C++ (98, 03, 11, 14, 17)
  4. C (89, 90, 99, 11, 18)
  5. Ruby
  6. Swift
  7. Go
  8. Haskell
  9. Javascript
  10. Perl
  11. PHP
  12. Java
  13. R
  14. Kotlin
  15. Julia
  16. C# (ISO-1, ISO-2, 3, 4, 5, 6, 7)
  17. Kawa
  18. Gambit-C
  19. Guile
  20. Clojure
  21. Erlang
  22. Elm
  23. Scala
  24. Elixir
  25. FORTRAN
  26. D
  27. Coffeescript
  28. Cobol
  29. F#
  30. Visual Basic
  31. OCaml
  32. Objective-C
  33. ReasonML
  34. Octave
  35. Matlab
  36. Zig
  37. SageMath
  38. JSON
  39. TOML
  40. XML
  41. SQLite
  42. PostgreSQL
  43. MySQL
  44. MongoDB

This would make Rust an even more ideal language for implementing a data interchange format parser, or an interpreter/compiler.

2 Likes

This API doesn't appear to support hex floats (0x0.1a2b3cp67) - is the assumption that the caller will recognize this case and separately use int parse + from_bits instead of parse_from_parts? Seems fair but you might want to call it out.

2 Likes

If you're worried about implementation work, then this may be solved without changes to core. core is open-source, you can take its sources, copy the parser and adjust the implementation to do what you want.

Parsing of IEEE754 floats is a "solved" problem, so forking may be fine. I don't think it'd be important to track changes to it (but if you build on top of rust repo, you could rebase your changes to track core's changes).

If your concern is code size, to reuse functions in applications that parse both Rust's syntax and other languages' syntax... then that may depend on how much overhead a flexible API does, and how reusable it's going to be.

I think that's one of the points here. Since float parsing is a "solved problem", if we replace the subpar dec2flt in core with a more efficient algorithm, there would be no need for anyone to fork anything for performance reasons because it's as good as it gets - if the interface is more than just FromStr and allows parsing tokenized inputs at the very least. Many crates like serde-json etc could then potentially ditch copied/forked/customised float parsers and use stdlib for efficient float parsing.

As for whether it should be a trait or not - doesn't really matter, as long as it's exposed to the outer world in some way.

4 Likes

Yes, sorry the documentation would clearly say "decimal strings", since it cannot realistically support a lot of esoteric floats. Since implementing your own, C/C++ hexadecimal float parser is quite trivial, this shouldn't be an issue.

There's also a few things the Rust core library can do that 3rd party libraries can't do (at least on stable), which can matter on i386 without SSE/SSE2 (due to the weird behavior of the x87). Although I've never seen a practical issue where an error in correctness actually occurs, we can theoretically get rounding issues that 3rd party libraries cannot solve on stable:

Specifically, the x87 has 3 modes of precision, which are set via the fnstcw instruction.

The 3 modes are:

f32 => 0x0000, // 32 bits
f64 => 0x0200, // 64 bits
 _ => 0x0300, // default, 80 bits

In practice, no 3rd-party library can use this functionality, since making it nightly-only would be a deal-breaker (and the only other solution would be a nightly-only feature or build script).

4 Likes