I would like to see the comprehensive measurements you base your statements above on, as well as benchmarks in real world usage (rather than just microbenchmarks). Also, benchmarks for something like this should be performed across a few different types of systems/architectures to make sure it is not a major regression on any, including at least one soft-fp configuration (representing embedded / kernel development).
It seems variations of this come up every so often, so I'll also link back to my statement from one of the previous times: [algorithm] new float/double to string algorithm - #2 by Vorpal (as well as the conversion that follows after and then runs out into the sand)
Note: I'm not saying that the library is bad, just that the bar from std is high, especially for code that ends up in no-std. It needs to work for every use case then.
TBH, what we really need is someone to figure out a language feature for things like the allocator shim so that it can be less special and we can use that mechanism for more things -- like this.
Then core can offer a default implementation that's a fine middle-ground, but people can plug in different things in their binaries depending on their needs: size-optimized ones if you only do this for a couple parameters in a CLI, one that calls out to the C library for embedded that has a C version already, a big-but-fast one for something doing a ton of floats in JSON, etc.
That would potentially be amazing for many other things too, one that comes to mind is being able to make mutex (and other locking primitives, as well as channels) use priority inheritance. That is a thorn in the side of anyone doing hard realtime on Linux currently.