Benchmark for std::str::from_utf8()?


#1

Is there a benchmark that purported performance improvements to std::str::from_utf8() should score better on in order to be considered improvements?

That is, what kind of ASCII/non-ASCII mix workloads are considered relevant?

CC @bluss


#2

For web use cases at least, the single most important thing is to provide a very fast path for ASCII chars, because that’s what web content is mostly composed of (due not only to the dominance of English content, but also HTML tags, URLs, CSS and Javascript syntax… you get the idea). So a first benchmark on ASCII-only contents would be good. As a first approximation, you can just generate a big bunch of random ASCII chars.

After that, you would also need to demonstrate that your changes do not affect true Unicode content negatively, which can be done by starting from a plain text document in any language you like that has lots of Unicode characters. A Chinese text for example would be a good “pure Unicode” benchmark, since Chinese writing is mostly composed of “extended” Unicode characters, which have variable length when converted to UTF-8 to top it off.

Finally, as a mixed workload, you could dump the source code of a Chinese web page (Unicode contents + ASCII code) and see how well your algorithm deals with it. Results of a Baidu search would be one easy possibility.

Although my point of view here is somewhat web-centric, I am convinced that most real-world big text files are a combination of ASCII and “true” Unicode with infrequent switches from one to the other. Think configuration files, system logs, rich text documents, text-based network protocols, computer programs with strings in them… All of these basically integrate Unicode text inside of an ASCII-only substrate.


#3

#30740 links to some benchmarks for various from_utf8 workloads.


#4

Indeed. encoding_rs optimized the UTF-8 validation (and generally decode speed) for Wikipedia HTML and for 100% ASCII, jquery.

This is hazardous. On ARMv7+NEON, I was able to make 100% ASCII faster with NEON, but even German Wikipedia HTML (ASCII runs of markup and in natural language runs of ASCII that fit in a SIMD register) got pessimized. As a result, I kept the ARMv7 case as ALU code, though faster ALU code (on my benchmark on ARMv7) than what’s in the standard library.

Thanks. It’s not immediately obvious why these workloads are considered relevant to the standard library, but I’ll try to see how my modifications to UTF-8 validation compare to the standard library on these workloads.