Benchmark for std::str::from_utf8()?

hsivonen · May 7, 2018, 9:12am

Is there a benchmark that purported performance improvements to std::str::from_utf8() should score better on in order to be considered improvements?

That is, what kind of ASCII/non-ASCII mix workloads are considered relevant?

CC @bluss

HadrienG · May 7, 2018, 2:53pm

For web use cases at least, the single most important thing is to provide a very fast path for ASCII chars, because that’s what web content is mostly composed of (due not only to the dominance of English content, but also HTML tags, URLs, CSS and Javascript syntax… you get the idea). So a first benchmark on ASCII-only contents would be good. As a first approximation, you can just generate a big bunch of random ASCII chars.

After that, you would also need to demonstrate that your changes do not affect true Unicode content negatively, which can be done by starting from a plain text document in any language you like that has lots of Unicode characters. A Chinese text for example would be a good “pure Unicode” benchmark, since Chinese writing is mostly composed of “extended” Unicode characters, which have variable length when converted to UTF-8 to top it off.

Finally, as a mixed workload, you could dump the source code of a Chinese web page (Unicode contents + ASCII code) and see how well your algorithm deals with it. Results of a Baidu search would be one easy possibility.

Although my point of view here is somewhat web-centric, I am convinced that most real-world big text files are a combination of ASCII and “true” Unicode with infrequent switches from one to the other. Think configuration files, system logs, rich text documents, text-based network protocols, computer programs with strings in them… All of these basically integrate Unicode text inside of an ASCII-only substrate.

mbrubeck · May 7, 2018, 4:21pm

#30740 links to some benchmarks for various from_utf8 workloads.

hsivonen · May 18, 2018, 2:50pm

Indeed. encoding_rs optimized the UTF-8 validation (and generally decode speed) for Wikipedia HTML and for 100% ASCII, jquery.

This is hazardous. On ARMv7+NEON, I was able to make 100% ASCII faster with NEON, but even German Wikipedia HTML (ASCII runs of markup and in natural language runs of ASCII that fit in a SIMD register) got pessimized. As a result, I kept the ARMv7 case as ALU code, though faster ALU code (on my benchmark on ARMv7) than what's in the standard library.

Thanks. It's not immediately obvious why these workloads are considered relevant to the standard library, but I'll try to see how my modifications to UTF-8 validation compare to the standard library on these workloads.

system · March 25, 2019, 8:30am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Consensus check: Asking the Unicode Technical Committee to revert their decision to change the preferred UTF-8 error handling libs	11	2502	March 25, 2019
To_upper speed	25	2097	May 4, 2021
Interest in branchless UTF-8 decoder? libs	7	1604	March 25, 2019
Pre-RFC: String from ASCII (not allowing UTF-8) libs	16	2368	August 8, 2021
Why's char not an utf8mb4? language design	18	1966	August 13, 2021

Benchmark for std::str::from_utf8()?

Related topics