Interest in branchless UTF-8 decoder?

ishitatsuyuki · October 8, 2017, 1:53am

Just saw an alternative UTF-8 decoder on HN.

Is the current UTF8 decoder optimized enough? Do you think this microoptimization can work in Rust std?

dhardy · October 8, 2017, 7:44am

That’s a really interesting piece of work. But:

This is very much designed around the hugely parallel architecture of x86_64 CPUs. Would it translate well to other CPUs, such as ARM?
It relies on the input buffer having at least three trailing null bytes. This means it couldn’t be used via a general-purpose API like in std without an expensive alloc and copy.
The author uses left-shift on the result of a comparison — *e = (*c < mins[len]) << 6; — is this even defined in C or is it platform-specific behaviour? Pretty sure it’s not possible in safe Rust.
The author notes his goal is to beat the performance of Höhrmann’s DFA-based decoder, which he does by 20% — with specially crafted input. He notes that on real-world input branch prediction usually works well enough anyway, so there’s generally no advantage in using his branchless algorithm.

Summary: neat work but pretty useless.

bluss · October 8, 2017, 2:13pm

If you can improve the .chars() iterator, it would be very cool. Just go for it and experiment (probably most productively out of the main tree). I suggest assuming it’s for .chars() specifically and not for generic iterators of bytes, as it is written right now more or less.

zackw · October 8, 2017, 5:11pm

Yes, it is well-defined in C. The result of all comparison operators is always either 0 or 1 (with type int).

The zero-padding requirement is definitely a fatal flaw here, but it might be possible to get around it with sufficiently clever predicated vector instructions. In fact I think I remember reading that AArch64 has vector instructions specifically designed to process C-style strings, doing auto-alignment of the loads and generating predicate vectors to ignore bytes beyond the terminator -- which would make this algorithm even better suited to AArch64.

ishitatsuyuki · October 9, 2017, 4:26am

It is. true as i32 works.

twmb · October 9, 2017, 5:18am

I want to expand on @dhardy's point

The author writes

The even distribution of lengths greatly favors a branchless decoder. The random distribution inhibits branch prediction.

Real text does not have this even distribution, and pursuing adding a branchless decoder to Rust is a wasted endeavor until it is proven better on a standard corpus of text. Further, the author updated his post:

Update: Björn pointed out that his site includes a faster variant of his DFA decoder. It is only 10% slower than the branchless decoder with GCC, and it’s 20% faster than the branchless decoder with Clang. So, in a sense, it’s still faster on average, even on a benchmark that favors a branchless decoder.

Even with the author's own benchmarks, he cannot beat the faster variant of the DFA decoder with clang.

bluss · October 9, 2017, 4:50pm

I hope you all are just saying it’s wasted time because you don’t want someone to scoop you on the optimization you’re secretly worked on…

I wrote the current chars() formulation and I want to see it beaten!

system · March 25, 2019, 8:29am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Feature Request: Utf8 prefix inspection for u8 language design	6	579	March 26, 2024
Wild idea: deprecating APIs that conflate str and [u8] libs	59	3620	November 12, 2020
Fn char::as_ascii(self) -> Option<u8>	4	732	April 20, 2020
Pre-RFC - allow usage of str::utf8_char_width libs	2	656	March 8, 2022
Str vs slice APIs libs	3	1592	March 25, 2019

Interest in branchless UTF-8 decoder?

Related topics