Interest in branchless UTF-8 decoder?


Just saw an alternative UTF-8 decoder on HN.

Is the current UTF8 decoder optimized enough? Do you think this microoptimization can work in Rust std?


That’s a really interesting piece of work. But:

  • This is very much designed around the hugely parallel architecture of x86_64 CPUs. Would it translate well to other CPUs, such as ARM?
  • It relies on the input buffer having at least three trailing null bytes. This means it couldn’t be used via a general-purpose API like in std without an expensive alloc and copy.
  • The author uses left-shift on the result of a comparison — *e = (*c < mins[len]) << 6; — is this even defined in C or is it platform-specific behaviour? Pretty sure it’s not possible in safe Rust.
  • The author notes his goal is to beat the performance of Höhrmann’s DFA-based decoder, which he does by 20% — with specially crafted input. He notes that on real-world input branch prediction usually works well enough anyway, so there’s generally no advantage in using his branchless algorithm.

Summary: neat work but pretty useless.


If you can improve the .chars() iterator, it would be very cool. Just go for it and experiment (probably most productively out of the main tree). I suggest assuming it’s for .chars() specifically and not for generic iterators of bytes, as it is written right now more or less.


Yes, it is well-defined in C. The result of all comparison operators is always either 0 or 1 (with type int).

The zero-padding requirement is definitely a fatal flaw here, but it might be possible to get around it with sufficiently clever predicated vector instructions. In fact I think I remember reading that AArch64 has vector instructions specifically designed to process C-style strings, doing auto-alignment of the loads and generating predicate vectors to ignore bytes beyond the terminator – which would make this algorithm even better suited to AArch64.


It is. true as i32 works.


I want to expand on @dhardy’s point

The author writes

The even distribution of lengths greatly favors a branchless decoder. The random distribution inhibits branch prediction.

Real text does not have this even distribution, and pursuing adding a branchless decoder to Rust is a wasted endeavor until it is proven better on a standard corpus of text. Further, the author updated his post:

Update: Björn pointed out that his site includes a faster variant of his DFA decoder. It is only 10% slower than the branchless decoder with GCC, and it’s 20% faster than the branchless decoder with Clang. So, in a sense, it’s still faster on average, even on a benchmark that favors a branchless decoder.

Even with the author’s own benchmarks, he cannot beat the faster variant of the DFA decoder with clang.


I hope you all are just saying it’s wasted time because you don’t want someone to scoop you on the optimization you’re secretly worked on… :smiley:

I wrote the current chars() formulation and I want to see it beaten!