Interest in branchless UTF-8 decoder?


#1

Just saw an alternative UTF-8 decoder on HN.

Is the current UTF8 decoder optimized enough? Do you think this microoptimization can work in Rust std?


#2

That’s a really interesting piece of work. But:

  • This is very much designed around the hugely parallel architecture of x86_64 CPUs. Would it translate well to other CPUs, such as ARM?
  • It relies on the input buffer having at least three trailing null bytes. This means it couldn’t be used via a general-purpose API like in std without an expensive alloc and copy.
  • The author uses left-shift on the result of a comparison — *e = (*c < mins[len]) << 6; — is this even defined in C or is it platform-specific behaviour? Pretty sure it’s not possible in safe Rust.
  • The author notes his goal is to beat the performance of Höhrmann’s DFA-based decoder, which he does by 20% — with specially crafted input. He notes that on real-world input branch prediction usually works well enough anyway, so there’s generally no advantage in using his branchless algorithm.

Summary: neat work but pretty useless.


#3

If you can improve the .chars() iterator, it would be very cool. Just go for it and experiment (probably most productively out of the main tree). I suggest assuming it’s for .chars() specifically and not for generic iterators of bytes, as it is written right now more or less.


#4

Yes, it is well-defined in C. The result of all comparison operators is always either 0 or 1 (with type int).

The zero-padding requirement is definitely a fatal flaw here, but it might be possible to get around it with sufficiently clever predicated vector instructions. In fact I think I remember reading that AArch64 has vector instructions specifically designed to process C-style strings, doing auto-alignment of the loads and generating predicate vectors to ignore bytes beyond the terminator – which would make this algorithm even better suited to AArch64.


#5

It is. true as i32 works.


#6

I want to expand on @dhardy’s point

The author writes

The even distribution of lengths greatly favors a branchless decoder. The random distribution inhibits branch prediction.

Real text does not have this even distribution, and pursuing adding a branchless decoder to Rust is a wasted endeavor until it is proven better on a standard corpus of text. Further, the author updated his post:

Update: Björn pointed out that his site includes a faster variant of his DFA decoder. It is only 10% slower than the branchless decoder with GCC, and it’s 20% faster than the branchless decoder with Clang. So, in a sense, it’s still faster on average, even on a benchmark that favors a branchless decoder.

Even with the author’s own benchmarks, he cannot beat the faster variant of the DFA decoder with clang.


#7

I hope you all are just saying it’s wasted time because you don’t want someone to scoop you on the optimization you’re secretly worked on… :smiley:

I wrote the current chars() formulation and I want to see it beaten!