Just saw an alternative UTF-8 decoder on HN.
Is the current UTF8 decoder optimized enough? Do you think this microoptimization can work in Rust std?
Just saw an alternative UTF-8 decoder on HN.
Is the current UTF8 decoder optimized enough? Do you think this microoptimization can work in Rust std?
That’s a really interesting piece of work. But:
std
without an expensive alloc and copy.*e = (*c < mins[len]) << 6;
— is this even defined in C or is it platform-specific behaviour? Pretty sure it’s not possible in safe Rust.Summary: neat work but pretty useless.
If you can improve the .chars()
iterator, it would be very cool. Just go for it and experiment (probably most productively out of the main tree). I suggest assuming it’s for .chars()
specifically and not for generic iterators of bytes, as it is written right now more or less.
Yes, it is well-defined in C. The result of all comparison operators is always either 0 or 1 (with type int
).
The zero-padding requirement is definitely a fatal flaw here, but it might be possible to get around it with sufficiently clever predicated vector instructions. In fact I think I remember reading that AArch64 has vector instructions specifically designed to process C-style strings, doing auto-alignment of the loads and generating predicate vectors to ignore bytes beyond the terminator -- which would make this algorithm even better suited to AArch64.
It is. true as i32
works.
I want to expand on @dhardy's point
The author writes
The even distribution of lengths greatly favors a branchless decoder. The random distribution inhibits branch prediction.
Real text does not have this even distribution, and pursuing adding a branchless decoder to Rust is a wasted endeavor until it is proven better on a standard corpus of text. Further, the author updated his post:
Update: Björn pointed out that his site includes a faster variant of his DFA decoder. It is only 10% slower than the branchless decoder with GCC, and it’s 20% faster than the branchless decoder with Clang. So, in a sense, it’s still faster on average, even on a benchmark that favors a branchless decoder.
Even with the author's own benchmarks, he cannot beat the faster variant of the DFA decoder with clang.
I hope you all are just saying it’s wasted time because you don’t want someone to scoop you on the optimization you’re secretly worked on…
I wrote the current chars() formulation and I want to see it beaten!
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.