I want to expand on @dhardy’s point
The author writes
The even distribution of lengths greatly favors a branchless decoder. The random distribution inhibits branch prediction.
Real text does not have this even distribution, and pursuing adding a branchless decoder to Rust is a wasted endeavor until it is proven better on a standard corpus of text. Further, the author updated his post:
Update: Björn pointed out that his site includes a faster variant of his DFA decoder. It is only 10% slower than the branchless decoder with GCC, and it’s 20% faster than the branchless decoder with Clang. So, in a sense, it’s still faster on average, even on a benchmark that favors a branchless decoder.
Even with the author’s own benchmarks, he cannot beat the faster variant of the DFA decoder with clang.