Adding a (predictable) branch to existing code can increase branch mispredictions

In many Rust-related discussions about safety checks I've often heard "it's just an always-taken branch, so it shouldn't be a problem", so this is an interesting result: a predictable branch may still cause mispredictions by affecting predictions of other branches.

1 Like

Given that Daniel Lemire has been using icc in some micro-optimization studies before, and assuming that he did it again here, I have one possible alternate explanation for the observed effect: icc generates completely different machine code for the first and second versions of the code, and in that situation comparisons of relative CPU branch prediction rates are meaningless because the absolute number of branches is completely different in the two codes being compared.

Let me rant about ICC for a second

Sadly, such ASM instability is to be expected. The Intel compiler is well-known in number-crunching circles to be dishonestly optimized towards winning micro-benchmarking contests, even when that means copy-pasting hand-tuned assembly for simple loops (like this compute-and-conditional one) or breaking the "as-if" rule of bitwise-perfect floating point computation reproducibility which one normally expects of modern C compilers that use SSE over the x87 FPU, in absence of fast-math flags.

Even if Daniel used GCC as he often does, it also generates slightly different assembly for the two loops. The difference is less egregious though (it simply unrolls the first iteration of the loop), and for a large enough number of loop iterations it should be inconsequential.

As a general comment, in this situation where the number of branches varies, "absolute" metrics like number of correctly predicted and mispredicted branches per iteration are easier to reason about than relative metrics like total rate of mispredicted branches, and I would therefore recommend using the latter when microbenchmarking to spot that sort of mistake more easily.

In this particular case, the reason why I'm suspecting a benchmark flaw is that the 4% x86 branch misprediction rate on the initial loop is very fishy, as it means either that the chosen pseudo-random number generator is so bad at pseudorandomness that a CPU's branch predictor can reliably predict the low-order bit of its output, or that something is wrong with the microbenchmark itself.

OTOH, the ARM results are more reasonable (25% misprediction rate is exactly what one would theoretically expect from a loop with one random conditional and one predictable loop branch), and it is a well-known fact that embedded chips cannot afford power-hungry features like sophisticated branch predictors as much as x86 chips do. Therefore, Daniel might genuinely be onto something here.

Now, a less arcane reason to worry about always-taken branches, which you can mention in such discussion threads when the occasion comes up, is that they mess with autovectorization. This can cost you a CPU performance factor of 2x to 64x when doing intensive number-crunching.


Ah, scratch that, I've been studying the source code published by Daniel Lemire alongside the blog post further and it does count absolute branch rates as I recommended above.

So if that table in the blog post is just the averaged output of that program, in the configuration where it was provided and without any further postprocessing, then I have no idea why it was expressed as a percentage as if it were a fraction of mispredicted branches. But it's actually an absolute missed branch per loop iteration counter, so the part of my analysis above concerning changes in absolute number of branches when using different compilers making relative branch rates meaningless, is incorrect.

However, in that case, it becomes worrying that the output of that program is complete noise on my machine, with both loops producing the same branch misprediction count output within (large) statistical error. Increasing the amount of benchmark iterations until the benchmark output stabilizes yields exactly 0.5 failed branch per loop iteration in both examples on the GCC-compiled program.

Further, my other point about 4% branch misprediction rate on pseudorandom data being fishy stands. If you get that kind of result, it means that either your pseudorandom generator fails hard at producing a pseudorandom low-order bit, or your CPU is cheating hard at the benchmark by remembering all pseudorandom outputs across benchmark runs (after all, 2000 trials is "only" 2 kbit of state...). No matter which one it is, I remain unconvinced that this benchmark's output is meaningful and that one can infer something about the behaviour of realistic software from it.