What is perf.rust-lang.org measuring and why is "instructions:u" the default?

I have strong opinions on this topic.

Wall time is the ultimate metric. Unfortunately, is has high variance, which makes it very hard to use in practice. Small improvements and regressions (e.g. < 1%) are difficult to reliably detect with wall time.

Instruction counts is a high-quality proxy measurement for wall time. The correlation between instruction counts and wall time is high. Instruction counts also have low variance. Very small improvements and regressions (e.g. 0.3%) can be reliably detected with instruction counts.

Cycles are an even better proxy for wall time than instructions, but also have a high variance. So they have no advantage over wall time and so don't have much value.

Let's look at some examples.

  • Measurements for a tiny change that should not have affected performance. For instruction counts, the outliers are +1.3% and -1.1%, and the vast majority are inside the range -0.2% to +0.2%. For wall time, the outliers are -8.9% and +9.7%, and many are outside the range -1.0% to +1.0%.
  • Measurements for a large improvement. This improvement is large enough that it's clear even from the wall time measurements, though the instruction counts are clearly more reliable. The instruction counts underestimate the wall time improvements somewhat.
  • And just look at the graphs at perf.rust-lang.org. The instruction count graphs are basically straight lines. The wall time graphs bubble up and down significantly.

I am well aware of the ways that instruction counts and wall time can diverge: cache misses, branch mispredictions, slow instructions such as integer division, etc. However, in my experience, significant divergence is extremely rare in practice. I say this as the author and long-term user of a profiling tool (Cachegrind) that measures instruction counts and simulates cache misses and branch mispredictions. Back in 2011 I wrote about using Cachegrind on SpiderMonkey:

"it counts instructions, memory accesses, etc, rather than time. When making a lot of very small improvements, noise variations often swamp the effects of the improvements, so being able to see that instruction counts are going down by 0.2% here, 0.3% there, is very helpful."

For all of the above reasons, I think instruction counts should stay as the default metric. It is certainly worth looking at wall time (and possibly the other metrics) as a sanity check, and I regularly do that.

11 Likes