Today I observed this discussion on PR https://github.com/rust-lang/rust/pull/59887:
Zoxc: This seems to be a slight performance regression.
nnethercote: How so? Pre-landing measurements indicated it was a slight improvement.
Zoxc: Instruction count doesn’t matter =P
So, I wanted to clarify with the people making perf changes to the compiler and running benchmarks - what we are actually measuring and why?
perf.rust-lang.org currently has multiple metrics collected by
wall-time, this what we are ultimately aiming for, I think?
People should wait less for the compilation to complete, however this metric is probably hard to use due to high fluctuation (?).
instructions:u, instructions executed by our process (but not operating system).
For x86 (macro-)instructions are a relatively high-level language that’s interpreted by a CPU front-end and coverted into lower-level uops that are actually executed on the rest of CPU, in parallel fashion. Mapping to uops is certainly not 1-to-1, moreover even uops can take different number of cycles to execute, both statically and depending on the state of various caches and branch predictor.
cycles:u, cycles required to execute instructions from our process (but not operating system). Time as it’s measured by the CPU.
max-rss, memory, we are not talking about it right now.
faults, page faults, memory page is not known to MMU yet, OS intervention is needed, pretty specialized metric.
What we need to measure depends highly on the PR in question.
The PR changes IO behavior, or threading behavior, or some other behavior that significantly changes how we are jumping into OS code (e.g. page faults or something).
In this case the CPU time (
instructions/cycles/anything:u) goes out of the window and we need to use something else.
The PR changes memory access patterns heavily, affecting behavior of caches.
We probably need
The PR just changes something computational.
We probably need
What we need to measure also depends on what target we are optimizing for.
“Generic” target, it may be x86, or ARM, or WASM, or whatever.
For this, I guess we need something like the number of optimized LLVM IR instructions before they enter the target-specific LLVM backend.
Most of the compiler runs are probably going to be on x86_64 though, so we should probably optimize for that.
Ok, x86_64 it is, but what microarchitecture? Let’s optimize for “generic” x86_64.
The exact mapping "
cycles:u" will depend on x86 micro-architecture (I’m not even sure what is
This is the only case where I think
Let’s optimize for x86_64/Haswell-or-something.
AFAIK, Intel microarchitecture didn’t change that much in the recent years, not sure about AMD.
Then we probably need
- No specific conclusions beside that
cycles:ushould probably be made the default.