Problem running rustc-perf on local machine

I am doing some experimentation with discriminants and want to run a lot of perf-runs without bothering people in a PR.

Problem is that I get wildly different results for running on my machine vs running on https://perf.rust-lang.org/

I have a Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz, nothing unusual.

My results are of low variance between runs on my local machine.

Basically my workflow is:

  1. ./x.py build --stage2
  2. cd ~/git/rustc-perf
  3. ./target/release/collector bench_local ~/git/rust/build/x86_64-unknown-linux-gnu/stage2/bin/rustc <COMMIT_HASH>

The results are different enough that they are worthless for making any changes. Any idea?

Any time you're measuring perf: absolute measurements don't really matter, what matters is relative changes in measurements.

(instcount measurements flatten this rule somewhat, but I think it still applies.)

So if your local measurements have the same "shape" as the reference measurements (that is, relative measurements between benchmarks are roughly the same), I wouldn't worry about the absolute measurements being different locally. Instead, capture your reference baseline locally, then compare against that.

Thanks for your answer. The problem is that the relative comparison is different.

I am comparing how many benches had improved instruction metric for UPSTREAM_COMMIT vs MY_CHANGED_COMMIT.

I will try and do a reference run for an existing perf-run and report back.

perf.rust-lang.org benchmarks builds produced by rustc's CI, specifically the x86_64-unknown-linux-gnu target. For that target, our CI currently does several things that a local build may not match (and likely won't by default); I think this is a mostly complete list:

  • PGO for rustc
  • ThinLTO + PGO for LLVM (if you use download-ci-llvm = true on x86_64-unknown-linux-gnu, you likely get most of the benefits here)
  • std is built with codegen-units=1

All of these will definitely make perf's instruction counts and absolute numbers differ from what you see locally. I wouldn't try to reproduce the above locally -- local benchmarking, particularly e.g. with cachegrind (which is less sensitive to environmental differences and noise), should give a fairly decent proxy for what you'll see as a relative change on perf. This is not always true -- for example, PGO can mean that your loop/condition reordering or whatever was already applied by LLVM -- but in the general case, locally you should be able to reproduce results fairly well. If you can't, we may not be able to do anything but we'd like to hear about it -- feel free to drop by #t-compiler/performance on Zulip and ask questions if something isn't working as you expect.

2 Likes

Thanks for your great answer. I will repeat local bench for next PR and see if I have something actual to report.