TL;DR I’m concerned about optimizing away benchmark functions, and debuginfo impact on optimization.
I noticed a couple optimization issues while benchmarking, but I’d like to get feedback before I go filing bugs. Here is the bench.rs
source I’m using for discussion:
#![allow(unstable)]
extern crate test;
fn run() -> u32 {
range(0, 1000).fold(0, |old, new| old ^ new)
}
#[bench]
fn function(b: &mut test::Bencher) {
b.iter(run);
}
#[bench]
fn closure(b: &mut test::Bencher) {
b.iter(|| run());
}
#[bench]
fn black_box_function(b: &mut test::Bencher) {
fn bb() { test::black_box(run()); }
b.iter(bb);
}
#[bench]
fn black_box_closure(b: &mut test::Bencher) {
b.iter(|| test::black_box(run()));
}
I gathered from the testing guide that returning a value to Bencher::iter
is supposed to prevent it from optimizing code away entirely, and test::black_box
is another option. However, with these variants I get:
$ rustc -O --test bench.rs && ./bench --bench
running 4 tests
test black_box_closure ... bench: 149 ns/iter (+/- 24)
test black_box_function ... bench: 148 ns/iter (+/- 13)
test closure ... bench: 148 ns/iter (+/- 7)
test function ... bench: 0 ns/iter (+/- 0)
test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured
So it appears that function
got optimized away! It’s especially surprising to me that closure
and function
are different, since they are semantically similar. But I can see the difference with objdump -d
, looking at the end of the timing loops:
10530: 89 4c 24 04 mov %ecx,0x4(%rsp)
10534: 48 ff c8 dec %rax
10537: 75 f7 jne 10530 <function::hc9a899682a75f808Baa+0xb0>
10539: e8 f2 03 06 00 callq 70930 <time::precise_time_ns::hf6018795dbd9ef3eykn>
vs.
10657: 48 ff c1 inc %rcx
1065a: 66 0f 7e 54 24 04 movd %xmm2,0x4(%rsp)
10660: 48 39 c1 cmp %rax,%rcx
10663: 75 9b jne 10600 <closure::h289fba2f9882faccNaa+0x60>
10665: e8 c6 02 06 00 callq 70930 <time::precise_time_ns::hf6018795dbd9ef3eykn>
That is, function
computes the answer only once and then goes into a tight dec loop, while closure
jumps all the way back to the beginning of the computation every time. So I’d guess that the optimizer figured out that the bare function is pure/const, but couldn’t tell for the equivalent closure – yet! Optimizers get smarter all the time. So does this mean that returning a value to Bencher::iter
is not a reliable way to avoid optimization? Or should it do more to prevent this scenario?
The second issue I have is that adding debuginfo, yet still optimized, it’s dramatically slower than what we saw above:
$ rustc -O -g --test bench.rs && ./bench --bench
running 4 tests
test black_box_closure ... bench: 1376 ns/iter (+/- 43)
test black_box_function ... bench: 1340 ns/iter (+/- 68)
test closure ... bench: 1313 ns/iter (+/- 24)
test function ... bench: 0 ns/iter (+/- 0)
test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured
Again peering at disassembly, before it had vectorized the loop into SSE, but once -g
is added it only uses plain registers and operations. I strongly believe that debuginfo should have NO effect on code generation, or else release-with-debug builds aren’t nearly as useful. For comparison, gcc has the -fcompare-debug
option to compile with and without, and it raises an error if there’s any difference in IR.
All of this output is from x86_64 Linux using rustc 1.0.0-nightly (4e4e8cff1 2015-01-24 22:14:14 +0000)
.
Comments welcome!