Benchmark optimization woes


#1

TL;DR I’m concerned about optimizing away benchmark functions, and debuginfo impact on optimization.

I noticed a couple optimization issues while benchmarking, but I’d like to get feedback before I go filing bugs. Here is the bench.rs source I’m using for discussion:

#![allow(unstable)]

extern crate test;

fn run() -> u32 {
    range(0, 1000).fold(0, |old, new| old ^ new)
}

#[bench]
fn function(b: &mut test::Bencher) {
    b.iter(run);
}

#[bench]
fn closure(b: &mut test::Bencher) {
    b.iter(|| run());
}

#[bench]
fn black_box_function(b: &mut test::Bencher) {
    fn bb() { test::black_box(run()); }
    b.iter(bb);
}

#[bench]
fn black_box_closure(b: &mut test::Bencher) {
    b.iter(|| test::black_box(run()));
}

I gathered from the testing guide that returning a value to Bencher::iter is supposed to prevent it from optimizing code away entirely, and test::black_box is another option. However, with these variants I get:

$ rustc -O --test bench.rs && ./bench --bench

running 4 tests
test black_box_closure  ... bench:       149 ns/iter (+/- 24)
test black_box_function ... bench:       148 ns/iter (+/- 13)
test closure            ... bench:       148 ns/iter (+/- 7)
test function           ... bench:         0 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

So it appears that function got optimized away! It’s especially surprising to me that closure and function are different, since they are semantically similar. But I can see the difference with objdump -d, looking at the end of the timing loops:

   10530:	89 4c 24 04          	mov    %ecx,0x4(%rsp)
   10534:	48 ff c8             	dec    %rax
   10537:	75 f7                	jne    10530 <function::hc9a899682a75f808Baa+0xb0>
   10539:	e8 f2 03 06 00       	callq  70930 <time::precise_time_ns::hf6018795dbd9ef3eykn>

vs.

   10657:	48 ff c1             	inc    %rcx
   1065a:	66 0f 7e 54 24 04    	movd   %xmm2,0x4(%rsp)
   10660:	48 39 c1             	cmp    %rax,%rcx
   10663:	75 9b                	jne    10600 <closure::h289fba2f9882faccNaa+0x60>
   10665:	e8 c6 02 06 00       	callq  70930 <time::precise_time_ns::hf6018795dbd9ef3eykn>

That is, function computes the answer only once and then goes into a tight dec loop, while closure jumps all the way back to the beginning of the computation every time. So I’d guess that the optimizer figured out that the bare function is pure/const, but couldn’t tell for the equivalent closure – yet! Optimizers get smarter all the time. So does this mean that returning a value to Bencher::iter is not a reliable way to avoid optimization? Or should it do more to prevent this scenario?

The second issue I have is that adding debuginfo, yet still optimized, it’s dramatically slower than what we saw above:

$ rustc -O -g --test bench.rs && ./bench --bench

running 4 tests
test black_box_closure  ... bench:      1376 ns/iter (+/- 43)
test black_box_function ... bench:      1340 ns/iter (+/- 68)
test closure            ... bench:      1313 ns/iter (+/- 24)
test function           ... bench:         0 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

Again peering at disassembly, before it had vectorized the loop into SSE, but once -g is added it only uses plain registers and operations. I strongly believe that debuginfo should have NO effect on code generation, or else release-with-debug builds aren’t nearly as useful. For comparison, gcc has the -fcompare-debug option to compile with and without, and it raises an error if there’s any difference in IR.

All of this output is from x86_64 Linux using rustc 1.0.0-nightly (4e4e8cff1 2015-01-24 22:14:14 +0000).

Comments welcome!


Let's talk about parallel codegen
#2

My understanding is test::black_box is the supported way to inhibit optimization for benchmarking, and Bencher::iter is not.


#3

Well, note that Bencher::iter is actually using black_box itself:

pub fn iter<T, F>(&mut self, mut inner: F) where F: FnMut() -> T {
    self.dur = Duration::span(|| {
        let k = self.iterations;
        for _ in range(0u64, k) {
            black_box(inner());
        }
    });
}

And yet it still figures out that the value from my function never changes throughout that iteration!


#4

The black_box is only being applied to the output value, not any inputs, i.e. the compiler is absolutely correct in seeing that that final value never changes, it is allowed to compute it once and then feed it into the black_box repeatedly. To avoid this you need to make parts of the input opaque (or otherwise unpredictable), e.g. range(0, test::black_box(1000)).fold(0, ...). Applying that fix gives:

test black_box_closure  ... bench:       133 ns/iter (+/- 22)
test black_box_function ... bench:       133 ns/iter (+/- 7)
test closure            ... bench:       134 ns/iter (+/- 9)
test function           ... bench:       136 ns/iter (+/- 3)

As for the difference between function and closure in the original code… well the optimiser is somewhat of a black box, and slight differences in the unoptimised code result in sadness. It’s probably worth filing a bug about this, “passing function as closure optimises better than equivalent manual closure” to see if we can improve the handling of the closure.

Accepting a return value is aiming to fix one thing: (parts of) the body of the iter being removed by dead code elimination because the compiler sees that it isn’t ever used (if it has no sideeffects).

This is a bug with LLVM, not rustc itself: it doesn’t have good support for debuginfo + optimisations. Definitely something that would be useful, though.

Incidentally, rustc supports the --emit=asm to dump out assembly directly.


#5

Why not change Bencher::iter to use black_box(inner)() (or whatever version of that compiles) then?


#6

Interesting; I suspect that may improve some situations (especially if used inside and out: black_box(black_box(inner)())), but the type-based approach of how Rust implements closures and functions may mean it doesn’t always work, the compiler still sees that the same code is executed.

(In fact, I would think it doesn’t change the case here, but I haven’t tested it.)


#7

Ok, I see. That’s tough, because now we’ve convinced the compiler that this code isn’t dead, but it decided to only run it once, which is still useless for benchmarking purposes.

Putting a black_box on some input is a reasonable hack though. If you didn’t notice, this range snippet is the example used in TRPL’s testing section. Maybe that should be expanded to demonstrate black_boxed inputs too.

The exact types could be defeated by casting inner into a Box<FnMut>, no? I guess that will prevent it from ever being inlined into the loop. Not sure if an indirect call is desirable here, but maybe that’s relatively small enough (and easily cpu-predicted) that it would be ok.


#8

I’m under the impression that it does? The last code block in the testing page is

extern crate test;

b.iter(|| {
    let mut n = 1000_u32;

    test::black_box(&mut n); // pretend to modify `n`

    range(0, n).fold(0, |a, b| a ^ b)
})

Yes, that would work, but, as you say, it would introduce a virtual call; maybe it’s not so bad.


#9

Oh, I see. The description just didn’t make it clear to me what’s going on. The part before that demonstrates returning the output to iter, then it says, “Or, the other option is …”. Really the second part is combining effects, still returning the output value but also applying black_box to one of the inputs, so the optimizer will consider it a unique computation every time.


#10

Oh, that seems like a bug in the docs; I didn’t notice the second last code block. :smile: