I came across this post from a while back and decided to take off with it. It started out as a quest to see whether the dynamic dispatch really had a that huge of an overhead but I ended up finding some weird behavior when benchmarking generic functions. They do have an overhead at runtime. My source is at this playground and my bench results are below.
running 7 tests
test tests::set_baseline ... bench: 84 ns/iter (+/- 9)
test tests::using_enum_dispatch ... bench: 82 ns/iter (+/- 7)
test tests::using_boxed_enums ... bench: 81 ns/iter (+/- 6)
test tests::using_generics_without_dispatch ... bench: 158 ns/iter (+/- 14)
test tests::using_bounded_generics ... bench: 160 ns/iter (+/- 25)
test tests::using_impltrait_generics ... bench: 159 ns/iter (+/- 17)
test tests::using_dynamic_dispatch ... bench: 464 ns/iter (+/- 43)
I suspect the placement of black_box is the difference in these benchmarks, as it will inhibit optimizations. In the enum cases, the black_box is surrounding the entire match, but in the generic cases you have the black_box on each arm.
#[bench]
fn using_impltrait_generics(b: &mut Bencher) {
let mut xs: Vec<F> = Vec::new();
for _ in 0..N {
xs.push(F::B(B));
xs.push(F::C(C));
xs.push(F::D(D));
}
assert_eq!(xs.len(), N * 3);
b.iter(|| {
let mut sum = 0;
for x in &xs {
sum += black_box( (|| { match x {
F::B(a) => e_impl(a),
F::C(c) => e_impl(c),
F::D(d) => e_impl(d),
}})());
}
sum
});
}
.
test tests::set_baseline ... bench: 83 ns/iter (+/- 13)
test tests::using_impltrait_generics ... bench: 82 ns/iter (+/- 21)
Weird, I've seen that --emit=asm hack mentioned a few times lately. If that's just to reduce parallel codegen, you can more directly use -C codegen-units=1. This can also be set in your manifest [profile.*] sections.
Also, not sure if you're already taking this into account, but the fact that two of the tests use Boxes kind of muddles the results, since these tests will encounter a lot more cache misses than the baseline tests (since boxes will be spread over memory somewhat, while a vec of enums occupies a contiguous slice).
@cuviper
Neat manifest trick, I'll keep that in mind.
@PoignardAzur
That's actually what I initially wanted to test. How much of the dyn dispatch slowness comes from cache misses and how much comes from dyn dispatch. If you look at the test results I think this program is small enough to fit entirely in my cache, the boxed dispatch and enum dispatch are the same speed while dynamic dispatch is slower.