Generics are not zero cost?

Hey compiler team,

I came across this post from a while back and decided to take off with it. It started out as a quest to see whether the dynamic dispatch really had a that huge of an overhead but I ended up finding some weird behavior when benchmarking generic functions. They do have an overhead at runtime. My source is at this playground and my bench results are below.

running 7 tests
test tests::set_baseline                    ... bench:          84 ns/iter (+/- 9)
test tests::using_enum_dispatch             ... bench:          82 ns/iter (+/- 7)
test tests::using_boxed_enums               ... bench:          81 ns/iter (+/- 6)
test tests::using_generics_without_dispatch ... bench:         158 ns/iter (+/- 14)
test tests::using_bounded_generics          ... bench:         160 ns/iter (+/- 25)
test tests::using_impltrait_generics        ... bench:         159 ns/iter (+/- 17)
test tests::using_dynamic_dispatch          ... bench:         464 ns/iter (+/- 43)

I suspect the placement of black_box is the difference in these benchmarks, as it will inhibit optimizations. In the enum cases, the black_box is surrounding the entire match, but in the generic cases you have the black_box on each arm.

1 Like

Hmm, would that be the same case for set_baseline? I compiled with --emit=asm and all my runtimes increased except the enum ones.

rsnakard-r7 $ RUSTFLAGS="--emit=asm" cargo bench
    Finished bench [optimized] target(s) in 0.03s
     Running target/release/deps/bench-0bbc44ddd0aaa06c

running 7 tests
test tests::set_baseline                    ... bench:         154 ns/iter (+/- 32)
test tests::using_enum_dispatch             ... bench:          81 ns/iter (+/- 10)
test tests::using_boxed_enums               ... bench:          81 ns/iter (+/- 7)
test tests::using_generics_without_dispatch ... bench:         289 ns/iter (+/- 22)
test tests::using_bounded_generics          ... bench:         286 ns/iter (+/- 49)
test tests::using_impltrait_generics        ... bench:         280 ns/iter (+/- 80)
test tests::using_dynamic_dispatch          ... bench:         461 ns/iter (+/- 60)

That was it, thanks Josh.

#[bench]
fn using_impltrait_generics(b: &mut Bencher) {
    let mut xs: Vec<F> = Vec::new();
    for _ in 0..N {
        xs.push(F::B(B));
        xs.push(F::C(C));
        xs.push(F::D(D));
    }
    assert_eq!(xs.len(), N * 3);
    b.iter(|| {
        let mut sum = 0;
        for x in &xs {
            sum += black_box( (|| { match x {
                F::B(a) => e_impl(a),
                F::C(c) => e_impl(c),
                F::D(d) => e_impl(d),
            }})());
        }
        sum
    });
}

.

test tests::set_baseline                    ... bench:          83 ns/iter (+/- 13)
test tests::using_impltrait_generics        ... bench:          82 ns/iter (+/- 21)

And -emit=asm seems to be a coin toss

rsnakard-r7 $ RUSTFLAGS="--emit=asm" cargo bench
    Finished bench [optimized] target(s) in 0.5s
     Running target/release/deps/bench-0bbc44ddd0aaa06c

running 7 tests
test tests::set_baseline                    ... bench:          85 ns/iter (+/- 12)
test tests::using_bounded_generics          ... bench:          84 ns/iter (+/- 7)
test tests::using_boxed_enums               ... bench:         159 ns/iter (+/- 20)
test tests::using_dynamic_dispatch          ... bench:         457 ns/iter (+/- 51)
test tests::using_enum_dispatch             ... bench:         154 ns/iter (+/- 18)
test tests::using_generics_without_dispatch ... bench:          82 ns/iter (+/- 7)
test tests::using_impltrait_generics        ... bench:          79 ns/iter (+/- 12)

Weird, I've seen that --emit=asm hack mentioned a few times lately. If that's just to reduce parallel codegen, you can more directly use -C codegen-units=1. This can also be set in your manifest [profile.*] sections.

2 Likes

Also, not sure if you're already taking this into account, but the fact that two of the tests use Boxes kind of muddles the results, since these tests will encounter a lot more cache misses than the baseline tests (since boxes will be spread over memory somewhat, while a vec of enums occupies a contiguous slice).

@cuviper Neat manifest trick, I'll keep that in mind.

@PoignardAzur That's actually what I initially wanted to test. How much of the dyn dispatch slowness comes from cache misses and how much comes from dyn dispatch. If you look at the test results I think this program is small enough to fit entirely in my cache, the boxed dispatch and enum dispatch are the same speed while dynamic dispatch is slower.

2 Likes