Generics are not zero cost?

rsnakard-r7 · January 23, 2020, 1:02am

Hey compiler team,

I came across this post from a while back and decided to take off with it. It started out as a quest to see whether the dynamic dispatch really had a that huge of an overhead but I ended up finding some weird behavior when benchmarking generic functions. They do have an overhead at runtime. My source is at this playground and my bench results are below.

running 7 tests
test tests::set_baseline                    ... bench:          84 ns/iter (+/- 9)
test tests::using_enum_dispatch             ... bench:          82 ns/iter (+/- 7)
test tests::using_boxed_enums               ... bench:          81 ns/iter (+/- 6)
test tests::using_generics_without_dispatch ... bench:         158 ns/iter (+/- 14)
test tests::using_bounded_generics          ... bench:         160 ns/iter (+/- 25)
test tests::using_impltrait_generics        ... bench:         159 ns/iter (+/- 17)
test tests::using_dynamic_dispatch          ... bench:         464 ns/iter (+/- 43)

cuviper · January 23, 2020, 1:23am

I suspect the placement of black_box is the difference in these benchmarks, as it will inhibit optimizations. In the enum cases, the black_box is surrounding the entire match, but in the generic cases you have the black_box on each arm.

rsnakard-r7 · January 23, 2020, 1:27am

Hmm, would that be the same case for set_baseline? I compiled with --emit=asm and all my runtimes increased except the enum ones.

rsnakard-r7 $ RUSTFLAGS="--emit=asm" cargo bench
    Finished bench [optimized] target(s) in 0.03s
     Running target/release/deps/bench-0bbc44ddd0aaa06c

running 7 tests
test tests::set_baseline                    ... bench:         154 ns/iter (+/- 32)
test tests::using_enum_dispatch             ... bench:          81 ns/iter (+/- 10)
test tests::using_boxed_enums               ... bench:          81 ns/iter (+/- 7)
test tests::using_generics_without_dispatch ... bench:         289 ns/iter (+/- 22)
test tests::using_bounded_generics          ... bench:         286 ns/iter (+/- 49)
test tests::using_impltrait_generics        ... bench:         280 ns/iter (+/- 80)
test tests::using_dynamic_dispatch          ... bench:         461 ns/iter (+/- 60)

rsnakard-r7 · January 23, 2020, 1:32am

That was it, thanks Josh.

#[bench]
fn using_impltrait_generics(b: &mut Bencher) {
    let mut xs: Vec<F> = Vec::new();
    for _ in 0..N {
        xs.push(F::B(B));
        xs.push(F::C(C));
        xs.push(F::D(D));
    }
    assert_eq!(xs.len(), N * 3);
    b.iter(|| {
        let mut sum = 0;
        for x in &xs {
            sum += black_box( (|| { match x {
                F::B(a) => e_impl(a),
                F::C(c) => e_impl(c),
                F::D(d) => e_impl(d),
            }})());
        }
        sum
    });
}

.

test tests::set_baseline                    ... bench:          83 ns/iter (+/- 13)
test tests::using_impltrait_generics        ... bench:          82 ns/iter (+/- 21)

rsnakard-r7 · January 23, 2020, 1:54am

And -emit=asm seems to be a coin toss

rsnakard-r7 $ RUSTFLAGS="--emit=asm" cargo bench
    Finished bench [optimized] target(s) in 0.5s
     Running target/release/deps/bench-0bbc44ddd0aaa06c

running 7 tests
test tests::set_baseline                    ... bench:          85 ns/iter (+/- 12)
test tests::using_bounded_generics          ... bench:          84 ns/iter (+/- 7)
test tests::using_boxed_enums               ... bench:         159 ns/iter (+/- 20)
test tests::using_dynamic_dispatch          ... bench:         457 ns/iter (+/- 51)
test tests::using_enum_dispatch             ... bench:         154 ns/iter (+/- 18)
test tests::using_generics_without_dispatch ... bench:          82 ns/iter (+/- 7)
test tests::using_impltrait_generics        ... bench:          79 ns/iter (+/- 12)

cuviper · January 23, 2020, 4:28am

Weird, I've seen that --emit=asm hack mentioned a few times lately. If that's just to reduce parallel codegen, you can more directly use -C codegen-units=1. This can also be set in your manifest [profile.*] sections.

PoignardAzur · January 23, 2020, 9:16am

Also, not sure if you're already taking this into account, but the fact that two of the tests use Boxes kind of muddles the results, since these tests will encounter a lot more cache misses than the baseline tests (since boxes will be spread over memory somewhat, while a vec of enums occupies a contiguous slice).

rsnakard-r7 · January 23, 2020, 4:59pm

@cuviper Neat manifest trick, I'll keep that in mind.

@PoignardAzur That's actually what I initially wanted to test. How much of the dyn dispatch slowness comes from cache misses and how much comes from dyn dispatch. If you look at the test results I think this program is small enough to fit entirely in my cache, the boxed dispatch and enum dispatch are the same speed while dynamic dispatch is slower.

steffahn · December 22, 2024, 4:55pm

This topic was automatically closed 540 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shouldn't static vs dynamic dispatch be a compile time decision ideally? language design	7	1956	March 25, 2019
Benchmark optimization woes	10	3465	March 25, 2019
Traits as generic type parameters language design	4	4934	March 25, 2019
pre-RFC: Automatic generic-to-dynamic conversion language design	11	3029	March 25, 2019
Replace dyn object with enum dispatch traits working group	14	772	April 1, 2025

Generics are not zero cost?

Related topics