Performance of Iterator::chain and std::iter::once

Recently I needed to return multiple elements from a flat_map closure so I used once(x).chain(once(y)), which turned out to be slower than returning vec![x, y] instead. Surprising, given that std::iter::Chain has fold and try_fold implementations, so internal iteration should be fast.

So I copied the implementations of the std::iter::Once and std::iter::Chain types to my crate and used those instead, and without changing them at all it made the chain implementation as fast as using a vector. Weirdly enough, using either of the standard library’s Once or Chain types will prevent the speedup – I have to copy both types to my own crate for it to be fast.

Here are some benchmarks:

test benches::bench          ... bench:   3,507,853 ns/iter (+/- 433,627)
test benches::bench_my_chain ... bench:   3,502,656 ns/iter (+/- 384,330)
test benches::bench_my_once  ... bench:   3,482,771 ns/iter (+/- 373,445)
test benches::bench_my_both  ... bench:     635,188 ns/iter (+/- 82,182)
test benches::bench_vec      ... bench:     641,547 ns/iter (+/- 42,438)

Here’s the benchmarking code. I’ve excluded the implementations of irrelevant traits and methods of the Chain and Once types for brevity, they did not affect the benchmarks. Note that you’ll have to run them locally since the playground doesn’t support benchmarks.

Any idea how this was able to cause a speedup? The types themselves seem to be fine, but something is stopping an optimization from happening. I thought that maybe the issue was that some methods aren’t being inlined, but adding the #[inline] attribute to the methods of Chain and Once that are being called didn’t change the results.

1 Like

It may be something to do with link time optizations. Do you have those turned on?

I just tried it and it didn’t help, unfortunately. I don’t think that’s necessary when #[inline] is used anyway?

Check output on (make sure to add -C opt-level=3 flag!) and if that doesn’t look good, file a bug against Rust.

Interestingly, I can’t really reproduce it there – no matter what I try, the generated assembly is always the same. Which by itself is totally expected, but it isn’t able to explain my benchmark results. Any idea what I could try to reproduce the problem in Godbolt?

Oh, did you try enabling LTO or changing the number of codegen units? They’re known to interfere with inlining.

Hadn’t yet tried setting the number of codegen units to 1, but neither unfortunately change the results.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.