Performance of Iterator::chain and std::iter::once

timvermeulen · March 2, 2019, 6:13pm

Recently I needed to return multiple elements from a flat_map closure so I used once(x).chain(once(y)), which turned out to be slower than returning vec![x, y] instead. Surprising, given that std::iter::Chain has fold and try_fold implementations, so internal iteration should be fast.

So I copied the implementations of the std::iter::Once and std::iter::Chain types to my crate and used those instead, and without changing them at all it made the chain implementation as fast as using a vector. Weirdly enough, using either of the standard library’s Once or Chain types will prevent the speedup – I have to copy both types to my own crate for it to be fast.

Here are some benchmarks:

test benches::bench          ... bench:   3,507,853 ns/iter (+/- 433,627)
test benches::bench_my_chain ... bench:   3,502,656 ns/iter (+/- 384,330)
test benches::bench_my_once  ... bench:   3,482,771 ns/iter (+/- 373,445)
test benches::bench_my_both  ... bench:     635,188 ns/iter (+/- 82,182)
test benches::bench_vec      ... bench:     641,547 ns/iter (+/- 42,438)

Here’s the benchmarking code. I’ve excluded the implementations of irrelevant traits and methods of the Chain and Once types for brevity, they did not affect the benchmarks. Note that you’ll have to run them locally since the playground doesn’t support benchmarks.

Any idea how this was able to cause a speedup? The types themselves seem to be fine, but something is stopping an optimization from happening. I thought that maybe the issue was that some methods aren’t being inlined, but adding the #[inline] attribute to the methods of Chain and Once that are being called didn’t change the results.

RustyYato · March 2, 2019, 6:34pm

It may be something to do with link time optizations. Do you have those turned on?

timvermeulen · March 2, 2019, 7:18pm

I just tried it and it didn’t help, unfortunately. I don’t think that’s necessary when #[inline] is used anyway?

kornel · March 4, 2019, 4:57am

Check output on http://rust.godbolt.org/ (make sure to add -C opt-level=3 flag!) and if that doesn’t look good, file a bug against Rust.

timvermeulen · March 4, 2019, 1:21pm

Interestingly, I can’t really reproduce it there – no matter what I try, the generated assembly is always the same. Which by itself is totally expected, but it isn’t able to explain my benchmark results. Any idea what I could try to reproduce the problem in Godbolt?

kornel · March 4, 2019, 2:04pm

Oh, did you try enabling LTO or changing the number of codegen units? They’re known to interfere with inlining.

timvermeulen · March 4, 2019, 3:20pm

Hadn’t yet tried setting the number of codegen units to 1, but neither unfortunately change the results.

system · June 2, 2019, 3:20pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
What additional performance overhead does the use of iterators and closures cause? compiler	19	1193	May 11, 2024
We may need some sort of #[flatten]	17	811	November 25, 2024
An optimistic version of collect::<Result<T, E>>? libs	8	1446	February 24, 2023
About optimizations of `for` loops compiler	6	1104	August 25, 2023
Proposal: implement `iter::Chain` using `Fuse<>` libs	14	1343	June 22, 2020

Performance of Iterator::chain and std::iter::once

Related topics