Add .dedup() for Iterator


#1

Currently, the .dedup() method only exists on Vec. But it can be implemented on any Iterator as well: https://gist.github.com/lfairy/0f952345ddf07c26eabe

Much of the complexity in the implementation boils down to one issue: how do we yield values as early as possible? The naive algorithm consumes a whole run before yielding, which is elegant but falls flat on infinite sequences (e.g. repeat(1)). It is also suboptimal when the underlying .next() is slow, say when we’re streaming data from the network or from disk.

Why can’t we yield the first element straight away? Because then we’ll have no element to compare with on the next iteration, since the previous element has already been moved out.

There are a few ways to approach this problem:

  1. .clone() the value before we return it. This is nice and simple but requires a Clone bound and an extra copy.

  2. Use Peekable as a one-element buffer. We .peek() at the next item before returning the current one, effectively doing the comparison one step in advance. This has the advantage of not requiring Clone, but the drawback of consuming an extra element.

  3. Return a reference to the element, rather than moving it out. This would be the ideal solution, but requires a StreamingIterator interface.

My implementation uses solution (2). It’s unlikely we’ll get the trait needed for (3), and copying is often more expensive than consuming an extra element.

I haven’t run benchmarks yet, but the Vec version should be faster as it works in-place. We could write the in-place .dedup() in terms of the iterator one – it’s safe because we yield fewer values than we consume – but I’m not sure if that’s worth the trouble.

So, any questions, comments? In particular, is an iterator .dedup() worth adding to the standard library?


#2

I like the design, but is there a need for it in the standard library, rather than just making it a crates.io repository?


#3

I just found it strange that .dedup() was implemented on Vec, but not Iterator.

I guess given that .concat() and .connect() aren’t on Iterator either, there’s precedent for not adding iterator versions when they’re less efficient than the specialized ones.

And since there are multiple solutions to this problem, all with different tradeoffs, perhaps it’s better to iterate (hehe) in a separate library.

(It seems the itertools crate already has a Dedup implementation, which uses the naive solution.)


#4

itertools’ dedup uses the solution you describe as (2). Dedup itself is easy to add to the iterator, but I understand why it’s on the vec – it makes much more sense to use it together with sort.


#5

Not quite. Consider the code:

repeat(1).dedup().next()

My implementation returns Some(1) straight away, whereas itertools’ loops forever.

This is because itertools’ dedup consumes a whole run of equal values before it yields, whereas my implementation delays consuming the remainder until the next iteration.

It’s subtle I admit, but since infinite iterators are a valid use case I think it’s important to handle them properly.

I’ll be happy to submit a patch for itertools if you’re interested.

Agreed.


#6

ok, good point, that example makes it clear. If you want to patch it you are very welcome.