Non-truncating, more usable zip()


#1

I’ve found that when I use iter.zip() it’s almost always to iterate over same-length arrays.

Unfortunately, when I don’t intend any of the iterators to be shorter or empty, .zip()'s graceful handing of different lengths hides and propagates errors.

I think it would be good for code robustness if the expectation of iterators being same length could be made explicit and enforced (like zip_eq).

Also, .zip() gets awkward quickly when iterating more than two iterators. So perhaps both problems could be solved by deprecating zip and adding an alternative, more convenient solution. Something that assumes same lengths by default (for varying-length iterators there should be a separate function, e…g zip_shortest) and easily supports more than two iterators?


#2

zip that stops on the shortest is the right default, see the zip of Python:

>>> zip([1,2], "abc")
[(1, 'a'), (2, 'b')]

And Haskell:

Prelude> zip [1,2] "abc"
[(1,'a'),(2,'b')]

Regarding the multiple arguments I agree, but Rust doesn’t support multiple arguments. Iterators solves this problem with an input tuple (but having multiple arguments is better):

https://docs.rs/itertools/0.6.0/itertools/fn.multizip.html


#3

I don’t think other languages merely doing the same thing is a sufficient justification. What if all three are copying the same mistake from each other?

I see that Python would prefer more flexibility and convenience rather than robustness, but from Rust I expect more reliability.

In image processing I have things like:

let image3 = image1.zip(image2).map(|(a,b)| a+b).collect();

and when image1 has a different stride than image2, I get a garbage result, and maybe an out of bounds error somewhere later due to the unexpected truncation. That’s unusual for Rust, because in other cases it catches most of my errors quite reliably and early.

I’ve started this thread after seeing a discussion on Twitter about Intel’s AMT vulnerability, which was caused by a similarly unexpected early termination in strncmp. Rust’s equivalent of that would be:

computed_response.zip(user_response).all(|(c,u)| c==u)

For basic string comparison Rust of course has ==, but if one wanted to implement any fancier comparison (constant-time, case-insensitive), they’d probably use .zip() and risk that mistake.


#4

Shorter seems most important for (pseudo-)infinite iterators. For example, Iterator::enumerate suggests .zip(0..) if you want to enumerate with a type other than usize.


#5

Couldn’t disagree more :slight_smile:

A lot of bugs of mine taught me a rule of thumb: “if the line with zip(xs, ys) is not preceded by assert len(xs) == len(ys) or by if len(xs) == len(ys), then this line contains a bug”. Of course, there are a tiny bit of use cases, where I do want to zip iterators of different length, and I always place an # XXX: xs and ys may be of different lengths comment there :slight_smile: And reading other people’s code taught me that everybody makes this mistake.

I :heart: :heart: :heart: the zip function from Jane Street’s Core library for ocaml ('a t means List<A>):

val zip : 'a t -> 'b t -> ('a * 'b) t option
val zip_exn : 'a t -> 'b t -> ('a * 'b) t

@kornel how would you like to implement .zip_ep ? The problem with it is that you don’t always know the length of iterator upfront, and because of this there’s no single reasonable place to report an error. Core dodges the issue by returning a list. Perhaps we can restrict this only to ExactSizeIterators?


#6

I agree with you. There should be two functions: one which is named something like zip_shortest() and another that’s named something like zip_same_length(), and the current zip() should be deprecated. Further, like @matklad said, it makes sense to have zip_same_length() return Result<> instead of panicking when the lengths do not match.


#7

Sure I agree, dropping extra elements is not usually what you want. But it’s an easy to use interface… I actually think a.zip_eq(b) is about as fluent an interface you can get for adding the length check.

(New proposed names are cute, but I don’t see any active harm with the old name, so deprecation is not needed.)

Good news: .zip() uses specialization to offer better codegen when zipping slice iterators and a few other ones.

Bad news: stable Rust libraries can’t compete with that, so zip_eq and multizip are neat, but potentially slower, which is not intuitive.

I guess for zip_eq it’s understandable that:

  1. There is no way to tell the length of an iterator, so if neither reaches its end, the iterators are of course not verified to be of equal length. So zip_eq is lazy and won’t yell if you don’t iterate to an end.
  2. A panic case in the loop body is not associated with great optimizations :frowning2:

#8

But can we use ExactSizeIterator bound for that? I would guess that it’ll cover the most of the use cases for zip_eq.


#9

ExactSizeIterator::len can be implemented incorrectly. TrustedLen was added for unsafe code whose safety depends on an accurate length: https://github.com/rust-lang/rust/pull/37306


#10

But that would be a bug? That is, zip_eq_result may panic or return the wrong result, but nothing bad will happen.


#11

In addition to the “stop at the shortest” and “require equal length” behaviors already mentioned, numeric programming environments (e.g. R, NumPy, Matlab) often offer “recycling”, in which the shorter vectors are repeated to fill out the length of the longest one.

I don’t think this can be one-size-fits-all.


#12

For sure! The problem with shortest by default is not that it is inconvenient sometimes, it is that it almost always (personal experience) leads to difficult to debug bugs.