`size_hint`, correctness, reproducibility and documentation


#1

We have Iterator::size_hint, ExactSizeIterator, and TrustedLen (still unstable and also unusable without specialisation) all doing roughly the same thing.

To quote the documentation of Iterator::size_hint:

It is not enforced that an iterator implementation yields the declared number of elements. A buggy iterator may yield less than the lower bound or more than the upper bound of elements.

size_hint() is primarily intended to be used for optimizations such as reserving space for the elements of the iterator, but must not be trusted to e.g. omit bounds checks in unsafe code. An incorrect implementation of size_hint() should not lead to memory safety violations.

That said, the implementation should provide a correct estimation, because otherwise it would be a violation of the trait’s protocol.

ExactSizeIterator appears to imply that the exact length is known, but not that the reported length can be trusted (for memory safety at least). See also The Trusted Iterator Length Problem.

TrustedLen appears to be intended to allow optimal collect() implementations by trusting that the reported length is correct (and violating memory safety if not).


But what can be done with these reported lengths/hints?

  1. They cannot be relied upon for memory safety (except TrustedLen).
  2. Is it allowable to ignore elements beyond the claimed maximum, e.g. in collect()? This does not appear to be specified anywhere, but the current implementations will not drop extra elements (in fact, with TrustedLen, they would instead write beyond the end of the allocated block if the maximum was too low; otherwise the claimed maximum is ignored).
  3. Is it allowable to incorrectly return None when randomly choosing an element from an iterator whose length is less than the claimed lower bound? This is analogous to the previous problem, so presumably also no, unless TrustedLen is implemented.
  4. Is it allowable to introduce bias in the result when the claimed bounds are incorrect — e.g. in the above problem to retain the last element in case there are less than the claimed minimum, thus potentially selecting the last element with higher probability than expected? This problem is probably specific to random selection, and still returns a valid result, but perhaps not the correct one.
  5. Is it allowable for size_hint to affect the result in other ways so long as the result is still correct? This is very specific to randomness: normally only one result is correct but with random sampling, many results may be correct (valid and without bias). Additionally, the number of random values extracted from the random number generator may not be specified. This goes against the usual assumption that optimisations cannot affect the result, and is something we’re trying to cover for the Rand library, but we’re not fully there yet.

I’m trying to tease out what we can and can’t do to optimise sampling based on size_hint (see https://github.com/rust-lang-nursery/rand/issues/511). Personally I think we should yes to the last question, 5. I’m less sure about question 4, but without allowing this I don’t think we can make any optimisations anyway.

Presumably since TrustedLen allows memory safety violations on incorrect size_hint, any of the rest is allowable too. For ExactSizeIterator however, it doesn’t appear that the trait implies anything more than size_hint alone (except that the lower and upper bounds are the same; there have been suggestions that this trait could be removed).


#2

Hm, I think the contract here is basically the same as in Ord? You can’t rely on Ord being implemented correctly in unsafe blocks, but, in safe code, you needn’t think about incorrect Ord at all.

So, if you use unsafe to implement sampling, you can use neither size_hint, nor ExactSizeIterator: these are safe traits, so they might have bugs which will triger UB in your unsafe code. You won’t be abtle to use TrustedLen either, because it is unstable.

In safe code, you can assume that size_hint indeed returns correct lower and upper bounds (which might be just 0 and None). You can also assume that ExectSizeIterator also returns what it says on the tin.

So, there’s nothing special about size_hint. Formally, every safe method of every safe trait should have “don’t assume that this is correct in unsafe code” printed in a small font. I think the docs call out the unsafety problem explicitly just because this method has a high chance of being used incorrectly to bypass bounds-check.

TL;DR assume that size_hint returns correct upper and lower bounds (which are not guaranteed to be tight, that’s why it is a hint). If you detect a violation of this property though, it might be a good idea to explicitly panic.


#3

↑ Assume incorrect bounds are a bug, and panic if you detect this (but avoid UB). ↑

That’s actually a really good basis to go on.


#4

Unrelated to the OP, but I’ve just realized that “guarantees memory safety in the presence of bugs” is a nice way to describe Rust to C/C++ folks, as opposed to just “guarantees memory safety”, which often gets “sure, just don’t trip with scissors” response :smiley:


#5

unsafe == Rust’s scissors though :slight_smile:


#6

Yes. Garbage in, garbage out: of someone gives you an iterator that doesn’t implement Iterator correctly, you’re also allowed to not behave correctly. The only thing you’re not allowed to do is be memory-unsafe.

Some examples:

  • if the size_hint says there will be N elements, you can OOM for the N even if it was actually empty

  • if it claims to be ExactSizeIterator, you can loop to the len calling .next().unwrap(), which might panic or skip some (but you cannot call .next().unwrap_or_else(unreachable_unchecked))