`scan` is too complicated, suggestion for new iterator methods

optozorax · November 26, 2020, 6:25pm

AFAIK scan iterator method is the only iterator method that provides a mechanism to share some mutable state across iterations. But I think this method is too complicated and wide, therefore it should be replaced by more convenient and simple methods.

Let's look at scan usage example:

let a = [1, 2, 3];

let mut iter = a.iter().scan(1, |state, &x| {
    // each iteration, we'll multiply the state by the element
    *state = *state * x;

    // then, we'll yield the negation of the state
    Some(-*state)
});

assert_eq!(iter.next(), Some(-1));
assert_eq!(iter.next(), Some(-2));
assert_eq!(iter.next(), Some(-6));
assert_eq!(iter.next(), None);

Observations:

We must provide an initial value of the state, we can't get it from the first iteration.
state is provided by &mut reference, so we must change it by using *state = ... syntax, not by returning a value. So, we always need curly brackets. But it's a powerful tool: we can compute a return value before changing state, or after.
Then we return Some() of next value. And by this we do as same as filter_map. Because of this Scan doesn't implement ExactSizeIterator, so we can't use it to allocate exact space when using the collect method.

I think:

We should able to provide initial value either by telling an exact value or from first iteration.
state should be changed not by mutable reference, but by function Fn(State, &Item) -> State, because it is more simple and don't require {} in simple cases.
There shouldn't be filter_map or map or filter behavior inside this iterator, it should be done by the next iterator adaptors.

So, I suggest these new methods to replace scan: state_before, state_after with equal signatures.

Example:

let a = [1, 2, 3];
assert_eq!(a.iter().state_before(None,    |state, &item| state * item).eq(&[(1, 1), (2, 2), (6, 3)]));
assert_eq!(a.iter().state_before(Some(4), |state, &item| state * item).eq(&[(4, 1), (8, 2), (24, 3)]));
assert_eq!(a.iter().state_after(None,     |state, &item| state * item).eq(&[(1, 1), (1, 2), (2, 3)]));
assert_eq!(a.iter().state_after(Some(4),  |state, &item| state * item).eq(&[(4, 1), (4, 2), (8, 3)]));

First argument of these methods - is an initial value. If it is None, then the initial value is yielded from the first iteration. If it is Some(...), then it initialized by the inner value.
Second arguments of this methods - is function Fn(State, &Item) -> State. That function receives current state and current item by reference and it must return next state value.
The main name is state because these methods provide state across iterations.
Suffix before means that the state will be modified before returning the element. Suffix after means that the state will be modified after returning the element.
These methods return an iterator with Item = (State, OldItem) by analogy as enumerate.
These methods return an iterator that implements ExactSizeIterator.
Then, we can apply filter_map next and get the same behavior as a scan.

What do you think?

quaternic · November 26, 2020, 7:14pm

AFAIK scan iterator method is the only iterator method that provides a mechanism to share some mutable state across iterations.

Most of the iterator adaptors taking closures take an FnMut, so you can provide mutable state via the closure. The example can also be implemented like this, for example:

let mut state = 1;
let mut iter = a.iter().map(|x| {
    state *= x;
    -state
});

dodomorandi · November 26, 2020, 7:36pm

Even if I don't agree with some points of the OP, I share the same sentiment that there is no way to use an alternative version of scan that always returns an item and, therefore, can impl ExactSizeIterator. This makes scan a non-zero-overhead abstraction (because you are paying for something you could not use), and maybe it's something that could be improved

matklad · November 28, 2020, 10:58am

I must confess, after about five years of using Rust, I've never needed to use scan even once. I feel like this method doesn't pull its weight. There used to be successors-shaped whole in the iterator API which could have been filled by a scan (if one tried hard enough), but it isn't there any more.

I am not even sure we need fold and try_fold (other methods with explicit state) -- for_each and try_for_each seem to be more rustic.

RustyYato · November 28, 2020, 2:18pm

I've used fold and try_fold on multiple occasions, so they pull their weight for me, but I agree that scan is pretty much dead weight. I always forget that it even exists.

matklad · November 28, 2020, 2:44pm

Do you have a quick example to share? I'd love to see how fold compares to for_each side-by-side. I am interested in this because, abstractly, fold is the most complicated of basic iterator methods in terms of cognitive complexity (up to the point of being removed from builtins in Python3), while for_each is probably the simplest one.

RustyYato · November 28, 2020, 3:54pm

I made a lexer and as a part of it, I wanted nested block comments, so I did this, try_fold version

let end = memchr::memchr2_iter(b'*', b'/', haystack)
    .try_fold((Prev::None, 0_u32), |(prev, depth), pos| {
        let item = haystack[pos];

        match (item, prev) {
            (b'*', Prev::ForSlash(p)) if p == pos => Ok((Prev::None, depth + 1)),
            (b'*', _) => Ok((Prev::Asterix(pos + 1), depth)),
            (b'/', Prev::Asterix(p)) if p == pos => {
                if let Some(depth) = depth.checked_sub(1) {
                    Ok((Prev::None, depth))
                } else {
                    Err(pos + 1)
                }
            }
            (b'/', _) => Ok((Prev::ForSlash(pos + 1), depth)),
            _ => unreachable!(),
        }
    })
    .err();

try_for_each version

let prev = Prev::None;
let depth = 0_u32;
let end = memchr::memchr2_iter(b'*', b'/', haystack)
    .try_for_each((Prev::None, 0_u32), |pos| {
        let item = haystack[pos];

        match (item, prev) {
            (b'*', Prev::ForSlash(p)) if p == pos => {
                prev = Prev::None
                depth += 1;
            }
            (b'*', _) => prev = Prev::Asterix(pos + 1),
            (b'/', Prev::Asterix(p)) if p == pos => {
                if let Some(depth) = depth.checked_sub(1) {
                    prev = Prev::None;
                } else {
                    return Err(pos + 1)
                }
            }
            (b'/', _) => prev = Prev::ForSlash(pos + 1),
            _ => unreachable!(),
        }
        
        Ok(())
    })
    .err();

Personally, I could change most uses of try_fold to try_for_each, but I would prefer the try_fold versions, because it makes it clear that something more involved is going on here.

mbrubeck · November 28, 2020, 4:36pm

Here's an example of one sort of use case where fold can be simpler than for_each:

fn checksum0(input: &[u8]) -> u8 {
    input.iter().fold(0, u8::bitxor)
}

fn checksum1(input: &[u8]) -> u8 {
    let mut checksum = 0;
    input.iter().for_each(|x| checksum |= x);
    checksum
}

or, more generally, any time where the combining operation makes sense as a standalone function in its own right, especially if this function already exists for other purposes.

steffahn · November 28, 2020, 4:57pm

One advantage of scan is that it doesn’t implement DoubleEndedIterator, so you can be sure that the closure really gets called in the correct order. The extra state argument and the ability/obligation to return Option<Item> both are pretty useless though. The problem with .map(..).rev() applying the function in reverse order can be avoided by introducing something called ".forward()" or ".irreversable()" that just removes the DoubleEndedIterator impl, so that we can e.g. chain .forward().map(...).

I’m not of the opinion that extra state arguments are useless in general, but I prefer them by value (since getting an owned value back out of the &mut St reference, or out of a captured variable in an FnMut can be tricky). @matklad: this is IMO one advantage of fold over for_each.

dhm · November 28, 2020, 7:10pm

One usage of .scan() is as a stable replacement of .map_while()

scottmcm · November 28, 2020, 9:42pm

The important difference is in ownership. You're absolutely correct that it's better to use for_each with an FnMut closure if you just need to mutate something, rather than using fold and threading it through all the calls -- not only is it simpler, it's faster too. (And here's the corresponding clippy issue to suggest as such.)

But if you need owning access to the accumulator, the version using for_each means you're storing it in an Option and doing the .take().unwrap() dance. So in those situations the fold is superior since you're given the value moved.

AngelicosPhosphoros · December 1, 2020, 8:41pm

I think, fold is needed. for_each is too imperative for my taste and fold allows running functions without any state in them.

let result = iter.fold(initial, my_pure_function);
// vs
let mut result = initial;
iter.for_each(|x|result = my_pure_function(result, x));
let result = result; // to remove mutability

I think, it is more about missed optimization in rustc or LLVM or std.

peterjoel · December 28, 2020, 6:36pm

Mutating captured state with map feels very dirty to me. I got used to iterator combinators from Haskell, where mutating external state is just not possible (or definitely a code smell if you work around it with MVar or similar), and when I see map in Rust I would find it surprising if the mapping function was not pure. I'd go as far as to say it was poor style.

If it really is slower to use fold than map-with-mutation, then this should be addressed in the compiler.

RustyYato · December 28, 2020, 6:45pm

It's not possible to express "pure" as a constraint in Rust. Even normal functions are stateful! How so you ask? By mutating global variables.

fn not_pure() -> i32 {
    thread_local! { static STATE: Cell<i32> = Cell::new(0); }
    
    fn do_something(x: &Cell<i32>) -> i32 { ... }
    
    STATE.with(do_something)
}

Also, Rust doesn't pursue purity in the same way that Haskell does. It's perfectly fine to mutate a value, so long as you have unique access to that value. The most obvious way this manifests is in references, where &mut T is unique and &T is shared and as a consequence, only &mut T can mutate T directly ^{_{(ignoring UnsafeCell and friends)}}. But this permeates every API in Rust, for example, Rc<T> normally doesn't let you mutate T, but if you can prove that it is unique, you can use Rc::get_mut to mutate it!

All this, because mutating things in place is far more efficient/easy to optimize than copying things around. This is why a mutating map is faster than a copying fold. (moving is just copying and preventing access to the old value).

peterjoel · December 28, 2020, 6:49pm

No disagreement on that. My comment was more about good style - global mutable state is usually not considered good style if it can be avoided.

I think it is actually a good design choice that map accepts a FnMut, but it should be encouraged to use fold. If fold is slower then that is undermined.

scottmcm · December 28, 2020, 11:36pm

I think this is comparing the wrong things, since map is an adapter and fold is a consumer. I would generally discourage using mutable state in the map closure (because of things like not being able to tell whether the item being mapped is the front or the back of the iterator). But using for_each+mutable vs fold+immutable are both totally acceptable.

Would it be nice if both optimized perfectly? Sure. But using mutable state here is no different from all the functions that people write taking &mut instead of doing move-in-and-move-out-again, where everyone agrees that the &mut version is fine, and often better.

RustyYato · December 29, 2020, 9:51pm

This topic is a dup of a github issue from earlier this year, apparently I was a part of that issue and forgot about it

Johannesd3 · December 30, 2020, 6:25pm

Once generators arrive, something like this might be a good scan replacement (this works in nightly, try it out):

use std::ops::{Generator, GeneratorState::*};

fn generator_scan<I, G>(mut iter: I, mut gen: G) -> impl Iterator<Item = G::Yield>
where
    G: Generator<I::Item> + Unpin,
    I: Iterator,
{
    iter::from_fn(move || match Pin::new(&mut gen).resume(iter.next()?) {
        Yielded(x) => Some(x),
        Complete(_) => None,
    })
}

The usage would be like this:

generator_scan(0.., |mut n| {
    let mut fac : usize = 1;
    loop {
        n = yield fac;
        fac *= n;
    }
}).take(20).for_each(|x| println!("{}", x));

What do you think?

RustyYato · December 30, 2020, 6:34pm

That generator isn't Unpin, so it will be a little less ergonomic than that.

system · March 30, 2021, 6:34pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Provide an alternative to scan (which is like a cumulative fold) that works as a cumulative reduce libs	4	683	September 4, 2022
Pre-RFC: Iterator::extend_into libs	17	1738	March 25, 2019
Iterators v2 libs	9	1043	May 8, 2021
[crazy] Replace Most of the Collections API with ranges and iterators libs	12	4549	March 25, 2019
Idea: &mut I where I:Iterator<Item=T> should be usable in for loops language design	4	671	May 27, 2019

`scan` is too complicated, suggestion for new iterator methods

Related topics