More about step_by

In inner loops (two or more levels deep) I now avoid the usage of step_by, especially if the step value isn't a compile-time constant (extra note: LLVM is often not able to remove the assert!(step != 0)). I think range loops with a step is a too much basic and important feature, I'd like to use it mindlessly as zero-cost abstraction over a while loop.

But perhaps this is not possible. So perhaps we can add a Range::step method (and RangeToInclusive::step too etc), that uses takes a generic value of type Idx (the same type of the Range). This is a special-casing, but with a compiler (and not Clippy) deprecation/suggestion lint to avoid the usage of step_by in that common Range situation, this could help improve the current situation.

2 Likes

I recently landed an update to the Step trait, which is responsible for the Iterator impl for Range* types. Is the behavior you're looking at from before or after that PR landed, and was it impacted by that PR?

(I suspect it probably has no impact, but it might've and is worth asking.)

1 Like

Side note: using NonZeroUsize doesn't really help, as LLVM doesn't know it's nonzero.

I don't know if your update has improved the situation for me, but the current situation I'm talking about is about tests I've done yesterday, so I think your change isn't good enough yet, I am sorry. That's why perhaps adding a special step function just for Ranges is one of the few solutions.

And regarding NonZeroUsize, I reported a related issue in past:

The intrinsic problem with step_by currently is that it's implemented roughly as

fn step_by(mut start: usize, end: usize, step: usize) -> impl Iterator {
    let mut first = true;
    iter::from_fn(|| {
        if start < end {
            if first {
                first = false;
                let res = start;
                start += 1;
                Some(res)
            } else {
                start += step - 1;
                let res = start;
                start += 1;
                Some(res)
            }
        } else {
            None
        }
    })
}

whereas for ranges specifically, it would be better implemented as roughly

fn step_by(mut start: usize, end: usize, step: usize) -> impl Iterator {
    iter::from_fn(|| {
        if start < end {
            let res = start;
            start += step;
            Some(res)
        } else {
            None
        }
    })
}

(Both of these examples are complicated by proper handling of overflow.)

This is, however, explicitly allowed for (thank @Emerentius with Issue#52065, PR#52193) by the step_by documentation.

I do not know if doing this optimization with specialization is allowed under today's policy, as the result is semantically observable. However, we have the space to make this optimization within step_by.

Alternatively, we can add the "next_and_advance" method to Iterator and always use it, which works even without specialization. (EDIT: wait, no, this only can be generally provided for FusedIterator types, so will require specialization in any case.)

I’m not sure if I got this correctly but if the repeated check on the status of first: bool is supposed to be problematic, then the underlying problem is not really only the implementation of StepBy but more so the fact that for loops for some reason still have this horribly inefficient translation where they’re using .next() instead of something based on .try_fold().

The other problem of having start != res all the time (they’re always off by one, and IDK for sure but the compiler probably could fail to notice that in any optimizing way) of course doesn’t go away.


I’d be curious what exactly the problem you were describing in your original post is @leonardo. You mention the step != 0 assertion being in an inner loop, which sounds weird since that thing only happens once every time you create the StepBy iterator, and I’d assume that there’s an even more inner loop that iterates over said iterator then.

Like I already said above, I’m unhappy with the translation of for, and while that will perhaps not change so quickly (or at all), in the meantime I’m also curious if your situation perhaps improves if you switch from for looping over your iterator in the inner loop to doing for_each or something like that (provided you didn’t already try that before anyways).

Interestingly, Rust originally used a try_fold-like mechanism, and try_fold was reintroduced for performance reasons. For convenience, here are the three reasons @veedrac gave for Rust choosing external iteration for for loops:

  1. Iterators adaptors used to be hideous
  2. The compiler can’t reason about termination
  3. Internal iterators are inflexible

(1) is pretty much handled by better, more modern design with the Iterator trait. (3) is handled by the fact that you can still call next.

But (2) is a problem. Rust cannot desugar [playground]

let x;
for i in iter {
    x = i;
    break;
}

to use try_fold [playground]

let x;
{iter}.try_fold((), |acc, i| {
    x = i;
    None
});

because control flow of try_fold is controlled by the user and thus unreliable. I can (probably; passing around the opaque accumulator makes it difficult) implement for_each in a way that breaks control flow.

2 Likes

(It's also handled by the fact that .next() is equivalent to .try_for_each(Err).err().)

While I mostly agree, the iterator is also coming from the user. So it's not obvious to me that it would be more wrong for rust to trust it to implement try_fold reasonably than it already is for it to trust that it's implementing next reasonably. (And .into_iter() reasonably.)

Well, in this case of a maximum of 1 iteration, of course using .next() wouldn’t have any performance disadvantages anyways. The compiler should use a single call to .next() and not even produce a loop here. I’m not saying that an unconditional use of try_fold is the way to go, but instead some static analysis would determine the right method to use between .next() or .for_each() or .try_for_each(). Especially since for_each() can have even better performance than the try_ variant since it doesn’t need to keep some internal state for the possibility of stopping in the middle of the iterator. I haven’t thought this through 100%, but I’m rather positive that it could be possible to implement every for loop with a single call to the iterator interface.

Expanding on my previous answer, of course there are more complex scenarios that I haven’t thought about too much yet. I’ll try to come up with a translation of this one:

struct S;
impl Drop for S { /* ...  */ }
let s = S;
let x0 = 0;
let mut ref_to_x = &x0;
let x;
for i in /*iter*/ {
    let locked_mutex = /*...*/;
    if some_condition() {
        drop(s);
        x = i;
        ref_to_x = &x;
        break;
    } else {
        stay_in_loop();
    }
}

Edit: One way I can think of, the details below are admittedly a bit magical, would use try_fold after all

// for reference
fn try_fold<B, F, R>(&mut self, init: B, f: F) -> R
    where
        F: FnMut(B, Self::Item) -> R,
        R: Try<Ok = B>,

with a type for B that offers no methods, in particular no impl for Clone, to the implementor of the Iterator. Hence all they can do (without unsafe) is to pass around a single B as long as the function f: F does not return the error case, in other words they cannot really break the way loops work, in particular they cannot continue executing the loop anymore after the first time it did break or return. Thus the compiler could put the current stack pointer (where the for loop is at) hidden inside of the type that gets put for B (or perhaps better make B actually zero-sized and put the pointer inside the F instead) and implement the function for the loop body in an unsafe way, directly accessing the original stack.

This approach would mean that every for loop would be translated into either a try_fold (in case the loop contains break and/or return) or as a for_each (in case the loop contains neither break nor return; the for_each translation could use the same technique of passing the stack-pointer).

Here's how you can desugar it to try_fold

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=50074ac773c67cfed8c29bc9f11fa683

The only unusual case is where we need to initialize a variable, but since this is as compiler desugsarring that shouldn't be an issue, we can just point to an uninitialized variable just fine. (continue can be handled as return Ok(state);)

edit: for labeled break/continue we can use an autogenerated enum that specifies which label to jump to and use something like this,

https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=ce85e59e3b02223e4ca8764dcbcbe4e9

1 Like

Yeah, thanks for carrying this out. I’m not quite convinced by the moving around of s. It seems rather expensive and this wouldn’t scale to the case where there’s another path to a break without a drop(s) leading up to it. Other than that, this looks a lot like what I (very roughly) had in mind already.

An alternative IMO would be something along the lines of let mut s = ManuallyDrop::new(S); with an extra drop_flag (that’s needed for s anyways). Then your State consists exclusively of references into the stack of iter_to_for_desugared.

Which in my opinion would perhaps be even nicer to collapse to a single stack pointer. The compiler would know the correct offsets to use anyways. But I’m no expert with LLVM, perhaps it is difficult to express this kind of behavior in LLVM correctly, especially without ruining optimization. Speaking of optimization, this closure and the call to try_fold be inlined in most cases anyways, so it doesn’t really matter at all what happens as long as the inlining and optimizer are smart enough to remove all the indirections again.


And on a final note, here’s a version with a zero-sized state, which means less passing around stuff inside the try_fold impl: LINK

In this case, the references are captured by the closure. I guess that making the closure smaller by combining references to the same stack frame into one would be a possible optimization independent of this for loop translation. → Closure size in the example.

1 Like

Yeah, this could definitely be improved if it was implemented in the compiler (like implementing this using a single pointer to the stack frame), I just wanted to show a proof of concept based on your description. If this is a desugarring we should do is a completely different question.

3 Likes

@CAD97 I already created a PR (#51601) that specialized StepBy<Range<_>> and which resulted in better code generation, but it was reverted later due to #55985.

Like I wrote in that issue, I think that the solution is to simply specialize more methods so that they behave consistently.

1 Like

I only went and did a specialization for Iterator (I believe DoubleEndedIterator probably needs to be specialized as well), but I sketched out what specialization of more of the methods could look like:

https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=88e742a487481bd3a61c93c7189593af

This is an observable use of specialization, though, so I don't actually know if it should be allowed in the stdlib before the required part of specialization is stable.

You can, however, pull out just the specialized impl and use this as a custom StepBy in your code, however. It should (did not actually measure) generate better assembly.