Extend for loop syntax to support more complex cases

[EDIT]

Everything I said previously is irrelevant. @huon made an excellent point which makes this all irrelevant.


Note: I’m not prepared to turn this into an RFC PR. At most I might make this into an RFC issue. I tried to make this clear. I hope it is.

This is a take on what a more flexible for loop syntax might look to support some more complex cases a regex parser would require. This is also difficult to setup for a regex because the syntax must be extremely flexible. I think more than this would be required. I don’t know if the concept would fit into Rust though. I was mainly just trying to find a reasonable syntax.

cc @P1start (I think you were looking at iterators once)


In a regex parser you might want to parse something like r"(dave)" or r"([abc]+)". Each different character set such as (), [], and + all will require differently parsing, and thus will likely be separated into separate functions. This makes using iterators difficult because more specialized parsing cannot be delegated to separate functions

// Doesn't work but even if it did, it's repetitive
// repeating the `for` for every new expansion.
fn parse_more(v: Iterator) {
    for j in v {
        println!("{}, {}", i, j);
    }
}

fn main() {
    let v = 0..10;
    for i in v {
        // Cannot send `v` into new function to continue parsing.
        parse_more(v);
    }
}

This leaves you to manually iterate over the sets via indexing a Vector with loop and such. This is problematic because every time you need to specialize parsing a group of characters you must redo indexing again which is repetitive and ignores iterators.

struct S {
    index: usize,
    vec: Vec<char>,
}

impl S {
    fn curr(&self) -> char { self.vec[self.index] }
    fn new(v: &'static str) -> S {
        S { index: 0, vec: v.graphemes() }
    }
    fn next(&self) -> bool {
        self.index += 1;
        if self.index < self.vec.len()
    }
    fn parse_paren(&self) {}
    fn parse_everything_else(&self) {}
}

fn main() {
    let s = S::new(r"(dave)");
    while s.next() {
        match s.curr() {
            '(' => s.parse_paren(),
            _   => s.parse_everything_else(),
        }
    }
}

This proposes multiple things which I think are all required to handle this:

1: for match: combine for and match

The problem is by calling parse_* you are required to pass the interators current state outside of the for to handle iterating which cannot be done. Since you aren’t modifying the iterator, it would be nice to continue using the iterator while allowing logic separation.

fn paren() {}
fn remains() {}

for c in r"(dave)".graphemes() {
    match c {
        // Won't work because requires passing
        // the iterator state outside the `for` loop.
        '(' => paren(),
        _   => remains(),
    }
}

What about supplying functionality connected to the match body. This is problematic because the functionality depends on which iteration. The first ( match state that all the following iterations will go to that function branch until recalled but there’s no reason a match have any foreknowledge of which iteration it’s on.

let mut m = vec![];
for c in r"(dave)ted".graphemes() {
    match c {
        // If `(`, call `paren` after `match` close. Otherwise, push char
        // and move to `remains` *function branch* next iteration.
        '(' => to.paren(),
        _   => m.push(c),
    } paren {
        ')' => to.normal(), // Recall to normal `match` for remaining iterations.
        _ => m.push(c),
    }
}

By combining for and match, the iterator count could be known. For an AST which could be recursive, each function branch would need to have it’s own information so they could eventually be returned to be appended at the end.

use AST::*;

enum AST {
    Char(char),
    Group(Vec<AST>),
}

// Trying to parse into: `Vec(Char('a'), Group(Char('b'), Char('c')), Char('d'))`
let chars = r"(a(b,c)d)";
let mut m: Vec<AST> = vec![];

for match c in chars.graphemes() {
    // Pass to `paren` functionality next iteration. Initialize `inner` so that
    // `paren` can immediately start pushing to it. When `paren` calls `leave`,
    // return to normal `match` branching. Values can be returned via `leave`.
    '(' => { let inner = vec![];
             paren() },
} paren {
    // If inside `paren` and finds another `paren`, go recursive. `inner` must
    // be kept track of separately.
    '(' => { let inner = vec![];
             paren() }, // Recursive
    _ => {
        // Check if already has been run recursively. If so,
        // push result onto vector.
        if child.exists() { inner.push(child) }
        // Finds `)` then return. Variable is saved to `child`.
        if v == ')' { leave(Group(inner)) }
        inner.push(Char(c));
    },
// Not sure the best way to return something here though.
} m.push(leave); // push completed AST onto `m` when complete.

2: peek() and next() available inside a loop

peek() must be available inside a for loop. Maybe it should take a range and yield a slice. Not sure. next() would take an integer. Names could be changed. Maybe related to continue.

use AST::*;

enum AST {
    Char(char),
    Dot,
    Ellipsis,
}

// If `.` is anycharacter and `...` is something else then they must
// be considered separately.
let chars = r"...";
let mut m: Vec<AST>;

for match c in chars.graphemes() {
    '.' => match peek(1..3) { ".." => { m.push(Ellipsis);
                                        next(2); },
                              _    => m.push(Dot), },
    '_' => m.push(c),
}

I’m having a hard time understanding why the iterator-less approach is not acceptable. I guess I don’t know what you mean by “redoing indexing.” I usually define a method called bump() -> char which returns the current character and moves the parser to the next input character. (Along with other helpers, like peek.)

Parsing a regex is also a strange use case here because regexes are small, so it’s pretty cheap to just toss it into a Vec and use indexing. A stronger use case for iterators might be parsing something that is very large, so the laziness of iterators gets you a performance benefit without extra work.

I also find your proposed solution a bit hard to read. e.g., is paren() a function call? Does it share the same stack frame as the calling function? What are its semantics? (You usually don’t want a regex parser to be recursive, since it opens you up to blowing the stack pretty easily.)

What you want seems like it’s already quite easy to achieve with a for-loop: just define an iterator adaptor, which takes an iterator over say ‘char’, and gives back an iterator over some custom type which implements ‘peek()’ and the other methods you require within the for loop.

I can see the case for a combined match/loop statement like “while let”, but where the inside of the loop is a refutable pattern match, although you can get similar behaviour like this:

    let mut a = 10;
    while match a {
    0 => false,
    _ => {a = a - 1; true}
    }{}

It’s definitely not a big issue. I just think the for loop looks very elegant compared with while s.next() match curr(). So I was trying to see if I could discover a syntax which would make it valid to use iterators in much more complex use cases. This is as close as I got.

“redoing indexing” meant requiring each function to call while ... match .... Technically, if main is iterating forward and a sub-function wants to iterator forward, maybe only one needs to go forward.

Ug. This seemed niftier an hour ago. They were meant to be like function calls and the meaning a function call would have in this position is really messy. Returning is also problematic. Though, treating them like function calls seemed necessary for recursion as I was going for.

It had occurred to me that a function would need some type info but I wasn’t concerned with it nor am I now. I was only concerned with whether this is a valid concept or not. The specific details like that didn’t concern me.

for match c in "12345".graphemes() {
    _ => { let mut inner = vec![]; // 1 4
           branch(); }
} branch {
    3 => { branch(); }     // 3
    _ => { inner.push(c) } // 2 5 (4 new function call)
}

The previous is really no different from this.

use E::*;

enum E { A, B }

let e = A;
for c in "12345".graphemes() {
    match (e, c) {
        (A,_) => { let mut inner = vec![];
                   e = B;
        (B,3) => { e = A; }
        (B,_) => { inner.push(c) }
    }
}

@burntsushi @Diggsey Thanks for comments. I find @Diggsey’s interesting too.

Another technique for the first one is to continue using an iterator, but use while let and references, e.g.:

fn parse_more(v: &mut Iterator) {
    for j in v {
        println!("{}, {}", i, j);
    }
}

fn main() {
    let mut v = 0..10;
    while let Some(i) = v.next() {
        parse_more(&mut v);
    }
}

Of course, this doesn’t solve the fact that loops are repeated, but it allows one to stay in iterator-space, rather than falling back to indexing.

1 Like

@huon That’s clever and solves the main thing I was trying to workaround. I was just trying to wrap up the duplicating loops into the solution.

I’m aware the performance difference is probably small but getting it to work with iterators seemed kinda cool.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.