Rust generators: exploration of potential syntax

I have some ideas on what generators on stable Rust should look like.

There are many open questions, and one of them is syntax. Thankfully, we can explore that space without having to patch rustc itself. If you're interested, please file PRs or issues against this repo with your thoughts!

16 Likes

One of the limitations of Iterator currently is not being able to have a streaming iterator that reuses a buffer. e.g.

struct ReadFile<'a> {
    reader: std::io::BufReader<std::fs::File>,
    buffer: [u8; 1024],
    _phantom: std::marker::PhantomData<&'a ()>
}
  
impl<'a> Iterator for ReadFile<'a> {
    type Item = &'a [u8];

    fn next(&mut self) -> Option<Self::Item> {
        if let Ok(n @ 1..) = self.reader.read(&mut self.buffer) {
            Some(&self.buffer[0..n])
        }
        else {
            None
        }
    }
}

cannot infer an appropriate lifetime for lifetime parameter in function call due to conflicting requirements

note: ...so that reference does not outlive borrowed content

I believe the solution would be to add the lifetime to the self reference but this isn't allowed because it contradicts the trait definition. Could generators resolve this issue?

I don't think that's related to generator syntax. It just needs new traits for so-called "lending (previously streaming) iterators" and "lending streams".

1 Like

As far as syntax goes, the thing that leaps to mind is most analogous to async fn:

yield fn my_generator(...) -> i32 { ... }

which desugars to a function returning impl Iterator<Item = i32>, and

async yield fn my_stream(...) -> i32 { ... }

for impl Stream<Item = i32>.

2 Likes

My proposal is a contextual keyword gen and

gen generator() -> i32 { ... }
async gen async_generator() -> i32 { ... }

I see using -> instead of a different token or keyword has some support. Aren't people concerned about potential user confusion when seeing that? Maybe I'm being overly pessimistic, but I like to increase syntactic distance as much as possible to make it easier to notice and teach.

I think this is greatly reduced given that there will be a yield keyword somewhere in the body, and some keyword (maybe also yield) in the function definition.

Rustdoc can always hide whether it was written using generators in what it shows, if desired...

1 Like

This is less a response to the bikeshed request (though it contains some bikeshed for communication purposes), but actually a semantics argument. Apologies, this is only lightly copy-edited, and maybe could be (should be?) its own blog post. I have some previous discussion on a very draft RFC.

I don't want trait Generator<Args> { type Yield; type Return; }. I want trait FnMut<Args> { type Output; }. (And, well, FnPinMut for semicoroutines[1] which require pinning.)

My core argument here is that semicoroutines[1] and FnMut have the exact same interface from the caller's perspective:

  • You have some state object
  • You call a function with a unique reference to said state object
  • You receive a return value, and the state object is potentially mutated
    • In FnMut case, FnMut::Output
    • In Generator case, GeneratorState<Generator::Yield, Generator::Return>

These are the same API, thus, imho, should be represented by the same trait. (This is similar to how Future was originally designed with Result baked into it, with both an Item and Error type, but today was stabilized with just a singular Output type. With today's Generator, it's GeneratorState<Yield, Return>, which would be unified into a singular Output.)

In this world, the only difference between yield and return would be where execution resumes after execution. Whether you yield or return (or implicitly return with e.g. ? or tail expression), the yielded/returned value is returned as-is to the caller. If you yielded, then execution upon next call resumes after the yield. If you returned, then execution upon next call resumes at the top of the function.

This also solves the -> Return versus -> yield Yield return Return problem nicely; there's still only a single "return" type which is returned to the caller to care about. It's the exact same relationship, so it uses the same symbol.

Per OP, I don't want to get into full semicoroutines, which could potentially accept arguments to each call. Instead, we'll stick to just that required to implement a generators feature: Args = (), and Output = Option<T>.

In my envisioned semicoroutine feature, merge_overlapping_intervals could potentially look like this:

fn merge_overlapping_intervals(input: impl Iterator<Interval>) -> impl Iterator<Item = Interval> {
    //         !1              !2
    std::iter::from_fn(move || loop {
        //                        !3
        let mut prev = input.next()?;
        for i in input {
            if prev.overlaps(&i) {
                prev = prev.merge(&i);
            } else {
                yield Some(prev); // !4
                prev = i;
            }
        }
        yield Some(prev); // !4
        yield None;       // !5
    });
}

Notes:

  1. Our semicoroutine just produces impl FnMut() -> Option<Interval>. To turn this into an iterator, we use iter::from_fn.
  2. We wrap the body in a loop. This is for convenience: otherwise, the implicit tail expression would return a (), which is not Option<Interval>, leading to a type error. As loop evaluates to type !, this coerces to our closure's return type just fine. Alternatively, we could have written the last line as a tail None. Note that return None; would not work[4], as we would still have the implicit tail expression returning () after it.
  3. ? works here. As our closure returns Option<Interval>, ? just works as normal. If input.next() yields a None, we resume from the top of the function the next time the closure is called.
  4. We yield Some(interval) here, because this is a -> Option<Interval> closure.[5] Through from_fn, this fills the Iterator contract of returning items via Some(interval).
  5. We yield None to indicate the end of the iterator stream. If and only if input is fused, we could instead leave this off, looping back to the start of the closure, and input.next()? would yield None; however, because we don't require input to be fused, we have to yield it ourselves to ensure None is yielded before calling next on the input iterator again. It also just helps with code clarity to do so explicitly. This could also be written as a tail-return None rather than the closure being a big loop (see point 2).

Having closures with yield in them implicitly become FnMut (or FnPinMut as necessary) when they yield is fine; closures are already FnMut objects if they capture state mutably. However, item fn are never FnMut; they're always stateless fn() (which is more restrictive than even Fn, which can have state). As such, item fn obviously cannot just contain yield; that'd change their signature, and the signature is important to be fully prescriptive[6].

So, we do the same thing async does. async fn f(Args...) -> Ret { body... } desugars (modulo lifetime specifics) to fn f(Args...) -> impl Future<Output=Ret> { async move { body... } }. Similarly, we would now have yield fn f(Args...) -> Ret { body... } desugar (modulo lifetime and pinning specifics) to fn f(Args...) -> impl FnMut<Output=Ret> { move || { body... } }.

Note that this does not return impl Iterator, it returns impl FnMut (or impl FnPinMut). This is, I think, the sharpest point in this model. This is the breakage point that could potentially convince me against this. We don't get gen fn to declare impl Iterator in a declarative form; we get semicoroutines, which are more powerful, but put a bit more salt in the way of the desired end goal of straightline generators.

This gap can fairly easily be bridged by a macro or other syntactic transform on top: all you have to do is: error on return, replace yield $expr with yield Some($expr), and add a tail expression None. (Or, if you want poisoning[7], a tail yield None; loop { panic!("exhausted generator"); }.)

In the name of proposing a minimized subset, rather than the fully general semicoroutines, the minimal stabilization target would be yield fn. This is a bit more than just supporting creating impl Iterator, but still I think represents a nicely contained surface level feature that stands on its own merit, even without "yield closures" to go with it. (And if yield closures are bundled with yield fn, for the love of Ferris, limit it to argumentless closures, to avoid the massive second bikeshed over semicoroutine resume argument handling.[8]) I'd even be happy to see it severely limited to only semicoroutines that are FnMut, or being conservative and only generating FnPinMut semicoroutines, so long as the path to providing both remains open. (Perhaps a blanked impl FnMut for impl FnPinMut + Unpin? Adding the ability to make FnPinMut when previously limited to FnMut is a lot easier, though the auto pinning requirement adds another auto trait like inferred trait to the return type.)


[1] A bit of formal terminology, paraphrased:

  • "Subroutines" are normal single-entry/single-exit[2] routines: function calls. When you call them, execution starts at the top of the routine, and exits back to the caller upon return, potentially with a value.
  • "Coroutines" are multiple-entry[3]/multiple-exit routines, though structured enough to not just be unstructured goto. Multiple-entry[3] is familiar enough, since we've familiarized ourselves with yield and async/generators enough: when you resume a coroutine, it resumes at the last place it yielded, rather than the top of the routine. Multiple-exit is the surprising part: rather than returning control flow to a single caller (control flow is a tree), the coroutine itself chooses which coroutine to yield control flow to next (control flow is a directed graph). This can allow for very powerful state machines to be written clearly (and I believe basic block IR to be a form of coroutines).
  • "Semicoroutines" pull back the full freedom of coroutines to maintain tree structured control flow, with multiple-entry[3]/single-exit. As this functionality is often used to implement iterables, it's also commonly called "generators." Specifically: when you call a semicoroutine, it resumes execution at whatever its last yield point was, and whenever the semicoroutine yields control flow, it returns to its caller.

I use the word "semicoroutine" because it semantically separates the concept of yieldable control flow from the idea of creating an iterable/generator, which yields some Result<Yielded, Finished>.

[2] Note that the single-entry/single-exit principle does not actually mean "only one return, no break, continue, etc" like many dogmatists try to claim. Rather, it refers to the core property of functions that we take for granted today: that they only "enter" at one place (callers always call the start of the function, and don't jump into the middle; functions don't share asm tails), and only "exit" to one place (we always ret to the caller). (Fun note: jmp for tail calls breaks this principle if you have more than one tail call option, as your function could exit to more than one other label! But from an outside perspective, the principle is held, as it will eventually ret.) This is about structured control flow and encapsulation of subroutines; the callee knows exactly how it's going to be called, and the caller knows that it will regain control in an expected manner.

[3] I use multiple-entry because formally it is correct -- the routine resumes in the middle -- but it's important to note that you still call the semi/coroutine in a uniform way, which then automatically picks the correct resume point for you. This form of "structured multiple-entry" is isomorphic to a single-entry which immediately checks the state and jumps to the correct execution point. (Similarly, multiple-exit can be implemented via returning to a trampoline which calls the next routine as indicated.)

[4] However, it can easily be made to work. Because the tail expression is unreachable, the type checker could be made to ignore the tail return (and only the tail return) if it is statically unreachable. This would allow an explicit tail return to typecheck, as the implicit return is unreachable, thus suppressed. However, an unreachable explicit return should still be a typecheck error. I believe this matches the behavior of straightline functions, so should be the default behavior. Then why have I made it a footnote extension instead? I guess maybe it makes my return point clearer? ¯\_(ツ)_/¯

[5] This could even maybe be composed with try. While try blocks wouldn't be helpful (yield/return targets the function directly, rather than the block, and you can't yield a block), a try closure could provide Ok-wrapping (Some-wrapping) for yield/returned values, and yeet[9] could specify the end case (None). This only works for generator-like semicoroutines, though, where you have many yields terminated by a single yeet.

[6] Well, auto traits for existential (impl Trait) return types…. These are, in my opinion, the exception that proves the rule. Still need to find a solution for trait functions, though…

[7] I consider explicit control over poisoning semantics a benefit of this scheme, by the way.

[8] For what it's worth, I believe this model works best with the "magic mutation" syntax, though I despise that it's called that. Rather, you just get a new set of argument bindings each time the closure is resumed. The old ones are just dropped or shadowed (I honestly don't know which is better semantics).

[9] throw, bail, etc: https://areweyeetyet.rs/

15 Likes

I think that ship sailed with async fn foo(...) -> i32 desugaring to fn foo(...) -> impl Future<Item = i32>. I wasn't thrilled with the proposal at the time for similar reasons, but I'll say that it's worked out fine, and it hasn't been a big problem in practice.

I guess you could make the argument that regular -> T return and impl Future<Item =T> return are semantically quite similar, in that they both ultimately end up returning a single value, whereas yield fn ... -> T looks like it might return a single value, but actually returns multiple. My proposal tries to mitigate that by putting a big fat yield up front to give context for the whole function signature.

On the other hand, I also know how easy it is to get myopic about reading just the part of the signature you're focusing on, so it might be easy to overlook the yield fn at the front if you're focusing on the return value.

But I think -> T is very strongly associated with "returns this type somehow", even if the specific details of how it's returned is more nuanced and depends on the rest of the signature. I know I often do a quick eyeball scan for -> to get at least a sense of what a function is returning; I expect that would break for a while if I also need to look for yields.

So maybe to get the best of both something like fn foo(...) -> yield i32 { ... } is a possibility (which isn't that far from fn* ... yields T). But to my eye that does bury the "I am a generator" as a detail in the middle of the signature rather than stating it up front, and it's pretty inconsistent with prefix async fn.

fn* foo(...) yields i32 { ... } covers everything by having the fn* prefix and a yields at the back, but looks pretty alien as Rust syntax.

5 Likes

I've only skimmed your post, so apology if you mention this, but that equivalence is only valid if the generator essentially returns !.

You allude to this by saying the macro could add loop { panic!("exhausted generator");, but I think it could be more explicit.

With the current Generator trait, you call it, and you get GeneratorState<Yield, Return>. My thesis is that we don't have to bake GeneratorState into the design of semicoroutines; they should return an arbitrary Output type to the caller (the same way we removed Result from being baked into Future).

Generators as used for impl Iterator (which is all that this thread is supposed to be about) in todays Generator trait have Yield=T, Return=(), which is equivalent to a single Output=GeneratorState<T, ()>, which is isomorphic to Ouput=Option<T>.

What happens to a current Generator trait design after you've yielded the Return value? Either it resumes from the top (loop body), or it poisons/panics (loop { panic!() } tail).

2 Likes

I get it but I don't understand why.
For me there's a sort of continuity between plain functions and generators: the latter (as defined in nightly) have arguments and a yield type, but they can also have a return type (and a resume type). So how will you integrate the return type in the future if you already used -> for the yield type ?

I understand that decorating an fn with yield and resume types will make it quite heavy, that would mean it would have:

  • pub
  • < generics >
  • where clause
  • ( arguments )
  • -> return type
  • ... with optional impl
  • yield type
  • resume type
  • async specifier
  • maybe someday a throw type

but maybe that's the price to pay for flexibility, and I don't find it too high.

1 Like

I think the auto-wrapping and single type for gen makes sense here (rather than separate Yield and Return types). Something I could forsee:

async gen fn async_gen() -> u32 { // impl Future<Output=impl Generator<Output=u32>>
    // A future with a generator type inside.
}

gen async fn gen_async() -> u32 { // impl Generator<Output=impl Future<Output=u32>>
    // A generator of futures
}

So you can't have both Yield and Return types ?

I was convinced by this post up-thread. If you have uniform Yield and Return types, it's just noise. If you have a () return type, you just use Option<Yield>. The same logic behind removing the Future::Error type makes sense for Generator::Return to me. Basically if you have different Yield and Return types, use Result<Yield, Return>; the language already has algebraic types, so why special case Generator by forcing a sumtype if Option<T> just ends up being more idiomatic anyways?

1 Like

There's two problems I forsee with async yield fn as -> impl Future<Output = impl FnMut() -> T> and yield async fn as -> impl FnMut() -> impl Future<Output = T>, though:

  1. The obviousness/teachability of ordering the two keywords for different semantics isn't great. It's definitely pure in an academically attractive fashion, but not great for code clarity. (Maybe type safety mitigates this?)

  2. Stream isn't Iterator<Item=impl Future>. It's not even LendingIterator<Item=impl Future>. This complicates approximately everything.

Actually, IIRC, the proposed syntax is gen fn generator() -> Yield -> Return right? How is this to be parsed?

gen fn generator() -> fn() -> InnerYield -> InnerReturn -> OuterReturn

Or would we have:

gen fn generator() -> gen fn() -> InnerYield -> InnerReturn -> OuterReturn

Of course, parentheses may be required in such cases, but I think that probably runs afoul of some of the goals of Rust's grammar with respect to ambiguous parsings either way.

Maybe there's room in the syntax for gen(async) fn and async(gen) fn to help make it obvious? Then one also has async(async) to consider…

That makes sense yes. Good sugar is still needed though.

My problem with the fn* syntax is that it can't be reused for generator blocks. The nice thing about async is that the same keyword can be used for marking functions and for async blocks.

I'm not sure how I feel about the gen keyword, but it's definitely better in this regard...