Pre-RFC: CPS transform for generators

Continuing the discussion from Async/Await series:

I’ve been thinking about the questions posed in @withoutboats’ conclusion, of how stable async functions will work. More specifically, how they might handle resumption into nested call stacks. Pin offers an interesting change we could make there that, I think, would give us a more solid conceptual framework (continuations) and a faster and more straightforward implementation (directly resuming nested futures).

Here’s a simple example using strawman syntax, which I will refer back to:

// spawn a task running f
tokio::run(async || {
    println!("{:?}", f("a"));
});

// await three callee async fns
async fn f(s: &str) -> io::Result<T> {
    let a = fetch(&format!("https://example.com/{:?}.txt", s)?;
    let b = fetch(&format!("https://example.com/{:?}.txt", a))?;
    let c = fetch(&format!("https://example.com/{:?}.txt", b))?;
    let r = process(a, b, c)?;
    Ok(r)
}

// some async fn that presumably suspends several times
async fn fetch(url: &str) -> io::Result<String> { /* ... */ }

The current design, as I understand it, has the executor hold a single allocation for the outer-most generator (the closure passed to run), which contains the inner generators. Each call to poll walks into the inner-most generator and then back out. For example, on the first invocation:

  • The closure calls f, f calls fetch.
  • fetch starts the first request and gives the Waker to the event loop.
  • fetch saves its state and returns; f saves its state as "awaiting fetch" and returns; the closure saves its state as "awaiting f" and returns.

Starting at the outer-most future every time a task resumes may be useful in some cases, but it is unfortunate for these typical straight-line async fns:

  • The entire call stack is (de)serialized for every call to poll, including frames that won’t be used that time.
  • Every suspension point is a loop in the generator, making the resulting binary larger.

One solution comes from framing this in terms of “continuations.” A continuation represents “the rest of” some suspended program. You invoke it and the program resumes directly from the point it was suspended. A program can be transformed into continuation-passing style (CPS), usually by a compiler, so that functions never return, and instead invoke their caller’s continuation explicitly.

Using continuations, instead of resuming generators from the outside in, the task resumes directly into the inner-most generator, because that’s the continuation from the last suspension. When that generator suspends again, it is the only one that is touched. When it completes, it resumes its caller (via a tail call) because that’s its continuation.

I encountered this fusion of generators and continuation-passing style in Kotlin, where it’s enabled by a slightly different model for “coroutines,” basically a generalization of Rust “generators.” Using those terms:

  • Like a generator, a coroutine is compiled into an anonymous type’s resume method, which dispatches on its current state and returns at suspension points.
  • Unlike a generator, a coroutine is initialized with a reference to its caller coroutine, which it saves until it completes, when it resumes it.
  • Unlike a generator, suspending a coroutine is done with a call/cc-like intrinsic. Call/cc takes a block, runs it with the current continuation as an argument, and then suspends the coroutine, yielding the value of the block. On resume, call/cc appears to return the passed-in value. This has two uses:
    • To yield, the block saves the continuation and returns the yielded value.
    • To await another coroutine, the block initializes the callee with the continuation and resumes it via a tail call.

To make this work in Rust, we need a few things:

One is a way to tell the executor where to resume, because now all the resume points are after the calls to the awaited functions, which are no longer resumed in loops. This could perhaps be done by including that information with the Waker on suspend; or else by storing it in the outer-most generator somehow.

Another is a way for the outer-most generator to complete, since it will have no generator to resume. This might be done with a stub implementation of the continuation interface, that bridges the two worlds by immediately returning to its resumer. There may be a better way, or perhaps some optimizations that can be applied in some situations.

The last thing is a way for callee generators to store their caller’s continuations, as a replacement for the current (de)serialization between a) the caller generators’ state in the loop that resumes the callee and b) the return address and self: Pin<Self> saved on the call stack. This is where Pin comes in! The callee generator can now store a pointer to its caller’s state block, for use when it completes and tail-calls into its caller.

This approach does have a few downsides. Of course, it’s still relatively vague as far as actual types and signatures go, so some problems may turn up there. Here is what I’ve thought about so far:

One issue is that generator completion involves dynamic dispatch. This is not actually a performance problem: it’s the same as returning from a function (minus any call/ret prediction the CPU normally does), and with MIR inlining before state machine generation it can be optimized the same way. It is perhaps a type problem, unless it can all be kept post-type-checking somehow.

A bigger issue is that it makes all generators immovable, as opposed to just the ones that keep references to local variables across suspension points. First, because callees have pointers to their callers, which are in the same allocation. Second, because Wakers are accompanied by references to the generators. This is unfortunate because AIUI avoiding that was part of the reason for that design to begin with.

6 Likes

Maybe add a way for users to inject a Place for the stack at yield points?

I’m not sure what you mean by that- could you clarify? There is currently no allocation or stack copying at suspension points, and we don’t want any- one of the core features of Rust’s futures design is that the whole tree of nested futures all gets put in one value and thus one allocation.

In fact, that’s probably why we ended up with this “rebuild and tear down the stack every time” problem- all previous async/await implementations* have been in languages with garbage collection, and so they “by default” allocate each generator separately, making this CPS transform much more natural for them. (I singled out Kotlin because it actually uses the “continuation” terminology and is fairly principled about it.)

* Except C++, which also allocates each generator separately, but for different reasons.

1 Like

There is currently no allocation or stack copying at suspension points, and we don’t want any- one of the core features of Rust’s futures design is that the whole tree of nested futures all gets put in one value and thus one allocation.

Oh absolutely, this is an extremely important property of rust generators and the raison-detre over more ergonomic stackful coroutine designs.

I'm not overly familiar with the exact structure of Rust generators, my terminology may have been off. The idea would be that instead of returning impl Generator on the stack a Generator function could be placement new'ed into a location such as a box. So while the Generator itself may not be moveable, how common is it to move it after it has been allocated somewhere? It seems like we mostly just need the ability to decide if the Generator is on the stack or on the heap.

Ah, I see. That’s solved independent of what I describe here- the result of the Pin work is that you can move the impl Generator around, and it only becomes immovable once you resume it. So we don’t actually need Place, it’s just an optimization around the usual idiom of returning the impl Generator by move when it’s first created.

The downside you quoted is that, while currently the only immovable generators are ones that take references to “stack” variables, with this CPS transform that would grow to include generators that await inner generators, because awaiting would create a reference to the “stack” itself

1 Like

I’d like to note that this post uses CPS but doesn’t explain what the acronym is for those that don’t know.

(I assume you mean Continuation Passing Style)

3 Likes

Ah, thanks! I’ve tweaked the post to clarify a couple things.

2 Likes

The entire call stack is (de)serialized for every call to poll, including frames that won’t be used that time.

I don't understand what this means; to make this all work, the data needs to live in the heap at the start, and never move. AFAIK there's no "serialization" involved.

I think it means to say ”traversed”; in the sense that if an event unparks a task, the nested futures have to call the inner ones, one “nesting” a time, instead of jumping directly to the instruction where the blocking call originated. Compare this to a call stack that can be “ frozen in time”, and jumped directly back into without any traversal.

But that’s not the case when using generators (i.e. async/await): they compile down to a flat state machine.

What I mean by "serialization" is that every generator has two forms it can take: When it is running, its state is encoded in the instruction pointer, the registers, and the host thread's stack. When it is suspended, its state is encoded in its state block.

Before it suspends, it "serializes" itself- the instruction pointer determines the state number to be written into the state block, and live values are saved into the state block.

When it resumes, it "deserializes" itself- it switches on the state number to restore the instruction pointer, and it reloads live values from the state block.

Some values may remain in the state block, like values behind references and especially immovable ones like nested generators' state blocks, but the key is the conversion back and forth between state number and instruction pointer.

This is not quite true, and that's the case where the (de)serialization becomes a problem. A single generator is a flat state machine, yes. But a generator awaiting another generator is two state machines in the same allocation- the inner generator's state block is one of the outer generator's "live values."

When a generator is resumed in the middle of an await, it restores its instruction pointer into a loop where it calls resume on the inner generator again. This call pushes the restored instruction pointer onto the stack as a return address, and then the inner generator restores its own instruction pointer.

Thus, for each active await, the task switches on a state number to restore the instruction pointer and pushes it onto the stack, and then later returns to it and converts it back to a state number.

1 Like

To put it another way, resume ends either by yielding or completing, and we currently require both go to the same place- the return address on the host thread’s call stack, which points into the awaiting caller generator’s resume method.

But really we want yielding to go directly out to the executor, and completing to go back to the awaiting caller generator. So we need to save that return address (or rather something equivalent that will last until we complete) in the suspending callee generator’s state block, just like any live value.

And, since the caller generator may have more state on the host thread stack below that return address, including another return address, it needs to save it and clean up after itself before resuming the callee generator- which makes that call to resume a tail call.

1 Like

This (the core of your motivation) seems like it ignores inlining optimizations. Inlining these generators into one generator seems like the obvious solution to this problem to me.

Inlining is great, and definitely helps, but it’s not a complete solution. (It also helps this proposal, as I mentioned in the first post.)

First, unless it’s done before state machine generation (i.e. on MIR), or LLVM gets extremely lucky and is able to combine the state numbers and simplify the resulting control flow, there will still be multiple nested state machines.

Second, inlining doesn’t always happen. Sometimes it’s better to leave a function out of line, like when it’s large and/or called from many different locations. (Technically, because generators can’t be recursive, we could just always fully inline them, so maybe “doesn’t” should be “shouldn’t.”)

Third, generators aren’t the only things you might want to await. Hand-written futures cannot be inlined the same way, because they’re already state machines. Hand-written futures could still take part in a CPS transform if we specify the interface right.

(Edit I: Fourth, there’s also debugging. If you’re running your program in a debugger and have to repeatedly step in and out of an unoptimized stack of generators, that will get old fast. There’s also the problem of debug builds being too slow to be useful. Reducing await's reliance on the optimizer will help both these cases.)

(Edit II: Fifth, a CPS transform would work across dynamically-dispatched generator calls while inlining wouldn’t. I don’t know how common e.g. async trait object method calls will be, but it would be great not to have to do the dynamic dispatch on every resume!)

I expect both situations 2 and 3 will be relatively common- larger functions that wouldn’t be inlined in synchronous code (including functions that are only large after a round of inlining), and hand-written combinators that generators await (e.g. for things like awaiting multiple futures in parallel).

If we truly want to call async/await a zero-cost abstraction, it ought to compile down to something that doesn’t have the “(de)serialize nested generators” problem.

2 Likes

Out of curiosity, do we have any evidence of the performance impact here?

I had some similar concerns when we were designing futures, since if you use a large tree of combinators, you end up with a nested enum representation, and on each resumption go through a cascade of matches. I spent some time trying to measure the overhead, and had to create ridiculous levels of nesting to measure any overhead at all.

2 Likes

Oh, one other question: I don’t fully grasp what you’re proposing here, but I think that the most pressing question is whether you think that async/await syntax based on the current design is forward compatible with such an interpretation later. We are pushing hard to land async/await for the 2018 edition, and definitely aren’t in a position to consider a change of this magnitude to make that schedule.

It's hard to fairly measure the performance of something that isn't implemented. We do know that C#, Kotlin, and Javascript all use CPS, while Python uses nested resumption, so we may be able to learn something there.

But really, zero-cost abstraction doesn't just mean nested resumption is "fast enough" for most use cases- after all, people write async code in all of those languages just fine. Zero-cost abstraction means you will never need to replace it with a hand-written equivalent, because its output is the same (or better!).

So while I don't know what scenarios you used to test combinator performance, or what you compared them to, those benchmarks aren't really my motivation. I want to avoid the scenario where someone builds something on generators, and then finds a scenario where that overhead does matter, but they can't do anything about it other than a rewrite. This is the sort of thing that happens to game developers using C# and its GC, for example- it's not that GC is "too much overhead" (in fact it's often faster than naive malloc/free), but that it's not a zero-cost abstraction.

To be more specific, today the focus for async/await is entirely on its use for networking. The conventional wisdom is that networking code is IO bound- again, people use much slower languages than Rust just fine in this space. But networking isn't the only application for generators. They could potentially be great for GUIs, for games, for kernels, for embedded. These new domains certainly have different performance characteristics where the cost of resumption could easily matter much more.

This is about future-proofing the language and the applications written in it, at least as much as it is about performance.

(I have to head to work, but I have a response to the timing question as well.)

4 Likes

Can I clarify anything in particular? Here is a quick sketch of the actual changes I'm proposing:

  • Generators are constructed with and save a caller: *mut dyn Generator.
  • In the generated resume methods, replace return GeneratorState::Return(x) with (conceptually) become self.caller.resume(x).
  • Replace this (approximation of await!) in the generator source:
    let f = f(args..);
    loop {
        match f.resume() {
            GeneratorState::Yielded(x) => yield x,
            GeneratorState::Complete(x) => break x,
        }
    }
    
    with this code in the generated resume method:
    self.f = F::new(self, args..);
    self.state = next; // also save live values, etc.
    become self.f.resume();
    next:
    
  • Use only a single Future to wrap an entire stack of generators. The leaf generator yields a reference to itself as a *mut dyn Generator, and poll saves that and resumes it on its next invocation. Generators awaiting Futures (as opposed to directly awaiting other generators) keep using the loop technique (and thus keep (de)serializing that stack frame every time).

Based on the changes above, I don't believe it is forward compatible as-is, but it could be made compatible with some tweaks and selective stabilization. I believe there are two things to address:

  • The interface and calling convention between awaiters and awaitees needs to remain an implementation detail of the compiler. If we switch to the "single Future per stack of generators" model that seems doable- forbid directly invoking a generator without awaiting it, and instead provide a standard library API to drive a whole stack of them.
  • The generator-driving Future needs to be !Unpin, as it would eventually hold a reference into the generator. Generators that await other generators also need to be !Unpin, as they would also eventually hold references to their caller generators.

The idea that generators directly await other generators without going through a Future doesn't have to make awaiting Futures much more cumbersome. Give Future an async method that contains the polling loop and await that, instead of baking the loop into the definition of await.

Interestingly, this is how Kotlin works- their generators are also not possible to invoke directly without awaiting, they also provide standard library APIs to drive whole stacks of them at once because they don't pass through their Future interface at every level, and they also provide helper async methods on Futures for when you do want to await one from within a coroutine. Then, because you can't ever invoke a generator without also awaiting it, they drop the await keyword altogether.

Thus, instead of this:

async fn foo() -> Result<T, E> {
    // ...
    await some_other_generator()
    // ...
    await some_future
    // ...
}

fn main() {
    let the_future = foo();
    tokio::run(the_future);
}

you write this:

async fn foo() -> Result<T, E> {
    // ...
    await some_other_generator()
    // ...
    await some_future.poll_wait()
    // ...
}

fn main() {
    let the_future = future_from_generator(foo);
    tokio::run(the_future);
}

(Or leave off the awaits if you like, because some_other_generator() on its own is meaningless).

This also neatly sidesteps the question of "but if foo() really returns a Future why isn't Future in its type signature!?" by making foo() not return a Future.

4 Likes

Does this mean this proposal is blocked on become or something morally equivalent? (I would expect so, since CPS transform leads to tons of tail calls.)

2 Likes

Not at the source level, but yes. We would need a way for the generated resume methods to jump to other resume methods without growing the stack.

Fortunately, I don’t believe this is necessary for the forward-compatible version of the current implementation- tail calls can wait until we actually go to implement the CPS transform.

1 Like