Inlining is great, and definitely helps, but it’s not a complete solution. (It also helps this proposal, as I mentioned in the first post.)
First, unless it’s done before state machine generation (i.e. on MIR), or LLVM gets extremely lucky and is able to combine the state numbers and simplify the resulting control flow, there will still be multiple nested state machines.
Second, inlining doesn’t always happen. Sometimes it’s better to leave a function out of line, like when it’s large and/or called from many different locations. (Technically, because generators can’t be recursive, we could just always fully inline them, so maybe “doesn’t” should be “shouldn’t.”)
Third, generators aren’t the only things you might want to await. Hand-written futures cannot be inlined the same way, because they’re already state machines. Hand-written futures could still take part in a CPS transform if we specify the interface right.
(Edit I: Fourth, there’s also debugging. If you’re running your program in a debugger and have to repeatedly step in and out of an unoptimized stack of generators, that will get old fast. There’s also the problem of debug builds being too slow to be useful. Reducing await's reliance on the optimizer will help both these cases.)
(Edit II: Fifth, a CPS transform would work across dynamically-dispatched generator calls while inlining wouldn’t. I don’t know how common e.g. async trait object method calls will be, but it would be great not to have to do the dynamic dispatch on every resume!)
I expect both situations 2 and 3 will be relatively common- larger functions that wouldn’t be inlined in synchronous code (including functions that are only large after a round of inlining), and hand-written combinators that generators await (e.g. for things like awaiting multiple futures in parallel).
If we truly want to call async/await a zero-cost abstraction, it ought to compile down to something that doesn’t have the “(de)serialize nested generators” problem.