An opportunity to improve performance of generators and futures

It could be possible to improve performance of async code by reducing amount of indirections associated with boxed futures. Description of the idea and toy benchmarks can be found here:

3 Likes

Could this also be a more general optimization of the layout of non-public enums that have only a single match statement globally that processes them? Also, you don’t mention generic Futures, but I guess that every monomorphic instance of those would get its own set of function pointers. The condition of “only a single match statement globally” isn’t true anymore in the presence of destructors. Which makes me wonder: How do you envision destructors to work with your implementation?

In theory it could be done by special-casing Box<dyn Future> and Box<dyn Generator> on compiler level and adjusting code generation accordingly.

That would be a non-trivial change, though. The problem is, Box<dyn Future> has a layout roughly equivalent to:

struct Box<...> {
    data: &FutureData,
    vtable: &__vtable_Future,
}

struct __vtable_Future {
    poll: fn(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output>
}

The problem is, poll in the vtable takes a (pinned) type-erased reference to the future data; it does not take a reference to the entire Box, which means it has no way to modify the vtable data when stopping at a different yield point.

I suspect finding a way to overcome this would require substantial changes.

Yes, essentially FSM transition function takes pointer to a state (it could be on the heap or on the stack) and assumes that state behind the pointer is valid for it. Compiler just has to track the maximum size of the state (and alignment requirements), otherwise it can view states and transition functions from libraries as completely opaque objects. I guess it could significantly reduce amount of analysis which has to be done by compiler, since it does not have to explicitly construct the huge state enum anymore.

Yes, you are correct.

Running destructor is essentially just another state transition dependent on the input data. We could modify the function signature to:

fn drive(f: *mut StateTransFn, state: &mut T, drop_flag: bool);

By calling the state transition function with drop_flag = true, it will handle cleaning objects inside the state (since only transition function can properly interpret data inside the state), while the caller would deallocate the state memory from the heap if needed.

Yes, I understand. And I wrote that I don't have enough knowledge to estimate how complex such change will be in practice. Although note that "special-casing" here means that we will bypass the usual memory layout for boxed traits entirely and in a certain sense will use a specialized calling convention.