Blog series: Dyn async in traits (continues)

I agree in the general case but I want to point out that "we cannot do heap allocations in this context" really is a bright line. For instance, @y86-dev recently brought up placement by return in another thread, in the context of instantiating mutexes in kernel mode, where allocations might be either lexically forbidden ("this code runs during early boot and we don't have a heap yet") or dynamically forbidden ("this code could be called from an interrupt handler").

7 Likes

This can't be done until Niko's point about Sized bounds is resolved one way or another.

I'll elaborate more later, but I think that the way to resolve this is to drop the idea that the type dyn Trait implements Trait. Instead, we would extend MIR with the idea of a "dyn call" as a new thing (right now we have only inherent calls and trait calls, and there is a special "dynamic dispatch impl" of the trait). In that case, the signature of methods when invoked through dyn Trait can be different than what is defined in the trait itself -- in particular, async fn and -> impl Trait woudl be rewritten to return dyn values.

2 Likes

That's very gracious of you. =)

No, that's a good summary.

It's funny, actually, I'd started to write a blog post to lay out the "placement return" solution to async traits, but you mostly hit the same points more concisely.

"Not carried by the time alone"? I think you made a typo.

(And I don't think a pass-the-allocator ABI would be necessary, at least not for RPITIT)

Do you mean "its associated type to be Sync"?

Huh, that's a good point.

That implies that RPIT in dyn traits wouldn't a straight desugaring with the current rules, and would require additional chalk support.

I don't know. The analogy I made in the zulip thread was "when you pull a pathfinding library and call find_path_with_astar(my_graph), you usually don't expect to control whether the A* implementation is using Vecs or VecDequeues.

I would expect most library writers would be fine boxing futures if needed, and if someone wants their library to be used in no_alloc contexts, then they'd go the extra mile.

Also, for the specific use-case of async iterators, we could create a InlineDynAsyncIterator which implements the inline strategy (details in the zulip thread).

1 Like

Nah, it works if you assume that traits with async methods cannot be made into dyn traits.

(which is not ideal, but again, good enough for the MVP)

That locks out certain possibilities, like deciding that async trait methods allow returning unsized futures even when the implementing type is Sized. For example, struct WrappedDyn(Box<dyn AsyncTrait>) might want to implement AsyncTrait by delegating to its field.

1 Like

"Can return non-Sized only if non-Sized" is the kind of negative reasoning that could lead to semver hazards, no?

I don't know if I would agree with this, it is nice to be able to reason that some tight loop doesn't perform any dynamic allocations... To give an example might be in a render loop of a game, where you allocate and free memory during loading/switching levels, in between during the main gameplay loop just mutate the previously allocated memory.

A less drastic measure is to decide that T: AsyncTrait actually means T: AsyncTrait where <T as AsyncTrait>::FutureType: Sized, and something like T: AsyncTrait where <T as AsyncTrait>::FutureType: ?Sized is available for those who really want it. But the rules around Sized bounds are complicated enough as is…

Maybe with unsized_{locals, fn_params, etc}, dealing with unsized types will become painless enough that implied Sized bounds can be removed from certain contexts across an edition? I doubt this will ever be feasible, but maybe I am wrong about that?

1 Like

Would that mean that a wrapper function would be placed in the vtable?

Also, I think that under this new definition, dyn Trait still impls Trait, since its return value implements Trait (recursively).

There’s one thing I miss from the discussion. C++ implemented coroutines with an implicit heap allocation quite a while ago. What’s industry experience with that approach? Do we have any case studies, blog posts, etc?

6 Likes

I think Rust has made many ergonomic sacrifices in order to have full control of memory and explicit allocations. I believe many of these design decisions can be significantly more challenging for beginners than this particular one (e.g. string literals being &str instead of String).

I would argue that by the time new users learn async rust, they have already come to understand Rust zero cost abstraction philosophy, and will probably have come to terms with having to explicitly opt into an allocation. I don't think this feature is worth breaking those principles.

In fact, I would argue that simplifying async APIs has been a mistake in the past. In particular, having Context and Waker be thread safe to simplify the implementation of work stealing runtimes, at the expense of single threaded and thread per core runtimes.

Is it not possible to use thread-per-core runtimes on Rust? That seems like a pretty useful case that C++ is already using coroutines for.

It is possible, but they have to be prepared for one of their Wakers to be activated from a different thread. The runtime is still free to decide where the awakened future is run, though. In particular, it can be !Send and tied to a particular thread if the runtime supports that use case.

There’s one thing I miss from the discussion. C++ implemented coroutines with an implicit heap allocation quite a while ago. What’s industry experience with that approach? Do we have any case studies, blog posts, etc?

First of all, C++ coroutines can be implemented without heap allocation. It's a painful path to walk, it involves forcing linker error to detect cases where the allocation would have occurred (if the linker had linked), which makes tracing the error back to its root cause is painful, and it means detecting too large coroutines at run-time, which is sub-optimal in general, and plain forbidden in safety-critical code.

Secondly, C++ coroutines require 3rd-party libraries. The standard library itself only offers the means to implement the coroutine handles (which is what coroutines effectively return). On top of that, implementations have been lagging for C++20 in general, so adoption of coroutines in C++ has been even more sluggish than usual.

With that said, from what I've gathered from r/cpp, there's essentially 3 camps:

  • People who use coroutines for I/O nigh exclusively: :+1: . Their previous solutions typically involved a few memory allocations here and there already, which is fast compared to the I/O they are doing, and thus reception of coroutines as is has been fairly positive due to the ergonomics gains.
  • People who use coroutines outside of I/O: :neutral_face: . They have appreciated the gain in ergonomics, at first, and then performance reports started kicking in. Performance is great when the compiler inlines the coroutine, but if it fails to at a critical juncture (inner loop), then it's an infuriating experience to try and for inlining, or disable the memory allocation. I do not think the situation applies to Rust, as non-dyn Future would not lead to boxing, offering tight control to the user.
  • People who wish to use coroutines in restricted environments (embedded, kernels, safety-critical): :-1: .They are plain disappointed, it's hell for them. Linker errors are NOT a friendly reporting mechanism. Runtime detection of OOM may be forbidden. As usual, they feel left out.
14 Likes

That's the conclusion I came to after sleeping on it.

Dynamic dispatch already requires specifying bounds for associated types; if I use a regular trait, the compiler complains otherwise:

fn dyn_use(iterator: &mut Iterator) {
    for item in iterator {}
}

The same would apply to the original example, from the blog post:

fn use_dyn(di: &mut dyn AsyncIterator) {
    di.next().await; // <— this call right here!
}

Needs to be rewritten as:

fn use_dyn(di: &mut dyn AsyncIterator<Item = ??>) {
    di.next().await; // <— this call right here!
}

That is, use_dyn is runtime polymorphic about the particular implementation of the trait, but still requires compile-time information about a number of its properties. A single dyn_use cannot, at runtime, handle a mix of i32 and String.

Do we really want to try and have use_dyn be runtime polymorphic over the various future types that various implementation of AsyncIterator<Item = X> could have?

This seems at odd with the fact that Item needs to be pinned to a specific type.

And if that requirement is lifted, then suddenly things fall in place (with a hypothetical FutureType associated item):

fn use_dyn(
    di: &mut dyn AsyncIterator<Item = X, FutureType = Box<dyn Future<Output = Option<X>>>>
) {
    di.next().await; // <— this call right here!
}

It can be made much more palatable with some library goodness:

trait DynAsyncIterator {
    type Item;
    type SizedFuture<T: ?Sized>: Future<Output = Option<Self::Item>>;

    fn poll_next(&mut self) -> SizedFuture<dyn Future<Output = Option<Self::Item>>>;
}

//  [std] only, not [core].
//  Make part of prelude, for seamless experience.
trait AsyncIteratorExt: AsyncIterator {
    type Boxed: DynAsyncIterator<Item = Self::Item, SizedFuture = Box>;

    //  Similar to the "fused" adaptor existing on Iterator.
    fn boxed(self) -> Self::Boxed;
}

impl<T: AsyncIterator> AsyncIteratorExt for T {
    type Boxed = /* todo */;

    fn boxed(self) -> Self::Boxed { todo!() }
}

Which enables a nifty:

fn make_dyn<AI: AsyncIterator>(ai: AI) {
    //  Explicit choice of strategy.
    use_dyn(&mut ai.boxed());
}

fn use_dyn(di: &mut dyn DynAsyncIterator<Item = X, SizedFuture = Box>) {
    di.next().await;
}

It looks pretty good, if I say so myself:

  • Designing DynAsyncIterator is a one-off cost for the designer, and a one-off cost for each implementation of which there'll probably be relatively few -- there's not that many strategies available.
  • AsyncIteratorExt is a one-off cost for the designer.
  • Even if talking about the return type of poll_next was possible, DynAsyncIterator is more user-friendly than having to specified the horrendously long type.
  • No unstable compiler features were harmed in this sample -- with GATs stabilized -- though AsyncIterator may still require impl Trait in return position in traits by itself, of course.
  • Compatible with [no_std]: the boxed adapter is purely optional, leaving in std itself. Users in [no_std] will instead pick a different strategy, possibly a Box<T, InlineStorage<N>>.

And of course, the usage is fairly need. A simple call to .boxed() is both succinct enough not to be a bore, and explicit enough to spot allocations.

Maybe I missed something but can someone point me to a resource as to why aalloc isn't compatible with using rust futures?

async fn futures compile into state machines. When the future is not actively being polled and making progress, all its local variables are stored inside the future. But the future is a value with a defined size, so all its contents need to have a defined size as well. So, to store a local variable inside a future, you need to know how much space it will take up at the point you create the future.

OK, I went ahead and wrote up a new blog post diving into call-site selection as I see it now.

6 Likes

Yeah, that's an interesting idea. Neat.