Blog series: Dyn async in traits (continues)

There’s one thing I miss from the discussion. C++ implemented coroutines with an implicit heap allocation quite a while ago. What’s industry experience with that approach? Do we have any case studies, blog posts, etc?

6 Likes

I think Rust has made many ergonomic sacrifices in order to have full control of memory and explicit allocations. I believe many of these design decisions can be significantly more challenging for beginners than this particular one (e.g. string literals being &str instead of String).

I would argue that by the time new users learn async rust, they have already come to understand Rust zero cost abstraction philosophy, and will probably have come to terms with having to explicitly opt into an allocation. I don't think this feature is worth breaking those principles.

In fact, I would argue that simplifying async APIs has been a mistake in the past. In particular, having Context and Waker be thread safe to simplify the implementation of work stealing runtimes, at the expense of single threaded and thread per core runtimes.

Is it not possible to use thread-per-core runtimes on Rust? That seems like a pretty useful case that C++ is already using coroutines for.

It is possible, but they have to be prepared for one of their Wakers to be activated from a different thread. The runtime is still free to decide where the awakened future is run, though. In particular, it can be !Send and tied to a particular thread if the runtime supports that use case.

There’s one thing I miss from the discussion. C++ implemented coroutines with an implicit heap allocation quite a while ago. What’s industry experience with that approach? Do we have any case studies, blog posts, etc?

First of all, C++ coroutines can be implemented without heap allocation. It's a painful path to walk, it involves forcing linker error to detect cases where the allocation would have occurred (if the linker had linked), which makes tracing the error back to its root cause is painful, and it means detecting too large coroutines at run-time, which is sub-optimal in general, and plain forbidden in safety-critical code.

Secondly, C++ coroutines require 3rd-party libraries. The standard library itself only offers the means to implement the coroutine handles (which is what coroutines effectively return). On top of that, implementations have been lagging for C++20 in general, so adoption of coroutines in C++ has been even more sluggish than usual.

With that said, from what I've gathered from r/cpp, there's essentially 3 camps:

  • People who use coroutines for I/O nigh exclusively: :+1: . Their previous solutions typically involved a few memory allocations here and there already, which is fast compared to the I/O they are doing, and thus reception of coroutines as is has been fairly positive due to the ergonomics gains.
  • People who use coroutines outside of I/O: :neutral_face: . They have appreciated the gain in ergonomics, at first, and then performance reports started kicking in. Performance is great when the compiler inlines the coroutine, but if it fails to at a critical juncture (inner loop), then it's an infuriating experience to try and for inlining, or disable the memory allocation. I do not think the situation applies to Rust, as non-dyn Future would not lead to boxing, offering tight control to the user.
  • People who wish to use coroutines in restricted environments (embedded, kernels, safety-critical): :-1: .They are plain disappointed, it's hell for them. Linker errors are NOT a friendly reporting mechanism. Runtime detection of OOM may be forbidden. As usual, they feel left out.
15 Likes

That's the conclusion I came to after sleeping on it.

Dynamic dispatch already requires specifying bounds for associated types; if I use a regular trait, the compiler complains otherwise:

fn dyn_use(iterator: &mut Iterator) {
    for item in iterator {}
}

The same would apply to the original example, from the blog post:

fn use_dyn(di: &mut dyn AsyncIterator) {
    di.next().await; // <— this call right here!
}

Needs to be rewritten as:

fn use_dyn(di: &mut dyn AsyncIterator<Item = ??>) {
    di.next().await; // <— this call right here!
}

That is, use_dyn is runtime polymorphic about the particular implementation of the trait, but still requires compile-time information about a number of its properties. A single dyn_use cannot, at runtime, handle a mix of i32 and String.

Do we really want to try and have use_dyn be runtime polymorphic over the various future types that various implementation of AsyncIterator<Item = X> could have?

This seems at odd with the fact that Item needs to be pinned to a specific type.

And if that requirement is lifted, then suddenly things fall in place (with a hypothetical FutureType associated item):

fn use_dyn(
    di: &mut dyn AsyncIterator<Item = X, FutureType = Box<dyn Future<Output = Option<X>>>>
) {
    di.next().await; // <— this call right here!
}

It can be made much more palatable with some library goodness:

trait DynAsyncIterator {
    type Item;
    type SizedFuture<T: ?Sized>: Future<Output = Option<Self::Item>>;

    fn poll_next(&mut self) -> SizedFuture<dyn Future<Output = Option<Self::Item>>>;
}

//  [std] only, not [core].
//  Make part of prelude, for seamless experience.
trait AsyncIteratorExt: AsyncIterator {
    type Boxed: DynAsyncIterator<Item = Self::Item, SizedFuture = Box>;

    //  Similar to the "fused" adaptor existing on Iterator.
    fn boxed(self) -> Self::Boxed;
}

impl<T: AsyncIterator> AsyncIteratorExt for T {
    type Boxed = /* todo */;

    fn boxed(self) -> Self::Boxed { todo!() }
}

Which enables a nifty:

fn make_dyn<AI: AsyncIterator>(ai: AI) {
    //  Explicit choice of strategy.
    use_dyn(&mut ai.boxed());
}

fn use_dyn(di: &mut dyn DynAsyncIterator<Item = X, SizedFuture = Box>) {
    di.next().await;
}

It looks pretty good, if I say so myself:

  • Designing DynAsyncIterator is a one-off cost for the designer, and a one-off cost for each implementation of which there'll probably be relatively few -- there's not that many strategies available.
  • AsyncIteratorExt is a one-off cost for the designer.
  • Even if talking about the return type of poll_next was possible, DynAsyncIterator is more user-friendly than having to specified the horrendously long type.
  • No unstable compiler features were harmed in this sample -- with GATs stabilized -- though AsyncIterator may still require impl Trait in return position in traits by itself, of course.
  • Compatible with [no_std]: the boxed adapter is purely optional, leaving in std itself. Users in [no_std] will instead pick a different strategy, possibly a Box<T, InlineStorage<N>>.

And of course, the usage is fairly need. A simple call to .boxed() is both succinct enough not to be a bore, and explicit enough to spot allocations.

1 Like

Maybe I missed something but can someone point me to a resource as to why aalloc isn't compatible with using rust futures?

async fn futures compile into state machines. When the future is not actively being polled and making progress, all its local variables are stored inside the future. But the future is a value with a defined size, so all its contents need to have a defined size as well. So, to store a local variable inside a future, you need to know how much space it will take up at the point you create the future.

OK, I went ahead and wrote up a new blog post diving into call-site selection as I see it now.

6 Likes

Yeah, that's an interesting idea. Neat.

I have a very annoying question: Right now all the suggestions for boxing implicitly seem to be assuming infallible allocation. Is there any hope to keep the designs (whether Boxing::new(), .box, or implicit boxing) compatible with fallible allocations?

Over in Rust-for-Linux we have some experiments using async/await in kernel programming, and we have a heap, but all allocation failures must be handled.

17 Likes

That seems easier to handle with call-site allocation. You could have something like FallibleAllocator::with(|| foo.next())?.await, I imagine.

3 Likes

Thinking more about this idea:

One thing I've wanted to do for a while is make it possible to have a dyn Trait where you don't specify the value of all the associated types. As a silly example, you could have a dyn Iterator (without any Item specified) and still use it to invoke count or something else where the Item is not part of the interface. (If you're familiar with Java, it'd be vaguely like when you have a Iterator<?>.)

One of the vexing parts of that is that dyn Iterator doesn't implement Iterator, then, because we can't actually compile functions generic functions unless we know the value of Item (since they may invoke next). Of course, if we drop that requirement, that's fine, we can say that dyn Iterator<Item = X>: Iterator but not dyn Iterator on its own.

Your proposal seems to fit into that -- you can think of the next future on dyn AsyncIterator as being unknown -- in this case, not completely unknown, but changed to a dyn. And then you can specify different values for them and, if those values are sized, then it implements dyn Iterator.

Definitely some details to work out, but neat.

2 Likes

Right now, for example, if you have a trait with a generic method, you get an error when you try to create a dyn Trait value, telling you that you cannot create a dyn Trait from a trait with a generic method. But it may well be clearer to get an error at the point where you to call that generic method telling you that you cannot call generic methods through dyn Trait.

It would be very unfortunate if object-safety would be checked only at method use time. As far as I understand, it doesn't expand the expressiveness of trait objects in a meaningful way (since you can always add bounds which exclude the offending methods). On the other hand, object safety is quite unintuitive and easy to violate, and it would best if the person writing the trait would be notified of its lack as soon as possible, rather than somewhere at use site with a confusing error message and the design around trait objects already ossified.


Aside: it would be really great if there was some support for object-safe "generic" methods. Obviously they can not be really generic in the common sense, but if we're already opting into dynamic dispatch, perhaps the generic methods could be automatically monomorphized for corresponding trait objects in some cases?

For example, in this method

trait Foo {
    fn foo<T>(&self, _: T);
}

there is nothing you could do to T other than move it around and drop. We can accept any type dynamically by using the trait AnyVal which provides type layout and drop (is there already such a trait? it's not exactly dyn Drop). Trait objects dyn AnyVal of this trait can be safely manipulated by Foo::foo via dynamic dispatch. (we have an issue that dyn AnyVal is unsized, so just monomorphizing to T=dyn AnyVal probably won't work... or would it?).


The problem is that the AsyncIterator trait defines next as returning impl Future<..>, which is actually shorthand for impl Future<..> + Sized, but we said that next would return dyn Future<..>, which is ?Sized. So the dyn AsyncIterator type doesn’t meet the bounds the trait requires.

We can always safely tighten the bounds on the returned type in the implementation. What if AsyncIterator::next desugars to return Future + ?Sized, and the Sized bound would be required separately for Self: Sized? That would mean that code which works with concrete sized async iterators would work as usual, while the one that wants to allow dynamic dispatch would have to jump some extra hoops.

An issue which immediately springs is that Self: ?Sized doesn't mean that Self = dyn AsyncIterator. It could be a slice, or some unrelated trait object, and it is unclear why someone would want to return unsized futures in those cases. Perhaps the unsized bound could be made to apply only to dyn AsyncIterator via some clever trait bound hack?

fn use_dyn(di: &mut dyn AsyncIterator) {
    di.next().box.await;
    //       ^^^^
}

That is a huge violation of concern separation. This means that any async function, which wants to use dynamic dispatch, would have to write the .box boilerplate, even though it 99% doesn't care where and how that dyn future is allocated. On the other hand, the toplevel code (e.g. the executor), which absolutely certainly would care about precise allocation strategy for futures, is left without any means to affect it.

Punting it on compiler optimizations doesn't work. The optimizer is notoriously unreliable for hard performance requirements, and can never be relied on for correctness. There is also no way that an optimizer would use some local arena for that. This means that writing something like Go's async model with separate segmented stacks will be impossible, or at least would get no help and a bit of obstruction from the language.

It's not meaningfully better than "implicitly allocate everywhere". The real problems are just as unsolved, and the async fns have to deal with boilerplate.

Aside: if the .box operator is ever added, I sure hope it is overloadable and gets its own trait.

To sum up, I think for most users this design would work like so…

I really don't understand how is the proposed design any better than adding #[async_trait] macro to the standard library. What does the proposed design allow to do that the macro doesn't?


Part of the problem is that async stack frames are stored in structs, and thus we cannot support something like alloca (at least not for values that will be live across an await, which includes any future that is awaited).

Why? I proposed it above, but let me ask again: what's stopping us from treating the backing allocation of the future as essentially a stack, and performing an alloca within it? I.e. an instance of dyn Future + ?Sized is allowed to be any size determined at runtime, most likely based on the size of the storing allocation (e.g. if you want to store Box<dyn Future + ?Sized>, you decide upfront how much memory you're willing to provide it, and the future is constrained to that memory).

This requires the future to know the size of its allocation, but I believe that information can be put in the future's vtable.

This also means that the future may overflow its allocation during execution, but it's not any different from overflowing the stack. If you're executing arbitrary dynamically provided code, then you're already fine with that code potentially exhausting your stack and killing the process. It is unfortunate, but it's just the way of life, and can be dealt with via standard methods. If you know all trait implementations, you can verify their stack consumption by hand. If you can't or if the implementations are untrusted, then you probably shouldn't be calling them dynamically to begin with. At least an overflow of a user-provided buffer is a relatively benign and easily recoverable error, compared to a real stack overflow.

The checks incur some runtime costs, but that's just a layout computation and a bounds check, which should be fast enough. Certainly faster than the allocate-everything model. Also, dynamic dispatch will incur runtime costs either way.

This means that the executor is free to choose the strategy for those allocations. It may put them all into an arena. It may allocate huge stacks for each future, or small stacks if it's feeling confident. You can implement any possible async stack model with that approach. Side stacks, segmented stacks, execute everything on the system stack, whatever.

We will probably require some sanity bounds on the relevant unsized locals (there should be some reasonable way to compute the alloca size before filling it with data), but it is a separate problem, and there are many reasonable bounds on unsized return values that one could enforce (RVO being the simplest one).

It would also give a simple way to write recursive async functions, as an extra benefit.

2 Likes

You say that dyn Foo would define inherent methods based on the trait methods, but exclude non-dyn-safe methods of the trait. I'm interested in how this would interact with #51402: method resolution for trait object types does not prefer inherent methods.

I've never really liked the "dyn Foo's trait methods 'are' inherent methods" aspect because it blocks valid use cases like those in the issue, and would rather dyn Foo act more like any other type generally. However, the cases that I generally care about / usually matter are the cases where the method is non-dyn-safe. If the compiler is going to gain the ability to override some non-dyn-safe trait methods with an inherent method having a different signature, I would like that ability as a Rust programmer too.

I think it is also forwards compatible with allowing inherent methods that shadow implemented trait methods. But if it's decided that will never happen, it would be better if writing such an inherent method gave an error itself, instead of just being confusingly non-callable. (Even better yet would be reducing some special cases around dyn Foo's method resolution, if possible.)

Potentially related RFC: inherent trait implementations.


I'm fine with dyn Foo being more like a normal type in the sense that it doesn't necessarily implement Foo. In fact, it has never been the case that dyn Trait always implements Trait. The first time I saw that was on URLO, so people do run into it in practice.

3 Likes

Something like the capabilities proposal, combined with the storages proposal, could be applicable here. The executor could provide a capability for the Storage method it prefers, which would then be passed to any Box::new..s.

In the normal stack, if one function uses a large amount of stack space A, and another uses a small amount B, the total amount of stack space used is A + B, which is not much bigger than A. If an async executor decides to give all dynamically sized futures enough space to do A, then if one dynamically sized future does A and another does B, the total space used is 2A. Either the executor wastes a lot of space by giving every future enough to do the maximum, or you are likely to OOM on some future, because unlike normal stack space which is shared between all functions, each dynamically sized future would get its own "stack" and can't share "extra" with more needy futures. Because of this, it's almost certainly better to just box your dynamically sized values in futures.

1 Like

Is there a fundamental reason "allocate maximal space" couldn't work across crates? Could the ABI be modified so that crates provide a table of maximal space required per trait method? That seems like it mightn't require much analysis.

Edit: Somewhat answering my question, it wouldn't really work in the (currently uncommon) case of dynamically loaded Rust libraries.

So now it seems a lot more appealing to me, and I’m grateful to Olivier Faure for bringing it up again.

Recognized!

There is one complication. Today in Rust, every dyn Trait type also implements Trait. But can dyn AsyncIterator implement AsyncIterator? In fact, it cannot! The problem is that the AsyncIterator trait defines next as returning impl Future<..>, which is actually shorthand for impl Future<..> + Sized, but we said that next would return dyn Future<..>, which is ?Sized. So the dyn AsyncIterator type doesn’t meet the bounds the trait requires. Hmm.

Is the solution you mentioned in this thread, about adding a special Sized if Self is Sized bound to RPITs, not on the table?

Because if it is, then dyn MyTrait would still implement MyTrait.

4 Likes

One as-yet-unmentioned possibility is to have some kind of Storage or similar that has a specialized behavior for Sized values. So Box<T, SizedOnStack>::new would place its contents on the stack iff T: Sized. That way, you can write code that is generic over both impl AsyncTrait and dyn AsyncTrait with no performance penalty for sized cases.

1 Like