Blog series: Dyn async in traits (continues)

Note that it would be perfectly possible (and sensible) to release an "async traits MVP" feature without boxing, with the caveat that async traits aren't dyn-safe.

That would probably be good enough for most people.

4 Likes

Hmm, ok, that's fair. I admit I don't think placement new is very promising and I should invest more time in thinking it through. Let's dig in and I'll try to write a follow-up blog post once we're in agreement about how it would work and the pros/cons.

First off, placement new could be a lot of things. In full generality, it'd have to be some sort of ABI change that allows the callee to place a callback to the caller once it knows how memory it needs to return. That would be needed if you wanted the "full generality" of allowing you to return a ?Sized type from a function. We might then support this by saying (a) function return types can be ?Sized but (b) when you call a function, we typically require a Sized return type unless it is in a context (e.g., foo().box, &foo().box in a non-async function), where we can allocate an arbitrary amount of space. I'm going to handwave a bit over how those contexts are defined for now.

I think the idea would be that the return type of the next async function would be dyn AsyncIterator when it is being called with Self = dyn AsyncIterator. (Note that for the more limited version of placement new, where the size is known up front, we need additional information not carried by the time alone, so a really general .box operator would require more of the "pass in an allocator"-style ABI.)

Then at the call site, at least when using dynamic dispatch, one would write something llke this...

let value = iter.next().box.await;

...and we would thread the allocator into the next call. Without the built-in syntax, that might be...

let value = Box::new_with(|| iter.next()).await;

Let me check-in here: does that sound about right? I'm probably getting some details wrong, or perhaps you have alternate proposals I've forgotten about.

Let me add in a few other observations then.

Is the Future Sized? There is one bit of conflict I see. The current plan is that async fn translates to -> impl Future, so we get a trait roughly like this:

trait AsyncIterator {
    type next<'a>: Future<Output = Option<Self::Item>> + 'a;

    fn next(&mut self) -> Self::next<'_>;
}

The idea is that T::next gives you the "real future" but if T = dyn AsyncIterator, you get a dyn* Future type. Under the design as I described it above, when T = dyn AsyncIterator, you would get a dyn Futue type -- but that's not valid! next requires the value of its associated type to be Send. So we have to loosen to type next<'a>: ?Sized + Future. But now code like this will not compile...

fn foo<AI: AsyncIterator>(ai: &mut AI) {
    let next_future = ai.next(); // Error: `ai::next` is `?SIzed`
}

In order to accept that code, we need a bound that says "next is Sized iff Self is Sized". We don't have those sorts of bounds at all right now, though I think in principle chalk could support it. Make sense?

Where is the right spot to choose the strategy? There is this other question, of whether the call-site is the right place to do the allocation. I think there's no one right answer here for all programs. But I do think it's fairly common to have some code that wants to be reused across different contexts, and that the code is not in a good position to choose the right strategy. But other times you'll have one long-lived object, and some of its calls will be more perf sensitive than others. In general, I think that the point of coercion is a good spot to intervene, at least in part because some of the logical strategies I think would be useful require storing data on the iterator itself. But overall this is a good topic to dig into on its own, and this comment is too long, so I'll drop it here.

8 Likes

No, you can't, because the second function never makes more than one call to any given iterator. You can only caller-side cache when you have repeated calls to the same method from the same object (of course, if you knew that all the objects led to the same kind of underlying type, you could cache across objects, but we don't carry enough info in the type system to make that conclusion).

Example where you COULD cache:

    while let Some(mut iter) = iters.pop_front() {
        while let Some(next) = iter.next().await { // <-- each of these calls could be cached
           process(next);
        }
    }
1 Like

I agree that the Clone/Copy split isn't really correct, but I do want to point out that this is not really apples-to-apples.

Right now, the "clone" line is basically: anything but memcpy, and I don't think that's sufficiently transparent for users in practice. There is often a big difference between incrementing a reference count and cloning a vector, and yet both of them look like value.clone(). If you add in interior mutability, it gets worse, because now the difference between cloning a Rc<Cell> and cloning a Cell also affects how the program behaves in a logical sense.

I also agree that, in practice, we allow you to Copy things that will cause performance issues, and we force you to flag allocations that won't. The heuristics are not perfect. But that's a much harder line to draw.

In short, I do think there's a better set of distinctions possible, but I'm not sure exactly what they are, and the current set does seem to serve us fairly well. (I actually wrote some about this this morning).

4 Likes

Minor correction: This particular situation can be handled under the existing proposal as I wrote it, but it would require authoring your own adapter that boxes one function and "inlines" the other. It's clearly more convenience with placement new.

Yeah, I saw that :frowning: I have to go back and try to fix it.

We might be able to use dyn* type here too, I have to think about that. The idea would be that the return type is dyn* and so the caller actually returns a pointer to the thing being returned. But this wouldn't work if we think of things as returning dyn, clearly, so that probably means this is not as general as "you can return any ?Sized type", and it would be more of a way to adapt dyn trait return values specifically.

...on a related note, in the particular case we are talking about here, the fully general version isn't needed -- i.e., when building the vtable, we have a Sized return type, so we could conceivably store the size in the vtable and pre-allocate the result space, but it's not obvious how to fit that into Rust's type system.

Anyway, got to run for now, but I'll re-read the Zulip thread too. (Other links?)

I would argue that the distinction is qualitative, rather than quantitative.

That is, the reason to avoid allocations is not inherently about performance (quantitative), it's about capabilities (qualitative). Allocating memory requires a specific capability that is not available on all systems, or libraries (eg. embedded, no-std), and therefore a difference is made.

The overlap with performance can be argued to be accidental.

I wonder how much dynamism is necessary, here.

In theory, the computation of how much space is necessary could be arbitrarily complicated, involving network calls, etc... for example, allocating a Vec out of the results of a select on a database.

There are various degrees here:

  • Compile-time sizes only: the caller knows, at compile, how much it will request.
  • Immediate only: the caller only requires immediate calculations (no await/IO involved).
  • ...

I note that for the case at hand, the size of the state to box is known at compile-time by the callee; the caller's issue is just that there's lot of callees each with their own size.

12 Likes

I believe the allocate maximal size strategy could be further expended (or split in two), to avoid the inter-procedural analysis downside.

That is, if the trait method is annotated with #[future_size(128)] (or enforce a StackFuture as return type), then each implementation can at compile-time verify that it does not need to allocate more.

In such a scenario, it's up to the callee to decide on a strategy to preserve its state in a way that fits within the requirement, and a specific callee may decide to box part of its stack state.

There are obvious downsides, especially for generic traits, but given how heavy a downside inter-procedural analysis is (outright impossible with dynamic linking), I think it's worth mentioning as an alternative.

2 Likes

Ah, I hadn't realized that you could write an adapter specific to one trait. In that case, adapters might have one advantage over placement: when you are creating the trait object, you know what the concrete type is and how much space the future will need, so you may be in a better position to judge optimal allocation strategy.

That's basically the "which straw breaks the camel's back" question. There is no hard line which could be drawn on memcopies at the language level. 1GB is obviously a no-go, 1 byte is obviously trivial, but between them? Some context may not care about a one-off 1MB memcopy, while for others 64 bytes copied in a tight loop is a killer. At least the line between "arbitrary code execution" and "a simple memcopy" is obvious and useful.


I think this post merges two distinct problems into a single issue. Which is fair, because it is required for dyn async traits, but also counterproductive, because they are separate and both deserve a proper solution. Those two problems are 1. return position impl Trait in trait objects (which are only briefly mentioned), and 2. the desugaring of async blocks in the presence of trait objects.

If we focus on return position impl Trait first, then we see that all of the options described in Niko's post are really just two options: write the return value to some caller-provided allocation, or "inline" the return type into the trait; and the second option just doesn't make sense for general traits. So there is really only one option: return some specific dyn-safe type hidden behind the impl Trait, and place it where the caller tells you to. There is only one type which can be returned in the generic context, and that is dyn Trait. Of course, it is unsized, but if we can negotiate in some way the allocation size with the caller, then the ABI of functions returning unsized types may be made to include a hidden out-pointer to the provided allocation, which is then filled by the callee.

This case is quite important on its own. Proper support for safe self-referential and other non-movable types must include a way to return them, and this will require a safe way to deal with an out-pointer to the place of that type. We also need placement return in order to support creating large types directly on the heap. I think that RVO like in RFC 2884 is the wrong way to do it and would prefer some type-based way to return unsized values, but whatever the solution, we need to solve that problem.

Note that a dyn Trait is unsized, but its data object may also be unsized, so simply including a size of the returned data in the ABI is insufficient for the general case.


Which brings me to the problem 2: the desugaring of async blocks. If we accept that the proper way to deal with RPIT is to return dyn Trait, then we have a problem of having unsized objects inside of async state machines. This is also a problem encountered by the unsized locals feature.

My understanding is that currently it is expected to forbid unsized locals existing over an .await or .yield call. But that is inconsistent with ordinary functions, limiting, and unfortunate in the light of dyn Trait discussion. So, what if we just bite the bullet and solve that problem?

It would mean that async StM may contain unsized values, which makes it unsized itself. That would be undesirable in certain contexts, but those may put a Sized bound on the operated futures (which is also present at the moment, and included by default in structs containing futures). How can we handle an unsized async (or generator) StM in general?

Well, we know how to deal with it if it exists purely on the stack, and if the .await calls instantly return. In that case that's just an ordinary function with unsized locals alloca'd on the stack. There is a risk of stack overflow, because the stack usage cannot be verified at compile time, but it seems to be an acceptable cost. Following Niko's remark that a boxed future is a kind of segmented side stack, we can apply the same approach to futures living at an arbitrary place. We will treat its backing allocation as basically a stack, with the caveat that we only need to store the data which is held over the .await.

Now, of course this means that the allocation may be too small to fit the resulting state at runtime. This is unavoidable given the design that we want to get, and no worse really than the case of OS-provided stack. The caller must ensure that the allocation is big enough, and may choose whatever strategy to deal with it. It may preallocate a huge buffer, or it may have some external bound on the size of all futures, or use some static analysis to get a better bound, or perhaps we could support some API for segmented growing allocations (e.g. using bumpalo as the async stack), or it may forego using dyn async traits entirely and restrict itself to F: Future + Sized (which will likely be the preferred solution in embedded and high-assurance contexts).

Note that a possibility of OOM (or its cousin, stack overflow) is inevitable when working with dyn Trait, since you can never be sure about the size of the value returned by the function (and if you have a bound, it works just as well for async dyn traits).

However, this means that the future must have some way to know the size of its backing allocation at runtime, so that it may panic if that "async stack" overflows due to a too big alloca. I believe that information may be included in the vtable for Box<dyn Future + ?Sized>.

Thoughts?

2 Likes

I suspect the cure is worse than the disease in this instance. Supporting alloca across await would require a lot of effort, just to permit something that users should be avoiding anyway. Better to use a StackBox or similar, to be explicit about the possibility of OOM; also the future itself will have a better idea of how much space it is likely to need, compared to the caller who would just have to guess

2 Likes

@tmandry and I spent an hour talking out the "choose at call-site option". I can certainly see the appeal. I'm going to try and write-up something diving into it and writing out some of the implications, but I think it will take me a few days.

To elaborate:

#![feature(unsized_locals, unsized_fn_params)]

trait Trait {}

async fn something() {}

fn frobnify_dyn_trait(foo: dyn Trait) {
    let local = foo;
}

fn frobnify_impl_trait(foo: impl Trait) {
    let local = foo;
}

async fn async_frobnify(foo: dyn Trait) {
    let local = foo;
    something().await;
    std::mem::drop(local);
}

When is frobnify_dyn_trait likely to overflow the stack? Approximately when frobnify_impl_trait would.

When is async_frobnify likely to overflow the stack? I have no idea.

I agree in the general case but I want to point out that "we cannot do heap allocations in this context" really is a bright line. For instance, @y86-dev recently brought up placement by return in another thread, in the context of instantiating mutexes in kernel mode, where allocations might be either lexically forbidden ("this code runs during early boot and we don't have a heap yet") or dynamically forbidden ("this code could be called from an interrupt handler").

7 Likes

This can't be done until Niko's point about Sized bounds is resolved one way or another.

I'll elaborate more later, but I think that the way to resolve this is to drop the idea that the type dyn Trait implements Trait. Instead, we would extend MIR with the idea of a "dyn call" as a new thing (right now we have only inherent calls and trait calls, and there is a special "dynamic dispatch impl" of the trait). In that case, the signature of methods when invoked through dyn Trait can be different than what is defined in the trait itself -- in particular, async fn and -> impl Trait woudl be rewritten to return dyn values.

2 Likes

That's very gracious of you. =)

No, that's a good summary.

It's funny, actually, I'd started to write a blog post to lay out the "placement return" solution to async traits, but you mostly hit the same points more concisely.

"Not carried by the time alone"? I think you made a typo.

(And I don't think a pass-the-allocator ABI would be necessary, at least not for RPITIT)

Do you mean "its associated type to be Sync"?

Huh, that's a good point.

That implies that RPIT in dyn traits wouldn't a straight desugaring with the current rules, and would require additional chalk support.

I don't know. The analogy I made in the zulip thread was "when you pull a pathfinding library and call find_path_with_astar(my_graph), you usually don't expect to control whether the A* implementation is using Vecs or VecDequeues.

I would expect most library writers would be fine boxing futures if needed, and if someone wants their library to be used in no_alloc contexts, then they'd go the extra mile.

Also, for the specific use-case of async iterators, we could create a InlineDynAsyncIterator which implements the inline strategy (details in the zulip thread).

1 Like

Nah, it works if you assume that traits with async methods cannot be made into dyn traits.

(which is not ideal, but again, good enough for the MVP)

That locks out certain possibilities, like deciding that async trait methods allow returning unsized futures even when the implementing type is Sized. For example, struct WrappedDyn(Box<dyn AsyncTrait>) might want to implement AsyncTrait by delegating to its field.

1 Like