Blog post: A formulation for scoped tasks

I just posted A formulation for scoped tasks in Rust - Tyler Mandry about scoped async tasks in Rust. From the introduction:

In this post, I attempt to state as precisely as possible the essential features that we might want from scoped spawn APIs and where those run into fundamental language constraints. In particular, by essential here I mean that removing the feature:

  • Makes a meaningful difference in the user experience, but
  • Also makes the problem tractable in Rust today.

My hope is that this creates a useful framework for understanding and reasoning about the problem space.

Let's use this thread for discussion of the post.

18 Likes

(NOT A CONTRIBUTION)

The language changes described as unexplored options here are non-starters for a ton of reasons, but one thing I never see discussed when people suggest the leak decision was a mistake is that futures would need to implement Leak to be spawned. When you spawn a future on a multitasking executor, it stuffs the state of that future into an Arc to form the body of the Waker that will be used to re-enqueue it. To do that safely, that future would need to be leakable, and this whole notion of unleakable types would not even be relevant.

I'm also not sure what "unsafe contract that it will be polled to completion" means in the context of passing the future to a reactor and waiting for that reactor to re-enqueue it. Even if these weren't unworkable for backwards compatibility reasons, I suspect they may not be workable period.


The way I’m imagining it, it would have to go through a couple of control layers that would make it significantly less ergonomic to use than “just capture a reference in your async block”. I’m also not sure yet if it would be sound. But it might be worth tinkering with, if only to see more clearly the limits of what the language can express

I doubt think there's a zero-overhead sound solution for borrowing with parallelism; I think in the end the only sound thing to do would be to put them in Arcs, and then they're 'static so you've just arrived back at spawn. You should explore your ideas though because I could be wrong.


I think the best solution is to distinguish between spawns that introduce parallelism by creating a task that can be stolen separately from this one and spawns that don't but don't have to be 'static. In effect users already do this when we use both spawn and FuturesUnordered. The performance footguns in FuturesUnordered should be investigated, and one should ask if something like a scoped API would be less foot gunny.

6 Likes

Need this be the case? If the future is scoped, then can't its state (and body of the Waker) reside in/be owned by the scope's stack frame, avoiding an Arc?

(NOT A CONTRIBUTION)

Then what do you put into the Waker, so that the task gets re-enqueued? The Waker needs to be 'static. Probably some sort of index to identify which task is getting reenqueued, but if the waker outlives the scope and is awoken later you're getting a runtime panic, because the waker isn't tied to that scope at all. And you need some kind of reference counting there too so that the index doesn't get reused while the waker is still alive, causing the waker to enqueue a totally different task. No one ever talks about how this aspect of the system would work in the magic world where the soundness issue was allegedly resolved by a type system change.

Similarly, the idea that you could have an unsafe poll method which has a contract "will poll to completion" is way too vague to be viable. First of all, you can't have a soundness invariant that something WILL happen, only that it WON'T happen - in extremis, an asteroid could hit the earth and then the poll will not finish. Of course what this requirement actually wants to state is that some other code will not run unless this future's poll method has already returned Ready. Actually trying to specify what code will not run until the poll returns ready and making that work with the underlying task/wake model is where I expect serious problems would re-emerge.

Maybe this would all be possible with a completely redesigned async system or completely redesigned ownership model, but we just don't know. But then because the surface level problem can be resolved by saying that the type system would just work differently, people toss it around as if its a viable alternative we just didn't pursue out of ignorance or poor design. What people don't seem to understand is that its not like we have a working alternative model waiting to go if we just broke compatibility and switched to it.

3 Likes

(NOT A CONTRIBUTION)

Sorry to spam a bit but I've thought a bit more about this:

The way I’m imagining it, it would have to go through a couple of control layers that would make it significantly less ergonomic to use than “just capture a reference in your async block”. I’m also not sure yet if it would be sound. But it might be worth tinkering with, if only to see more clearly the limits of what the language can express

It's helpful to relate scoped tasks to io-uring. It's not a coincidence that people think they would both be solved by some sort of linear typing / guaranteed destructors: the soundness issue underlying them is exactly the same. In both cases, you want to share a reference with another process (in a CSP/system model sense) - in io-uring that process is the kernel, in scoped tasks that process is another thread in your program. The fundamental problem is that Rust's static lifetime analysis can't accommodate dynamic process scheduling.

However, io-uring is an instructive example, because there are 3 ways I know of to make that data sharing sound.

The simplest is just passing ownership. You can pass it back when the task is done, too (e.g. by returning it from the future or passing it back through a channel). Of course this means only the parallel task can have access to the data until it's done with it.

Then there's shared ownership. You put the data into an Arc and both tasks hold it. Now they can both access it in parallel. If they need exclusive access you use a runtime coordination primitive like a Mutex.

Finally, there's the trick used to make something like a BufReader work for io-uring. The parent task would hold ownership of the data it was sharing with the child task by reference, but use a runtime check to make sure it never accesses it again until the child returns. Note that this means the scope object on the parent task needs to own the data; even though it passes it to the child(ren) by reference. And it would also need to be at least pinned, or maybe that won't work and it must be heap allocated (the BufReader definitely just relies on the fact the buffer is known to be heap allocated to deal with dropping the owner handle).

The child(red) tasks would give back their lease on the data as their handle to it drops. To get the data back on the parent task, there would be some accessor on the scope that awaits all the childrens' handles dropping. If a child task's handle leaks, too bad the data is leaked as well and this accessor will await indefinitely.

I think this is probably basically what tmandry had in mind.

3 Likes

I have ran into issue recently as well and was surprised to see that this is not possible with Rust today. I'm not sure but it seems there can't be a fundamental problem with scoped async since it is possible with "normal" threads as well, and the underlying problem is the same.

The reason scoped tasks work in non-async Rust is because one can simply use the fact that linear control flow guarantees that a function will not simply stop halfway. In async Rust, due to cancellation, this can happen at any await point. One possible solution (at the language-level) that I wonder could maybe do the trick is some kind of async-compatible defer mechanism. A deferred statement would run at the end of the scope, even if the future is cancelled. This way, there's no need to rely on Drop at all and the solution is more similar to how it is done in non-async Rust. Not sure if I'm missing something, could this work?

Future cancellation is Drop; there's no difference. The guarantee which is needed for scoped async to be sound is exactly that the future isn't forgotten; cancellation/drop is perfectly acceptable.

The following is adapted from the linked blog post.

Scoped futures are fundamentally unsound so long as they can be forgotten.

The fundamental difference with scoped threads is that a thread scope blocks when you exit it.

let data = ..;
thread::scope(|s| {
    s.spawn(|| worker_thread());
    s.spawn(|| { victim(&data); });
    // here we block until all scoped threads have finished
});

The same API would actually be sound for async:

task::scope(|s| {
    s.spawn(worker_task());
    s.spawn(async { victim(&data).await; });
    // here we block until all scoped tasks have finished
});

but only if task::scope is synchronous; this is, in effect, the equivalent of writing

task::block_on(async {
    join! {
        worker_task(),
        async { victim(&data); },
    }
});

and hey, this is sound and allowed in all implementations of block_on! But blocking obviously isn't what we want, but if we return a future from scope, I can forget it:

let tasks = task::scope(|s| {
    s.spawn(worker_task());
    s.spawn(async { victim(&data).await; });
    // tasks continue running until the scope is awaited
});
// oops, I didn't await the scoped tasks
forget(tasks);
// the scope lifetime is over but the tasks are still running
// and now I can cause all sorts of UAF havock, like just
drop(data);

Well, alright, make it a macro and include .await in the macro to ensure it gets awaited. Unfortunately, we've only temporarily deferred the issue:

let tasks = Box::pin(async {
    task::scope!(|s| {
        s.spawn(worker_task());
        s.spawn(async { victim(&data).await; });
        // tasks are awaited here
    });
});
// advance `tasks` far enough to spawn the subtasks
runtime::poll_once(tasks);
// oops I forgot to drive the future to completion
forget(tasks);
// subtasks are still running, lifetime over, mayhem time
drop(data);

The only way in which spawning a scoped task can be sound is if the root future is 'static[1] (thus leaking it is also leaking any resources internal scoped tasks borrow), or if you somehow require the futures to be polled to completion and/or dropped.

Of course, this doesn't matter all that much, actually, because the real scoped task concurrency is just join!. If polls are relatively short and don't block, there's not much difference between join!(spawn(a), spawn(b), spawn(c)) and join!(Box::new(a), Box::new(b), Box::new(c)); that's the entire point of async: concurrency without spawning.

I'm somewhat tempted to implement a task::scope-looking interface around FuturesUnordered to prove a point here[3]. Yeah, it's unfortunate that a single task having an unexpectedly long poll prevents making progress on the others in the same cluster, but that's the ticket price for multiplexing many tasks on one system thread.


  1. This is actually an approach I've rarely seen mentioned, and that was in the scope of async Drop rather than scoped spawned tasks. This would be nearly as invasive to adopt as a Leak auto trait (but only for async-related code), but I do believe an unsafe trait UnsafeFuture supertrait to Future with the requirement of being polled to completion or dropped before its lifetime ends, where futures containing scoped spawns implement UnsafeFuture instead of Future, does make scoped task spawns sound, and UnsafeFuture + 'static can complete back to Future. The justification is basically that since they're 'static, I could safely record all of them that have been started and poll them again whenever, justifying the progress made on the subtasks. (This justification doesn't quite cover Pin<&'short mut UnsafeFuture + 'static>, but even if our loan expires, it's still either there or dropped, theoretically able to be polled by someone.) While you're making UnsafeFuture and getting everyone to thread it around, though, might as well also include a guarantee of async Drop[2] as well, so those use cases can also benefit. ↩︎

  2. And just a bit of a tangential thought: async Drop is fine if not doing it leaks, and it's just a resource optimization over sync Drop; the unsound issues are when Drop (whether async or not) getting executed is relied on for soundness; it's all different versions of the same issue of relying on sub-'static leak freedom. My observation here is purely that leaking specifically 'static values cannot cause any soundness problems caused by lack of leak freedom. ↩︎

  3. Oh, and this API is beneficial for even for 'static tasks which can actually be spawned, if the spawn function is provided by the async Context, since it provides a poll point to pull it out and hand it to the async world. ↩︎

1 Like

Future cancellation is Drop ; there's no difference. The guarantee which is needed for scoped async to be sound is exactly that the future isn't forgotten; cancellation/drop is perfectly acceptable.

How does that work from inside an async context? I was thinking about this from the perspective of already async code spawning an async task, for example:

async pub fn some_async_function(data: &Data) {
    let shared_guard = SharedGuard::new(); // NOTE: when dropped it will wait (blocking) for the lock
    let task = task::spawn_local(async {
        let _scope_guard = shared_guard.lock();
        victim(data).await;
    });
    task.await;
    shared_guard.lock(); // NOTE: this is blocking
}

In the above example, if the caller forgets the some_async_function future, is that unsound? Either the future some_async_function is cancelled, in which case everything is fine, or it is forgotten, in which case it will keep running and eventually reach shared_guard.lock() and the lifetime constraint is upheld. What am I missing here? :thinking:

Okay, after some thinking I realised that the above code is just as wrong. Even though it is guaranteed that some_async_function will reach block on the guard at some point assuming that it is driven to completion, the caller (or the runtime usually) could in theory forget the future without it running to completion at all. The only way to prevent that is to block in place on the task: this effectively locks out the runtime/caller and prevents them from forgetting the future, sidestepping the issue (and defeating the purpose of async tasks).

Of course, this doesn't matter all that much, actually, because the real scoped task concurrency is just join! . If polls are relatively short and don't block, there's not much difference between join!(spawn(a), spawn(b), spawn(c)) and join!(Box::new(a), Box::new(b), Box::new(c)) ; that's the entire point of async : concurrency without spawning.

I ran into the scoped future issue when building a custom Future that offloads all of its work to a single worker thread (posted on r/rust about this couple week ago here: Reddit - Dive into anything). Since spawning work on the inner worker thread requires a static lifetime bound, I'm running into an issue similar to this scoped task problem (but not entirely the same).

I'm surprised there's no way for getting around this issue. Especially since the unsafety arises only in the pathological case. I opted for simply putting a big fat warning on my code (which is not open source for now anyway) not to forget the future.

I'm really hoping that someone will be able to come up with a solution. If only there was a way to just panic and crash the program when the pathological case comes up, that would be enough (but being able to do that seems as hard as solving the issue itself, if I understand correctly).

First, I'm sorry for being slow to respond on this thread. I'll tweak my settings to make sure I see replies here sooner. It's been awhile since I had the context of the original post fresh in my mind, but I'll try to answer to the best of my ability.

What I'd like to say for the purpose of this post is that any memory borrowed from the Future will not be invalidated without some code running: either it will be polled to completion, or the destructor will be run, before memory is invalidated. So something like the Pin drop guarantee, but applied to the fields of the Future/async block.

It's possible this might be unworkable (or insufficient) for some reason or another.

That's a salient point; most executors are implemented using reference counting today.

Perhaps it should be okay to silently drop a wake-up of a task that's no longer around. Wakeups are allowed to be spurious and essentially no-ops, so I this doesn't automatically raise red flags for me; in fact I think I know of at least one async executor that does this.

I think the second problem you mentioned could be solved using generational indices instead of reference counting.

Yep, I was thinking of something along those lines!

(NOT A CONTRIBUTION)

I've thought about it and I think what you could do is design an executor which has single ownership of the tasks and hands out weak references as the waker. The executor also needs to track these weak references to know if a future has gotten stuck and should be cancelled (since it won't naturally fall out of the future failing to register a waker and dropping its strong count to 0). And you would probably need async drop for this system to work well at all, which runs into other problems with async drop.

But the important thing is that this would make non-leaking futures incompatible with the existing async ecosystem. It would be a very disruptive shift, even if you could do it without something like adding a Leak trait and making a language level change.

That guarantee is not enough because you can leak the type with an Rc cycle without invalidating its memory, therefore the type's memory is not invalidated, but the lifetime is allowed to end without polling it to completion or running its destructor.

I'll also note that my original plan was to make poll unsafe and just include an invariant that users cannot move it between calls to poll. We quickly got the sense that this would have been really hard for users to validate and developed Pin. For all the complains people have about Pin, it would have been way worse to put the burden of validating this very broad invariant on anyone writing a poll method. I would worry about any proposal to create an unsafe poll method with a user-checked invariant even if it weren't an effectively breaking change.

2 Likes

Can you say more about this? I don't immediately see the connection between weak wakers and async drop.

Yes, it would be disruptive. I would want to find a way such that it's possible to upgrade most code in a way that preserves API compatibility with old code, needing new APIs in only a small number of cases, but I'm not sure if that's possible without additional language features since leakability is very effect-like. It would also require a migration across the ecosystem for users to benefit.

None of these are true non-starters for me, given how many places this problem appears (not just in async), but they do represent a very high bar to clear to justify why it's worth it. Put another way, I think perhaps Rust can handle the level of change of a Leak trait, but only 1 or 2 of them per decade.

(NOT A CONTRIBUTION)

If you drop a scoped task handle, it has to block the thread until the child task completes. This is obviously not acceptable, so you need a way for it to block only this task. That's async drop.

Async drop is essentially just a cooperative cancellation mechanism. If you can't non-cooperatively cancel tasks (which is inherent in making parallel scoped tasks work), you need a way to cooperatively cancel them. This is true for both combinators cancelling tasks (like selecting them) and for the executor, its not tied to the different waker system.

There are a few other situations that the executor also needs to handle in relation to this which current ones don't need to consider:

  1. If the task returned Pending but it didn't register a waker, the executor needs to cancel the task or else it will leak, since it will definitely not be woken again. Do they async drop it or just drop it in this case?
  2. The same question, but it returned Pending in async drop with a waker count of 0, rather than in its normal poll. Clearly, this should just be dropped. This has some similarity to panicking while unwinding.
  3. When a waker drops, it also needs to check the waker count. If the last waker drops, it needs to enqueue the task to be async dropped. But the waker also needs to be able to tell if the task was in async drop already, because if it was it needs to signal the executor to just drop the task. Again, somewhat like panicking while unwinding.

This consideration of the executor suggests (unless someone has a better idea) that the idea that async drop is never guaranteed might be fundamental; even with !Leak types, you still only guarantee that drop will be run, and async drop is a best-effort optimization. So blocking drop always needs to act as a fallback, blocking if necessary (but this would never happen in correctly written code). This suggests poll_drop_ready is the right approach.

I think the only viable way to do this is to have a Leak trait, and I think there's no way to do that that's not a breaking change. I think even if you found some technically non-breaking way to do it, it would be as disruptive as a breaking change. I think Rust can handle 1 or 2 of these in its lifetime, not per decade. And frankly I don't think the project currently has a track record of shipping and building consensus around a vision that I think it could present a compelling 2.0, which is the bigger problem.

I do suspect that the leak decision should have gone the other way. If we were in that situation, but still had an async ecosystem just like it exists, that's built around shared ownership of Leak futures, someone could come out today and publish a new runtime that takes non-Leak futures and has scoped tasks and it would be a real competition between materially different APIs. But since the decision went the other way, I'm not convinced that it would be positive for Rust to make the transition to having leak just to support this API.

2 Likes

Whenever I read these blogs about scoped tasks, this feels like the elephant in the room. Specially since I consider io-uring to be more important that scoped threads, and there is a lot of prior art on this topic. I personally believe Sabrina Jewson's Completion Future is the best solution for this problem, and I do think it is ultimately worth it.

The point is that the unsafety has to be somewhere. It can be in the task::scoped() call, in the poll() call, or in the forget() call. These cannot all be safe for our code to be sound, and we got to pick one. Now, given that the compiler is so good at abstracting our poll() calls with .await, wouldn't it be great if poll was the unsafe one.

This doesn't need to be a backwards incompatible change. We can have async blocks implement both Future and CompletionFuture. If all awaited items in the block implement Future then the whole block implements Future, otherwise it only implements CompletionFuture. This implicit structural typing is already used for Send and Sync. Futures would have a blanket implementation for CompletionFuture. The only problematic trait I can think of is IntoFuture.

In terms of compatibility with the async ecosystem, the major benefits would be that io-uring based runtimes would be able to implement the async io-traits, improving compatibility that way. The drawback is that we would end up with two traits for Future.

1 Like

(NOT A CONTRIBUTION)

I wrote extensively about io-uring and how it could be supported with the leakable futures model in 2020. Notes on io-uring - Without boats, dreams dry up

This is obvious. However, the problem is that you need to define a safety contract that is guaranteed to be sound. The contract as written on CompletionFuture is in my opinion flawed:

Once this function has been called and the type does not also implement Future, the user must not drop or forget the future until it it has returned Poll::Ready or panicked.

This contract fails to define "drop or forget" in a way that can be guaranteed. This is exactly the problem that I raised with @tmandry's definition: what about leaking the value through an Rc cycle? Clearly that also needs to count as forgetting. So you can't define "dropping or forgetting" as simply invalidating the memory, you need something broader.

But a problem also emerges when a future "gets stuck," returning Pending while also not registering a waker (e.g. if you literally just await std::future::pending). With the executors that exist today the future naturally gets dropped and its resources cleaned up. But if I can't drop or leak the future (in a broader sense that is sound enough for scoped tasks and io-uring), what can I possibly do here? I could busy poll indefinitely hoping it eventually gets unstuck? I could abort the process? I can't see another choice here.

I really don't see how io-uring IO objects could implement the async-io traits if their read and write futures needed to be CompletionFuture and not Future.

1 Like

My assumption is that we would have unsafe fn poll_read requiring the same guarantees of the caller; which is the major downside to me, now every single async-io implementor is having to write unsafe code (not to mention all the future combinators too). And I don’t see any way to just pay a performance cost to have a non-unsafe implementation (like is commonly done with requiring Fut: Unpin and boxing when needed).

(NOT A CONTRIBUTION)

To me that sounds like we would need a second Completion* version of every async trait, and every combinator would need to deal with unsafely to guarantee they never leak their interior values, and all of this so that CompletionFutures can't be executed on tokio or async-std and need a new runtime anyway. I would not call this proposal very compatible with the existing async ecosystem.

This is why I've always agreed with Patrick Walton's tweet:

Again, if the leakpocalypse had gone the other way, I would think all of this would be worthwhile experimentation. You'd just have some futures that impl !Leak and figure out a runtime that could support them, and people could experiment and maybe the ecosystem would naturally shift toward that and maybe it wouldn't.

But Rust decided that values of every type can be leaked. And unlike the move operator, I can't imagine a Pin-like trick to "trap" a value so that it can't be leaked - I tried this while looking into io-uring. So you need to specify the liveness guarantee that !Leak represents an inverted way tied to calling poll, and make executors and combinators manually guarantee that they meet it, and you're totally incompatible with the async ecosystem. It just doesn't seem like it has legs to me.

Yeah, I guess that something closer to what we really mean would be "the lifetimes captured by the future must last until the value returns Ready", but "not forgetting or dropping" is a clearer, albeit imperfect way to say that.

Well, here pending() is a future, not a completion future, so it would return Ready(()) in its first call to poll_cancel and it would get dropped. Getting stuck should not be a problem as long as the executor implements poll_cancel on their completion futures.

I mean, there is an incompatibility regardless of whether we adopt completion futures. IO-uring runtimes need to build all their ecosystem from scratch because they cannot implement the usual io-traits. But I see the point.

It seems to me, that the challenge of slotting in CompletionFuture "above" Future is similar to that of slotting in LendingIterator "above" Iterator. You want to retrofit the ecosystem to support the weaker version of a trait, without breaking uses of the more powerful version.

2 Likes