Async/Await - The challenges besides syntax - Cancellation

This is my second document about async/await: https://gist.github.com/Matthias247/ffc0f189742abf6aa41a226fe07398a8

This document describes the design choices around cancellation in the async/await/Futures environment, and how the design impacts library/application development.

I think we already discussed this topic in a few places (e.g. here and here). However I felt that the topic is important enough to warrant a full write-up and further exploration.

18 Likes

Javascript doesn't have standardized types for Cancellation. Libraries need to define their own.

Web API has an experimental AbortController which functions almost identically to CancellationTokenSource.


A webservice might create a requests for a variety of other services that need to be answered before a response for the actual user can be generated. However from time to time one of those requests might take an exorbitant amount of time (e.g. due to network failures). Since the user won't wait multiple minutes for the request to finish anyway, the sub-operations should rather get cancelled if no response had been received within a timeout window and an error should be sent back to the user.

This actually touches on an example I have mentioned elsewhere about a service I implemented using C#'s async/await. The service was an HTTP API frontend to a backend that provided an AMQP API, it had an endpoint that took in an operation to run, sent a message to the backend, then waited for a response from the backend to send to the client. These operations took a relatively long time (from a couple of seconds to ~30 seconds max) along with a high resource usage while running.

The HTTP server provided a CancellationToken to represent the client disconnecting, because of the high resource usage we wanted to cancel the backend operation when this happened. In C# this was relatively simple, when the CancellationToken was cancelled we asynchronously sent another AMQP message to cancel the operation and waited for a response before returning.

As far as I can tell implementing something similar in Rust is less feasible, the standard way for the HTTP server to indicate client disconnection would be to drop the future responding to it; we could then synchronously send an AMQP message for the cancellation and block on the response, or spawn a separate task to send it. Obviously synchronously sending would be a bad idea as it would block the executor. Spawning a separate task is also bad because it disconnects accounting of outstanding operations from the HTTP server, any kind of standard rate limiting would not be usable as it now needs to understand that these outstanding cancellations should also be applied against the accounts limits.

5 Likes

For the synchronous cancellation issues: would async Drop handle this? I’m playing with a “minimal trait effects” sketch, and I think it could allow this, so long as these types are only allowed in async contexts.

As for defer, I think it’d be greatly useful in synchronous code as well. People don’t consider panic correctness enough; synchronous code is not run-to-completion because of panics. (Of course, you can make the argument that it matters less and propogates the panic correctly, which I’ll give is sometimes true.)

Every use of a custom scoped RAII guard or scope_guard call could be a defer call instead. Swift benefits from defer even when no exceptions are expected to happen.

Because of this, I don’t think defer would make async more foreign, rather it’d make sync and async better together.

5 Likes

The HTTP server provided a CancellationToken to represent the client disconnecting, because of the high resource usage we wanted to cancel the backend operation when this happened. In C# this was relatively simple, when the CancellationToken was cancelled we asynchronously sent another AMQP message to cancel the operation and waited for a response before returning.

In general there exists the possibility to use a CancellationToken-like approach in Rust too. E.g. we can pass a Channel or ManualResentEvent as a parameter into each child async fn, and select! on it in order to determine cancellation state. If cancellation is signaled, the method can run to completion.

However that will only work in well in environments where one owns the complete code. If there is anything in the stack which forces cancellation then no guarantees are provided that methods run to completion.

I think your example is a good example for something where one wants to perform even asynchronous cleanup work - by continuing to all async fns after the main task had been cancelled. This is indeed even harder than the synchronous cleanup that was part of my problem. It can indeed be done in C# or Kotlin by suppressing the Cancellation for the cleanup subtask.

There exist workarounds for it in Rust:

  • As you described, you can synchronously queue the cancellation work in the destructor and spawn another task which performs the cancellation work.
  • In order to make sure the rate limiting/accounting on the server still works, the HTTP handler could aquire an async Semaphore for each request. Then the request finishes normally, the semaphore is directly released. When cleanup is required the ownership of the Semaphore permit is forwarded to the cancellation task. This task then has to release the Semaphore.

This should work - although it arguably separates one logical execution strand in 2 separates ones due to implementation reasons.

1 Like

Other environments utilize variations of this pattern:

Another example is in OS kernels. In most kernels, if you kill a process, any threads currently running userland code are immediately terminated, but threads which are running kernel code – because they're in the middle of a syscall – are kept alive until the syscall completes. That way they can release any kernel locks they're holding and perform other necessary cleanup. To ensure that such threads exit promptly, certain blocking calls within the kernel, such as waiting on a mutex or condition variable (including as part of blocking I/O), check whether the current thread has been signaled and return an error code if so, e.g. EINTR in Unix kernels. The scheduler also knows how to abort those calls if they're already in progress, with the same result (error returned to the caller).

On the other hand, a partial counter-example to the pattern exists in Unix userland, in the form of pthread cancellation. Like the examples you noted, pthread cancellation is not immediate and only takes effect when the thread calls one of various OS functions that are considered "cancellation points". But by "takes effect" I don't mean the OS function returns an error code; instead, it just terminates the thread without ever returning to the caller! However, it first runs any "cleanup routines" that you registered using pthread_cleanup_push (there's also pthread_cleanup_pop).

Importantly, though, pthread cancellation is considered badly designed and is rarely used in practice.

In Rust

I think it's important to preserve synchronous Drop as an option, because it provides low-level control of how coroutines run and maximal flexibility regarding allocation.

However, especially considering the CancelIoEx issue, it sounds like some async runtimes might want to adopt a model where Drop is simply never used for cancellation. Instead, all async functions would be expected to run to completion, and cancellation would be handled via an entirely separate mechanism.

If so, this would have important consequences! Any functions implemented using await should preserve the "never drop" property naturally. However, some combinators would need to be changed. For example, futures 0.3's try_join! macro polls multiple futures in parallel, and if any returns an Err, it drops all of them. An "async-drop-safe" approach would have to be different. It could perhaps work by continuing to poll the other operations, but using a modified Context object that indicates cancellation. Low-level "blocking" async operations could then check the context object and return an error code.

You'd also want to have a mechanism to detect accidental drops at runtime. This can be done today by just sticking a local variable that panics on drop into all your async fns. But the language could potentially make it more convenient by adding, say, a #[panic_on_drop] attribute, or even a way to make it the default within a scope.

Anyway, this design could be implemented only for a particular async runtime, but you'd have to be careful to avoid any code that uses try_join! or other combinators from the "standard" world. On the other hand, because future combinations are not currently in std, it's also not too late to standardize the never-drop pattern: say, add a standard is_cancelled() method to std::task::Context, and change futures-rs to make what I just said the default behavior.

Is that actually a good idea, though? I don't know.

Type safety?

Even if it is implemented only for a particular async runtime, it would be nice if the compiler could guarantee that we don't accidentally call any "normal" futures code, while still allowing us to take advantage of the .await syntax. To do this it would have to provide way to use a different trait to distinguish "our" futures.

Maybe we could have a trait like

pub trait GenericFuturePlzBikeshedName<SomeContext> {
    type Output;
    fn poll(self: Pin<&mut Self>, cx: &mut SomeContext) -> Poll<Self::Output>;
}

I'd like to say that Future could then become an alias for GenericFuturePlzBikeshedName<std::task::Context>, but that doesn't actually work as written, because Future has a lifetime parameter, and there's no way to make MyContext something that can accept a lifetime parameter without some cumbersome use of GATs :expressionless:

But it could still be a separate trait which the async/await desugaring recognizes as an alternative to Future. To handle lifetimes, you'd want to allow things like

async fn foo() -> impl for<'a> GenericFuturePlzBikeshedName<MyContext<'a>>
8 Likes

I’m new to this topic and I’m trying to understand this problem,

But one workaround could be to have an “async trashbin” to escape the underlying async function, I suppose. Then you could poll the trashbin until completion whenever you’d like (now and then, idk), and it could still hold the &mut buffer that could still receive OS writes.

I think it's wort to link this issue (written by OP) for an additional context:

To me personally zero-cost compatibility with completion-based APIs looks more important than potential surprises around future cancellations, especially considering that io_uring may become The Way to do async IO on Linux (and IOCP is already The Way for Windows).

I also thought about something similar, but IIUC one of the problems is that buffers which passed to the kernel may be part of one future, i.e. let's imagine one big future which does select! on two smaller sub-futures. One of the smaller sub-futures get dropped and big future execution continues as usual. But while executing dropped sub-future we have passed buffer residing on the big future "stack" to the kernel, so now this space can be written to by both user-space code (during big future execution) and kernel (during completion of the dropped sub-future). Also you always can poll future manually and drop it at any time, so relying on executor here can trigger memory-safety issues by safe code.

BTW does anyone know if io_uring provides any cancellation API?

3 Likes

I heard that term a few times, but what would it be? Has there already been a definition/idea somewhere?
If is Futures where drop can be an async fn then we have again the problem that this async fn must be run to completion, and the same cancellation issues might apply inside it.

I think that's the model I outlined at the end of the document. I also toyed with your idea on making run-to-completion optional (e.g. via a second trait) while also keeping the current definition of synchronously cancelable Futures. But I think that might just lead into compatibility hell, since then there exist 3 types of functions

  • run_to_completion async fn can call run_to_completion async fn, async fn and fn
  • async fn can call async fn and fn
  • fn can only call fn

And that doesn't even take async Streams into account, which are already incompatible with async fn.

Obviously implementing run-to-completion Futures would also be harder than the current ones, since one would make sure to poll all potential inner Futures also to completion. Potentially that even means that [new] trait new needs to be unsafe. Not sure whether that's the biggest downside since safely manually implementing Futures that contain other Futures is already hard in the presence of Pin, but it certainly also adds complexity.

If that route is even investigated, I think adding any type to Context that represents a Future and can be select!ed on would make sense in order to listen for cancellations. E.g. a Channel or ManualResetEvent.

The question is who moves things into the trashbin. If the application developer has to explicitely do it, then it's basically the same as just spawning the remaining work as a new task. The "workaround" for implementing IOCP APIs also follows the trashbin (or: outstanding work queue) idea: When the user cancels the IO the IO runtime keeps ownership of all associated resources and continues to drive the IO to completion. Since this is now outside of user task visibility it requires an owned buffer.

I just checked the APIs and haven't found anything. But I guess as with most IO APIs one can cancel by closing the fd concurrently.

3 Likes

async Drop is very much in the rough sketch stage right now, and the extent of the specification is basically async fn drop.

I'm unsure what the transition from async Drop to "sync Drop" would look like. But depending on how much control we can exert over the use of async Drop types, it may be able to enforce run-to-completion for the drop. I'm not certain, though, because the sketch ends here. (I'm still slowly expanding the sketch, and with additive effects on Drop initially forbidden for this very reason.)

I think a kind of Future would help that panics if it is dropped before completion.

To help others to use this API correctly, we’d need a #[must_await] attribute for async functions. If there’s a chance that the returned future is cancelled, the compiler should emit a warning. But I’m not sure how hard it would be to implement this warning.

That’s… actually not too bad of an idea, though definitely not exactly fine grained.

Just to sketch out what it would look like:

Annotate an impl Future block with #[must_await] and the compiler will warn when a consumer calls Future::poll on a concrete value of this type, instead directing the user to use fut.await in an async context. An async block/fn that awaits a #[must_await] future inherits this warning. Note that erasing the future type (such as to put it on a generic executor) will also erase the warning. This also includes using a future as a generic parameter to a function which is not async.

The big limitations of this method is that anything that erases the future type necessarily loses the warning. Additionally, a large async function would warn on potential cancellation even if it only matters during a small, fast, critical section.

Actually, rereading this, it fails to allow a warning for creating a Select future, i.e. the root of all most cancelation. The lint would have to be formulated to warn when switching on a #[must_await] future in addition to direct polling.

2 Likes

I’m sorry, I’m not very familiar with all the details of async/await.

I think for the compiler warning, some false positives would be acceptable, if they can be silenced with #[allow(...)].

The idea that our story around cancellation is bad is confusing to me (but not the only idea of this thread of course). The fact that all futures are trivially cancellable is a key advantage of our design. But of course people will write bugs if they depend on two actions on opposite sides of an await to be logically atomic instead of writing code that uses destructors to properly maintain state.

However, the problem with what we have so far is that destructors can only call synchronous code, but the “set up” is done in an async context, and so there are good reasons to want the “clean up” to be async as well. AsyncDrop is on the list for the future of async/await. The rest of my post is some notes about this.

Probably it is entirely the responsibility of the executor library to guarantee that AsyncDrop is called - and probably by spawning a new task on the destructor of its task wrapper. I don’t know how we could guarantee async drop to be called given that we don’t even provide a mechanism in the standard library to execute futures at all.

std would want to provide these two APIs:

trait AsyncDrop {
    async fn drop(&mut self);
}

async fn drop_async<T>(item: T) { /* .... */ }

drop_async would be a magic function that destroys this type properly, including calling the drop glue as necessary on its fields and so on. AsyncDrop would have the same magic limits as Drop its impl must cover all valid instances of the type, you can’t call the method, and so on. (It’s not technically necessary to make Drop and AsyncDrop mutually exclusive to implement I believe, but maybe its a good idea.)

Then executors would use their own spawn API to spawn(async_drop(task)) when a task gets dropped (whether because it resolved or because it has been cancelled).

11 Likes

What about combinators like try_join which internally drop? I assume they would have to use drop_async as well?

I guess this would have to become part of the Future contract, requiring the use of AsyncDrop over Drop.

EDIT: Oops, I mean things like FuturesUnordered, not try_join. Other examples would be a theoretical timeout combinator.

Its unclear to me how futures unordered would best handle asynchronous dropping because it doesn’t have access to an API for spawning new tasks.

I think there was nowhere mentioned that it was "bad". What was mentioned was that there exist a certain amount of tradeoffs in the design (which happens for most designs). Exceptions in C++ are also not either universally good or bad. People will weigh the pros and cons differently. For a lot of people the universal cancellation might outweigh everything. However there was also a discussion in the discord channel that a lot of very low-level operations (e.g. access to DMA engines, GPUs, NICs, Hypervisors) resemble the completion model and are not cancellable. People who work in those areas and want to use coroutines/async-await for it might find the model not ideal.

"people do it wrong" is rarely a good a good argument. Of course people make errors. The job of good to tools is to minimize the possibility of making an error and the consequences as much as possible. That's a key design goal of Rust.

I am not sure whether I understand how it would work or help: It would move the Future to another task an execute the code there? Then

  • It doesn't seem to help the completion based API use-case, since the the original async call / Future (and potentially it's parents) would still be immediately dropped and thereby any references inside that Future might already be dangling. Continuing to do something with the Future would be unsafe.
  • It doesn't seem to help users having forgotten to add a Drop method for doing cleanup on cancellation. Like Drop the AsnycDrop implementation would also manually be added. If the goal is to just run a task on cancellation then a combination of defer/scope_guard and task::spawn could do the same:
    async fn xyz() {
      let mut is_completed = false;
      defer {
         if !is_completed {
           executor.spawn(||{
             // ...
           }
         }
      }
      sub_fun().await; // Can be cancelled
      is_completed = true;
    }
    
  • It would probably not work due to lifetime issues: drop(&mut self) would reference the dropped object on a new task - however it's original location will likely be on the old task, probably on it's "virtual" stack. And this locations is already gone. Typically we also can't move the object away since it would be pinned.

If AsyncDrop would force the asynchronous constructor to run to completion directly inside the original task instead of in a new one things would be different. However that would mean each drop(future) is an await point, and would actually need to be drop(future).await. And dropping inside a combinator would be require the same. The outcome seem to be lots of additional await points that force everything not to be synchronous anymore. That would be very similar to having a different type of Futures which need to get run to completion, only that it's this time encoded in the destructor instead of in the type.

6 Likes

Yes this was in response to the general conversation being had in this and other venues, not your particular gist (which is quite thorough and fair).

(And I also have no response to the comments about completion-based APIs, which I haven't personally spent much time thinking about at all.)

Isn’t this exactly the same issue that scoped thread spawning used to have?

In that API, the problem is that a thread could continue running and still access data outside its lifetime, while here it’s that an I/O request running in parallel (in hardware or in the OS) could do the same.

Even if you add AsyncDrop, there is no way to guarantee that it will be called instead of having the future forgotten, so it’s memory-unsafe.

The only fix I see is the same as the scoped thread spawing problem, i.e. providing an async function that takes a closure giving a “I/O scope” that lets you create completion-based futures.

2 Likes

I can see the reason why some futures might not want to be cancellable. Actually, this one feels like one of the cases: https://github.com/alexcrichton/tokio-process/issues/51

However, adding an async destructor, which is an in-compiler magical trait seems a bit heavy solution for something that feels a bit niche to me. Would some kind of workaround solution instead of this be good enough for such use case? I don’t know, but spawn maybe have a way to spawn the future separately and give a handle back that would propagate the result, but would not propagate the drop in the other direction?

Considering IOCP and io_uring I don't think they are niche at all.

2 Likes