Can we reduce the burden of cancel-correctness for async Futures?

In an async context, each await point introduces a suspend point of the async routine, at which point the routine can be a) resumed (happy path) or b) cancelled/dropped (“surprising” path).

Consideration of not holding locks (and similar) is easily explained as that async is long-running. As well as any other long-running sync call that doesn’t depend on the lock, you shouldn’t be holding the lock.

The cancellation problem is harder, though. The closest sync equivalent is probably panic-correctness (which is just as important to consider! but often forgotten as the “crash path”).

Having to worry about it is even more common as well, as async code implies synchronization of progress of some amount, so manipulation of synchronization primitives is more likely in async contexts.

First: can this create issues with purely safe code (while also being useful; of course you could just spinloop a CAS anyway)? I think anything manipulating primitives in a useful manner will require being within an unsafe barrier, which should suggest the need for greater scrutiny to begin with.

Second: is there much we can do beyond making it clear in intermediate level teaching that await introduces a cancellation edge? It seems similar to ? in this fashion, but in the case of ? you’re getting the exit edge by request, by await you’re getting it as an extra tag along from the primary effect.

10 Likes

I’m coming from C#, no practical Rust experience here. I just searched out this post after reading yours: http://www.randomhacks.net/2019/03/09/in-nightly-rust-await-may-never-return/

Based on that it seems like locks using RAII guards, for example, wouldn’t fall victim to the cancellation issue… is that correct? I’ve seen mutexes mentioned a few times in relation to the cancellation issue on here, but Rust uses RAII guards for those, so I’m not really sure.

5 Likes

Yes, the correct way to handle this is RAII guards, the same as panic safety.

You need to make sure that all cleanup is done in Drop implementations, and that the Drop run at any given point correctly cleans up the current state.

This applies equally for panic safety in sync code as it does for both panic safety and cancellation correctness in async code, it’s just very easy to pretend that panic safety doesn’t matter (though this is very incorrect!), and cancellation correctness is a normal expected condition for futures.

4 Likes

Crazy proposal: when an async fn is canceled, make the drops of its local variables act like panic drops, where Mutexes and other lock-like types get poisoned. This could also be extended to things like RefCell.

(This might be nontrivial to implement, since the poisoning mechanism currently relies on calling std::thread::panicking().)

3 Likes

What would you do if you’ve properly handled your resources properly and you don’t want the poisoning/panic behavior on cancellation (but do on a panic)?

Poisoning mutexes works because it’s effectively propogating the panicked thread’s panic to the other threads synchronizing with it via the mutex. It feels wrong somehow to propogate a panic for an expected condition.

Then again, the logic of poisoning (the value may be in an unsafe state) is still valid for unplanned cancellation, and the panicking check was for “non end of scope” drops in the first place so… I guess I’m undecided.

1 Like

It’s a good topic to talk about! I also started doing a writeup about this topic, since it’s one of the most peculiar things about Rusts async/await support.

It’s pretty obvious that RAII types avoid potential pitfalls - and they are already often used to perform the necessary cleanup work. E.g. for the async Mutex type provided by futures-rs or the async Mutex and Semaphore types in my futures-intrusive library everything is safe. However the challenge is more along the lines that users didn’t actually know/expect that a RAII type is needed there because the exit path was invisible. Maybe that can be solved with a substantial amount of documentation.

Maybe there are also good tooling solutions. But in general the invisible returns might make things hard for tools too. E.g. code coverage tools will typically not show that the cancellation path had never been tested, since it’s on the same line as the normal code.

For people who are already aware that additional cancellation/cleanup logic is required, having additional language support for that might be useful. E.g. in the spirit of Go’s defer or some async finalizers that run after an async fn had been cancelled. Those all sound not great because the constructs are not required in synchronous Rust code. However e.g. a library-based ScopeGuard won’t work for asynchronous functions (since the scope-guard borrowing the same values as the potentially still executing async fn will be rejected by the borrow-checker), and manually writing RAII wrappers each time a cleanup is required is just a lot of overhead.

Finally, there is also a “solution” to redefine the Future and async fn semantics to always run-to-completion and thereby to avoid the pitfalls and to better resemble the behavior of synchronous methods. However I’m not sure how many people would really be interested in still exploring a path that different.

6 Likes

At least branch coverage should be aware of this, but even still, cancellation is difficult to test. (Anything IO is difficult to test to start with.) I suppose you can always just .poll the correct number of times with a mocked IO provider that always is pending the first try, blocking the second, but it's still not easy.

What's the problem here? I'm fairly certain that something like the following should work (assuming that the scope_guard function functions like the others and is possible to make available):

async fn foo() {
    let s = String::from("hello, async");
    let _scope = scope_guard::on_cancellation(|| { dbg!(&s); });
    await!(aio::print(&s));
    drop(_scope); // automatic
}

(Of course, this example is silly, but example.)

The whole point of async fn (and async { }) is enabling this kind of cross-await-point borrowing.


Maybe the answer is to introduce a std::thread::is_panicking alternative for async cancellation? So you have a std::task::is_cancelling magic function that returns false in a sync context and in an async context returns whether the code is currently dropping a Future.

This would probably require setting a thread local whenever a future is being cancel dropped, though, so it's not exactly a zero cost.

Your example is bad because in most cases, you will need to get a &mut ref to do cleanup, which locks the object for as long as the scope guard is alive.

3 Likes

Yes, that’s the problem I run into. In my case I even required ownership of the object in the scope guard, since I need to give it back to the actual owner of the resource.

That is safe to do, since at the end of scope the cancelled async function obviously doesn’t require the object anymore. However the borrow-checker won’t understand that.

What would the rules around a language defer be?

(Potential weak keyword usage of defer: make it a prefix keyword that takes a closure-like argument: defer || { .. }. This avoids the issue of defer { } being ambiguous when defer can be a structure name. Using a real keyword, make it a final { .. }.)

I’m not sure exactly how this would be modeled in HIR/MIR, but I think the semantics we’d want is that the code is stuck into the “drop glue” section of the cleanup path before MIR borrowcheck is run, assuming MIR borrowcheck even looks at drop glue, so that borrowcheck can see that the mutable borrow doesn’t require a lock for the whole function, just that the place exists and isn’t aliased on drop.

2 Likes

Actually, scopeguard handles this situation.

let handle = scopeguard::guard(self, |this| cleanup(this));
let this = &mut *handle;

A scopeguard::guard wraps and derefs to a value which is given to the cleanup closure to support this case.

7 Likes

Oh, that’s clever. I hadn’t thought of moving the value inside.

1 Like

Is it possible to notify the awaiting call about the cancellation eventually? For example, return an error to the awaiting function to tell the cancellation:

let result = future.await?.other_operation();
                         ^ 
                         Cancellation is returned here and can 
                         be handled like any other ResultError
let some_other = result.await? // So this line will never be executed

You might be interested in this thread which proposes a similar syntax.


Also, if ultimately Result will be returned then it would be possible to match:

let result = match future.await {
    Some(x) => x,
    Err(cancel) => fallback(),
};

which don’t has any sense

1 Like

Don't know why. I mean, like what if I want to tell user something about it? For example:

let result = match future.await {
    Some(x) => x,
    Err(cancel) => {
        print_error!("Sorry, Operation XXX has timed out, please retry. Error:", cancel)
        ....
    },
};

Having to do it using Drop can be a bit troubling, especially when you have more than one await.

Also see a post from withoutboats Async/Await - The challenges besides syntax - Cancellation

4 Likes

Wouldn't this look like a short-circuiting or operation?

2 Likes

Ah, missed that. You are definitely correct.

I don’t know if there’s a way to make defer a contextual keyword that wouldn’t be super clunky at this point.

The (works today) solution seems to be to use the version of scopeguard that wraps an owned value and borrow from that in order to communicate the ownership to the borrow checker. I still think a first class scopeguard could be useful, but it’s not necessary.

Could we use a callbacks here?

async fn foo() {
    bar().on_drop(|| {/* cleanup */}).await;
}

Is it feasible to implement a lint against calling await while a holding a MutexGuard? It seems like that would almost always be a bad idea. However it’s not obvious how easy it would be to detect.

2 Likes