Surprising soundness trouble around `PollFn`

RalfJung · October 4, 2022, 11:09am

This Zulip thread led me down an interesting rabbit hole... here's some interesting async code that causes a use-after-free:

$ ./miri run async2.rs --edition 2021 -Zmiri-disable-stacked-borrows -Zmiri-disable-validation
error: Undefined Behavior: pointer to alloc872 was dereferenced after this allocation got freed
  --> async2.rs:39:15
   |
39 |     let _ = { *problematic_variable };
   |               ^^^^^^^^^^^^^^^^^^^^^ pointer to alloc872 was dereferenced after this allocation got freed
   |
   = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
   = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
   = note: BACKTRACE:
   = note: inside closure at async2.rs:39:15
   = note: inside `<std::future::from_generator::GenFuture<[static generator@async2.rs:32:20: 40:2]> as std::future::Future>::poll` at /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/library/core/src/future/mod.rs:91:19
note: inside closure at async2.rs:14:9
  --> async2.rs:14:9
   |
14 |         unsafe { Pin::new_unchecked(&mut future) }.poll(cx)
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   = note: inside `<std::future::PollFn<[closure@async2.rs:13:47: 13:56]> as std::future::Future>::poll` at /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/library/core/src/future/poll_fn.rs:61:9
   = note: inside `<std::boxed::Box<std::future::PollFn<[closure@async2.rs:13:47: 13:56]>> as std::future::Future>::poll` at /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2074:9
note: inside `main` at async2.rs:29:13
  --> async2.rs:29:13
   |
29 |     let _ = Pin::new(&mut pinned2).as_mut().poll(&mut cx);
   |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The code does contain an unsafe block, but that block looks fine at first sight:

    let mut future = trouble();

    let mut pinned = Box::pin(future::poll_fn(move |cx| {
        unsafe { Pin::new_unchecked(&mut future) }.poll(cx)
    }));

future is not moved, so we can just re-create a pinned reference to it each time.

Except, unfortunately, later the future is being moved. This is possible because PollFn<F> is always Unpin, even if F: !Unpin. And the move in the closure means that future actually lives inside the closure environment, so when we move the PollFn, we invalidate a self-referential generator.

This causes trouble even without actually moving the PollFn -- the example in this Miri bug report violates aliasing rules, since as part of the PollFn wrapper, mutable references are being created, and due to PollFn: Unpin, those are truly unique mutable references, thus the fact that they alias with the self-referential generator causes Stacked Borrows UB.

I have not found a way to cause UB without unsafe, so there is strictly speaking not a bug here -- but I feel like this is a big footgun, so at least the poll_fn docs should be quite upfront about the fact that PollFn is always movable, so you better be careful about what you do in that closure. Maybe it would have been better to not make PollFn unconditionally Unpin; I think the regular auto trait impl for Unpin would fix the aliasing violation (since it would propagate the aliasing rule exemption we have for self-referential generators) and would also stop the use-after-free example from compiling.

In fact maybe Unpin should have been an unsafe trait after all...

jack · October 4, 2022, 12:35pm

The snippet in the linked bug report is reduced from the expansion of tokio::join/futures::join, so programmers can run into this aliasing issue without (to their knowledge) writing unsafe at all; e.g.:

use std::{future::Future, task::Poll};

#[tokio::main]
async fn main() {
    futures::join!(yield_now(), trouble());
}

async fn trouble() {
    let lucky_number = 42;
    let problematic_variable = &lucky_number;

    yield_now().await;

    // problematic dereference
    let _ = { *problematic_variable };
}

fn yield_now() -> impl Future<Output = ()> {
    let mut yielded = false;
    std::future::poll_fn(move |cx| {
        if core::mem::replace(&mut yielded, true) {
            Poll::Ready(())
        } else {
            cx.waker().wake_by_ref();
            Poll::Pending
        }
    })
}

RalfJung · October 4, 2022, 1:11pm

Looks like they have the same poll_fn with the same Unpin instance. They don't actually move the closure though and users cannot directly get hold of it, so a use-after-free is probably impossible with join.

If we had a proper handle on the aliasing of self-referential generators, this would not be a problem: wrapper types cannot 'opt out' of UnsafeCell, and similarly if self-referential generators internally used an UnsafeAlias kind of a construct that allows mutable reference aliasing, then PollFn<...> would still behave as expected since its field would still be wrapped in UnsafeAlias.

But for now we are using the Unpin trait to control aliasing (both in the LLVM codegen backend and in Miri), and that means the fact that PollFn is always Unpin is relevant -- and ideally futures::join would be changed to not re-pin the future inside a movable closure.

matklad · October 4, 2022, 9:00pm

Can we just fix this as bug? poll_fn is fresh, so I wouldn't expect that to break too much code in the wild, and we allow breaking bug fixes.

Or am I wrong that changing

impl<F> Unpin for PollFn<F> {}

to

impl<F: Unpin> Unpin for PollFn<F> {}

would fully solve the issue?

afetisov · October 5, 2022, 12:57am

This isn't in any way specific to poll_fn. The bug happens because pin projection doesn't give the required guarantees in this case. Note that the bug isn't even in any way specific to async Rust, since the same closures and pinning may be used for self-referential types in any place.

The projection PollFn<F> -> F isn't structural, so the closure isn't pinned.
If the closure was pinned, then it wouldn't be a closure, since Pin<&mut F>: !FnMut. Looks like we should really have a separate concept of "pinned closure" which would implement

trait FnPinnedMut<Args> {
    type Output;
    fn call_pinned_mut(Pin<&mut Self>, args: Args) -> Self::Output;
}

However, this also wouldn't actually solve anything, because projection of Pin<&mut F> to its fields (i.e. captures of the closure) cannot be implemented syntactically in the current Rust. We really would need something like the field projection RFC or its equivalent to solve this.
In principle, the compiler could autogenerate the required projections, like an implicitly applied #[pinned_project]. There is the question which fields should be projected structurally, though. I think the safe bet would be to project Unpin fields non-structurally, and !Unpin fields structurally. I'm not sure if it's possible to do currently, though, since macro expansion runs before trait resolution, and pin projection cannot be implemented via the current trait machinery.
Again, I would reiterate that there is nothing specific to poll_fn in this issue. For example, nothing at the language level stops me from writing

let future = fut();
let mut f: impl FnMut() -> Poll = move || {
    if rand::random::<bool>() {
        unsafe { Pin::new_unchecked(&mut future) }.poll(cx)
    } else {
        mem::replace(&mut future, pending());
        Poll::Pending
    }
};
f();
f(); // BOOM

Superficially, it looks legitimate. The only proper way to solve the issue is to take capture pinning in your own hands, and declare let future = Box::pin(fut()); before moving it inside the closure.

I think we should have a Clippy deny-by-default lint which forbids pinning the closure's captures on the stack. It's possible to imagine code which does it soundly, but it's much more likely to be an error like in the OP. One case where pinning the captures could be sound is FnOnce, since the pin will certainly last until the end of closure, and it will never be called again.
Also, perhaps we should have an "useless pinning" Clippy lint, which would warn about Box::pin(poll_fn(..)). Since PollFn<F>: Unpin, it's likely that the user falsely believes that they have pinned some inner field, when that is not the case. Unfortunately the pinned form may be required due to some function signature, which makes it hard to avoid false positives.

RalfJung · October 5, 2022, 9:30am

Yeah, I think that would fix this bug. But while poll_fn in the stdlib is fresh, its version in futures is old.

I agree the issue comes up any time you mix FnMut and pinning, but poll_fn could shield the user from the problem instead of forwarding the problem.

Your code much more obviously moves the future via replace, so I would disagree with the statement that this looks superficially legitimate.

pcpthm · October 5, 2022, 10:30am

The alternative way of creating a struct, and think about whether some field is structurally pinned, feels too heavy for an ad-hoc future only used in one place. I imagine poll_fn is used because of that reason.

With RFC 2781, one can easily write an ad-hoc future without any unsafe, like:

let future = fut();
let f: impl FnPin(&mut Context) -> Poll<()> = move |ctx| {
    pin_mut!(future);
    loop {
        yield future.poll(ctx);
    }
};
let f: impl Future = poll_fn_pin(f);

RalfJung · October 5, 2022, 12:10pm

The example that was brought to Miri can already be written without unsafe using stack pinning:

fn main() {
    let mut future = pin!(trouble());

    let mut pinned = Box::pin(future::poll_fn(move |cx| {
        future.as_mut().poll(cx)
    }));

    let waker = futures::task::noop_waker();
    let mut cx = Context::from_waker(&waker);

    let _ = pinned.as_mut().poll(&mut cx);
    let _ = pinned.as_mut().poll(&mut cx);
}

This avoids moving the self-referential future into the closure, and thus the extra Unpin is not a problem.

afetisov · October 5, 2022, 12:27pm

You can't really shield from that problem since it's unrelated to poll_fn. You can cover up, so it's harder to hit, but it would also be harder to debug. In my view it's better to amplify errors, so that one hits them sooner, when they're easier to solve, and more reliably, so one can learn from the errors.

Obviously that's just a simple example. I could write a more convoluted one with the same bug, where the move would not be immediately obvious.

The unsafe is hidden in pin_mut!, and it's literally the same unsafe as in the OP. However, the generator machinery will make its usage safe, because you won't be recreating the pin on each iteration.

The root of the problem is that if you have f: fn(p: &mut T), then pinning p inside of f is fundamentally unsound. You don't own the borrowed value, and you have no idea what someone will do to it once the function returns. At the same time, the contract for Pin states that, for T: !Unpin, you must never move the referenced value once you pin it. Those requirements are incompatible, which means that at the very least f must be unsafe. The correct approach is to not do pinning of non-owned values altogether, because the API is such that it's very likely to be misused.

For the case of a closure, the &mut borrow comes from FnMut::call_mut(&mut self, args), which also can't be unsafe. This means that pinning !Unpin captures within a closure is similarly fundamentally unsound, and must be forbidden. It may work accidentally, but that's not the standard of safety which we strive for in Rust.

We probably can't make it an error in rustc, because of backwards compatibility, because it is possible to pin captures without introducing UB, under certain conditions, and because the pinning is unsafe anyway. But this should be an error lint, in Clippy if not in rustc itself.

Either the value can be safely pinned outside the closure, in which case the fix is easy, or the closure is expected to be moved to a different scope, in which case the lint saves you from an error.

I think this error is in the same category as

let ptr = CString::new("Hello").expect("CString::new failed").as_ptr();

It's technically not an issue with the language or the standard library, but it's a very easy footgun which can be linted away.

RalfJung · October 5, 2022, 12:55pm

This is backwards. There's no good reason for PollFn to be Unpin, and this doesn't amplify errors in any useful way. It silently turns code into UB.

Actually moving things is always fishy when dealing with Pin, but there is the unsafe { Pin::new_unchecked } in the closure to signal this trouble. But silently wrapping the user state in Unpin is inviting users to move the PollFn at a completely different place in the code, and that's a completely unnecessary footgun.

If you control caller and callee, you know exactly what will happen to it. That is the case in these examples.

I think you need way more evidence to make claims like this. (a) poll_fn is, as far as we know, entirely sound (you need unsafe code elsewhere misunderstanding its API to cause UB), so "fundamentally unsound" is just not true. (b) The version of poll_fn that preserves Unpin avoids both the aliasing trouble and the "far-away code accidentally moves pinned data captured in the closure" trouble.

afetisov · October 5, 2022, 3:11pm

It produces UB on simpler code, which is still better than producing it only in convoluted cases.

It has as much reason to be Unpin as any other combinator. It's sound to have PollFn<F>: Unpin, it allows to pass it as generic parameter where Unpin is expected, it means that you are free to move it between calls. Which you are, if you use a sound closure.

That's the "just don't make errors" approach. The thread started because someone didn't know exactly what happened to it. Because writing correct code is hard, and it's oh so easy to miss a move. Good thing we have the type system to prevent such errors. That is, assuming you write sound code, which the closure in the example is not.

I have repeated that several times, didn't I?

The closure is fundamentally unsound. You can't soundly pin a non-owned variable.

What evidence? That's a mathematical fact. Pinning a captured variable inside of FnMut or Fn closure is unsound, because moving a closure between calls will invalidate the pin. We need to help people avoid that unsoundness, instead of passing the buck to innocent code.

Remember that you are proposing to break an API which has been stable for two weeks, unstable for 2 years, and stable in futures even longer. You can't even say that it fixes unsoundness, because poll_fn is sound. The unsoundness is in the user's closure, and there are plenty of other places where the same unsound closure could be used.

CAD97 · October 5, 2022, 5:59pm

No, it's not.

pin_mut! moves from its expression in order to prevent it from being moved after the call to pin_mut!.

Problematic:

let mut fut = async_fn();
// captures fut, incorrectly assumes it's pinned in the captures
move || {
    let fut = Pin::new_unchecked(&mut fut);
}

Sound:

// captures &mut fut, assumes it's pinned to the creator's stack
|| {
    let mut fut = Pin::new_unchecked(&mut fut);
}
// captures fut, pins it to closure's stack
|| {
    pin_mut!(fut);
}
// equivalent
|| {
    let mut fut = fut;
    let mut fut = Pin::new_unchecked(fut);
}
// equivalent
|| {
    let mut fut = pin!(fut); // std pin! expression macro
}

The difference is that the sound closures which capture fut directly are only FnOnce.

CAD97 · October 5, 2022, 6:17pm

Given the purpose of poll_fn, I think it's reasonable to expect that the closure's captures would be pinned; this is the behavior for async futures, after all. What we currently have means this isn't the case, though.

We either should change PollFn to only be Unpin when the closure's captures are all Unpin or deprecate poll_fn and introduce unpinned_poll_fn and pinned_poll_fn to make the difference painfully obvious.^[1]

This at its core is the same issue of pinging between &mut T and Pin<&mut T> causing issues. The proper solution would be FnPinMut, so you could write e.g.

poll_fn(|cx| {
    pin_mut!(fut);
    loop {
        yield fut.poll(cx);
    }
}

and avoid having to drop from Pin<&mut PollFn<Closure>> to &mut Closure and use unsafe to repin at all.

And in the latter case, have a lint for incorrectly treating Unpin captures as pinned in a closure not syntactically immediately passed to pinned_poll_fn. ↩︎

comex · October 6, 2022, 2:31am

I agree with @afetisov and @CAD97 that the 'correct' solution is to have the closure be pinned, and have the captured variable be effectively a structurally pinned member of it.

Which makes this yet another way in which things would have ended up simpler if we had never created Pin, and instead bitten the bullet and implemented !Move. Grump.

comex · October 6, 2022, 3:45am

For what it's worth: tokio::join! and futures::join! are unsound, but std::future::join! is not. tokio::join! moves the input futures into a lambda, similar to the original example, and so does futures::join! (whose implementation is more complicated). In contrast, std::future::join pins the input futures outside the lambda and only moves Pin<&mut T> objects into the lambda.

Here is a playground link, pieced together from some of the previous snippets, that triggers a Miri error using tokio::join! and no unsafe. If you change tokio::join! to futures::join! it also triggers an error.

And... here is a playground link that actually triggers a miscompilation using futures::join!. The code should print 44, but in release mode it prints 43. This doesn't work with tokio::join! simply because it generates more complicated control flow that confuses LLVM.

RalfJung · October 6, 2022, 10:11am

@afetisov not sure if we are even talking about the same thing. What I am saying is there are two reasonable ways to resolve this:

We document explicitly in poll_fn that the closure is not pinned, which then makes it quite clear that the original code is wrong. The fact that this is not documented currently makes this somewhat of a footgun. In general, I agree that FnMut closures cannot rely on staying in the same place, but for this specific case it would be worth pointing that out again to make it painfully obvious. (Well, it's still a footgun since people might not read the docs, but this is about the best we can do without changing the API.)
We change poll_fn to impl<F: Unpin> Unpin for PollFn<F>, and document that the closure is actually pinned in place, even though the type signature cannot say so due to a lack of FnPinMut. This mostly disarms the footgun.

Yeah I agree that that is a good solution -- but @afetisov has been arguing against my position above. Or maybe we just misunderstood each other.

@comex hits again. Nice! I'm always worried when we see Stacked Borrows UB that it might be terribly unrealistic, so I am always relieved when you find an actual miscompilation.

afetisov · October 6, 2022, 10:56am

I don't have any objections to updating documentation. It's always useful to highlight important information.

I don't consider changing impl Unpin for PollFn a good solution, because it breaks an old and widely used API. That could be warranted if we really fixed some unsoundness, but there is no unsoundness in poll_fn, it doesn't fix the issue with the closure (which can easily surface in other contexts), and it doesn't even fix the error while using poll_fn, because a slightly more complex inner closure will trigger the same UB.

The proper solution is to prevent people from writing bugged closures. As it stands, we can only do it through a lint, since we don't produce compiler errors for potential unsoundness.

If a function argument is &T or &mut T and T: !Unpin and the function is safe, then pinning it is an error.
It is an error to pin a capture of type T: !Unpin in an Fn or FnMut closure.

The fix is simple: pin the value outside of the closure.

Having a concept of "pinned closure" could also be a solution, but it doesn't help the current buggy APIs, and it will take a very long time to get into the language, if ever.

RalfJung · October 6, 2022, 11:23am

That is our disagreement then. With the changed Unpin, I think one can treat the FnMut closure as if it fields were pin-projected. This does need unsafe code obviously, but it's not fundamentally different from other cases of manual pin projection, just less explicit.

In particular, this definitely does fix some of the issues with the closure -- both the Miri UB and the miscompilation should be fixed by this. So it increases the set of closures that work correctly with poll_fn. Specifically, if the closure doesn't itself move the same thing that it pins, things are fine. That is much easier to check for than "does any user of the closure move the thing that the closure pins": checking that within the closure, each captured variable is either pinned or moved is much easier than coordinating closure body and caller.

(FWIW coordinating callee and caller when they are literally defined in the same block of code is orders of magnitude away from the C++ approach of "just don't make errors", so I strongly disagree with your earlier statement here. You were putting up a strawman by considering a free-standing function with an &mut T argument, called by arbitrary unknown code. This is just not the situation we are in here.)

It would still be nice to also get help for pin-projections in closure, here we are in agreement.

RalfJung · October 6, 2022, 11:38am

I am proposing the Unpin change in poll_fn and Unpin: fix pinning by RalfJung · Pull Request #102737 · rust-lang/rust · GitHub.

Nemo157 · October 6, 2022, 11:40am

For interest, here's @comex's miscompilation using the same poll_fn structure as the OP https://play.rust-lang.org/?version=nightly&mode=release&edition=2021&gist=ccd5143456429c51b25ac74ceddd3735.

I feel like I have seen proposals in the past to always make closures Unpin, which would mean having PollFn inherit the unpinness wouldn't help.

Topic		Replies	Views
Safe stack pinning without macro libs	21	2247	March 13, 2023
First experience and thoughts about `Pin` API libs	31	2783	April 22, 2019
Idea: Enhance match ergonomics to match on pinned enums without unsafe language design	9	1764	May 3, 2019
Unsoundness in `Pin` language design	113	41966	April 9, 2020
Pin ergonomics	29	1478	October 28, 2024

Surprising soundness trouble around `PollFn`

Related topics