Will it ever be possible to optimize away trivial futures?

I'm just curious if an optimization like this is possible, or if there's some reason why it's theoretically not practical. Consider this code:

use futures::executor::block_on;

async fn foo_async() -> i32 {
    return 5;
}

pub fn foo() -> i32 {
    return block_on(foo_async());
}

Theoretically this could be a very small program, but if you compile this on Rust playground and look at the assembly or llvm-ir output, you still get a very large program. Are optimizations for this type of thing coming, or is this just not doable?

Most of that assembly comes from block_on, which has a functioning Waker implementation.. If you use noop_waker it optimizes down to a single mov: Playground Link.

7 Likes

I think OP’s question is still relevant. The general API might indeed need to block, but a particular caller has a Future that’s already Ready. Inlining will get us to the example in the original post, and it would be great™ if the optimizer could go further.

Turning the problem around (and pushing it back towards users instead of internals), is there anything in block_on that could be implemented differently to handle that? Is it something all Wakers have to do for themselves? Given the signature of poll it’s hard to not make the Waker ahead of time, but maybe block_on should poll with a no-op waker first and then poll again with a real one?

1 Like

I'm not surprised that LLVM struggles to optimize async completely away; the waker API necessarily involves going through an erased vtable, so inlining requires devirtualization.

Ultimately, it's very likely that when you run a future it will involve at least one wait, so optimizing for const propagation is pretty unnecessary in the general case.

2 Likes

If you look for declare in the LLVM IR, you'll see these futures_* functions:

<futures_executor::local_pool::ThreadNotify as futures_task::arc_wake::ArcWake>::wake_by_ref
futures_task::waker_ref::WakerRef::new_unowned
futures_executor::enter::enter
<futures_task::waker_ref::WakerRef as core::ops::deref::Deref>::deref
<futures_executor::enter::Enter as core::ops::drop::Drop>::drop
<futures_executor::enter::EnterError as core::fmt::Debug>::fmt

All of these functions are being imported from already compiled machine code in upstream crates, and without turning on LTO (or changing the code to use #[inline]) they can't be optimized away.

Some of them make sense to not get cross-crate codegen (like the fmt::Debug impl for the error), but not most of them.

Looking through though, seems like #[inline] is needed for futures_task::waker_ref::WakerRef:

While the rest could use it, they seem to be all leaves (i.e. likely not blocking optimizations):


Rust doesn't have any closed-world assumptions (at least yet AFAIK) wrt virtual calls, so devirtualization is only const-prop (if we include "constant-folding loads from constant globals" under the same umbrella).

And the only things that can block const-prop are either true dynamism (e.g. switching executors at runtime) or hidden static knowledge. Always look for the latter, because it's much more common to forget an #[inline] (and/or mislead LLVM into thinking some value can mutate, though I don't see that here), especially when there's no hint of true dynamism from the source code.

14 Likes

Out of curiosity I checked the rust-lang/futures-rs repo and saw:

Thanks, @xfix, that was quick!

3 Likes