Theoretically this could be a very small program, but if you compile this on Rust playground and look at the assembly or llvm-ir output, you still get a very large program. Are optimizations for this type of thing coming, or is this just not doable?
Most of that assembly comes from block_on, which has a functioning Waker implementation.. If you use noop_waker it optimizes down to a single mov: Playground Link.
I think OP’s question is still relevant. The general API might indeed need to block, but a particular caller has a Future that’s already Ready. Inlining will get us to the example in the original post, and it would be great™ if the optimizer could go further.
Turning the problem around (and pushing it back towards users instead of internals), is there anything in block_on that could be implemented differently to handle that? Is it something all Wakers have to do for themselves? Given the signature of poll it’s hard to not make the Waker ahead of time, but maybe block_on should poll with a no-op waker first and then poll again with a real one?
I'm not surprised that LLVM struggles to optimize async completely away; the waker API necessarily involves going through an erased vtable, so inlining requires devirtualization.
Ultimately, it's very likely that when you run a future it will involve at least one wait, so optimizing for const propagation is pretty unnecessary in the general case.
If you look for declare in the LLVM IR, you'll see these futures_* functions:
<futures_executor::local_pool::ThreadNotify as futures_task::arc_wake::ArcWake>::wake_by_ref
futures_task::waker_ref::WakerRef::new_unowned
futures_executor::enter::enter
<futures_task::waker_ref::WakerRef as core::ops::deref::Deref>::deref
<futures_executor::enter::Enter as core::ops::drop::Drop>::drop
<futures_executor::enter::EnterError as core::fmt::Debug>::fmt
All of these functions are being imported from already compiled machine code in upstream crates, and without turning on LTO (or changing the code to use #[inline]) they can't be optimized away.
Some of them make sense to not get cross-crate codegen (like the fmt::Debug impl for the error), but not most of them.
Looking through though, seems like #[inline] is needed for futures_task::waker_ref::WakerRef:
honestly, every function in that module should get #[inline], ironically the only one that already has it (waker_ref) doesn't need it because it's already generic
in particular, these block optimizations because they obfuscate access to an Waker, which contains a vtable, so any virtual calls through the Waker held in a WakerRef will be completely unknown to LLVM
While the rest could use it, they seem to be all leaves (i.e. likely not blocking optimizations):
sets the samethread_local!Cell<bool> back to false
while the AtomicBool or thread_local! state could in theory be important for optimizations, it can only block optimizing out branches, which is a much smaller issue compared to the devirtualization blocking
Rust doesn't have any closed-world assumptions (at least yet AFAIK) wrt virtual calls, so devirtualization is only const-prop (if we include "constant-folding loads from constant globals" under the same umbrella).
And the only things that can block const-prop are either true dynamism (e.g. switching executors at runtime) or hidden static knowledge. Always look for the latter, because it's much more common to forget an #[inline] (and/or mislead LLVM into thinking some value can mutate, though I don't see that here), especially when there's no hint of true dynamism from the source code.