Async/await optimizations could make language even more powerful

gmarcosb · February 4, 2026, 8:56pm

I realize rust has many benefits with borrow checker, but I hope it's clear that the true power rust brings is async/await

As you probably know, the way this is done in C++ is terrible (usually with hacky callbacks) - I have a whole blurb about this @ Evolution of asynchronous programming - Balanced Thoughts

Rust's future state machine, which allows allocation on the stack, is next-level - it can make async programming in embedded systems magical

Unfortunately, it seems there are a couple of small things in future state machine creation that results in binary bloat (especially important in embedded programming) - most of this is around "we made this async API just in case you need to wait on something; but often you don't & just have non-blocking code"; rust does poorly at optimizing the latter

The 2 big examples of this which I came across in my short time coding in rust (my hope is to make Matter, the smart home standard, shift to using rust):

github.com/rust-lang/rust

Async fn doubles argument size

opened 10:47PM - 24 Jul 19 UTC

4teap

C-enhancement T-compiler A-coroutines I-heavy A-async-await AsyncAwait-Triaged C-optimization

Generator optimization in #60187 by @tmandry reused generator locals, but argume…nts are still duplicated whenever used across yield points. For example ([playground](https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=af3e5e9233c45d6397c7b4b3e671f092)): ```rust #![feature(async_await)] async fn wait() {} async fn test(arg: [u8; 8192]) { wait().await; drop(arg); } fn main() { println!("{}", std::mem::size_of_val(&test([0; 8192]))); } ``` Expected: 8200 Actual: __16392__ When passing in futures, the future size can grow exponentially ([playground](https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=02a3e685de2b2bf868733e40e56e192b)): ```rust #![feature(async_await)] async fn test(_arg: [u8; 8192]) {} async fn use_future(fut: impl std::future::Future<Output = ()>) { fut.await } fn main() { println!( "{}", std::mem::size_of_val(&use_future(use_future(use_future(use_future(use_future( use_future(use_future(use_future(use_future(use_future(test( [0; 8192] )))))) )))))) ); } ``` Expected: 8236 Actual: __8396796__ I didn't find any note on this. But given how common arguments are used, I think it might be useful if they are included in the optimization.

github.com/rust-lang/rust

.await "blowing up" state machine unnecessarily

opened 08:32PM - 04 Feb 26 UTC

gmarcosb

T-compiler C-bug I-heavy A-async-await

Trying to get `core::future::Ready<>` handling to be more optimal (this is usefu…l when supplying an async API that is smart enough to, at compile-time, not take the extra overhead if the API isn't actually async) _Originally posted by @gmarcosb in [#62958](https://github.com/rust-lang/rust/issues/62958#issuecomment-3807931515)_ I tried to optimize with the following tricks in the hope that it would work (see [full usage in commit](https://github.com/project-chip/rs-matter/pull/366/changes/11ab6a1219848812f48d7d6ad818baec2d77280f)); unfortunately, it seems the rust compiler (even in this straightforward case) is still generating all the "cruft" for the state machine: ``` #[inline(always)] pub const fn extract_ready_check<const B: bool>(_: ReadyCheck<B>) -> bool { B } #[macro_export] macro_rules! process_maybe_async { ($source:expr) => { match $source { fut => { #[allow(unused_imports)] // use $crate::dm::{MaybeReady, IsReady, IsNotReady, NotReadyFallback}; use $crate::dm::{MaybeReady, IsReady, IsNotReady}; let is_ready : bool = $crate::dm::extract_ready_check((&&fut).get_check()); // even with check for true vs is_ready, there's still binary size bloat! if true { fut.get_ready() } else { fut.await } } } }; } pub struct ReadyCheck<const B: bool>; pub trait MaybeReady : core::future::Future { fn get_ready(self) -> Self::Output; } // 1. The General Case: Any Future impl<F: core::future::Future> MaybeReady for F { default fn get_ready(self) -> Self::Output { const { panic!("This future is not a Ready<T> type!"); } } } // 2. The Specialized Case: Specifically Ready<T> impl<T> MaybeReady for core::future::Ready<T> { fn get_ready(self) -> T { self.into_inner() } } pub trait IsReady<T> { #[inline(always)] fn get_check(&self) -> ReadyCheck<true> { ReadyCheck } } impl<T> IsReady<T> for &&core::future::Ready<T> {} pub trait IsNotReady<T> { fn get_check(&self) -> ReadyCheck<false> { ReadyCheck } } impl<T, F: core::future::Future<Output = T>> IsNotReady<T> for &F {} ``` I then have: ``` #[inline(always)] fn fn_1(o : &SomeObject) -> core::future::Ready<String> { core::future::ready(o.do_stuff()) } #[inline(always)] fn fn_2(o : &SomeObject, s: String) -> core::future::Ready<String> { core::future::ready(o.do_other(s)) } ``` I would expect **A**: ``` async fn my_example_a(o : &SomeObject) -> String { let s = process_maybe_async!(fn_1(o)); process_maybe_async!(fn_2(o, s)) } ``` To result in the **exact same binary** as **B**: ``` async fn my_example_b(o : &SomeObject) -> String { let s = o.do_stuff(); o.do_other(s) } ``` But it doesn't; instead, it results in ~the same binary bloat (more bloat, surprisingly) as **C**: ``` async fn my_example_c(o : &SomeObject) -> String { let s = fn_1(o).await; fn_2(o, s).await } ``` See full compiling code [in playground ASM](https://play.rust-lang.org/?version=nightly&mode=release&edition=2024&gist=868616cb1526de786b7f47c81899a115), where the binary delta between 3 methods is clear Of course it would be nice if A, B, and C all resulted in the exact same compilation; but at a minimum, with all the hints given, A and B should result in the same binary; preferably with `if is_ready` {` as well vs just `if true { I've also brought this up in discourse: https://internals.rust-lang.org/t/async-await-optimizations-could-make-language-even-more-powerful/23973

This has been brought up previously, too:

josh · February 4, 2026, 10:05pm

Those are definitely optimizations we'd want to see happen, if someone is up for working on them.

dingxiangfei2009 · February 4, 2026, 11:04pm

This is a problem that we have been eyeing for a solution. #135527 could be a way to alleviate some of the pains. We have been thinking about cooperating with codegen backend(s) better so that better code could be emitted, when the backend understands coroutine dialect.

For that reason, we are proposing a project goal to enable a survey and experimentation.

gmarcosb · February 4, 2026, 11:18pm

I've touched bases with @dingxiangfei2009 (we're colleagues) & we're going to see if I can't help him make some magic & get these optimized, stay tuned!

vague · February 6, 2026, 7:28am

There is also another project goal Async statemachine optimisation by diondokter · Pull Request #510 · rust-lang/rust-project-goals · GitHub just merged.

Topic		Replies	Views
Would implicit `await` really be a bad idea?	48	5650	April 28, 2019
Async-traits - the less dynamic allocations edition	10	4779	January 1, 2021
Desire: async is IntoFuture, not Future language design	5	793	June 18, 2024
Will it ever be possible to optimize away trivial futures? compiler	6	1246	October 20, 2022
Flattening nested futures language design	12	3269	August 13, 2019

Async/await optimizations could make language even more powerful

Related topics