Questioning the design of the "Future" trait, and the cost implied by it

Background

In other languages, asynchronous programming is usually implemented via completion callbacks, where when a sub-task finishes, the logical control flow is returned to the direct requester by calling the continuation function of the direct requester. The callback is done either directly (in which case the physical call-stack grows), or deferred by placing it in a task queue so that it can be called in the next event-loop iteration (in which case the physical call-stack "resets"). In either case, the async call-stack is never realized on the physical call-stack. This is the approach used by libuv, Boost.Asio and Javascript. Javascript's Promise is an abstraction of this model, but the underlying principle is the same.

However, Rust's Future follows a different model. When a sub-task finishes, the root task, rather than the direct requester, is awakened. The root task would have to recursively poll its sub-tasks until reaching the actual point of reentrance, realizing the async call-stack onto the physical call-stack in the process. This would incur a cost linear to the depth of the async call-stack.

Example

Here is an example demonstrating the behavior:

use futures::future::{BoxFuture, FutureExt};
use std::time::Instant;

fn recurse(n: isize) -> BoxFuture<'static, isize> {
    async move {
        tokio::task::yield_now().await;
        if n > 0 {
            1 + recurse(n - 1).await
        } else {
            0
        }
    }
    .boxed()
}

async fn run_test(n: isize) {
    let start = Instant::now();
    recurse(n).await;
    println!("{}", (Instant::now() - start).as_micros());
}

#[tokio::main]
async fn main() {
    for i in 0..500 {
        run_test(i * 10).await;
    }
}

And here is a Javascript equivalent:

const { performance } = require('perf_hooks');
const fs = require('fs');

const fd = fs.openSync('perf_node.txt', 'w');

async function recurse(n) {
    await Promise.resolve().then(() => {});
    if (n > 0) {
        return 1 + await recurse(n - 1);
    } else {
        return 0;
    }
}

async function run_test(n) {
    const start = performance.now();
    await recurse(n);
    fs.writeSync(fd, (performance.now() - start) * 1000 + '\n');
}

(async () => {
    for (let i = 0; i < 500; ++i) {
        await run_test(i * 10);
    }
})();

Performance

Wall-time comparison (unit: microseconds): perf_rust perf_node

Notice the quadratic growth in Rust's approach versus the linear growth in Javascript's approach. The Rust example can be fixed by spawning a new root task for each recursive call and immediately joining it. However, it feels like a hack and would probably incur more overhead than directly using a completion callback based Future design. Rust's current future design indeed minimizes dynamic allocation costs, but would inevitably introduce other costs like this. Any thoughts on this?

Thanks.

10 Likes

I am not very much involved with the design but from what I have read it seems to be an anti-pattern to create a lot of runtimes and then call block on.

Furthermore the general use of asynchronous seems to be a macro on top of an asynchronous main and then calling await there.

So I would guess your time is much more dependent on the creation of these runtimes whereas your JS implementation only creates one of them.

It's not related to the runtime creation. The runtime creation is just for collecting independent test samples. I edited the code to use only one runtime, updated the plot, and the result is the exactly the same. Please see the edit history of the main post for the differences.

1 Like

What happens if you use a different executor, e.g. executors?

An interesting language to compare with would be Kotlin. IIRC, they avoid callbacks as well (or maybe just do less callbacks?).

In this particular benchmark, I am wondering if the effects of boxing actually dominate recursive dispatch in polling? Typically, the dispatch to the appropriate future doesn't invoke a virtual call and might be optimized? It seem plausible that the compile can optimize a series of nested switches to just one switch.

1 Like

I also wonder if the yield is impacting the measurement. In JS IIRC, awaiting a promise that immediately resolves doesn't hit the executor at all, and is done entirely straightline. With tokio, an async yield will always actually yield for one turn of the async executor.

Even a 1000 frame deep async callstack seems pretty over the top to me, and the graph makes it hard to see the details at those more reasonable sizes.

This is also causing a lot of allocations, the deepest allocation-less variant I managed to compile on the playground was 50 frames deep, which gave a runtime of 17µs, compared to 142µs for the allocating variant. The JS code has the benefit of being able to JIT compile and potentially eliminate a lot of the intermediate allocations involved.

The fact that compiling async stacks over 50 frames deep is slow enough to not work on the playground really shows that a 5000 frame stack isn't really an appropriate benchmark.

Though, building a 1000 frame deep variant on my local machine seems to show the time scaling worse than the allocating version for some reason past 100 frames :thinking:

depth macro alloc
10:      1    2
20:      2    3
30:      2    3
40:      2    4
50:      5    7
60:      7   11
70:      7   16
80:     11   14
90:     13   28
100:    16   20

100:    23   26
200:    85   67
300:   199  133
400:   351  257
500:   557  389
600:   840  560
700:  1194  774
800:  1600  987
900:  2092 1283
1000: 2770 1555
1 Like

LLVM probably doesn't know how to efficiently compile state machines that large, so it comes up with some wacky codegen.

It seems to me this could equally well be taken as additional evidence for OP's point that the async design has problems. Why shouldn't I be able to use deeply nested recursion together with async, if that's the most natural way to express the algorithm I want to implement?

2 Likes

Even in non-async code you can't always use recursion. It's just a known technical limitation of the chosen design. It doesn't mean the design is wrong in general.

Rust's futures need to have shallow call stacks, ideally with a statically known depth. This is often an acceptable limitation, bit it is particularly unfriendly to deep recursion. Rust's design has its advantages too: non-recursive futures require only a single allocation for the whole call graph, rather than a new allocation for each async call.

5 Likes

Here is a more sensible example that doesn't explicitly use recursion. In this example, we chain a large amount of futures together using the "then" combinator provided by the futures crate. Again, Rust is quadratic while Javascript is linear. Omitting the plots as they look the same as in the OP.

Rust example:

use futures::future::{BoxFuture, FutureExt};
use std::time::Instant;

fn chain_many(n: isize) -> BoxFuture<'static, isize> {
    let mut future = async { 0 }.boxed();
    for _ in 0..n {
        future = future.then(|x| async move {
            tokio::task::yield_now().await;
            x + 1
        }).boxed()
    }
    future
}

async fn run_test(n: isize) {
    let start = Instant::now();
    chain_many(n).await;
    println!("{}", (Instant::now() - start).as_micros());
}

#[tokio::main]
async fn main() {
    for i in 0..500 {
        run_test(i * 7).await;
    }
}

Javascript equivalent:

const { performance } = require('perf_hooks');
const fs = require('fs');

const fd = fs.openSync('perf_node_chain.txt', 'w');

function chain_many(n) {
    let future = Promise.resolve(0);
    for (let i = 0; i < n; ++i) {
        future = future.then(async (x) => {
            // "then" does a yield in Javascript.
            return x + 1;
        });
    }
    return future;
}

async function run_test(n) {
    const start = performance.now();
    await chain_many(n);
    fs.writeSync(fd, (performance.now() - start) * 1000 + '\n');
}

(async () => {
    for (let i = 0; i < 500; ++i) {
        await run_test(i * 7);
    }
})();
4 Likes

This test is useful because it allows us to separate the allocation of the future from the running, to see which is the cost. What does the graph look like in that case?

async fn run_test(n: isize) {
    let fut = chain_many(n);
    let start = Instant::now();
    fut.await;
    println!("{}", (Instant::now() - start).as_micros());
}

And no, no work is done until .await is called. Rust futures are "lazy" in that they don't do any work until you start polling (.await) them, unlike JS's (which act like threads; start running when created, and awaiting is just waiting for it to finish).

I suspect that this is the actual cause of the discrepancy. tokio::task::yield_now().await will always jump back to the tokio event loop for a turn. You say that then causes a yield to the event loop in JS, but I don't know that for sure, and it's quite possible that the JIT sees through your shenanigans and is making (parts of) your test straightline and not yield.

Calling and immediately awaiting a no-await async function in JS is (allowed to be) implemented as straightline code that doesn't yield to the executor. My gut says the JS runtime is "cheating" in a similar way here. But even if it isn't, do note that JS executors have been heavily tuned to JS promises' semantics, and tokio is still quite young in comparison. The quadratic blowup could be a tokio issue, not a Rust issue. I'm not sure -- it sure seems like this should be linear -- but I'm more willing to blame Tokio or the allocator than that this is a fundamental limitation to Rust's Future semantics.


Actually, no, you're fully right. Because of the boxed indirection at every step, the optimizer can't strip out the calls to each subfuture's Future::poll method.

In the common case, you aren't boxing each future, so the step of recursing down the poll tree is optimized to just be a jump table. But every indirection point inhibits this.

A more fair comparison is to actually do what the JS runtime is doing (before JIT shenanigans). Instead of encapsulating another boxed future, spawn it on the executor directly -- Rust gives you the choice!

Going back to the original recursion:

async fn recurse(n: isize) -> isize {
    tokio::task::yield_now().await;
    if n > 0 {
        1 + tokio::spawn(recurse(n - 1)).await;
    } else {
         0
    }
} 

It's quite unfortunate that this footgun exists, and that there's no exexutor-agnostic way of not stumbling into the footgun, but this should be linear instead of quadratic.

This is exactly due to the eager/lazy semantics. The solution is to be eager and spawn the recursion onto the executor rather than encapsulating it as a BoxFuture member.

EDIT: ew that's an ugly error as written. You should get my semantic point, though: use tokio::spawn for deeply nested boxed futures.

Execution
Close
Standard Error
   Compiling playground v0.0.1 (/playground)
warning: unused import: `BoxFuture`
 --> src/main.rs:1:23
  |
1 | use futures::future::{BoxFuture, FutureExt};
  |                       ^^^^^^^^^
  |
  = note: `#[warn(unused_imports)]` on by default

error[E0391]: cycle detected when computing type of `recurse::{opaque#0}`
 --> src/main.rs:4:31
  |
4 | async fn recurse(n: isize) -> isize {
  |                               ^^^^^
  |
note: ...which requires borrow-checking `recurse`...
 --> src/main.rs:4:1
  |
4 | async fn recurse(n: isize) -> isize {
  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: ...which requires processing `recurse`...
 --> src/main.rs:4:1
  |
4 | async fn recurse(n: isize) -> isize {
  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: ...which requires processing MIR for `recurse`...
 --> src/main.rs:4:1
  |
4 | async fn recurse(n: isize) -> isize {
  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: ...which requires unsafety-checking `recurse`...
 --> src/main.rs:4:1
  |
4 | async fn recurse(n: isize) -> isize {
  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: ...which requires building MIR for `recurse`...
 --> src/main.rs:4:1
  |
4 | async fn recurse(n: isize) -> isize {
  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: ...which requires type-checking `recurse`...
 --> src/main.rs:4:1
  |
4 | async fn recurse(n: isize) -> isize {
  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  = note: ...which requires evaluating trait selection obligation `impl futures::Future: std::marker::Send`...
  = note: ...which again requires computing type of `recurse::{opaque#0}`, completing the cycle
note: cycle used when checking item types in top-level module

Here's a version that works:

fn recurse(n: isize) -> BoxFuture<'static, isize> {
    async move {
        tokio::task::yield_now().await;
        if n > 0 {
            1 + tokio::spawn(recurse(n - 1)).await.unwrap()
        } else {
            0
        }
    }
    .boxed()
}
4 Likes

See also: Pre-RFC: CPS transform for generators

At the time, the team working on async/await didn't consider this particular performance pitfall to be a problem. That is, the potential to do inlining (including before the generator transform), combined with common tasks having relatively shallow stacks, means they did not expect this to cause problems.

5 Likes

This is well-known: the advantage of the Rust design is that it's much more memory and cache efficient in the common shallow cases, since you only need one memory allocation for the whole future/task chain without any pointers to the next tasks and the data is contiguous in memory reducing cache misses. The CPU time for going a few levels deep is negligible compared to the CPU time needed to send I/O to the kernel.

Of course you need to avoid any sort of unbounded recursion: fortunately, it's easy to avoid doing it accidentally since it will not compile unless you explicitly box the futures, which is an indicator that you are probably doing something wrong.

If you really need it, then spawning on an executor is the correct approach as you stated. It's not an "hack" and should not incur more overhead than in JavaScript.

If you want to fully emulate the JavaScript design in Rust you in addition to spawning tasks use the Shared adapter, which will give something that behaves exactly like a JavaScript promise (which obviously is a bad idea in general since it's inefficient).

3 Likes

I had to reduce the number of samples to make this take a reasonable amount of time to benchmark, but here's results showing a linear correlation when using tokio::spawn:

use criterion::{criterion_group, criterion_main, Bencher, BenchmarkId, Criterion};
use futures::future::{BoxFuture, FutureExt};

fn recurse(n: isize) -> BoxFuture<'static, isize> {
    async move {
        tokio::task::yield_now().await;
        if n > 0 {
            1 + tokio::spawn(recurse(n - 1)).await.unwrap()
        } else {
            0
        }
    }
    .boxed()
}

fn benchmark(c: &mut Criterion) {
    let rt = tokio::runtime::Runtime::new().unwrap();
    let mut group = c.benchmark_group("bench");
    for &i in &[1, 2, 5, 10, 50, 100, 500, 1000, 5000] {
        group.bench_with_input(
            BenchmarkId::from_parameter(i),
            &i,
            |b: &mut Bencher, &input| {
                b.to_async(&rt).iter(|| recurse(input));
            },
        );
    }
    group.finish();
}

criterion_group!(benches, benchmark);
criterion_main!(benches);

It got noisy at the top end.

Edit: for completeness, switching to OP's implementation, without the tokio::spawn:

bench/1                 change: [-94.705% -94.527% -94.330%] (p = 0.00 < 0.05)
bench/2                 change: [-92.955% -92.808% -92.661%] (p = 0.00 < 0.05)
bench/5                 change: [-95.333% -95.309% -95.285%] (p = 0.00 < 0.05)
bench/10                change: [-93.543% -93.486% -93.425%] (p = 0.00 < 0.05)
bench/50                change: [-75.750% -75.592% -75.428%] (p = 0.00 < 0.05)
bench/100               change: [-61.842% -61.640% -61.447%] (p = 0.00 < 0.05)
bench/500               change: [+21.922% +22.437% +22.979%] (p = 0.00 < 0.05)
bench/1000              change: [+139.97% +140.97% +142.02%] (p = 0.00 < 0.05)
bench/5000              change: [+1399.6% +1412.6% +1426.0%] (p = 0.00 < 0.05)

Even at 100 layers of boxing, it's still "worth it" to not tokio::spawn and just keep it in one task, even with Future::poll chaining down the recursive rabbit hole. Somewhere around 400ish levels it becomes worth it to tokio::spawn everything.

Most algorithms that are recursive don't typically need 100+ layers. (Recursion works best for expressing O(log n) divide-and-conquer type algorithms.) Other tasks that like a lot of boxing can consider using task::spawn at logical places to flatten the task stack a bit. (You can still return impl Future by just returning async move { join_handle.await }.)


I think the key takeaway is that Rust gives you a choice, and unfortunately you have to make the right one, because the choice isn't made for you. JS style futures don't give you a choice: you're allocating each new promise. (Though to be fair: on a GC heap optimized for shortlived objects, complete with the ability to move generations and all that fancy stuff, not the general malloc heap.)

17 Likes

It's actually more complicated. JS has two loops, one for "tasks" and one for "events". When a promise resolves immediately, the continuation is executed at the next task, but in the same event.

1 Like

Should be in the BOOK (filed an issue). very instructive. 101 of Rust futures

1 Like