Unify async+sync into pure async

As far as I know, async does not have many use cases. In fact, I'm only aware of one: networking with a huge number of network requests.

Given that most people never need async, it seems wrong to force it on everyone. Async comes with a burden. You need to have a runtime, for example.

4 Likes

Embassy uses async on microcontrollers as light weight replacement for threads. And Zed uses it in it's GUI framework for waiting on background threads to finish without blocking as any wait longer than a single frame may be noticable by the user. Iced does something similar.

7 Likes

after above discussions, I think defaulting to always using async is prob not a good idea, but a better approach is to use generics over sync/async, which is currently being worked on.

But I don't think that networking with huge number of network requests is the only use case. Anytime you have a huge workload you want to finish ASAP, you need to shard it (i.e. distribute the workload across multiple cores to enable multiple cores to process the workload rather than a single core doing all the work alone). Coordinating those threads is a great use case for async.

CPU:s have basically reached their limit to how big of a processing power they can deliver, so hardware manufacturers are nowadays resorting to increasing the number of cores as a means of increasing computer's processing power. I think async is important for unifying the processing power of those numerous cores.

You need async to yield back to an executor to do more when you're waiting on something and waiting happens often (because you're doing lots of tiny reads/writes, waiting on unpredictably slow remote hosts, etc.)

For big compute workloads you want to shard into large work items to avoid synchronization overhead and use queues to maintain throughput. So what would normally be an await point should instead be popping something off a queue every now and then which will have work ready.

So async/await shouldn't be needed. Rayon, queues, continuations, join handles and the like should be sufficient most of the time. Some mild concurrency to mask IO latency might be needed, but that can usually be solved by throwing a few dedicated IO threads at the problem that keep the input queue saturated.

3 Likes

I think it's not just important for that, it makes this significantly easier. Also consider that there are other applications besides networking that can benefit from non-blocking code, which may be easier to write with async, especially considering Rusts thread safety guarantees:

  • Hardware I/O / Periphery: Embassy was already mentioned. In embedded applications you often end up writing one main loop that implements a state machine, thus effectively polling all input pins regularly like it is done in async Rust.
  • Network connections: This was already mentioned, but even for applications that only send a smaller amount of requests you often want them in parallel, not blocking the UI or other computation. Yes, this can be implemented without async, using threads, but I think many find the async/await approach easier to comprehend (at least as long as you're not using too many other things that don't fully support async code yet).
  • File I/O: A lot of applications need to read from/write to the file system. For example games: Ideally you wouldn't have to show a loading screen every time you need to save the games state, or keep it to a minimum. Again: This is possible with threads but can be easier to implement/understand with async functions.
  • Basically anything that has a UI has some part that is responsible for doing things and a part/thread responsible for providing a responsive interface. Without async you tend to build some kind of MVC system around it, using one thread exclusively for the UI (View) and message passing to the background tasks. With async you can effectively say "do this and notify me once done", regardless of how many threads are used in the background. Note: Ideally there would be some way to prioritize tasks differently, then it could become even more useful for this.

Every time you interface with the GPU you effectively have some kind of asynchronous code. Not the async you have in Rust, but I wouldn't be surprised if async Rust makes interacting with the GPU easier (though there still needs to be some lock/ordering around anything that communicates with the GPU.

Limiting yourself to threads in strategic places or using rayon (which cannot be used everywhere) can limit how much you can scale with more CPU cores quite a bit, whereas async code tends to be a lot more flexible in regards to how many cores you have. I'd argue that's the biggest benefit of async in the first place: Changing the programming model from synchronous to asynchronous where it makes sense.

2 Likes

I completely agree with that, but I think many applications are not this big, or cannot be split up this easily.

For high performance computing that can be split up in such a way async probably isn't the best choice :+1:

async has costs. You need a runtime, the Send issues, cancelability, object safety and all that crud.

If your network IO can be handled by a handful of threads that's fewer moving parts and thus easier to comprehend.

IMO the main reason for using async networking code is because some library that you want to use only has an async interface. That's not a particularly good reason, but whatever.

File I/O: A lot of applications need to read from/write to the file system. For example games: Ideally you wouldn't have to show a loading screen every time you need to save the games state, or keep it to a minimum. Again: This is possible with threads but can be easier to implement/understand with async functions.

File IO is just threads and queues under the hood anyway. Since most of the OS APIs are sync. At least until there's enough io_uring adoption. With uring, yes, async could be beneficial to keep IO queue depth > 1.

On the other hand if you're dealing with spinning rust you do not want more parallelism, you want to optimize your workload for big sequential accesses.

Note: Ideally there would be some way to prioritize tasks differently, then it could become even more useful for this.

Well, there are thread priorities. Afaik there's no async runtime that has per-task priorities. Instead you'd spawn work into a different runtime or thread pool. But you'll need something else than just async/await. Perhaps it's submitting the task to a different runtime and a oneshot channel to get the response. But in the end you're again doing more structured concurrency than just async/await.

2 Likes

Note that async doesn't imply parallelization. You still need to explicitly use APIs like tokio::spawn to achieve that, but that's not all that different from an equivalent sync API.

Async will likely not help there anyway.

3 Likes

With async you can write pure os/hardware independent (see no_std) libraries that need to wait/stop/sleep etc. Blocking on future in linux is trivial and zero cost, you only need AtomicUsize for state and call thread::park and thread::unpark when needed. Any block involves thread parking, but implicitly, from kernel side

Executors are only needed if you want to center the whole your application (or some part) around async and spawn a lot tasks. If you don't need to spawn a lot of tasks, you really can just start a new thread and block on future with thread::park/thread::unpark - kernel will be your actual executor.

Future is just a handy coroutine. It means that it can be stopped and resumed by you, not only by OS and in the sync code. You are not really losing anything, just getting more flexible with order of execution and scheduling + allowing more abstractions.

3 Likes

On PCs, spinning rust isn't relevant any more. Maybe it still is in server, I don't know, don't work with them.

I would love better support for io-uring. I would also love an executor optimised for mixed io/compute. And this isn't about "big science" style compute. I was recently trying to hash all files installed by the package manager and compare them to known good checksums (in order to find discrepancies). On a typical Linux desktop. That is a whole lot of file IO as well as compute. And async does not serve that use case well currently.

The really big use case for async that I see is in UIs. Both GUI and TUI can benefit from the async model. Unfortunately this is also under served from what I can tell, though zed is doing some interesting things there.

5 Likes

does spinning Rust refer to the usage of spinlocks implemented in Rust? I haven't encountered that term before.

EDIT: nvm, I found the answer

2 Likes

Are you thinking about an executor that recognizes when its threads are blocked and spawns new ones or a mechanism to tell the executor if a future is doing heavy work on the CPU?

The first could result in a lot of threads, which effectively just add overhead, so I think it would have to be the latter. Then there is the question of how to handle futures that start out with heavy-IO and as soon as that I/O is completed does heavy CPU usage like in your example.

There might be a need to indicate to the executor whether the code after a .await is/can be heavy-CPU or is always light on CPU usage. That way the executor wouldn't have to guess when deciding if a future should be polled now (mainly I/O) or wait (because the CPU is already busy).

I think that'd be relevant in the "read lots of files and compute hashes" use-case you mentioned: Let's say the executor has 100 open futures it can poll and 4 cores/8 threads or so. All futures are waiting for the I/O to complete, 50 of them would continue waiting, the other 50 would start with CPU-heavy computation. If the executor (8 threads for execution) is currently processing 7 heavy-CPU futures (thus having the CPU busy). It can now do one of the following:

  • Wait until one of the CPU futures finishes - Thus blocking all other futures like for the UI, which would finish quickly, OR
  • Take any of the existing futures and hope it is one for I/O and not another one with heavy CPU usage.

Without knowing the difference between those futures there is no way to choose correctly. So unless I'm mistaken you'd need something like the following (either manually or automatically added by the compiler):

async fn do_something() {
    let data1 = read_io().await;
    let data2 = read_io().await_then_compute;
    // Alternative syntax suggestion (this syntax unfortunately conflicts with futures that return functions)
    let data2 = read_io().await(estimated_compute_time_that_follows_or_priority);
    let hash = compute_somethingdata1, data2);
}

Which the compiler could convert into a state machine that has a will_do_compute_if_ready() -> usize function or so, indicating that if this future is polled and ready it will take a while to execute. Maybe with some estimation on how long it thinks it takes.

With this knowledge the executor could look at the futures ready to be polled and decide which one to poll first if the CPU is already busy.

It's like (when there was one big computer for an entire university/company) when you had to tell the computer how long you think something takes, such that it can schedule smaller tasks earlier and stop doing work once it reached its limit. Might make sense to take some ideas of that time, as that was effectively a kind of cooperative multi-tasking, too.

I'm not sure, I notice that using rayon I don't utilise the CPU fully (only about 85-90 %), this is even worse in the mode where I just compare time (not the full checksum). Unfortunately io-uring doesn't yet support directory walking as far as I can tell, otherwise rewriting around that could make sense.

So I was thinking of a mix of io-uring for making sure IO is fed plus rayon-like for the compute. Somehow unified into a cohesive whole.

Marking the futures somehow makes sense.

1 Like

Or you could just do what Unix does, and let applications take as long as they want to react to control-C.

That's only safe on Unix because it has kill -9 as a fallback if the application isn't cooperating. But you should be able to support a kill -9 equivalent: a way to end a process that isn't cooperative, but dodges the complexities of signal handling (at least from a userland perspective) because it forcibly ends the process.

From a kernel perspective, a very purely-async kernel might have a hard time with kill -9, if it entirely removed support for asynchronous exceptions or thread state save/restore. Removing those features would simplify the kernel a bit, and might allow you to make interesting guarantees to userland like "the CPU literally won't execute any instructions other than yours until you await". But if you support preemptive multitasking, then you couldn't remove those features or make those guarantees anyway. So supporting kill -9 shouldn't be much further of a reach.

2 Likes

I don't really think this makes sense. With async/await you're interleaving interchangable units of work.

When doing IO and compute there is no point in "switching" between the IO part and the compute part. They're not interchangable units of work, and there's no point in mixing them in the same function.

Sure, it would be convenient to do so to pretend you're writing straight-line blocking code and have magical parallel and concurrent execution. But you're not going to achieve that. Because you're not going to produce 2 units of IO, then 2 units of compute, then 2 units of IO etc. If your loop is structured like that you're leaving throughput on the table because timings of one resources may leave the other idle longer than necessary.

In async terms you want to have independent top level tasks with their own loops running that are connected to each other via a bounded queues, maybe with a buffer pool to avoid allocator overhead.

But at that point you have:

  • a set of tasks that just loop, block on a queue for more work, compute. this is just a sync threadpool
  • a set of tasks that either block on IO or block on the queue being full. this is just a thread pool. or an io_uring submit/CQE-processing loop.

Neither of which needs async. Because when they block they block because there is nothing else left for them to do, no further interchangable units of work. And when they run they have no reason to yield since yielding won't make anything else finish sooner.

You need more convenience APIs like rayon and scoped threads and what not to glue different tasks together and shovel data between them. Not async/await.

All this glue serves a purpose. Explicitly declaring a bounded queue says how many items you want to keep in memory concurrently, i.e. how much buffering you want to do. Declaring separate thread pools / top level tasks / whatever essentially says how many IO threads and how many CPU threads make sense for your work. If you're working on HDDs you want 1 IO thread. If you're doing blocking IO on SSDs you want enough to keep your QD above 1. If you use IO-uring then perhaps as many threads as the device has IO-queues, assuming the compute part can keep up.

Maybe an executor could try to guess and infer these things. But once your compute-io-device-graph gets complicated enough it'll probably start doing the wrong thing for inscrutable reasons.

2 Likes

You are probably right, sigh. However, doing the right thing for this currently is difficult, and it would be useful to have better abstractions.

In particular it is hard to integrate this with async in a different part of the program (the full task is to load a user config script describing the expected system state and compare it to what is actually on the system, then offer to either save the differences to the config or apply the config to the system, depending on mode). Here the scripting engine I use uses async, as do some other libraries that I depend on.

You can use tokio spawn_blocking to wait on rayon jobs in a blocking helper thread, then wait async on that join handle. But that creates an extra thread just for that. Adding an extra io thread pool feeding the compute thread pool (which at this point likely won't be rayon I guess but a custom thing) just makes this even more complicated.

On the other hand, the actual graph of data flow and where we want back pressure via bounded queues etc is actually quite simple in this case. If only there was an easy way to translate that directly in to code (just connect various modules of code that take streams of input and produce streams of output. Perhaps some sort of actor/data-flow framework? But such things are often made for servers where you have a steady state until the end of the program, not phases of execution after which you are done with the parallelism.

And the main thread has to kick off various jobs and wait for them at other points (also straightforward in theory, just a dependency graph). Can be modelled via spawning and joining tasks/threads and having tasks wait for other tasks.

After writing this I feel like the main problem (from a usability POV, not performance necessarily) is that integrating the various different concurrency/parallelism approaches in Rust is not nice. There isn't enough convenient glue to go between them. And it is too easy to get into a architecture that is a confusing mess.

(Oh and build times and binary size suffer because you now have so many dependencies on alternate underlying libraries that solve the same issue (possibly in multiple versions). Why do I need both log and tracing, why both nix and rustix, etc? And for a hobby project, doing it all from scratch is not viable, I don't want to yak shave but to solve my original problem.)

Can I ask why tokio_rayon's approach isn't suitable? This doesn't use an extra thread; instead, it wraps the function you're sending into Rayon (as your compute thread pool) with code to send the result through a tokio::sync::oneshot channel back to the async caller.

You do end up with an I/O thread pool (for Tokio), and a compute thread pool (for Rayon), but you don't need spawn_blocking because you're using oneshot::send to send results from the compute pool back to the async caller.

1 Like

Because I hadn't found that library. Thanks! That makes things a bit better (at the cost of yet another dependency).

It's also a very simple approach to replicate without adding it as a dependency. In your async code that fires off compute, instead of a "simple" rayon::spawn(|| my_code()), you write:

{
    let (tx, rx) = tokio::sync::oneshot::channel();
    rayon::spawn(move || tx.send(std::panic::catch_unwind(|| my_code()));
    rx.await
}

All the crate does is bundle that up nicely for you.

This is also a useful technique to have in mind for FFI code that does something in the background then calls a callback - make the callback send the result down a oneshot channel, and now you've got a neat technique for making it async, or for making the wait happen nicely in sync Rust (using a different channel type - maybe std::sync::mpsc::channel). No need for Rust to start a thread, and a quick unwinding of the FFI stack frame back into pure Rust.

3 Likes

tokio::sync::oneshot::Receiver has a blocking_recv method for waiting for the value outside async contexts, so this can be done with the same channel type. Not sure about performance overhead, however (if any).