Hi everyone,
For the past couple of weeks, I've been implementing a research database storage engine in Rust to learn the language and get up to speed with the latest academic designs. Since my professional interest lies in high-performance server software, I'm also trying to squeeze every last bit of speed from the hardware. Eventually, this led me to my custom memory management scheme for performance-critical data and thread pools for I/O (obviously), where I can schedule things the way I want them. But damn it, I wanted to use async, like this:
struct Root {
data: Index<String, u64>,
};
...
async fn merge_foo_and_bar() -> Result<u64> {
database.run(||tx| async {
let root = tx.root();
let foo_fut = root.data.get("foo");
let bar_fut = root.data.get("bar");
let (foo, bar) = join!(foo_fut, bar_fut);
root.data.insert("foobar", foo + bar)
});
}
(this is an idealized example, real code will have to deal with things like serialization and multi-block allocation - but one can dream :-)).
Wouldn't that be neat? Initially, my thinking was that this probably wouldn't perform as good as hand-rolled scheduling inside of the storage engine, but the ergonomic benefits are quite nice. But maybe I'm wrong? My goal right now is learning Rust and research, so I thought it would be worth a try.
But then I got to thinking, what if my data is scattered around in storage devices connected to different sockets? I need to be able to: a) allocate memory for objects in the memory local to the same socket to which the PCIe disk is connected and b) somehow communicate to the executor that I want my async functions to run on that specific socket.
The point a) is reasonably easy, at least for my use cases, because I don't need my memory allocation scheme to be compatible with standard collections/Box. What I'm missing, though, is placement new, but I know that is in the works.
But the second point, b), is where I got stuck. I searched around the internets and found no way to accomplish the above using async/await.
What I think I need is an executor that's aware of the locality of resources used by the async functions. And a way of communicating such locality information from the functions to the executor. While I (kind of) know how to implement the former, I have no idea how to do the latter.
Contemporary hardware platform topologies are getting increasingly complex. It's not uncommon nowadays to see servers equipped with many different types of compute resources (initiators) competing over a wide variety of different memory resources (targets). In the simplest form, this takes the form of multi-socket servers with multiple NUMA (Non-Uniform Memory Access) domains. In such scenarios, each socket is its own NUMA domain, with low latency access to directly attached local memory, and higher latency access to remote memory. But real life is rarely this simple - we also have to consider other non-CPU devices connected through PCIe, such as disks, GPUs, RNICs or FPGAs.
But wait, there's more. In the near future, we are going to have what is effectively commodity hardware with copious amounts of HBM (High-Bandwidth Memory), memory and compute resources attached through CXL (https://www.computeexpresslink.org/), or similar extremely low-latency interfaces, and even Persistent Memory, that promises to bridge the gap between storage and memory. One of the recent additions to the ACPI spec is HMAT (Heterogeneous Memory Attributes Table), which enumerates the various initiator and target devices available in the system, and provides information about relative performance between them (see https://www.kernel.org/doc/html/latest/admin-guide/mm/numaperf.html).
To date, highly complex heterogenous systems were mostly a concern in the HPC world (High-Performance Computing). Things like OpenMP can already be told to allocate different types of memory, and the scheduler takes that into account.
But that's beginning to change, and I expect that software will need to be written with heterogeneous memory and compute architectures in mind.
Typically, this problem is handled by the operating system's kernel, whose job is to make sure that compute is scheduled as close to the resources as possible. But it can only do so retroactively (e.g., AutoNUMA in the linux kernel), and in a very limited scope. Applications have vastly more information available to them to aid in intelligent scheduling of functions.
I hope I didn't bore anyone with this lengthy background information, but I felt it necessary given the niche topic.
With my relatively limited knowledge of Rust internals, I wouldn't presume to suggest any changes for the async/await API or the standard libraries.
I'm currently considering if it would be possible to introduce a function, similar to mbind()
, that would tell the scheduler to bind execution of a task to a particular socket(s). Or even something like .await(cpu_mask)
, which would do a similar thing. But I think it would be better to have a generic abstraction over initiators and targets, where the application defines targets/resources and async functions communicate which resources they use. The executor can then use that information to map those async functions to the appropriate initiators (compute resources).
And maybe the answer is that this problem out of scope for async? I think that would also be fair since the benefits of async/await for database-style software are uncertain.
(I hope I posted this in the right category)
Thanks, Piotr