Support for heterogenous memory systems

pbalcer · May 29, 2020, 7:57pm

Hi everyone,

For the past couple of weeks, I've been implementing a research database storage engine in Rust to learn the language and get up to speed with the latest academic designs. Since my professional interest lies in high-performance server software, I'm also trying to squeeze every last bit of speed from the hardware. Eventually, this led me to my custom memory management scheme for performance-critical data and thread pools for I/O (obviously), where I can schedule things the way I want them. But damn it, I wanted to use async, like this:

struct Root {
   data: Index<String, u64>,
};

...

async fn merge_foo_and_bar() -> Result<u64> {
    database.run(||tx| async {
        let root = tx.root();

        let foo_fut = root.data.get("foo");
        let bar_fut = root.data.get("bar");

        let (foo, bar) = join!(foo_fut, bar_fut);

        root.data.insert("foobar", foo + bar)
    });
}

(this is an idealized example, real code will have to deal with things like serialization and multi-block allocation - but one can dream :-)).

Wouldn't that be neat? Initially, my thinking was that this probably wouldn't perform as good as hand-rolled scheduling inside of the storage engine, but the ergonomic benefits are quite nice. But maybe I'm wrong? My goal right now is learning Rust and research, so I thought it would be worth a try.

But then I got to thinking, what if my data is scattered around in storage devices connected to different sockets? I need to be able to: a) allocate memory for objects in the memory local to the same socket to which the PCIe disk is connected and b) somehow communicate to the executor that I want my async functions to run on that specific socket.

The point a) is reasonably easy, at least for my use cases, because I don't need my memory allocation scheme to be compatible with standard collections/Box. What I'm missing, though, is placement new, but I know that is in the works.

But the second point, b), is where I got stuck. I searched around the internets and found no way to accomplish the above using async/await.

What I think I need is an executor that's aware of the locality of resources used by the async functions. And a way of communicating such locality information from the functions to the executor. While I (kind of) know how to implement the former, I have no idea how to do the latter.

Contemporary hardware platform topologies are getting increasingly complex. It's not uncommon nowadays to see servers equipped with many different types of compute resources (initiators) competing over a wide variety of different memory resources (targets). In the simplest form, this takes the form of multi-socket servers with multiple NUMA (Non-Uniform Memory Access) domains. In such scenarios, each socket is its own NUMA domain, with low latency access to directly attached local memory, and higher latency access to remote memory. But real life is rarely this simple - we also have to consider other non-CPU devices connected through PCIe, such as disks, GPUs, RNICs or FPGAs.

But wait, there's more. In the near future, we are going to have what is effectively commodity hardware with copious amounts of HBM (High-Bandwidth Memory), memory and compute resources attached through CXL (https://www.computeexpresslink.org/), or similar extremely low-latency interfaces, and even Persistent Memory, that promises to bridge the gap between storage and memory. One of the recent additions to the ACPI spec is HMAT (Heterogeneous Memory Attributes Table), which enumerates the various initiator and target devices available in the system, and provides information about relative performance between them (see https://www.kernel.org/doc/html/latest/admin-guide/mm/numaperf.html).

To date, highly complex heterogenous systems were mostly a concern in the HPC world (High-Performance Computing). Things like OpenMP can already be told to allocate different types of memory, and the scheduler takes that into account.

But that's beginning to change, and I expect that software will need to be written with heterogeneous memory and compute architectures in mind.

Typically, this problem is handled by the operating system's kernel, whose job is to make sure that compute is scheduled as close to the resources as possible. But it can only do so retroactively (e.g., AutoNUMA in the linux kernel), and in a very limited scope. Applications have vastly more information available to them to aid in intelligent scheduling of functions.

I hope I didn't bore anyone with this lengthy background information, but I felt it necessary given the niche topic.

With my relatively limited knowledge of Rust internals, I wouldn't presume to suggest any changes for the async/await API or the standard libraries.

I'm currently considering if it would be possible to introduce a function, similar to mbind(), that would tell the scheduler to bind execution of a task to a particular socket(s). Or even something like .await(cpu_mask), which would do a similar thing. But I think it would be better to have a generic abstraction over initiators and targets, where the application defines targets/resources and async functions communicate which resources they use. The executor can then use that information to map those async functions to the appropriate initiators (compute resources).

And maybe the answer is that this problem out of scope for async? I think that would also be fair since the benefits of async/await for database-style software are uncertain.

(I hope I posted this in the right category)

Thanks, Piotr

passcod · June 3, 2020, 11:33pm

Perhaps only tangentially related, but I recall Web Assembly is due to be getting multiple memories, some unique, some shared. There may be insights on handling heterogeneous memories coming from there?

pbalcer · June 4, 2020, 8:42am

Thanks for the suggestion. I looked around the proposed changes to WebAssembly memory model, and from what I can tell, the shared memory buffers are, in fact, private from the OS perspective (MAP_PRIVATE), i.e., they are copy-on-write across processes. In Rust's nomenclature, they are simply arrays that are Send. I'm assuming, based on web workers documentation, that this memory is intended to be "shared" between threads. Not between processes, as is the case with real shared memory.

Having said that, if Rust code would wish to dynamically allocate from WebAssembly.Memory, it would likely need a custom memory allocator that could treat the existing WebAssembly.Memory as its heap. And so, if you have multiple different WebAssembly.Memory objects, you need multiple heaps, meaning we can no longer rely on a single global allocator. If the function of the memory is different, then applications need to be able to intentionally allocate from the heap with the desired properties. In this case, I'd expect that the structs that the Rust code wishes to be Send, need to be allocated from the "shared" WebAssembly memory.

I found wee_alloc, which does what I describe, but it relies on the allocator being global, at least as far as I can tell.

If that's what you meant, that yes - this is indeed a very similar problem to the one I've described. In the NUMA case, we want to allocate memory with desired properties (Low Latency or High Bandwidth or with Huge Pages or Slower but with higher capacity or even local to a specific device), and then use that memory with the appropriate I/O initiators.

But that problem, I believe, can be simply solved by libraries, similar to wee_alloc, that allow the creation of non-global heaps. Such heaps could be parameterized with a custom page allocation function, instead of relying on a fixed lower-level OS interface (e.g., mmap on Linux).

To fully address the entire problem I described in my original post, we need to look beyond merely allocating resources. Below I describe my current thinking.

All allocated memory objects could implement a Target trait, something like:

trait Target {
    fn relative_performance<T>(&self, initiator: &T) -> PerformanceMetric where T: Initiator;
}

But other things could also implement Target, e.g., file descriptors or threads. This could allow us to allocate memory of similar locality to the given Target, like this:

  let file = fs::OpenOptions::new()
         .open("/mnt/disk_on_socket0/file")?;

  let data = unsafe {
    let ptr = find_best_allocator_for_target(&file).alloc(Layout::new::<MyData>());
    *ptr = MyData::default();
    Box::from_raw(ptr)
  }

And then, ideally, the executor could be made aware that the async functions are using some resources that implement Target, and would schedule it on the initiator that has the best average relative performance (best locality) for the resources on which it will operate.

  let file = ...;
  let data = ...;
  ...
  let future = async {
    uses!(file, data);

    let some_other_data =
        find_best_allocator_for_target(thread::current())
            .alloc(...); 
         /*
          * The executor could probably expose the best allocator for
          * the scheduling domain, instead of relying on threads...
          */
        ...
  };

But it would probably be better to associate that usage information with the Future itself so that the executor has a priori knowledge of the locality of the future resources.

These ideas are all half-baked, but this is roughly the direction I intend on heading if I get that far in my project

PoignardAzur · June 4, 2020, 9:43am

I feel like there's some ambiguity in this discussion, so I want to clear one thing up: WebAssembly with multiple memories isn't at all equivalent to mmap. Each different memory has its own "pointer space", so to speak, with no trivial way to index one of several memories based on a runtime value.

In other words, if you want 20 different memories in your wasm instance to represent eg different sockets, then you will need 20 copies of every socket-manipulating function, which in practice will probably involve generics.

pbalcer · June 4, 2020, 10:11am

That's interesting, how then does one dereference a pointer inside a pointer space (which I'm assuming is a type of Logical Address Space)? If there's no common address space, then I'm assuming that any WASM instruction that operates on a memory location must be parametrized with some identifier of the WebAssembly.Memory, which, as you seem to be suggesting, must be known compile-time?

Even if that's true, I still think there are some parallels to be drawn between the WASM and native use cases. A good abstraction would hopefully be able to cover both.

HerrJemine · June 4, 2020, 10:48am

Would it make sense to represent the different memory spaces in the type system?

Similar to Box<T> a library could provide a HBMBox<T> or a 3DXpointBox<T> type with suitable conversion functions between them which move the underlying data between memories and the user then has full control where they want to place their data.

pbalcer · June 4, 2020, 11:13am

Code typically shouldn't care about the exact specifics of the hardware that it is running on. It might care about properties of that hardware, but not whether it's HBM or something else. So instead of allocating specifically HBM, the application should instead indicate how it wants to use the object. Let's say you are allocating a B+ tree, you might want your internal nodes and keys to be located on the fastest memory available, but values, which are accessed less frequently, can be located on memory of which there's more in the system (like PMEM).

The abstraction that I described in my earlier post allows for that by simply enabling you to enumerate the characteristics of a given "Target Memory", and selecting the most appropriate allocator for the intended use of the resulting memory allocation. So, in the B+ Tree example, for internal nodes, you'd want to find the fastest target available, but for leaf nodes, you'd want to find a target which is larger, but ideally still connected through the same socket. And obviously, this should all still work on single target/initiator platforms.

The Target trait that I described above is too simplistic, and a real implementation would likely need to be able to describe things like read/write bandwidth, latency and the theoretical max capacity (but I'm not 100% sure about that one).

You also mentioned 3D XPoint, which is a marketing name for one specific technology. The general term is Persistent Memory. And while you can use Persistent Memory as a capacity tier of volatile memory (because it's larger and slower), for example, to allocate leaf nodes in the B+ Tree example. But that wouldn't take advantage of the fact that the memory is persistent. Normal memory allocators are not designed to survive across application restarts, and typically are not fail safe. So a PMEMBox<T> wouldn't do much - we'd need a whole new memory allocator and transactions to go with it, and at that point, it wouldn't be even similar to a normal Box<T>. I intend for the storage engine that I'm working to enable semi-transparent use of PMEM (while taking advantage of its persistence), HBM and other types of memory in Rust. Hence this topic

bill_myers · June 4, 2020, 1:13pm

As far as I can tell, all you need is a threadpool-based executor crate that can spawn multiple threadpools, and where you can set the affinity of the threadpool's threads to a specific NUMA node, so you can then spawn the future on the executor for the NUMA node you want.

Not sure whether the existing runtimes support this, but it should be easy to modify them to do so.

You'll also probably want a Rust global allocator that can be configured, in a thread-local way, to allocate on a specific NUMA node, which I'm also not sure if it has already been written, but that should also be easy to write.

Alternatively, you can run an OS process for each NUMA node and set it to have both CPU and memory affinity to it using Linux cgroups and then it will work with no need to change the programs (but of course you'll now have a cluster instead of a single process).

pbalcer · June 4, 2020, 2:16pm

All I need is mbind[1] and pthread_setaffinity_np on some threads. If I want to run those threads inside of my library. But I don't, I want to let the downstream applications to chose whatever executor they want. That's the point of decoupling the executors from the runtime, after all.

I want my code to scale from small consumer-grade PCs to large multi-socket servers, so I don't want to hardcode assumptions about the hardware topology.

And yes, I could forgo NUMA awareness (but the problem that I'm trying to highlight goes beyond that), and make my library multi-process safe. But it's arguable which problem is easier to solve for an embedded database library.

[1] - some types of memory need to be mapped from a file, instead of just being allocated from a NUMA node through something like libnuma, so that's not entirely true - but let's leave it at that.

PoignardAzur · June 5, 2020, 9:12am

Yup. My understanding is that it's the intended semantic.

bill_myers · June 5, 2020, 11:29am

I suppose you could return a NumaFuture struct with a future and an optional NUMA node id (or mask/set) and let the library user handle executing it with proper NUMA affinity (or ignore the affinity), presumably using an executor crate that you would provide along with the crate publishing the NumaFuture definition.

pbalcer · June 5, 2020, 4:21pm

Right, creating a resource-locality-aware Future is one part of the solution that I've outlined in one of my earlier posts. Allocation of local resources inside of an async function is another.

But I don't think it's as simple as associating cpumask with the Future. It's the executor that should have a list of initiators (NUMA nodes), and match it with the list of Targets in the Future. There's one pragmatic reason for that - mapping the list of target NUMA nodes to optimal initiator NUMA nodes requires the information located in HMAT (https://uefi.org/sites/default/files/resources/ACPI_6_2.pdf, 5.2.27). Loading and parsing that table during future creation would be slow. The executor can just load HMAT once into a convenient data structure, and quickly derive the best cpumask.

Also, simply providing a mask may needlessly restrict the executor design which might negatively impact performance. The abstraction should merely provide more information to the executor.

system · September 3, 2020, 4:22pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Blog series: Dyn async in traits (continues) language design	69	5465	December 24, 2022
Blog post: Async Cancellation language design	10	6722	February 18, 2022
RFC IDEA: Rust Database API libs	9	1634	March 25, 2019
[Feature Request] Add async indexing language design	7	856	January 22, 2022
Compacting memory for C/C++/Rust?	4	1005	January 21, 2020

Support for heterogenous memory systems

Related topics