[Pre-RFC] Split stack codegen option

Summary

The proposal is to add a new codegen option, -C split-stacks, that causes the compiler to emit the function prologue required to support (potentially) discontiguous stacks that can be grown at-will at runtime.

This proposal also includes a new function attribute, #[no_split_stack], which is a per-function opt-out of the split-stack function prologue.

Motivation

Split stacks, also called “segmented stacks” or “linked stacks”, are a technique for growing the stacks of threads while a thread is running. Instead of reserving a large amount of memory up-front for a thread’s stack (and crashing if that limit is ever exceeded), one can instead reserve a small amount of memory and expand it as needed.

Split stacks were useful enough for even Rust to use them, for a time - the old "libgreen" pre-1.0 Rust runtime utilized them to implement a “green-threaded” runtime atop libuv. Rust was hardly the first runtime to utilize split stacks: the Go programming language made heavy use of split stacks for goroutines, until they chose to instead use contiguous stacks for goroutines (although they still require the compiler support detailed in this RFC) 1. Microsoft’s Midori project also made use of split stacks to great effect 2.

While Rust has departed from the “libgreen” days, in both philosophy and implementation as a systems programming language, Rust still stands as a potential C++ replacement as a language for runtime implementations. Even if Rust does not use split stacks for its runtime, it is still useful for Rust code to be emitted to use split stacks, for interoperating with (or even implementing) a runtime that uses them. Both C and C++ have the ability to use split stacks using the clang/gcc option -fsplit-stacks.

Detailed Design

LLVM provides the compiler support required to support split stacks 3, while libgcc provides the runtime support 4. One could also supply the runtime support themselves, by implementing the API described in 4.

For us, as users of LLVM, the work consists of:

  1. Adding a new codegen option, -C split-stacks that, when enabled, adds the “split-stack” function attribute on all functions (except ones marked by the below attribute). This will ensure that LLVM emits the proper function prologue for split stack functions.
  2. Adding a new attribute, #[no_split_stack] that, when adorning a function, will prevent the function from having the split stack prologue. When writing a split-stack runtime, this attribute would be used on all code paths that involve allocating new stack segments.

It’s worth noting that rustc has emitted split-stack code in the past and it is known to work, so this work is not entirely unprecedented.

How We Teach This

Split stacks (and the general problem of stack switching) is difficult to teach and understand, even for those with a background in computer science. A few blog posts about split stacks with images would go a long way in illustrating what’s actually happening behind the scenes of a split-stack runtime. We can also use existing documentation for split-stack runtimes (mostly Go) to illustrate what is occuring behind the scenes.

Drawbacks

Like -C panic=abort, -C split-stack requires that all crates linked to the crate being compiled also be compiled with -C split-stack in order to be most effective. Tools like xargo 5 make this very doable.

LLVM does not support split stacks for every target. We are able to check this using the target specification JSON 6.

Unresolved Questions

libgcc allocates memory itself for stack segments, which may be inappropriate for no_std scenarios. no_std users of -C split-stack could certainly provide their own implementation of __morestack and friends. What should we do with no_std?

2 Likes

With 47 bits of address space on 64-bit platforms, I hardly see the need for this. You can easily reserve (not allocate) 64MB of address space per stack (ought to be enough for everyone ;)) and map in additional pages in that space as needed.

OTOH, if you have a legitimate need for split stacks it’d be neat if the compiler can do the correct codegen. This seems pretty noninvasive.

There are reasons other than that to want split stacks - a GC can get creative when the stack is partitioned into split regions. For a while (and I think they still do), Go inserted stack barriers into its stack so that it could perform portions of a goroutine GC root stack scan while a goroutine is running. Rust code with -C split stack could do that as well. If the compiler has the capability of interoping with split stack code (e.g. implementing such a GC) noninvasively, I don’t see how it’s a bad thing for it to be able to do so.

Also, the compiler support for split stacks doesn’t mean that the stacks are necessarily split - it’s the runtime portion in libgcc that does that. One could use the compiler support to allocate a large, expandable, contiguous stack. I believe the 1.0 Rust compiler even used it to detect when a task has overflowed its stack (since __morestack will get called when the task is out of stack space).

I would not mind making it possible to experiment with split stacks in Rust again. It’s definitely something I would consider completely experimental, and I wish our feature gating could talk about the difference between stable-track features and experiments.

In general I think it’s totally possible to convert Rust back into a green-threading language again, for at least limited scenarios, as a compile-time option.

The old split stack implementation in Rust on Windows had the unfortunate property of relying on undocumented system registers and mucking about with thread state in a way that isn't supported at all, and actually would cause many system functions to crash. If we ever want to support split stacks on Windows again, we'd really need to find a safer more supported way.

1 Like

Having worked on a port of Go to another platform, you really don't want stack barriers or similar mechanisms. They're quite painful to implement portably and make it much harder to debug the runtime itself (in my opinion). In fact, Go just recently removed them entirely since they eliminated the need to rescan stacks.

1 Like

I must admit being curious too.

Go used to have split-stack, but I seem to remember they switched to a single growing stack instead because they had performance issues when the “edge” of the split-stack would find itself in a tight loop. Of course, moving the stack is a tad complicated in Rust, because pointers are not tracked.

On 64-bits platform, with 47-bits for user-space, split-stacks don’t seem very interesting. You can have a million 64 MB stacks in 46-bits, leaving half the user-space for heap allocations. The only requirement is to physically allocate 8KB (2 OS pages) per stack, one for the actual stack and one for the guard page, which requires 8GB of physical memory for this million stacks.

Performance is likely to be much better than with split stacks, while the benefits are mostly similar.

Is there a concrete use-case that motivates split stacks?


Regarding the RFC itself, there’s an issue with native function calls, which Go has as well.

Each time one switches to a native function, the stack need be big enough for that function:

  • it cannot be expected for users to annotate each and every extern function with #[no_split_stack], which means that all extern functions must “gain” this attribute by default.
  • I expect that most stacks will call an extern function, if only because memory allocation functions call into malloc…

How do you propose to solve this conundrum? Does anyone remember how it was solved in the libgreen era?

1 Like

In the olden days of libgreen, you had to annotate all functions that called into C code with something that would cause the runtime to make sure there was some large amount of stack space available.

it cannot be expected for users to annotate each and every extern function with #[no_split_stack], which means that all extern functions must "gain" this attribute by default.

This is true by default - when compiling with -C split-stack, the default option is to put the "split stack" attribute on all functions in the current compilation unit, at the LLVM level. Extern functions won't have this attribute, so LLVM can observe exactly where split stack functions call non-split stack functions and insert the correct code. (Usually a call to __morestack_nosplit, which can do a number of things - switch to a larger stack, allocate a new stack segment, etc.) #[no_split_stack] is an opt-out for the current compilation unit and does not need to adorn external functions.

You're right that code will allocate often and as such will potentially have performance problems with malloc. Allocating in rust is not nearly as common as it is in other languages, so I don't think this will be a problem. In cases where it does become a problem, once could write an allocator in Rust and compile it with -C split-stack. If people are writing performance-sensitive experimental code, I don't think it's out of the realm of reasonableness to write a split-stack-aware allocator (especially since that allocator may be allocating stacks itself).

Though now that I think about it, the __rust_allocate symbol has a stable ABI and can't ever be split-stack aware, so that's a problem.

1 Like

Realistically, as @matthieum so eloquently summarised, there’s really no point to doing this on 64-bit. However, this may be very interesting for anything 32-bit and less, particularly in embedded applications.

As for Go, it solves the issue of native function calls by ensuring that native function calls always happen on the “system stack” – which is the initial stack that the initial thread started with given to it by the OS. However, as you can imagine, this also has a significant performance penalty since it means the active stack has to be “switched out” on every native function invocation and then “switched back” upon return from the native function.

1 Like

Out of curiosity, what does “switching out” the stack (and back) involve? Do you just have to save the stack pointer and restore it, with the performance hit being due to cache misses? Or is it more complicated than that?

Basically, switching out the stack means saving current register state, copying any arguments that need to be passed on the stack to the new stack (some architectures / platforms require this), and then making the function call. And then obviously, when the function returns, performing roughly the reverse.

For example, the innermost part of that implementation for Linux amd64 for Go.

1 Like

The only requirement is to physically allocate 8KB (2 OS pages) per stack, one for the actual stack and one for the guard page, which requires 8GB of physical memory for this million stacks.

This does prevent the use of hugepages for stacks, though. And page faults are not exactly free.

For the record, guard pages don’t require physical memory backing, so it would be 4k physical per thread. Which is still at least an order of magnitude more than needed, in many cases…

1 Like

The problem is not physical backing. A hugepage can be up to 2Gb in size, so you’ll have to wait until it fills up completely for a 4k page to get faulted.

So this means you’ll have to stick to 4k pages to map the stack. For a million stacks it’s definitely not free - page table manipulation requires locking (in the kernel) and creates pressure on TLBs.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.