Did Rust ever consider cactus stacks?

Recently I've been thinking about how Rust might be able to benefit from cactus stacks. In particular, if Rust used cactus stacks it might help solve multi-threading issues where a thread's stack grows into a different thread's stack.

But cactus stacks are an old idea, and I know that at least a few of the core developers are aware of them, so I have to ask, did Rust ever consider using cactus stacks? If they were considered and rejected, what were the reasons for doing so?

This doesn't seem particularly related to the Rust language? I don't think there's anything in the language that would preclude a compiler from using this. See e.g. https://crates.io/crates/stacker. However, on modern 64-bit memory address architectures, there's plenty of address space available to practically avoid the problem of stacks running into each other.

@jethrogb you're right that this isn't Rust language related, but it is Rust compiler related. That's why I posted.

As for stacker, I appreciate the link, but I suspect that psm is closer to what would be needed here. That said, manually manipulating the stack is probably not the best idea, especially when the compiler could be handling this issue.

Finally, while you are largely correct about modern 64 bit memory address architectures, embedded Rust is a thing, and 8-bit embedded systems are still being sold. So while I agree that with modern desktop/laptop machines you can choose to separate your thread stacks by a large amount, it leaves smaller systems out in the cold.

I hope that all makes sense as to why I posted.

1 Like

Rust the language doesn't know and doesn't care about the stacks. It's just some default place in memory where local variables go. There is literally nothing about stacks in the definition of the language (although the heap is, strangely, present, because Box must allocate on the heap). The specific implementation of the stack is provided, I believe, by the codegen backend. I think that you could write a LLVM target which would have cactus stacks, and then Rust would automatically use them.

You don't need a cactus stack for that -- there is no need for different threads to share parts of their stacks.

Cactus stacks are for a more complicated different scenario, where you don't want to drop a stack frame as soon as you exit the function because you want to keep references into the stack frame. And then you garbage collect the frames when they are no longer needed.

Rust doesn't allow references into the stack frame to outlive the stack frame, so it doesn't need this.

What you might want for multi-threading is a simpler data structure: stacks that are discontinuous in virtual memory, but still no sharing, so no garbage collection needed. That is a good approach, the only other alternative being that the stacks for multiple threads in virtual memory are allocated upfront when a thread is created (which is the approach currently taken).

Note that on x86-64 the address space is only 48 bits long, which means that if you statically allocate 16 GB per stack (which might still be less than physical RAM), you will use up all the virtual space after 16k threads (OK that is a lot, but not unthinkable).

Incidentally this also means that 64-bit pointers are always wasting at least 25% of their bits on x86-64.

5 Likes

5 level page tables and 57 bit pointers are coming. At the very least, the mainstream Linux Kernel has support for them.

Though with the ubiquity of NaN boxing (which has a maximum 52 bits to hide pointers), I don't know if we'll ever see it enabled by default in a consumer OS...

1 Like

I expect that would look like the 3GB aware support flag in 32 bit Windows, but that was when processes were really getting down to the wire. I can't imagine what workload would need more than 48 bits in one process, let alone 52 bits.

My understanding is that the 57-bit virtual memory addresses still only allow 52-bit physical memory addresses, i.e. 4 PB of RAM. Which is the same as the 48-bit virtual addresses which also allow up to 52-bit physical addresses.

It is convenient to have some slack in virtual space because it simplifies memory allocation, allowing some fragmentation. If pointers already have 64 bits anyway, you might as well use all the bits. An example is given by this discussion -- it allows simpler memory management for multi-threaded stacks.

Note that the 48, 52 or 57 bits limitations are defined by currently manufactured hardware and OS implementations. The x86_64 ISA itself allows to have memory addresses up to 64bits. That's why loosening the limit shouldn't break any userland programming interfaces. Of course there're softwares which assume such limits will never change. But it's more like a short-sighted bug, like having int type to represent unix timestamp.

8 Likes

Rust used to have segmented stacks back when we had green threads, but it got removed as we moved away from green threads and added support for running without an OS (Allow linking against crates with #[no_std] -- reopening by alexcrichton · Pull Request #7924 · rust-lang/rust · GitHub) as segmented stacks need an OS to allocate memory.

In order to have a true "runtimeless" executable, we can't emit segmented stacks. By emitting segmented stacks, we currently then place a dependence on libmorestack, which itself has a dependence on librustrt for allocating new stacks, and the whole point of this is to get away from the runtime. For this reason, I added a new -Z no-segstk option to disable LLVM emission of segmented stacks.

Another issue with segmented stacks is that it adds overhead to every call to check that the stack hasn't been exhausted yet. In addition when making calls in a loop with the stack almost exhausted, every call will now allocate a new stack segment and deallocate it when returning, which is slow.

8 Likes

Ah cool!

This could be implemented by catching and handling a page fault, so that the overhead only happens rarely when you exhaust a segment.

This can be solved by hysteresis: don't deallocate a memory segment for a while even after it has become empty. For instance, you could wait until you have two unused segments, and then deallocate one. It's the same idea as with a Vec implementation that shrinks capacity: it can be done in a way that amortizes well.

You would still encounter a lot of page faults with that solution if you're between segments, maybe there is a way to solve that too.

3 Likes

I think having a standard and boring usage of the thread stack is an important feature for Rust as a low-level systems programming language. Predictability and compatibility are important there.

1 Like

The problem with this approach though is that it burdens the programmer with having to decide on the stack length which they have no way of bounding because the stack frame layout is outside their control.

1 Like

That's an OS/architecture problem. High-level languages with VMs, green threads and segmented stacks are free to work around and abstract it away. Rust as a systems programming language should expose the system as it is.

To me segmented stacks are in the same boat as GC. A great idea in general, a usability improvement for majority of users, good fit for nearly every high-level program, but also not suitable for Rust.

So many good replies in this thread, thank you all!

Someday, if I ever learn enough about rustc and how to write a proper backend, I think I'd like to investigate this because of something @tczajka and @bjorn3 were talking about.

Strangely enough, this is kind of the solution that I was kicking around in the back of my mind. You wouldn't allocate space on a per-frame basis, but on a per-page basis. The stack would grow within a page until it needs more space than can be provided within the given page, at which point a new page is allocated, with a pointer to the previous page turning the whole thing into something that looks like a linked list structure. If for some reason you need to grow the stack by more than a single page, you just jump out to the global stack frontier and allocate as many pages as you need in a contiguous manner. If this sounds like stack and frame pointers, but very slightly twisted, you're right, that is where I was getting my inspiration from.

As for why I was thinking about this, it's because I wanted to leverage the MMU as far as possible, so that cold parts of the stack get swapped out for the hot parts efficiently. Not sure if that is a good idea, but it was what was kicking around in the back of my head.

At this point I'm reaching the limits of my understanding of how things are done 'under the hood', so if what I'm about to say is dead wrong, please forgive (and correct!) my misunderstanding.

What about futures? As I understand it right now, either you need to move or copy anything that the future references into the future specifically for the reasons you just mentioned. If we had a cactus stack would it be possible to have something like Arc<StackMutex<T>> that lets you prevent the collection of a stack frame until all futures drop it? I think this might be safe to do on a stack as the stack structure should enforce a DAG, so no problems with circular garbage never being collected[1].

That said, I acknowledge what @bjorn3 said earlier about running without an OS. I don't know enough to know if it would be possible to create a data structure that 'looked like' a stack to all threads/tasks that would work without OS support.

Now I'm starting to wish that Rust had an equivalent to GlobalAlloc for the stack so that we could experiment with different ideas!


  1. That is just speculation on my part, until and unless someone really puts together a proper proof that this is true I wouldn't trust it. ↩︎

At that point it is already too late to switch the stack. In addition it is way to heavy. In addition with a true segmented stack it is rather common to exhaust a segment.

I understand what you're saying, and agree that if it requires a massive runtime or OS to actually work, then it's inappropriate for Rust. What I'm not yet convinced of is that a runtime/OS is required for this to work. I will fully admit that I'm at the very edge of my knowledge though, so it may not be possible to do without a runtime or OS, but I'm not yet convinced that it is.

Why is it too late to switch the stack at that point?

Also, do you have any performance numbers to point to for this being too heavy? I'm genuinely curious (and now I really, really want the stack equivalent of GlobalAlloc so that I can run experiments!)

Because there may already be existing references to the current stack frame, which means you can't move it. But if the stack is too small, you would have to move the current stack frame to the new stack segment. These two requirements conflict.

But... isn't part of the point of cactus stacks to solve that issue? The current stack frame won't be moved, you jump to a new segment that is large enough to handle the allocation request. Or do you mean that you don't know how much stack space you need a-priori (you don't know how large your frame is going to be), a different thread/task allocates the next page, and then you can't grow the current frame?

I'm sorry if I'm being a little dense, I'm really trying to find the precise problem to fully understand what the limitations are. Once we know they completely we can look for solutions.