Official support for a new coroutine runtime in Rust

A truly scalable async I/O system not just needs but actually requires the capability of balancing the load between processor cores. While it is already possible to use async I/O in Rust none of those approaches fulfill this requirement.

As part of rust-lang/rust#33368 @zonyitoo and me realized that our project coio-rs might probably one of the fastest I/O libraries right now, but will never safely work with Rust due to the way the stdlib works right now - all possible solutions outlined in the linked issue require larger, non-trivial changes.

At the same time we still think that it’s a good idea for Rust’s developers to investigate into a “libgreen v2” in the near future, because while it was kind of a bad choice for most software at the beginning it’s still an integral part for server systems.

Would it thus be feasible to “fork” the current libstd and provide a (semi-)official one with support for coroutines? We were able to gather quite some experience with Go’s runtime and how to port it over to Rust, but continuing our work means that we need a tighter integration into libstd and the only way we see forward is a “tighter integration” with the Rust core and lib teams.

We’d of course be willing to take the burden of writing all the necessary code, but we fear that this can only happen under such a semi-official guidance, because otherwise all our work could be for nothing, whenever some truly breaking change happens in the stdlib etc.

What do you guys think? :slight_smile:

4 Likes

I am still enthusiastic about coroutines in Rust and want to reduce barriers to making progress.

I do think a friendly fork is a good solution. Even if we don’t add support for coroutines to mainline Rust in the near future, it is totally feasible to implement a library that looks like std, but built on top of coroutines, and we should want to make that happen. To make this possible is one of the reasons we’ve worked so hard to subdivide std into reusable pieces already.

We need to figure out how you can experiment with a coroutine-std out of tree that shares much of the code with std, but doesn’t impose a great burden on the std maintainers.

Refactoring std could be a part of the story here. So could config-specific compilation scenarios.

Edit: based on the two above links, the way I imagine a fully-couritinified Rust working at a high level is that std is implemented on top of a platform abstraction layer (std-pal) that already has a rough definition in std today. Coroutine-Rust implements it’s version of std-pal that stock std links to, and simultaneously enables a global feature flag indicating that the entire world is compiled with coroutine support. The ecosystem can slowly be audited for compatibility and opt into this world with a yet-undesigned mechanism. Although that doesn’t allow you to mix coroutine-Rust with regular Rust, it seems like a good direction to head in that’s well aligned with existing goals.

3 Likes

If I understand you correctly this std-pal would be what you linked as refactoring std, right? If that’s case I do agree that this would be the best (and cleanest) solution for this.

What we had in mind is to add a feature flag, allowing you to import additional, coroutine-specific functionality (like spawning coroutines, setting the size of the thread pool, etc.), and a flag for the main function to automatically insert the setup code of the scheduler. We would have added our coroutine specific code into libstd directly by simply using #[cfg(...)] attributes to swap out entire modules or specific implementations. Splitting it up using that std-pal would obviously be better though of course.

So what’s the way forward? Do you want to wait for this std-pal idea to be implemented first? Or do you want to give it a try right now? We’d be ready to work on anytime now and we’re basically only waiting for at least some kind of endorsement by “the official” Rust developers - not because we need the publicity but only because of the simple reasons I outlined in the original post. So I’d be cool if you, @brson, could talk with the others about this sometime soon and tell us if this entire thing is feasible and what your general requirements for this project would be, or whether you have any wishes and so…

P.S.

We’ll soon start writing a library similiar to ASIO for Rust here (including the proactor design etc.), due to the longstanding issues with mio’s bad code quality and it’s rather inactive maintainer (*). If coroutines will not be adopted, this library could then at least be used as a stable, high level networking library, or if they are be useful as the runtime backend for it (which handles the polling).

(*) I’m sorry if that sounds mean - I’m saying this as professionally as I can, but mio has really proven itself to be a serious bottleneck and causes a lot frustration in greater parts of the Rust community.

Yes.

There's probably multiple things we can do to ease the work, and refacting std is a big project itself. I do prefer to avoid doing experimental coroutine work directly in tree - rather would add hooks and refactorings to make it easier to do the work out of tree, perhaps with a small patch set against std. When you are getting into 'swap out entire module' territory that does sound to me like it calls for some work making it easy to do that swap, which leans in the 'std-pal' direction.

It's hard to say without a better understanding of what modifications to std are needed. At the least, it requires a different TLS implementation right? What are the other big pieces? Can we limit the scope to start such that we don't need to make std::io coroutine aware? As in, let's fix the horrible problems that make coroutines in Rust not work at all first, then move on to making std itself built on top of an event loop so that I/O doesn't block.

We don't need to get a full pal working to move on this, can move in that direction piecemeal. As just a guess about what's needed I might expect to pull the entire TLS implementation into its own crate. Then out-of-tree you can maintain a build that replaces the TLS crate with coroutine-aware TLS. That refactoring can feed into the larger pal as it evolves.

cc @alexcrichton, @carllerche and @pcwalton who have some experience and/or interest in the matter.

My personal opinion is that coroutines are an evolutionary dead-end and lower-overhead “large array of state machines” has a better future, at least for projects that don’t try to run away from the bare metal.

I don’t have anything against people using Go-like frameworks (with or without forms of GC), but I wouldn’t and AFAIK e.g. Servo or high-throughput servers also wouldn’t.

1 Like

As far as I'm aware there is only a single required modification for this to work: Making the accessor methods of the thread_local!() macro #[inline(never)] - simple as that. This obviously leads to a heavy slow down for software which makes heavy use of TLS, but it should be fine for a coroutine based server. The biggest hit will probably take the poison checker for the Mutex, but it remains to be actually benchmarked to decide wether there is any relevant slowdown at all in an average production application. A coroutine-aware TLS is something a lot of people recommend (i.e. to redirect the TLS accessors to a coroutine-local storage), but I think this would a bad idea because it would be in the order of magnitudes slower and doesn't improve the actual use-case for TLS: concurrency without locking.

All other changes to the stdlib are mostly related to the std::net module, where some code has to be added to catch a possible Err(WouldBlock) return value from reading/writing and to park the current coroutine in that case. Basically a loop and a wait() call. While the std-pal approach is preferrable the alternative wouldn't be that hard either.

The reason we don't just swap out the TLS implementation is that the concern came up that if our coroutine library provides an API which is functionally equivalent to the (blocking) stdlib API but not exactly the same (i.e. not actually the stdlib) it still will only be either a fully opt into our library by using our API or otherwise they have to use something entirely different. Most people will choose the latter to gain a better compatiblity I believe. So modifying the stdlib to transparently use coroutines under the hood seemed like a good idea.

So no: We can't just swap out TLS alone, because this will only make coroutine libraries possible. While this would be great, it'll not be that useful. Also for Rust to not contain a officially sanctioned way to easily build scalable servers in 2016 does seem a bit backwards to me… So I strongly believe we should build something great together to pinch a piece off of Go's glory - it's long overdue.

My personal opinion

The equivalent to a state machine is in fact a stackless coroutine. They are basically identical - it's just a matter of expressing yourself differently in your code. Stackful coroutines have the same abilities, but add a lot of benefits you'll never get with state machines, while incurring a small performance hit. I mean… you're using coroutines instead of state machines for e.g. parsing some data (i.e. in a tight loop) you're doing it wrong anyways and it's obvious that you'll get a slowdown in the order of a magnitude thanks to the missing inlining.

If you're using them like they're intended to thoigh (for server systems) they allow you to move local state and it's data between threads, which is something extremely important. This enables you to provably build scalable server systems, since it allows you to ensure that all your CPU cores share the load equally without some of them running out of work spuriously (which really is hard to prevent otherwise – in the order of the compexity of writing good cryptography library). There are also huge concerns regarding unfairness and jitter in industries where it matters (for instance the financial sector), all of which can safely and easily be prevented by using stackful coroutines and a good scheduler. There are a lot of papers and studies out there analyzing good approaches for this in great detail. You should read them. :slight_smile:

Well maybe you can also achieve it using state machines or promises but to get the same safety benefits you have to then go and sprinkle your code with "annotations", telling the runtime where it can move your local state between threads. Oh and then you notice that your code is actually slower even though it has the benefit of inlining, because coroutines have the benefit of cache locality.... and... and you're back at square one: a matter of style and indistinguishable performance.

P.S.: Using context-rs which we wrote for this we get a very good performance for context switches. It's so good that a context switch is actually faster than a function call on my i3770 using Linux x86_64.

I’ve seen how fast context switches can be, libfringe can do them in 2.5ns by using inline assembly to get LLVM to generate the minimal amount of instructions necessary to spill registers in use and reload them later.

What I’m worried about is the memory footprint and allocation overhead of stackful coroutines limiting scalability.

However, until we have the ability to build a relatively complex backend server in both and actually benchmark them against each-other (complete with performance counters - instructions executed, cache misses - energy efficiency measurements, etc.), the discussion is somewhat single-sided.

Ergonomic stackless “coroutines” might not happen this year, although impl Trait + @alexcrichton’s futures library seem to be a beachhead into that space, with @erickt’s stateful experiment and more integrated versions of generators and/or “resumable functions” filling the ergonomic void later.

Specifically, I don’t want to oppose any effort in either direction, but I would like to sort of keep both out of mainline Rust before we can make an informed decision.

@brson @aturon Do you think forking the compiler is a good way to test alternative strategies? At this point, it should be possible to experiment with state machine transforms on the MIR, although blocking that on getting it into Rust proper may require going through lots of bikeshed before a chance for experimentation on real-world scenarios.

EDIT: This sort of happened with impl Trait where my initial implementation got some use by one or two people in the community.

I think the point of the std scenario stuff is to long-term minimize forking the ecosystem over these sort of things by maximizing the potential for “accidental portability” to be exploited.

Also, if I may self-advertise here :slight_smile: I have a souped-up fork of fringe and a long standing desire with (warning, currently bit-rotted and incomplete) https://github.com/RustOS-Fork-Holding-Ground/rust-net to build a network stack which is agnostic to stack vs stackless coroutines. The APIs would is closures/callbacks/stacklass (currently done the hard way but yay await!()), and with further modifications to libfringe would allow effectively reifying the current continuation into a closure to fake stackless from stacked.

So tl;dr stacklass and stack need not be enemies.

Important to clarify: are we talking about swapcontext-style coroutines with a large (1MB+) stack and guard page, or GHC/Go style coroutine stacks that start out ~4KB and are movable?

I fear this thread might be digressing from the original point I wanted to make - maybe this discussion is good, but personally I'm really not all that interested in having a "C1M" server capable of handling millions of connections. What I and many others in the Rust community want instead is a stable, by the Rust developers endorsed library, which is simple to use and can easily be integrated into libraries and applications. Honestly a "c10k" would be enough at this point, as long as something stable comes out of it, which we all can rely on.

The reason I choose stackful coroutines is because they can provide an interface which is mostly identical to the already existing synchronous I/O facilities. Furthermore they are very likely to provide safety guarantees which no other solution can provide.

If stackful coroutines are really that unacceptable, it would probably also be fine to use stackless ones (async/await), or to use simple Promise/Futures. Because if we don't have a good official/endorsed solution soon some people will settle with library A, while others will choose B and so on. And this will very likely ultimatively create the same fractured library situation which C++ has - or "had", because even they seem to adapt ASIO as the official solution soon enough. I like Rust and I'd do a lot to not see this happening here.

I personally feel like asynchronous I/O is a highly underappreciated topic for the Rust developers, since progress on this seems far slower than it should be in the age of millions of connections and the IOT.

Now onto the offtopic:

context-rs is actually a bit faster (~25%), all while being capable of additionally transferring a usize over. It can also invoke a callback function after a switch (after which you could for instance safely take out the coroutine handle from the processor). :wink:

Movable stacks can only exist with languages, which use a garbage collector. So Rust won't have that. Go uses stack copying which is a lot better than segmented stacks but also only possible with a garbage collector.

I was previously talking about "large" stacks with guards. Those stacks would allocate memory using mmap of course, possibly using MAP_NORESERVE to prevent the OOM killer from killing your application. Thanks to virtual memory they in fact also start out at 4KB physical memory. We could provide options to manually specify the size of a stack and even whether it should have a guard page. I've observed my small HTTP server (similiar to those for C1M demos) to use about 16KB of physical memory in average per connection which would result in 16GB physical and in my case 128GB virtual memory usage. I think it would be a good thing to have segmented stacks in Rust sooner or later anyways and then you could swap between using those, depending on your needs.

One could argue now that state machines and the like use less memory - which is true - but then again you will have to fight the issue that every second connection doesn't receive any data because they are coincidentally treated more unfairly. Your average software developer will then have to fiddle around with the code a long *** time before he figures out that he continues reading from a socket right out of it's completion callback, which makes other connections starve randomly, because the OS accidentially has always data ready for the same couple sockets.

People thankfully don't underestimate the complexity of developing safe crypto functions anymore. Issues like timing attacks etc. are simply things developers normally don't think about but are very critical there. I recommend to not do the same mistake in server systems to underestimate the effect of randomness. Making it safe to use for average joe without forcing him to think about every little detail should IMO have a higher importantance here.

BTW: I invite everyone to join https://gitter.im/rusty-io/asio to discuss pros/cons of any approach and their possible solutions in detail. While I do feel a strong preference for stackful coroutines I’m open to better suggestions, but I still want to go forward actually doing something as soon as possible. :slight_smile:

Oh and I should note that it’s not my intention for this to be actually in mainline Rust. rust-lang-nursery or at least some sort of cooperation is more of what I had in mind, since this won’t be fully stable for at least 1-2 years anyways. Oh and of course: A want a working solution. Yesterday if possible. :smile:

The swapcontext ones are the only option for Rust atm, which doesn't have a relocating GC.
The stack may be allocated as a few MB in virtual memory but it doesn't have to use that much of physical memory and I believe linux won't even preallocate the entirety of regular thread stacks.

You still run into issues with page table overhead.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.