Scoped threads in the nursery (maybe with rayon?)

nikomatsakis · March 10, 2017, 4:55pm

So @steveklabnik in the Crossbeam: request for help thread wrote the following (and @bstrie / @Amanieu also commented on this):

I also agree that we want some nice scoped-threads library in the nursery. However, I don't think it should be crossbeam's scoped threads -- or at least not only that. I've been thinking about proposing rayon and rayon-core move there.

The key difference is one of philosophy: rayon's scoped threads are organized around a global thread-pool and are targeting data parallelism. Basically, CPU bound tasks where you have a lot of data to get through. I think this is very common and in particular I think it's the case where you want references into the stack most acutely. In such cases, you don't want to request a worker thread that is guaranteed to be in a distinct OS thread. You want to tell the library where parallelism is possible and let the runtime do load-balancing. This is what Rayon aims to do.

I think it would be a mistake to offer only the "make a real O/S thread primitive" in the nursery, since people will use it and performance will suffer as a result. This is true even if people write their own thread-pools: in particular, a common pathology in C++ code is that every library has its own thread-pool, and hence the CPU is wildly oversubscribed. Having a global thread-pool abstraction (like Rayon) in the platform might help us avoid that fate.

That said, if there is a strong reason, it'd be possible to grow Rayon with a "fixed O/S thread" scoped-threading mechanism. But I'm not sure I'd want to. The main use case is blocking I/O, and I think it'd be better to code on a Rayon-style threadpool and use futures (which Rayon now supports, at least in preliminary form) to manage your I/O.

On a related note, we recently factored the rayon crate into two crates (not yet post to crates.io). The rayon-core crate contains only the core threading APIs: join (which forms and joins 2 threads) and scope (which forks and joins any number of threads), as well as the ability to create thread pools. (Rayon also lets you inject "asyncronous" tasks into the threadpool, but they must have a 'static bound.) These are pretty fundamental APIs for any thread-pool.

The rayon core layers on top the parallel iterator APIs. The idea is that rayon-core will live at 1.x forever, but we can make more fundamental changes to the iterator and user-facing APIs. The reason for this is that if we ever have two major versions of rayon-core, that will introduce two threadpools, which would be inefficient and can create deadlocks.

steveklabnik · March 10, 2017, 5:22pm

I would certainly be happy to see rayon in the nursery regardless of a specific “scoped threads” crate.

When investigating this, I found that scoped_threadpool's implementation was much, much smaller, and gives you basically the same thing. It does require a pool, and so it’s not the same interface, but it seems like if we were to do it, that approach would be nicer.

However, it sounds like rayon-core might fit too.

Given the small surface area of such a crate, I’m largely interested in something we could get up to snuff fairly quickly; I think we’ve held off doing this for far too long for various reasons, and keeping the interface simple is good.

cuviper · March 10, 2017, 5:50pm

That's a great ideal, but the blocking is not always under your control, especially if you're dealing with FFI. So even with the best Rust code, there may still be reasons to punt a non-CPU-bound task (or even mixed CPU/IO) to a different thread for a while.

I'm not sure what integration into rayon-core looks like here. Maybe as some kind of yield? i.e. "Run this in its own thread, and if we're part of the threadpool then try to steal some tasks while we wait." It's important to continue its participation in the threadpool to avoid re-entrant deadlocks, as we've discussed before.

bstrie · March 11, 2017, 7:41am

Would it would be useful to have Niko and Aaron produce a document containing a high-level overview of what sort of concurrency story they envision for Rust between crossbeam and rayon? Concurrency is one of Rust’s greatest strengths, and I’d love for there to be a more unified blessed solution to demonstrate that.

nikomatsakis · March 11, 2017, 6:06pm

That's fair. It'd be good though to drill down on precisely what problem we are trying to solve. For example, if it's really about wacky FFI things, one possibility might be to create a special purpose thread-pool that just executes the FFI call and then returns. You can then wrap up this call into a Future so that, in Rayon land, you spawn a future into this "async pool" and then invoke rayon_wait() to wait for it.

(I agree there is some danger of deadlock, but it seems fairly remote. You'd have to have this blocking FFI call wind up blocked on a rayon computation -- it seems more likely to be blocked on external I/O or something like that. But, in any case, having the ability to dynamically grow and shrink the number of threads in the threadpool still seems like a good long-term ambition.)

jethrogb · March 13, 2017, 5:08pm

I’m using scoped threads extensively to build my own scheduler just to make the lifetime issues go away. I’m not opposed to a higher-level abstraction but I think the lower-level primitive of an OS thread that is guaranteed to have exited before some point needs to be provided too.

gnzlbg · March 14, 2017, 12:56pm

I agree with the goal of your proposal, and think that it would actually be a great idea to have parallel iterators, a thread-pool scheduler aimed at data-parallelism, and scoped threads in the nursery.

However, say I need to use my own scheduler for “reasons” (e.g. the same reasons why one might have different reactors when using futures). I haven’'t used rayon for a while, is there now a way to:

switch the rayon scheduler for a binary (e.g. such that all libraries used by the binary switch to that scheduler)?
choose the scheduler for a particular parallel iterator invocation?

If these things are easily possible I wouldn’t have any objections about moving rayon into the nursery.

Otherwise, I would be worried that we would be kind of standardizing a particular scheduler that we cannot change or configure afterwards.

nikomatsakis · March 14, 2017, 7:11pm

No, there is no ability to customize a scheduler. I too would like this but I think there is no good mechanism to achieve it that I see. I think however that this need not be a blocker to having support in the nursery.

I guess I think that:

I want there to be a default scheduler in the nursery. Ideally, it would also be configurable, but i’m not sure how best to tackle that problem without incurring dynamic overheads.
- In particular, I don’t want fork to invoke dynamic code or something like that.
- Some kind of generic “require-provides” mechanism like we use for the default allocator might be a good fit here.
I think if we only include in the nursery various low-level libraries (e.g. scoped threads) then we will get the “too many schedulers” problem.

So perhaps another way to put it is that I view rayon-like interfaces (e.g., join, scope+spawn) are really kind of a primitive in the language in much the same way as an allocator. I’d like there to be a way for lots of people to write code against those interfaces while still providing the ability for applications to control the “default scheduler” at a coarse-grained level.

gnzlbg · March 14, 2017, 7:41pm

I think we need a path forward for a way of switching the global scheduler before making one "the defacto" scheduler.

Even when it was impossible to change the global memory allocator, there was a path forward to a solution. IMO both problems look too much alike, and I've also had the need of switching a global "logger" as well (on an MPI application with 100k processes you don't want 100k log files, but that all or some processes write to a single file), so this might be a more general pattern than just something special to allocators.

I also would expect that the parallel iterators are actually invoked with .par_iter_on(scheduler) that allows the user to select the scheduler that should run the parallel computation. In this world, .par_iter() would just call .par_iter_on(global_scheduler). Either that or they always return a computation that one needs to pass to a scheduler to obtain a future.

rayon-like interfaces (e.g., join, scope+spawn) are really kind of a primitive in the language in much the same way as an allocator.

I think that the scheduler is the primitive that is like the allocator (e.g. like a std::collection). OTOH rayon-like interfaces are like algorithms on slices or iterators, they collection (or in this case scheduler) agnostic (up to some point).

So... if rayon would be scheduler agnostic, I would be ok with moving rayon into the nursery, in the same way that futures could be in the nursery, but mio probably won't be anytime soon.

I still would like to have a default scheduler, but I fear that if we offer a defacto one without a way to change it, as more crates depend on it the ecosystem will split into crates that one can use when one needs a custom scheduler and crates that one cannot use because they pull in a possibly very-heavy weight scheduler as a dependency that competes with the one the application wants to use.

Do I make sense?

EDIT:

Ideally, it would also be configurable, but i'm not sure how best to tackle that problem without incurring dynamic overheads. In particular, I don't want fork to invoke dynamic code or something like that.

Ideally, one would configure it at compile time just like the allocator: there can only be one. We can then go and provide different schedulers, e.g. like a "type-erased" scheduler that could be switched dynamically at run-time (does anybody actually need this?) or something that can be set up at run-time, like what OpenMP does: code against a table of function pointers, and on program initialization, set those function pointers to a particular scheduler once, based on an environment variable. My point being, if we could switch the scheduler, we could play with these alternative solutions while having a default scheduler that works 99% of the time.

nikomatsakis · March 25, 2017, 12:35pm

So, I share your concerns, although I think I balance them differently. In particular, I think I would be ok with moving Rayon as it is – inflexible scheduler and all – and assume that we will find ways to make it possible to change the default scheduler in the future, while retaining the basic rayon-core interface as stable (i.e., join and scope, primarily).

But let’s put that aside for a second. I want to just dump a few thoughts I’ve had about what it would take to make the scheduler in Rayon configurable. I’m not sure of the best approach right now. I think there are some challenges.

Why not dynamic dispatch?

The most obvious approach would be to let people specify a scheduler as a kind of “dynamic code path”. For example, when you create a thread-pool, you might supply a “custom scheduler hook” as a kind of trait object or something. This is very flexible (you can pick your scheduler at runtime!) but it comes with some downsides.

First off, the performance will suffer. I don’t have strict numbers here, but I feel like it has the potential to be quite a bit: the goal with Rayon (only partially achieved thus far) has always been that one can freely introduce join() calls (and parallel iterators) even for cases where there may not be a lot of data, because the overhead if no parallelism occurs is very small. I think exciting possibilities like Tapir – which integrates a join operation directly into LLVM IR – might help move the needle there, and the “statically selected global scheduler” design would fit very well into that.

To get more concrete, the way that Rayon works now, when you call join(a, b), we check if you are on a worker thread (that is intended to be the common case). If you are, we push a pointer to the closure b into a thread-local deque, and then start executing a. We then try to pop b from that deque and – if successful – call it.

The key point here is that we want it to be very cheap to call join(a, b) even when you will wind up executing both a and b yourself. If we can inline things like “push onto the deque” (should be just a few instructions) and “pop from deque” (likewise), that is quite achievable (indeed, we could do better than we do, but that’s a different topic (see some random thoughts below)). However, if we have to check for the possibility of a dynamic scheduler, that implies virtual calls, which will not be cheap enough, in my mind.

Now, maybe we can privilege the default scheduler by having some ifs that prefer the staticall selected path (and leave the custom scheduler to the cold path), but that doesn’t seem to achieve the full goals. I don’t want to hobble custom schedules!

Why not traits?

I basically want the scheduler to be a trait, but the problem is that I don’t want the functions like join etc to be customized by a trait. There might be some routines here that would work (e.g., I’ve been contemplating the idea of making it possible to have modules parameterized by type parameters, sort of like applicative ML functors), but we don’t have such mechanisms in the language today. And once we add them I thnk it’d be important that to get the “default scheduler”, you would still be able to just type rayon::join() and have it work (in other words, I’d want to make it possible to add this backwards compatibly).

What about something like the allocator?

I do think that having some kind of general “crate dependency inversion” mechanism in Rust – possibly just for a pre-selected number of dependencies, like the allocator, thread-scheduler, panic handler, and so forth – makes a lot of sense. We’ve seen a number of instances (I just cited 3, I suspect I forgot some). It’s a bit of design work. But it also seems like something that we can clearly insert after the fact. That is, it need not block offering APIs like join and scope.

Note in particular that custom allocators came quite a long time after collections that use the default allocator and so forth.

Precedent

Basically every serious language has moved to offer a “default scheduler” in some way. The JVM offers ForkJoinPool, Microsoft offers various APIs as well as LINQ. C++ doesn’t have anything in the language itself but there are certainly contenders, e.g. TBB, and I’ve heard anecdotally at many conferences how moving from many custom thread-pools to TBB has been a big win.

Random thoughts on how to make `join()` cheaper

I’ve not done a ton of work on micro-optimizing join() in Rayon, but the basic design is definitely aimed at making it the cost of join(a, b) to be quite close to the cost of a(); b();. Pushing onto the deque is highly optimized, as is popping from it. The closure itself is stack-allocated and we only have to push two words onto the deque anyhow (data pointer, code pointer). In the case where a() and b() wind up happening on the parent thread, those calls are statically dispatched and hence can be readily inlined.

That said, the original cilk work goes quite a bit further. For example, when it decided that it was unlikely that your jobs would be stolen, it would compile distinct versions of functions where join(a, b) is just hard-coded to a(); b(); for the case where it thinks that other CPUs are busy – but it does help to keep overheads down. It may be possible to do this in a library but it would require some deep hacking (e.g. tweaking return pointers on the stack and so forth).

I think a much more promising approach is moving knowledge of fork-join out of a library and into the compiler itself. This is where projects like Tapir come in. Hopefully we can build on that work (or that style of work).

I am imagining a compiler intrinsic like:

fn try_fork_join<F, G, H>(f: F, g: G, h: H)
    where F: FnOnce(), G: FnOnce(), H: FnOnce(F, G)

which will execute F and G in parallel (and join them afterwards) using the “native parallelism”. If no “native parallelism” is available on the current backend, it falls back to invoking h. Hence Rayon’s join() could be like this:

fn join<F, G>(f: F, g: G)
    where F: FnOnce(), G: FnOnce(),
{
    // try to use native parallelism
    try_fork_join(f, g, |f, g| {
        // fall back to 'emulated' parallelism, like we do today
    });
}

Obviously there is more work to do here (e.g., figuring out how scope interacts), but this all suggests that having some concept of a “global scheduler” would be a big win.

nikomatsakis · March 25, 2017, 12:45pm

Just a few things I wanted to mention:

I’d love to make some concrete progress towards extracting what the interface to a scheduler would even look like. I think there are two options:

A safe interface. That probably looks like rayon-core, honestly. It’s pretty general by now.
An unsafe interface. That might look something like rayon’s registry, but that was never really intended to be public, so it’s probably got lots of rough edges.

(I guess where rayon-core is lacking is that it doesn’t offer the ability to spawn out things that guaranteed to be O/S threads. That wouldn’t be a terribly difficult addition, I don’t think, but I’ve not thought hard about what it should look like. It’d be useful if you plan to do blocking I/O, though as I said earlier I think in that case you should use futures. But maybe you have your reasons for wanting blocking I/O.)

I think any interface that aims to allow things like join() to be implemented on top of it in any efficient way must however be unsafe, since the lifetime constraints are hard to capture otherwise. But maybe I’m just not being imaginative enough.

nikomatsakis · March 25, 2017, 12:59pm

OK, one more parting thought than I really have to go. If what we want to do is to make it possible for parallel iterators, in particular, to be retargeted to other “core libraries” – that might be relatively easy (as opposed to making raw calls to join retargetable).

gnzlbg · March 26, 2017, 10:41am

C++ has achieved consensus in a schedulers proposal: P0443R1 - A Unified Executors Proposal for C++ (2017-01-06). That's the second revision of the combined proposal, but in the global scheme of things is revision ~20. It basically forces you to pass the scheduler around, but default function arguments and overloading are used to provide the default scheduler. It plays well with ASIO (C++'s tokio), Cilk/Cilk+ fork/join parallelism, GPGPU computing, TBB, OpenMP, OpenCL, SIMD...

Not that C++'s solution can be applied 1:1 to Rust, but I think that the solution they have is very flexible, and makes a lot of sense. Without this it would have been impossible to satisfy Google (which wanted dynamic schedulers), Nvidia (which wanted GPGPU and SIMD support, thousands of execution against and bulk execution), Intel (which wants to add Cilk+ to the standard), IBM (which want them to be compatible with OpenMP), and the Networking Working Group (which wanted zero-overhead, no allocations, ...).

gnzlbg · March 26, 2017, 11:00am

I guess this is what I meant before with .par_iter(scheduler).... Can't we make a Scheduler trait that has join and scope, and parametrize all parallel algorithms on those? I would then be fine with all of the "scheduler-independent" algorithms being moved into the nursery, the current rayon::join/scope/scheduler being also moved into the nursery as a separate crate, and then provide a third crate in the nursery that just provides .par_iter() which invokes par_iter(rayon::scheduler).

nikomatsakis · March 26, 2017, 11:06am

It's not as simple as that. There are many parallel iterators. I would not want to move them into a separate crate. However, we could probably make a ParallelIterator combinator like with_scheduler(), so you would write something like this:

foo.par_iter()
    .with_scheduler(sched)
    .map()
    .collect()

and if you don't write with_scheduler, then you get the default (rayon-core).

nikomatsakis · March 26, 2017, 11:08am

Cool, I hadn't seen that! I'll take a look and let you know what I think. =) Always happy to pilfer good ideas from C++ (they have many...).

(To be clear, I don't claim to be an expert when it comes to parallel runtimes -- I've thought a fair amount about it, and done a modest amount of research, but I know there is lots of work out there I'm not aware of.)

gnzlbg · March 26, 2017, 1:19pm

and if you don't write with_scheduler, then you get the default (rayon-core).

Sounds good to me

Cool, I hadn't seen that! I'll take a look and let you know what I think.

Just keep in mind that this paper is more a specification than a rationale discussion, and also, that the problem it solves isn't necessary what rayon aims to solve.

Bit of history: The design has evolved from the Networking WG which started standardizing Christopher Kohlhoff's Boost.ASIO (~2012-2013). At the same time, Google wanted to standardize thread-pool-like things, and provide a dynamic API to abstract over those:

N3747: A universal model for asynchronous operations (Kohlhoff, 2013, the original proposal of what we currently have).
N4414: Executors and schedulers, revision 5 (Google, 2015, not very relevant).

It turned out that network people hated virtual calls (big surprise). Anyhow at the same time the Parallel STL technical specification was being approved (the parallel STL is C++'s rayon). It turned out that nvidia wanted the parallel STL to work on their 10.000 core GPUs, and Intel wanted to be able to implement it using Cilk+, and also, to be able to use SIMD, so here things start getting out of hands, now executors (schedulers) need to solve all these problems too. So they become a feature of their own:

P0113R0: Executors and Asynchronous Operations, Revision 2 (Kohlhoff, 2015)

And everybody joins to the party complaining that executors don't solve their problems:

N4406: Parallel Algorithms Need Executors (nvidia, 2015, very rayon relevant: why should parallel algorithms be parametrized over a scheduler, superseded by P0058R1, read that instead).
P0076R2: Vector and Wavefront Policies (Intel, 2016, kindish relevant)
P0058R1: An Interface for Abstracting Execution (Nvidia, 2016, rayon relevant (in particular the task_block part))
P0072R1: Light-Weight Execution Agents (Nvidia, 2016, rayon relevant).
N4411: Task Block (formerly Task Region) R4 (Intel/Microsoft, 2015, rayon relevant, fork-join parallelism).
... and many others, Intel with more Cilk+ stuff, AMD with the Heterogeneous Systems Architecture stuff, IBM with OpenMP, and the SIMD proposals...

At the same time, ~2015-2016, Microsoft wrote a full specification for C++ co-routines, implemented those in MSVC and Clang (the LLVM coroutines), and wanted to get this into C++17, so now executors needed to solve co-routine problems as well, particularly, in the context of networking:

P0286R0 A networking library extension to support co_await-based coroutines (Kohlhoff, 2016)
N4482: Some notes on executors and the Networking Library Proposal (Kohlhoff, 2015).

And the specification we currently have, is the result of executors going full circle, from networking and thread pools, to data parallelism in its 1000 flavors, and finally through coroutines back to networking. I don't think this design makes everybody happy (in particular, Google), but it doesn't make anybody specially unhappy, which is what ISO standardization is all about. Rust can definitely do better.

EDIT: I've added the Intel paper on fork-join parallelism, since that is only tangentially addressed by the executors proposals for algorithms.

ckaran · February 25, 2019, 11:07pm

I want to signal boost this type of application; when I first heard about rust, my first thought was 'thank God I don't have to learn GPGPU programming, rust will eventually be able to handle it!' This was in part due to my knowledge of projects like RLSL, and in part because I knew that safe rust has an incredible amount of information available to the compiler, which means that it might be possible to make SPIR-V a backend for rustc. I don't know what the final API that is pulled into the library will be, but I'm hoping that whatever is chosen will make it easy to use GPUs (and any other object that is effectively a coprocessor) from within the language/standard library.

ckaran · February 26, 2019, 2:36pm

I appreciate the RFC that @anon19897381 has written, but I personally prefer rayon’s approach of task parallelism as it frees me from having to think about the number of threads I need to spin up. Getting the load balancing right can be difficult, but rayon does a pretty good job of that via work stealing. I also think that a rayon-like API would be easier to adapt to GPGPU programming, where the number of cores I can use seems to increase every 5 seconds.

ckaran · February 26, 2019, 3:18pm

Re-reading, and quoting lots of quotable people:

All of these posts got me thinking about the log crate, and how it explicitly states that it is a 'lightweight logging facade'. Instead of pulling something into the standard library, can we create a facade for scoped threads, and also provide one (or more) submodules/subcrates (forgive me, I'm still learning rust's terminology) that provide default implementations of the facade? The advantage of this is that if in the future rust starts targeting stuff like Vulkan/SPIR-V, then people can provide crates that implement the facade, and users can use it as a drop-in replacement (my assumption is that even if rust gains the ability to compile code to SPIR-V, someone will still need to write a crate that understands the Vulkan compute APIs to move the compiled code onto the GPU. Same for CUDA, OpenCL, some kind of distributed architecture across a compute farm, or even combinations of all three).

I would also like to see an architecture where we can compose different concrete implementations together to handle bigger problems. For example, rust could have the equivalent of python's SCOOP at a high level, then on each machine there may be scheduler that understands how to feed jobs to the CPU & GPU (or GPUs), and even lower-level schedulers for each CPU or GPU core.

OK, I've come up with the big idea, now someone go make it happen! (please take that as a tongue-in-cheek joke, and not as an insult...)

Topic		Replies	Views
Crossbeam: request for help	54	14165	March 25, 2019
Parallelizing Query Execution - All Hands 2018 Discussion Summary compiler	6	1454	March 25, 2019
Parallelizing rustc using Rayon compiler	44	19170	March 25, 2019
Blog post: observatonal equivalence and unsafe code Unsafe Code Guidelines	27	4948	March 25, 2019
Post: streams concurrency	6	1151	March 23, 2020