So, I share your concerns, although I think I balance them differently. In particular, I think I would be ok with moving Rayon as it is – inflexible scheduler and all – and assume that we will find ways to make it possible to change the default scheduler in the future, while retaining the basic rayon-core
interface as stable (i.e., join
and scope
, primarily).
But let’s put that aside for a second. I want to just dump a few thoughts I’ve had about what it would take to make the scheduler in Rayon configurable. I’m not sure of the best approach right now. I think there are some challenges.
Why not dynamic dispatch?
The most obvious approach would be to let people specify a scheduler as a kind of “dynamic code path”. For example, when you create a thread-pool, you might supply a “custom scheduler hook” as a kind of trait object or something. This is very flexible (you can pick your scheduler at runtime!) but it comes with some downsides.
First off, the performance will suffer. I don’t have strict numbers here, but I feel like it has the potential to be quite a bit: the goal with Rayon (only partially achieved thus far) has always been that one can freely introduce join()
calls (and parallel iterators) even for cases where there may not be a lot of data, because the overhead if no parallelism occurs is very small. I think exciting possibilities like Tapir – which integrates a join
operation directly into LLVM IR – might help move the needle there, and the “statically selected global scheduler” design would fit very well into that.
To get more concrete, the way that Rayon works now, when you call join(a, b)
, we check if you are on a worker thread (that is intended to be the common case). If you are, we push a pointer to the closure b
into a thread-local deque, and then start executing a
. We then try to pop b
from that deque and – if successful – call it.
The key point here is that we want it to be very cheap to call join(a, b)
even when you will wind up executing both a
and b
yourself. If we can inline things like “push onto the deque” (should be just a few instructions) and “pop from deque” (likewise), that is quite achievable (indeed, we could do better than we do, but that’s a different topic (see some random thoughts below)). However, if we have to check for the possibility of a dynamic scheduler, that implies virtual calls, which will not be cheap enough, in my mind.
Now, maybe we can privilege the default scheduler by having some ifs that prefer the staticall selected path (and leave the custom scheduler to the cold path), but that doesn’t seem to achieve the full goals. I don’t want to hobble custom schedules!
Why not traits?
I basically want the scheduler to be a trait, but the problem is that I don’t want the functions like join
etc to be customized by a trait. There might be some routines here that would work (e.g., I’ve been contemplating the idea of making it possible to have modules parameterized by type parameters, sort of like applicative ML functors), but we don’t have such mechanisms in the language today. And once we add them I thnk it’d be important that to get the “default scheduler”, you would still be able to just type rayon::join()
and have it work (in other words, I’d want to make it possible to add this backwards compatibly).
What about something like the allocator?
I do think that having some kind of general “crate dependency inversion” mechanism in Rust – possibly just for a pre-selected number of dependencies, like the allocator, thread-scheduler, panic handler, and so forth – makes a lot of sense. We’ve seen a number of instances (I just cited 3, I suspect I forgot some). It’s a bit of design work. But it also seems like something that we can clearly insert after the fact. That is, it need not block offering APIs like join
and scope
.
Note in particular that custom allocators came quite a long time after collections that use the default allocator and so forth.
Precedent
Basically every serious language has moved to offer a “default scheduler” in some way. The JVM offers ForkJoinPool
, Microsoft offers various APIs as well as LINQ. C++ doesn’t have anything in the language itself but there are certainly contenders, e.g. TBB, and I’ve heard anecdotally at many conferences how moving from many custom thread-pools to TBB has been a big win.
Random thoughts on how to make join()
cheaper
I’ve not done a ton of work on micro-optimizing join()
in Rayon, but the basic design is definitely aimed at making it the cost of join(a, b)
to be quite close to the cost of a(); b();
. Pushing onto the deque is highly optimized, as is popping from it. The closure itself is stack-allocated and we only have to push two words onto the deque anyhow (data pointer, code pointer). In the case where a()
and b()
wind up happening on the parent thread, those calls are statically dispatched and hence can be readily inlined.
That said, the original cilk work goes quite a bit further. For example, when it decided that it was unlikely that your jobs would be stolen, it would compile distinct versions of functions where join(a, b)
is just hard-coded to a(); b();
for the case where it thinks that other CPUs are busy – but it does help to keep overheads down. It may be possible to do this in a library but it would require some deep hacking (e.g. tweaking return pointers on the stack and so forth).
I think a much more promising approach is moving knowledge of fork-join out of a library and into the compiler itself. This is where projects like Tapir come in. Hopefully we can build on that work (or that style of work).
I am imagining a compiler intrinsic like:
fn try_fork_join<F, G, H>(f: F, g: G, h: H)
where F: FnOnce(), G: FnOnce(), H: FnOnce(F, G)
which will execute F
and G
in parallel (and join them afterwards) using the “native parallelism”. If no “native parallelism” is available on the current backend, it falls back to invoking h
. Hence Rayon’s join()
could be like this:
fn join<F, G>(f: F, g: G)
where F: FnOnce(), G: FnOnce(),
{
// try to use native parallelism
try_fork_join(f, g, |f, g| {
// fall back to 'emulated' parallelism, like we do today
});
}
Obviously there is more work to do here (e.g., figuring out how scope
interacts), but this all suggests that having some concept of a “global scheduler” would be a big win.