Scoped threads in the nursery (maybe with rayon?)

@anon19897381 Do you have any suggestions for what kinds of primitives would work well together?

Grand Central Dispatch?

1 Like

@notriddle Do you have much experience with GCD? If you do, do you know if it is amenable to throwing tasks on the GPU, or having extensions that can do that?

I know Iā€™m a broken record here, but I want to make sure that whatever is developed can easily be extended to non-traditional assets (e.g., GPUs, or clusters of machines). I do not expect that weā€™d support such platforms out of the box, or even if they would ever be supported directly, but having the option would be nice. Maybe a facade for GCD, with libdispatch being shipped with rust? Or would it be better to develop something new from scratch?

1 Like

GPGPU (General-Purpose computation on a Graphics Processor Unit) and SMP (Symmetrical Multi-Processor, like your "four core server CPU") are fundamentally different concurrency models. A GPU is basically a bunch of ALU's all connected to the same CPU core, so that you can perform the same operation on millions of pixels simultaneously. It's Single-Instruction-Multiple-Dispatch. This is awesome for some use cases, but it's different from a multi-core application CPU, which runs completely different processes at the same time. That is, an SMP system is Multiple-Instruction-Multiple-Dispatch.

While I don't have much experience with GCD, since I develop mostly on Linux and Windows, I can read its documentation enough to notice that it's all built around queuing up closures to run (like Travis CI, but at the level of a single machine instead of a cluster). That's doesn't sound like a GPU-friendly way to do concurrency (a GPU-friendly model would be something like a "parallel for"), and the docs don't mention support for GPU's, so I assume the answer is "no".

I also don't see any problem with standard library providing a thread pool abstraction that only works on the CPU. OpenCL is nice, and I have no actual objection to bundling something like it in std, but OpenCL imposes a lot of restrictions on the application in order to achieve portability.

I would like to have std provide a thread pool library that only works on the CPU, and by placing this restriction, the thread pool library can be flexible enough that tokio, rayon, and crossbeam can all use it. Anything OpenCL-esque can then be layered on top of that, splitting work between the CPU (as provided by the GCD-esque thread pool library) and the GPU (as provided by whatever GPU-specific abstraction we're using).

2 Likes

Absolutely true, which is why I hated trying to figure out how to use it; I kept making mistakes in how I'd order things, resulting in sequential operations, instead of parallel operations. That was part of why I quoted @nikomatsakis comment about tapir; my suspicion is that rust is sufficiently strongly typed that it can extract the available parallelism on its own, figuring out how to deal with the differences between GPU architectures and CPU architectures. However, just having rustc --target=spirv isn't sufficient; an end user needs to be able to say 'this chunk of code should be parallelized because it will never block on io', and 'this chunk of code is listening to the game controller, and will be blocked for long periods'. This is where the crate's design becomes important, and why we I think we should design a facade that permits these different use cases (more below).

I understand where you're coming from with this, but my hope is that we'll be able to figure out something that would allow rust to expand outwards as needed. That was part of my interest in creating a facade, and an implementation of the facade in the standard library. If in the future someone figures out how to compile rust -> GPU (OpenCL, SPIR-V, magical pink unicorns, etc.), it would be nice if they could also implement the facade for what they've done, making it easy for everyone else to use their work. When I see rayon's par_iter(), I see someone creating their own version of parallel for, and shoving work towards a GPU. I can also see a tree of concrete implementations, where higher nodes in the tree are shoveling work around to lower nodes (cluster -> computers -> CPU/GPU). I just want a facade that is open to all of this.

There is an unsafe spawn_unchecked function in nightly if you haven't seen that. Not sure why it hasn't gone stable yet though.

But scoped threads can be safe, which is the point. I pushed for this addition stdlib in order to use this function in crossbeam::scoped, which allows to avoid some allocations and lifetime hacking. I just didnā€™t have the time to push things further due to being occupated with my masters thesis (which coincidentally is about cooperative and preemptive suspension in the context of work stealing :wink: )

3 Likes

OK, I think that there are two parts here.

First, I think that mimicing rayon's plumbing (see plumbing docs and README) is the way to go. At a high level, you can have specialized consumers for IO, CPU, GPU, blocking tasks, etc. Each of these consumers are used for different purposes. For example, the IO consumer might really be looking to iterate over set of readers, and its parallel iterator has been optimized to be like a select() or poll() (rayon's for_any() seems a lot like a good fit for this), while the CPU/GPU versions expect to never block, and are really optimized for parallel execution (they implement all of the ParallelIterator trait, they just aren't optimized over some bits of it). Above that are schedulers which are both producers and consumers. They act as the interior nodes of a tree whose leaves are the consumers I just talked about before. The scheduler's task is to keep all their children full of work, and to gather the results as they come in. Thus, you may have a CPU+GPU scheduler that knows how to distribute work over a set of CPUs + GPUs (child asks parent for work, parent steals work for child from other children, or asks its parent for more work). You could then layer another scheduler on top of the CPU+GPU scheduler that knows how to distribute work across a cluster of compute nodes. The depth of layering is up to the programmer. All of this handles moving work up and down the layers, but it doesn't allow jumping between nodes, which brings up my next point.

std::sync::mpsc already defines both channel and sync_channel. In the example you gave above, the IO task could create a new channel, hand the send side over to rayon. On the IO side, it would create a vector of closures, one of which listens to the receive side of the channel, and the other of which does IO work. This vector is then executed using par_iter().for_each(|x| x()), which means that the task that is blocked waiting for rayon doesn't prevent the IO task from executing (or vice-versa).

1 Like

I was thinking about what both I and @anon19897381 said earlier, and I realized that we may already have most of the machinery we need in place to make a rayon-like library work with both blocking and non-blocking tasks. We already have the std::marker traits which allows the compiler to decide if something can cross thread-boundaries, etc. What would happen if we added the marker std::marker::Blocking? Anything that could be a blocking task would implement the trait; if you have some kind of executor that can't correctly handle work that may block, it just has a bound of !Blocking for any work that is fed to it. The compiler can then decide if the bounds are met. All IO in the standard library could then be marked with Blocking, as well as anything else that could give headaches.

There may need to be some kind of dynamic check as well, but I haven't thought far enough ahead yet to know if that is needed.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.