Single-source GPU support


#1

Hi!

I am new to both Rust and the forum, so bear with me if something’s not right.

I am a physicist doing my PhD and 90% of my day is coding C++ for GPU calculations, more specifically experimenting with advanced Template Metaprogramming that does some pretty neat things in both host-side and device-side code. TMP propagating all the way through to kernels is just pure awesome. For this we’re using portable tools, C++AMP (with Clamp under Linux) and we are SYCL beta testers. (No CUDA) C++AMP uses DirectCompute on Windows, while Clamp uses SPIR and HSAIL as a back-end. SYCL is capable of generating SPIR and flat OpenCL C kernels.

A few of us in our group have come across Rust and it simply blew our mind. Rust is the better C++ we need. Rust’s type system is far more powerful (and human readable) than what I can express with TMP. Concepts are a long way, before it is a widely adopted C++ feature. (Not before C++17, plus implementation time.) I saw that there is a Rust port to the OpenCL API (most likely making excessive use of C compatibility), and I even found the RustGPU project, from where I contacted Eric Holk, a member of the group implementing that proof of concept.

Due to the infancy of SPIR at the time, RustGPU was implementeg using NVPTX as the kernel intermediate, and OpenCL for the host-side stuff. Read more in this blog post. OpenCL 2.0 with it’s matching SPIR 2.0 however have received numerous refinements since the first provisional specification and is close to being finalized, with 2 implementations already at hand (Intel, AMD). SPIR 2.0 has support for function pointers (called ‘blocks’ in OpenCL C), multiple levels of Shared Virtual Memory, just to name the most important stuff.

Following up on the brief mailing with Eric about this RustGPU pilot project, I have the feeling that all the underlying headaches have been cleared out. LLVM already supports OpenCL memory namespaces, as I was following the LLVM list a while back too.

I understand that getting Rust 1.0 is top priority now, and that Rust will have subsequent backward-compatible updates in the future. My questions are:

  • How open is the community for bringing single-source GPU programming to Rust?
  • If yes, is it something of an ‘explicit goal’ or more like ‘we’re not against it’?
  • Are there people with the necessary skills and time available to pull something like this off?

While I would be very much excited to work on a project like this, I fear my agenda in the next 1-2 years (while my PhD is running, beside a full-time job and my newborn child (the first)) will not allow me to take on such an endeavor. I do know however that I would be one of the most enthusiastic users of the feature, and I’d even be willing to beta test it with all it’s headache, similar to what we’re doing with SYCL now.

Thoughts? Ideas? Comments?

Cheers, Máté

ps.: second but similar question: substitute HSA and HSAIL for OpenCL and SPIR.


#2

I’m very interested in this idea (though I have limited GPU experience and haven’t taken a compilers class yet). Just one question: how do you define “Single-source GPU”? I’m not familiar with the term.


#3

http://blog.theincredibleholk.org/blog/2012/12/05/compiling-rust-for-gpus/

was written long ago, is this the kind of thing you’re talking about? A lot has changed in Rust-land since then, but I don’t see why it’s not fundamnetally possible somehow.


#4

By Single-source I mean GPU programming, where both the host-side and the device-side code reside in the same source file. Pretty much all GPGPU APIs apart from OpenCL, which requires the user to load the kernel code from a separate file. (It is not mandated, but it is best practice, as the compiler expects a const char* as kernel source.)

While the term single-source does not automatically imply that the same compiler has to generate the device code as the one doing host compilation (as is the case with SYCL, where the standard leaves this fact to be an implementation detail), but usually it is.


#5

(Reading the post takes only ~10 minutes, but writing it took 2+ hours.)

In my opinion, should a project like this be revived (which I think would do only good), the first things that need to be decided:

  • What mechanisms to use to decorate GPU functions, and how to compile them?
  • How to annotate memory namespaces?
  • What features of Rust does one wish to propagate through to kernels?

The spectrum of tools available in C++ for kernel calling are:

  1. CUDA style language extension. GPU functions are decorated with qualifiers resembling calling conventions (device void myFunc(…)) and calling the kernels use a special triple-bracket operator (myFunc<<<dimX,dimY>>>(…)).
  2. C++AMP style language extension. GPU functions are decorated with qualifiers resembling concepts, the original version that is (void myFunc(…) restrict(cpu,amp)). This is mixed with a bit of library magic, because calling kernels is possible through the invocation of a specific function (concurrency::parallel_for_each(…)).
  3. SYCL style library magic. GPU functions are ordinary host-side functions. Calling kernels is possible through special libarary functions, that await conforming functors (cl::sycl::parallel_for(…)).
  4. OpenACC/OpenMP style pragma directives. Ordinary language for loops being decorated with #pragma directives.

The spectrum of tools available in C++ for annotating memory namespaces are:

  1. CUDA style language extension with API support. 1.1. On host side, one calls special cuda_malloc(…) functions, which return seemingly pointers, but in truth handles to device memory, which pointers are only meaningful when used inside API functions. With SVM rising, the value of pointers can have actual meaning when passed raw to kernels, but without physically the same memory, the API handles address translation. 1.2. On device side, variables are decorated with qualifiers (shared) to annotate memory namespaces.
  2. C++AMP style language extension with library magic. 2.1. On host side, one uses templated library containers that can be passed to kernel functions. 2.2. On device side, variables are decorated with qualifiers (tile_static) to annotate memory namespaces.
  3. SYCL style library magic. 3.1. On host side, one uses templated library containers that can be passed to kernel functors. 3.2. On device side, one uses templated library classes, to obtain pointers of various memory namespaces (cl::sycl::accessor<…, cl::sycl::access::local>)
  4. OpenACC/OpenMP style pragma clauses with compiler magic. 4.1. On host side, regular C/C++ memory allocation is performed for host. Memory movement is implicit on kernel launch. 4.2. On device side, there are special #pragma caluses that control shared memory allocation.

All of the above single-source GPGPU languages/APIs restrict language features to a subset of C98/C11/C++98/C++11/C++14. Usually anything dynamic is forbidden, but the newest version of all APIs are becoming more and more relaxed in terms of restrictions. Which are the following:

  1. CUDA 6.5 allows dynamic memory allocation inside kernels (overallocating/segfaulting results in failure reported to host), usage of function pointers, but no direct references to host side variables, no dynamic polymorhpism and no exceptions.
  2. C++AMP 1.0 allows any C++ feature to be used (that the compiler understands), as long as types in kernel lambdas are amp-compatible. That means types used in kernels must not have virtual functions, kernels must not dynamically allocate memory and cannot hold references to any other types than concurrency::array<> (the templated containers created on the host), no recursion, no RTTI, no exceptions, same old, same old. The constructor of concurrency::array decorated with restric(cpu) only, so as one would expect, only host can allocate device memory.
  3. SYCL 1.2 allows any C++11 technique to be used apart from same old, same old.
  4. Allow me to omit this part. Never used OpenACC/OpenMP4.0, and never will.

Violating any of the preconditions can always be detected at compile-time. Note, that non of this is a complete list, just a showcase to get the feeling.

This is a rough overview of how single-source GPGPU is done in C/C++. I have no idea what would be the best way to do something similar in Rust. I have ideas, but I have not followed the evolution of Rust, and I am no compiler dev or software engineer (although I do tend to feel like one). Just a brief quote from the linked blog post:

…I added a -Zptx flag to rustc and started making minor changes to the translation pass. Functions that have the #[kernel] attribute get compiled to use the ptx_kernel calling convention, which tells NVPTX to add the .entry line. According to Patrick, we should probably use a new ABI setting instead, as arbitrary attributes aren’t part of the function’s type.

I have no idea how function attributes work, neither do I know how it would look like in code to introduce new ABI setting for a function. I am also unsure what would be the best way to launch kernels (see earlier: special functions, new operators…), what fits best the Rust phylosophy. Some of the above mentioned techniques are very much intrusive, others are very subtle. CUDA is intrusive in the sense, that it introduces many new things, and without understanding of it, the code is meaningless. C++AMP is also intrusive, but it introduces a new type-system enhancement orthogonal to the language, and thus integrates a bit more elegantly. SYCL only uses library classes that behave in special ways; this while may be counter-intuitive, code without understanding of SYCL constructs is valid C++, and actual implementations of the special classes can exist that even make the code correct. OpenACC and OpenMP code can be written in a way, that a compiler not understanding the pragmas, and as such simply omitting them, will result in valid AND correct serial code.

I do not know what the core Rust developers think about introducing GPU parallelism into the language. I know that the C++ committee is very reluctant to touch upon GPU parallelism, as they believe GPUs are evolving too fast to be standardized and brought into the language either via a language features, or just STL features. If something makes it into the C++ standard, essentially it never makes it out in a non-backward-compatible manner. If Rust wishes to be defensive about it, it will introduce GPU parallelism as an optional feature (compiler flag), or anything else, that later can be turned off, or deemed deprecated, should GPU’s fly by in terms of features (or existence).

It seems to me, that heterogenous computing will most certainly be the future: meaning that latency optimized cores (CPU) and throughput optimized cores (IGP) will share memory with increasing level of transparency. OpenCL 2.0 abstracts the notion of SVM, and the highest possible level of memory sharing is atomically correct shared virtual memory. AMD Kaveri already supports it, while coming Intel Skylake is known to be supporting it. Mobile chips usually don’t implement such costly features, as they are considered too expensive in the mobile world. Discreet GPUs plugged into PCI-extensions are questionable, whether they will survive in 5-10 years time, or the latency of the BUS that essentially kills of atomic correctness will eventually be such a burden that game developers will not want to tolerate. The HSA architecture neatly abstracts out all these entities and features that devices can (and must) implement in some way, while OpenCL tends to have bit more focus on actual use, but not features such as QoS and preemption. (HSA has no associated language to it, so OpenCL is one ‘front-end’ to HSA, C++AMP being another one)

One needs to set a goal of features that would be imported into such a single-source Rust GPGPU project, and what features would be left out. OpenCL 2.0 is capable of dynamic parallelism (kernels launch new kernels (in various ways)), even if that is on a different device, even the CPU. In this case however, the function launched on the CPU must be a kernel itself and thus abide by all the rules and restrictions imposed on kernels. Is atomically correct memory and pointer value validity on both host and device are considered mandatory? (Most likely much easier to make a compiler for such a case.) If one goes further, HSA backend is not restricted to kernels only being able to launch kernels. In HSA kernels can launch ordinary CPU functions and await their result, launch them async and what not. The only reason this is not available to programmers, because no ‘front-end’ capable of compiling to HSAIL allows such things, and result in compile-time error, even though both the HW and the IL support it; all this because neither OpenCL, nor C++AMP were designed for such high level of interoperability betwen devices. Where does one want to draw the line?

I believe the order in which things must be decided:

  1. What would be the scope of single-source GPGPU?
  2. What back-ends (HW/SW) are capable of implementing the required features?
  3. What language feature would be needed to give a neat interface to the Rust user?
  4. How on Earth would that be compiled?

I am always up to discussions like this, might even be willing to help in interface design, but I am essentially a Rust noob (though a seasoned C++ programmer can pick up Rust fairly fast, if the new way of thinking in Rust has been picked up). As for actual coding for rustc, I am definately not the person to contact, but I have already said that.


#6

I think the GPU interface should not be part of the language, but rather part of a library.


#7

Well, the whole point is that the device code needs to be compiled too and the point here is to have it compiled by the same compiler. So the compiler has to be involved.

Question is what is what needs to be added to the core and what can be implemented as compiler plugin. Preferably the plugin interface should just allow enough control over code generation and the actual extensions should all be in the plugin.


#8

This is RFC issue #614. Check the link for the past discussion.


#9

So I only have a passing familiarity with GPU/OpenCL programming, so I’ll try to avoid making assumptions. I’m pretty familiar with most of the compiler internals though.

Adding an ABI is fairly easy, most of the compiler only cares about one or two of the ABIs (in the relatively rare case that it cares at all), and so adding a new ABI would only require updating a small number of locations in the compiler. As @pcwalton said, this has the nice advantage of being part of the type.

For actually compiling the kernel functions, it should be enough to 1) collect the functions as we see them, translating them in a separate context and 2) “calls” to these functions generate the appropriate code for executing them on the appropriate device. A separate library would be the most appropriate place to implement the bulk of the code for doing that though. I’m not familiar with how you go about executing kernels and what the requirements are, so I may be very much off the mark on this.

For language features, there isn’t much I can see be off-limits. From a runtime perspective, Rust’s features are almost identical to C. The main exceptions being closures (especially boxed) and trait objects. It’s the library code that is important, especially the “language items” (special items recognised by the compiler and used to implement various features like operator overloading and stack unwinding). Being able to use at least the bulk of libcore seems necessary for this feature to be useful overall.

I guess I need to know the restrictions, limitations and requirements on writing and executing GPU code to answer the questions better. That said, it’s clear to me that the biggest issue isn’t going to be the GPU code itself, but instead the interaction between GPU-Rust and Non-GPU-Rust.


#10

You are not off the mark at all. Calling GPU kernels is no black magic. In fact, it cannot get much simpler as it is, even in the C APIs. Trying to reflect on your comments:

  1. Marking functions to be compliled against a new ABI, and compiling them as the compiler meets them “in a seperate context” (whatever that may mean) seems simple enough. 2.1) Calls to these functions seems yet again simple enough to me. Step 1) ultimately outputs SPIR intermediate. Before the kernels can be callled on a given device, one needs to create actual binaries that the devices understand. That is done using clCreateProgramWithBinary (OpenCL 2.0 spec page 181) where one specifies a list of devices and the SPIR “binary”. The result is a cl_program object that holds handles to callable kernels. (A cl_program object allows many queries to be made even at runtime using clGetProgramInfo())From this, one can extract cl_kernel objects (handles to compiled kernels) using clCreateKernel() (page 206) which takes a cl_program and the const char* name of the function that serves as the entry point to a kernel, as input. 2.2) The actual calling of these functions only require one to pass function arguments via API functions clSetKernelArg() which takes pointers and number of bytes as input. (No typecheck done in naked OpenCL! clSetKernelArg takes void* as input, but naturally the kernels param may be any pointer type. Not even the C++ wrappers can help us here.) Parameters can be buffers (the abstract objects of cl_buffer, that serve as handles to GPU memory arrays) or any POD datastructure, in which case all instances of a kernel obtain a copy of the object. If one wishes to pass a reference to an object as kernel parameter, the object need to be wrapped in a buffer that is either an SVM buffer (OpenCL 2.0 way, page 163), or a buffer created with CL_MEM_USE_HOST_PTR (page 93), in which case all accesses of the variable will be done through a host memory pointer (which is a rather costly access if the PCI-E BUS is involved).

As you said, the tricky part is getting the GPU part neatly interact with the Non-GPU part. All C++ GPU APIs allow eplicit control over device selection, which is a must, and it should not be masked away. Most of them employ lazy memory movement to copy things at the latest possible time, but also allow explicit control, if one knows when they wish to move data.

The trickiest part comes here.

GPU kernels are best launched asynchronously, and only wait upon them when their output is needed (either for a new kernel, or to be copied back to host). OpenCL uses cl_event objects as syncing handles that may be used by the user explicitly, and cl_command_queue objects that inside handle correct syncing inside themselves when parametrized correctly. C++AMP’s concurrency::parallel_for_each() is a synchronous function, though it can be wrapped into a std::async() call to retrieve a std::future which allows for user control.

The tricky part is how does Rust guarantee at compile time, whether or not I use buffers in an invalid way. (Writing the same buffer on 2 seperate devices at the same time, etc.) OpenCL defines, that in such a case, it is implementation defined how a program explodes. These are the kind of cases one would expect Rust to detect statically. Both SYCL and AMP introduce entities that can be used for such detection in Rust also. cl::sycl::accessor<> holds in it’s type whether it is a reading/writing/read-write pointer in a buffer, and AMP as array_views which can be const (and also change the logical indexing of the array).

Partly relevant is how would Rust make the distinction of regular buffers vs. buffers pointing to host memory and SVM. Regular kernel arguments vs. single element buffers (this is a rare thing to do, ugly and slow, yet valid code). Which one should Rust allow and how to map them to Rust types?


#11

Thanks, that’s actually really helpful! I actually have a better idea of the general parts needed here.

By “context”, I mean an LLVM context. I forgot that most people aren’t familiar with LLVM :stuck_out_tongue: LLVM uses contexts for compilation, so you’d have to compile the OpenCL functions in a different context to be able to compile them to a different target.

So what I see is that we would compile all the OpenCL functions, producing an appropriate “binary” in SPIR format. That would get added to the regular binary as a global constant. Another global variable would be generated to store the cl_program, which we create from the SPIR binary in the main function. Calling the SPIR functions can then pull the appropriate kernel out of the cl_program and do the right thing.

Rust uses a number of analyses to ensure that data isn’t modified concurrently. The most obvious one is that a &mut reference cannot alias with any other reference (though things like re-borrowing complicate this definition). Another is that most values “move” by default (in fact, you have to opt-in to copy behaviour), which in Rust is simply a shallow memcpy that leaves the original value unusable (enforced at compile time). It should be easy to use these features to enforce the constraints that OpenCL imposes (but cannot easily enforce).

A few questions I don’t know how to answer is how the language-level “API” should work. Dealing with the asynchronous aspect seems like the more difficult part. We could wrap the return type from the kernel so a declaration like extern "opencl" fn foo() -> Bar would actually return a ResultHandle<Bar> when called (and you’d be able to wait on it for results). Another option could be that we block by default, but provide a way to “convert” it into an asynchronous form. The second option seems better to me, since it doesn’t require rewriting types.

I think that a lot of the functionality should be provided via library code though, even if it means that doing things like passing a &[T] uses a host-memory buffer, and that using a GPU memory buffer requires explicitly requesting one. In practice, I don’t think this would be much of an issue though, especially since you could use iterators to construct one (even use FromIterator so it’s as simple as let data : cl::Buffer<Foo> = my_iter.collect()). Using iterators also allows the buffer to be constructed lazily anyway (with the “move all the data” method being public).

I might go an do some of my own research on OpenCL and the C++ abstractions you have mentioned. As with many things, the challenge here isn’t the functionality itself, it’s presenting it to users.


#12

Thank you for the thorough explanation. Indeed, I am not much familiar with LLVM. I know what it’s for, and roughly what it is, but have never layed hands on it. I’m pretty much an ordinary (but eager) end-user.

Constructing a cl_kernel on every call from a cl_program is considered bad form, but ultimately that is an optimization issue.

Lucky for us, all GPGPU APIs forbid returning values from kernels, so the async return is not an issue. Kernels must always return void. If one needs output, that must always be provided through buffers or SVM.

Putting that aside, it might more elegant to alleviate this restriction, but that is another issue we might not want to tackle just yet. While in some corener cases it would be intuitive (such a parallel reduce, aka. std::accumulate), but more often than not it is confusing. In this case, only one instance of a kernel could return; all others must break, and that is quite ugly.

If you would like to educate yourself, the OpenCL 2.0 specs is a good start. The introductory part is useful, the rest can pretty much be scanned through. After that I would suggest reading about SYCL, as it has many 1:1 mapping to OpenCL entities. AMP is an elegant, yet different approach. But by this time, you will have the feeling of “same old, same old”, specially on the restriction parts. The 2 best parts of AMP is the “requires(amp,cpu)” part, which allows one to overload practically everything on the device-ness of functions in a manner orthogonal to the type system, but interacting intuitively with it, and the 3-dimensional extents and indices, which with the array_views which let you conveniently reindex arrays. Other than that, it’s pretty much SYCL.


#13

What with the advent of the Khronos group’s Vulkan efforts, their slated changes to support compiling to a uniformly cross-platform IR (SPIR-V) via readily-made-available tools or library calls (which may be easily wrapped into a tool), and the relative simplicity of taking a limited subset of the Rust language and transforming it into OpenCL C, it seems to me that soon the lowest hanging (and not-too-hard-to-reach) fruit will be a procedural macro that invokes a future OpenCL C compiler against OpenCL C code translated from constrained Rust code, dumps the resulting IR into a string literal, and provides a support library with generated code hooks for easy calling of the GPGPU function. Type safety could possibly still be enforced by having the macro output the raw constrained Rust code alongside the stringified IR, thus allowing the type checker to go over the written code; perhaps such would be a dirty hack, but it would still likely be capable of getting the job done.

(I hope the run-on-sentence-police don’t find me)


#14

I wanted to add the same thing.

SPIR-V as an IR is as straightforward as it can be. If you take a look at the specs, it cannot be too hard to generate, and the IR will provide the neccessary constraints on the language features. Rust could follow a similar path as what SYCL is taking. Ordinary host side code is compiled with some compiler (rustc), and there are special library classes which when encountered by another compiler (would be rustc also) trigger special compilation paths, generating host-side code in case of classes like cl::sycl::buffer and generating device-side SPIR-V in the body of kernel functions. The host side code could be either OpenCL 2.1, but to obtain a wider set of devices (mostly not lose those damned Nvidia devices), Vulkan API calls could be generated.