(Reading the post takes only ~10 minutes, but writing it took 2+ hours.)
In my opinion, should a project like this be revived (which I think would do only good), the first things that need to be decided:
- What mechanisms to use to decorate GPU functions, and how to compile them?
- How to annotate memory namespaces?
- What features of Rust does one wish to propagate through to kernels?
The spectrum of tools available in C++ for kernel calling are:
- CUDA style language extension. GPU functions are decorated with qualifiers resembling calling conventions (device void myFunc(...)) and calling the kernels use a special triple-bracket operator (myFunc<<>>(...)).
- C++AMP style language extension. GPU functions are decorated with qualifiers resembling concepts, the original version that is (void myFunc(...) restrict(cpu,amp)). This is mixed with a bit of library magic, because calling kernels is possible through the invocation of a specific function (concurrency::parallel_for_each(...)).
- SYCL style library magic. GPU functions are ordinary host-side functions. Calling kernels is possible through special libarary functions, that await conforming functors (cl::sycl::parallel_for(...)).
- OpenACC/OpenMP style pragma directives. Ordinary language for loops being decorated with #pragma directives.
The spectrum of tools available in C++ for annotating memory namespaces are:
- CUDA style language extension with API support.
1.1. On host side, one calls special cuda_malloc(...) functions, which return seemingly pointers, but in truth handles to device memory, which pointers are only meaningful when used inside API functions. With SVM rising, the value of pointers can have actual meaning when passed raw to kernels, but without physically the same memory, the API handles address translation.
1.2. On device side, variables are decorated with qualifiers (shared) to annotate memory namespaces.
- C++AMP style language extension with library magic.
2.1. On host side, one uses templated library containers that can be passed to kernel functions.
2.2. On device side, variables are decorated with qualifiers (tile_static) to annotate memory namespaces.
- SYCL style library magic.
3.1. On host side, one uses templated library containers that can be passed to kernel functors.
3.2. On device side, one uses templated library classes, to obtain pointers of various memory namespaces (cl::sycl::accessor<..., cl::sycl::access::local>)
- OpenACC/OpenMP style pragma clauses with compiler magic.
4.1. On host side, regular C/C++ memory allocation is performed for host. Memory movement is implicit on kernel launch.
4.2. On device side, there are special #pragma caluses that control shared memory allocation.
All of the above single-source GPGPU languages/APIs restrict language features to a subset of C98/C11/C++98/C++11/C++14. Usually anything dynamic is forbidden, but the newest version of all APIs are becoming more and more relaxed in terms of restrictions. Which are the following:
- CUDA 6.5 allows dynamic memory allocation inside kernels (overallocating/segfaulting results in failure reported to host), usage of function pointers, but no direct references to host side variables, no dynamic polymorhpism and no exceptions.
- C++AMP 1.0 allows any C++ feature to be used (that the compiler understands), as long as types in kernel lambdas are amp-compatible. That means types used in kernels must not have virtual functions, kernels must not dynamically allocate memory and cannot hold references to any other types than concurrency::array<> (the templated containers created on the host), no recursion, no RTTI, no exceptions, same old, same old. The constructor of concurrency::array decorated with restric(cpu) only, so as one would expect, only host can allocate device memory.
- SYCL 1.2 allows any C++11 technique to be used apart from same old, same old.
- Allow me to omit this part. Never used OpenACC/OpenMP4.0, and never will.
Violating any of the preconditions can always be detected at compile-time. Note, that non of this is a complete list, just a showcase to get the feeling.
This is a rough overview of how single-source GPGPU is done in C/C++. I have no idea what would be the best way to do something similar in Rust. I have ideas, but I have not followed the evolution of Rust, and I am no compiler dev or software engineer (although I do tend to feel like one). Just a brief quote from the linked blog post:
...I added a -Zptx flag to rustc and started making minor changes to the translation pass. Functions that have the #[kernel] attribute get compiled to use the ptx_kernel calling convention, which tells NVPTX to add the .entry line. According to Patrick, we should probably use a new ABI setting instead, as arbitrary attributes aren’t part of the function’s type.
I have no idea how function attributes work, neither do I know how it would look like in code to introduce new ABI setting for a function. I am also unsure what would be the best way to launch kernels (see earlier: special functions, new operators...), what fits best the Rust phylosophy. Some of the above mentioned techniques are very much intrusive, others are very subtle. CUDA is intrusive in the sense, that it introduces many new things, and without understanding of it, the code is meaningless. C++AMP is also intrusive, but it introduces a new type-system enhancement orthogonal to the language, and thus integrates a bit more elegantly. SYCL only uses library classes that behave in special ways; this while may be counter-intuitive, code without understanding of SYCL constructs is valid C++, and actual implementations of the special classes can exist that even make the code correct. OpenACC and OpenMP code can be written in a way, that a compiler not understanding the pragmas, and as such simply omitting them, will result in valid AND correct serial code.
I do not know what the core Rust developers think about introducing GPU parallelism into the language. I know that the C++ committee is very reluctant to touch upon GPU parallelism, as they believe GPUs are evolving too fast to be standardized and brought into the language either via a language features, or just STL features. If something makes it into the C++ standard, essentially it never makes it out in a non-backward-compatible manner. If Rust wishes to be defensive about it, it will introduce GPU parallelism as an optional feature (compiler flag), or anything else, that later can be turned off, or deemed deprecated, should GPU's fly by in terms of features (or existence).
It seems to me, that heterogenous computing will most certainly be the future: meaning that latency optimized cores (CPU) and throughput optimized cores (IGP) will share memory with increasing level of transparency. OpenCL 2.0 abstracts the notion of SVM, and the highest possible level of memory sharing is atomically correct shared virtual memory. AMD Kaveri already supports it, while coming Intel Skylake is known to be supporting it. Mobile chips usually don't implement such costly features, as they are considered too expensive in the mobile world. Discreet GPUs plugged into PCI-extensions are questionable, whether they will survive in 5-10 years time, or the latency of the BUS that essentially kills of atomic correctness will eventually be such a burden that game developers will not want to tolerate. The HSA architecture neatly abstracts out all these entities and features that devices can (and must) implement in some way, while OpenCL tends to have bit more focus on actual use, but not features such as QoS and preemption. (HSA has no associated language to it, so OpenCL is one 'front-end' to HSA, C++AMP being another one)
One needs to set a goal of features that would be imported into such a single-source Rust GPGPU project, and what features would be left out. OpenCL 2.0 is capable of dynamic parallelism (kernels launch new kernels (in various ways)), even if that is on a different device, even the CPU. In this case however, the function launched on the CPU must be a kernel itself and thus abide by all the rules and restrictions imposed on kernels. Is atomically correct memory and pointer value validity on both host and device are considered mandatory? (Most likely much easier to make a compiler for such a case.) If one goes further, HSA backend is not restricted to kernels only being able to launch kernels. In HSA kernels can launch ordinary CPU functions and await their result, launch them async and what not. The only reason this is not available to programmers, because no 'front-end' capable of compiling to HSAIL allows such things, and result in compile-time error, even though both the HW and the IL support it; all this because neither OpenCL, nor C++AMP were designed for such high level of interoperability betwen devices. Where does one want to draw the line?
I believe the order in which things must be decided:
- What would be the scope of single-source GPGPU?
- What back-ends (HW/SW) are capable of implementing the required features?
- What language feature would be needed to give a neat interface to the Rust user?
- How on Earth would that be compiled?
I am always up to discussions like this, might even be willing to help in interface design, but I am essentially a Rust noob (though a seasoned C++ programmer can pick up Rust fairly fast, if the new way of thinking in Rust has been picked up). As for actual coding for rustc, I am definately not the person to contact, but I have already said that.