As compiling Rust for GPU targets is progressing, we’re more often hitting the problem that we want a name that describes a GPU feature in a concise and clear way.
There are a couple of existing APIs for GPUs that are unfortunately not united on the naming side but they can serve as inspiration. Some discussion has happened in #146181, so I’ll take this as a starting point. Somewhat agreed names are written bold. None of this is stable, so everything can be changed and any input is welcome.
Once we have some basic agreements, I think we should write this down in the Rust documentation, also to make it “updatable”.
Maybe as 7.4 GPU Target Nomenclature in the Targets section of the rustc book?
Or as 16.2 GPU Target Nomenclature in the Platform Support section?
Many things are listed here but they don’t need to be solved at the same time, we can tackle them part by part.
To get it out of the way, (Rust) GPU code that can be launched on the GPU is called a gpu-kernel (from the extern "gpu-kernel" Rust ABI).
Other APIs
- CUDA/HIP/OpenCL/SYCL call this kernel
- GLSL/DirectX call this (compute) shader
Hierarchy
Threads on a GPU work in a hierarchy. The GPU hierarchy from largest to tiniest unit is
- Launch (also called dispatch): The size that is started from the CPU (it is possible to start things from the GPU, though seldomly used so far) (in the millions of threads)
- Workgroup (also called block or thread-group): The threads that have access to the same workgroup-shared memory (usually up to 1024 threads)
- Wave/Wavefront/Warp/Sub-group: The threads executed in SIMD fashion (usually 32 threads, sometimes less or 64)
- Thread: One instance of a gpu-kernel, the simplest name here
Other APIs
- Launch
- CUDA/HIP/OpenCL/SYCL use launch and dispatch as synonyms, though use launch more often and use launch in the API function names. Sometimes, grid is used to describe all launched threads
- GLSL/DirectX call this dispatch
- Workgroup
- CUDA/HIP call this block
- OpenCL/SYCL call this work-group
- GLSL calls this work group
- DirectX calls this thread-group
- Wave/Wavefront/Warp
- CUDA/HIP call this warp (in a few instances, HIP also calls it wave or wavefront)
- OpenCL/SYCL/GLSL call this sub-group
- DirectX calls this wave
- Thread
- DirectX calls this lane
- Everyone else calls this thread (and sometimes also mentions lane)
IMO lane is the hardware part and thread is the software concept running on a lane. Analogously, SIMD is the hardware part and wave/… the software concept. Or CPU core the hardware part and thread the software concept.
Address Spaces
There are multiple, different memory regions (or address spaces in compiler speak) on a GPU (see also references in the other APIs section for details). Apart from workgroup-shared memory, these address spaces appear only in rather specific cases. Though the Rust compiler/standard library needs to know and support them at least internally.
- The generic address space for everything (
0in LLVM) - The global address space for VRAM
- The constant address space for constant memory that can be better optimized
- The workgroup-shared address space
- The per-thread/private/local address space for thread-private data, i.e. the stack (frame)
Other APIs
The generic address space
- CUDA/HIP do not seem to mention this explicitly
- SYCL calls this generic
- OpenCL/GLSL/DirectX do not have this concept
The global address space
- CUDA/HIP/OpenCL/SYCL call this global
- GLSL/DirectX do not have this concept
The constant address space
- CUDA/HIP/OpenCL/SYCL call this constant
- GLSL/DirectX do not have this concept
The workgroup-shared address space
- CUDA/HIP/GLSL call this shared memory
- OpenCL/SYCL call this local memory
- DirectX calls this groupshared memory
The private/local address space
- CUDA calls this local (or private)
- HIP calls this local or per-thread
- GLSL calls this shared memory
- OpenCL/SYCL call this local memory
- DirectX calls this groupshared memory
Some references:
Intrinsics for Ids/Sizes
GPUs are rather similar, so we can expose a common standard library for many parts.
For every stage in the hierarchy of a GPU launch, a gpu-kernel can query the size and its id. The sizes and ids at most stages are three-dimensional, there is an x, y and z coordinate.
Potentially core::gpu would be a fitting place for shared parts?
I thought about a good name to get the the number of launched workgroups and number of threads in a workgroup, etc.,
so it would be something like workgroup_threads_x/y/z() for the number of threads in a workgroup and then launch_workgroups_x/y/z()
for the number of workgroups in a launch (this is a bad name, it doesn’t launch workgroups), or maybe workgroups_per_launch_x/y/z(),
but then it would also need to be threads_per_workgroups_x/y/z().
But then I thought, why not take a more structured approach? (I couldn’t resist the pun)
And have different structs for launch, workgroup and wave scope with methods to query sizes and ids.
fn launch() -> LaunchMetadata;
// The following code lists only the _x component where a real implementation would have _x, _y and _z variants
struct LaunchMetadata {}
impl LaunchMetadata {
/// Get data inside a workgroup
fn workgroup() -> WorkgroupMetadata;
/// Get data inside a wave
fn wave() -> WaveMetadata;
/// Global id (x/y/z) (not sure if we should expose this)
/// `workgroup_x() * workgroup().threads_x_len() + workgroup().thread_x()`
fn thread_x() -> u64;
/// Workgroup id (x/y/z)
fn workgroup_x() -> u32;
/// Number of launched workgroups (x/y/z)
fn workgroups_x_len() -> u32;
/// Number of total threads (x/y/z) (not sure if we should expose this)
/// `workgroups_x_len() * workgroup().threads_x_len()`
fn threads_x_len() -> u64;
}
impl WorkgroupMetadata {
/// Number of waves in a workgroup
fn waves_len() -> u32;
/// Number of threads in the workgroup (x/y/z)
fn threads_x_len() -> u32;
/// Thread id inside workgroup (x/y/z)
fn thread_x() -> u32;
}
struct WaveMetadata {}
impl WaveMetadata {
/// Number of threads in the wave
/// ~= Number of lanes in a SIMD unit
fn len() -> u32;
/// Thread/lane id inside wave
fn lane_id() -> u32;
}
Then again, thread_in_workgroup_x() is faster to write than launch().workgroup().threads_x_len().
amdgpu/nvptx can expose intrinsics under their respective names (core::arch::*) and both implement the common interface.
Other intrinsics
This is more in progress than the above sections. As sources of inspiration there is clang’s gpuintrin.h, amdgpuintrin.h and nvptxintrin.h. stdarch#1976 for amdgpu intrinsics is a related PR.
Some general ones:
- barrier/sync_threads (may be complicated, I think some of these have slightly different semantics between nvptx and amdgpu)
- sleep (amdgpu: s_sleep, nvptx: nanosleep)
- exit (could be
core::intrinsics::exit?) - abort (is
core::intrinsics::abort)
Intrinsics that work within a wave:
- ballot (bool → mask of threads that have true)
- sync_wave (similar to barrier/sync_threads but only within a wave)
Intrinsics that work within a wave and can be generic over some types, these may make sense as proper #[rustc_intrinsic]s:
- read_first_lane (for which types can we define this)
- match_any/match_mask (value → mask of threads in the wave having the same value)
- match_all (value → bool if all threads in the wave have the same value)
- lane_scan/wave_prefix_reduce
- lane_sum/wave_reduce
And, honorable mention, the gpu_launch_sized_workgroup_mem intrinsic to access workgroup-shared memory.