Naming GPU things in the Rust Compiler and Standard Library

As compiling Rust for GPU targets is progressing, we’re more often hitting the problem that we want a name that describes a GPU feature in a concise and clear way.

There are a couple of existing APIs for GPUs that are unfortunately not united on the naming side but they can serve as inspiration. Some discussion has happened in #146181, so I’ll take this as a starting point. Somewhat agreed names are written bold. None of this is stable, so everything can be changed and any input is welcome.

Once we have some basic agreements, I think we should write this down in the Rust documentation, also to make it “updatable”. Maybe as 7.4 GPU Target Nomenclature in the Targets section of the rustc book? Or as 16.2 GPU Target Nomenclature in the Platform Support section?

Many things are listed here but they don’t need to be solved at the same time, we can tackle them part by part.

To get it out of the way, (Rust) GPU code that can be launched on the GPU is called a gpu-kernel (from the extern "gpu-kernel" Rust ABI).

Other APIs
  • CUDA/HIP/OpenCL/SYCL call this kernel
  • GLSL/DirectX call this (compute) shader

Hierarchy

Threads on a GPU work in a hierarchy. The GPU hierarchy from largest to tiniest unit is

  1. Launch (also called dispatch): The size that is started from the CPU (it is possible to start things from the GPU, though seldomly used so far) (in the millions of threads)
  2. Workgroup (also called block or thread-group): The threads that have access to the same workgroup-shared memory (usually up to 1024 threads)
  3. Wave/Wavefront/Warp/Sub-group: The threads executed in SIMD fashion (usually 32 threads, sometimes less or 64)
  4. Thread: One instance of a gpu-kernel, the simplest name here
Other APIs
  1. Launch
  • CUDA/HIP/OpenCL/SYCL use launch and dispatch as synonyms, though use launch more often and use launch in the API function names. Sometimes, grid is used to describe all launched threads
  • GLSL/DirectX call this dispatch
  1. Workgroup
  • CUDA/HIP call this block
  • OpenCL/SYCL call this work-group
  • GLSL calls this work group
  • DirectX calls this thread-group
  1. Wave/Wavefront/Warp
  • CUDA/HIP call this warp (in a few instances, HIP also calls it wave or wavefront)
  • OpenCL/SYCL/GLSL call this sub-group
  • DirectX calls this wave
  1. Thread
  • DirectX calls this lane
  • Everyone else calls this thread (and sometimes also mentions lane)

IMO lane is the hardware part and thread is the software concept running on a lane. Analogously, SIMD is the hardware part and wave/… the software concept. Or CPU core the hardware part and thread the software concept.

Address Spaces

There are multiple, different memory regions (or address spaces in compiler speak) on a GPU (see also references in the other APIs section for details). Apart from workgroup-shared memory, these address spaces appear only in rather specific cases. Though the Rust compiler/standard library needs to know and support them at least internally.

  1. The generic address space for everything (0 in LLVM)
  2. The global address space for VRAM
  3. The constant address space for constant memory that can be better optimized
  4. The workgroup-shared address space
  5. The per-thread/private/local address space for thread-private data, i.e. the stack (frame)
Other APIs

The generic address space

  • CUDA/HIP do not seem to mention this explicitly
  • SYCL calls this generic
  • OpenCL/GLSL/DirectX do not have this concept

The global address space

  • CUDA/HIP/OpenCL/SYCL call this global
  • GLSL/DirectX do not have this concept

The constant address space

  • CUDA/HIP/OpenCL/SYCL call this constant
  • GLSL/DirectX do not have this concept

The workgroup-shared address space

  • CUDA/HIP/GLSL call this shared memory
  • OpenCL/SYCL call this local memory
  • DirectX calls this groupshared memory

The private/local address space

  • CUDA calls this local (or private)
  • HIP calls this local or per-thread
  • GLSL calls this shared memory
  • OpenCL/SYCL call this local memory
  • DirectX calls this groupshared memory

Some references:

Intrinsics for Ids/Sizes

GPUs are rather similar, so we can expose a common standard library for many parts.

For every stage in the hierarchy of a GPU launch, a gpu-kernel can query the size and its id. The sizes and ids at most stages are three-dimensional, there is an x, y and z coordinate.

Potentially core::gpu would be a fitting place for shared parts?

I thought about a good name to get the the number of launched workgroups and number of threads in a workgroup, etc., so it would be something like workgroup_threads_x/y/z() for the number of threads in a workgroup and then launch_workgroups_x/y/z() for the number of workgroups in a launch (this is a bad name, it doesn’t launch workgroups), or maybe workgroups_per_launch_x/y/z(), but then it would also need to be threads_per_workgroups_x/y/z().

But then I thought, why not take a more structured approach? (I couldn’t resist the pun)
And have different structs for launch, workgroup and wave scope with methods to query sizes and ids.

fn launch() -> LaunchMetadata;

// The following code lists only the _x component where a real implementation would have _x, _y and _z variants
struct LaunchMetadata {}
impl LaunchMetadata {
	/// Get data inside a workgroup
	fn workgroup() -> WorkgroupMetadata;
	/// Get data inside a wave
	fn wave() -> WaveMetadata;

	/// Global id (x/y/z) (not sure if we should expose this)
	/// `workgroup_x() * workgroup().threads_x_len() + workgroup().thread_x()`
	fn thread_x() -> u64;

	/// Workgroup id (x/y/z)
	fn workgroup_x() -> u32;

	/// Number of launched workgroups (x/y/z)
	fn workgroups_x_len() -> u32;

	/// Number of total threads (x/y/z) (not sure if we should expose this)
	/// `workgroups_x_len() * workgroup().threads_x_len()`
	fn threads_x_len() -> u64;
}

impl WorkgroupMetadata {
	/// Number of waves in a workgroup
	fn waves_len() -> u32;

	/// Number of threads in the workgroup (x/y/z)
	fn threads_x_len() -> u32;

	/// Thread id inside workgroup (x/y/z)
	fn thread_x() -> u32;
}

struct WaveMetadata {}
impl WaveMetadata {
	/// Number of threads in the wave
	/// ~= Number of lanes in a SIMD unit
	fn len() -> u32;

	/// Thread/lane id inside wave
	fn lane_id() -> u32;
}

Then again, thread_in_workgroup_x() is faster to write than launch().workgroup().threads_x_len().

amdgpu/nvptx can expose intrinsics under their respective names (core::arch::*) and both implement the common interface.

Other intrinsics

This is more in progress than the above sections. As sources of inspiration there is clang’s gpuintrin.h, amdgpuintrin.h and nvptxintrin.h. stdarch#1976 for amdgpu intrinsics is a related PR.

Some general ones:

  • barrier/sync_threads (may be complicated, I think some of these have slightly different semantics between nvptx and amdgpu)
  • sleep (amdgpu: s_sleep, nvptx: nanosleep)
  • exit (could be core::intrinsics::exit?)
  • abort (is core::intrinsics::abort)

Intrinsics that work within a wave:

  • ballot (bool → mask of threads that have true)
  • sync_wave (similar to barrier/sync_threads but only within a wave)

Intrinsics that work within a wave and can be generic over some types, these may make sense as proper #[rustc_intrinsic]s:

  • read_first_lane (for which types can we define this)
  • match_any/match_mask (value → mask of threads in the wave having the same value)
  • match_all (value → bool if all threads in the wave have the same value)
  • lane_scan/wave_prefix_reduce
  • lane_sum/wave_reduce

And, honorable mention, the gpu_launch_sized_workgroup_mem intrinsic to access workgroup-shared memory.

10 Likes

As a complete GPU novice, this was a helpful introduction to terminology.

My only thought is to prefer "wave" over "warp" (the sections lower down seem to prefer wave, too). I suppose both names can be both nouns and verbs, but to me "wave" is a more familiar noun (standing waves) that makes sense as a level above threads.

3 Likes

I presume the term "warp" here comes from weaving. It's a very apt metaphor, given that it's the set of longitudinal threads through which the other component, the weft, is drawn (ie. weaved/woven).

5 Likes

It's great that you initiated this discussion. I think this makes sense at this point, considering the multitude of ideas that have been mentioned.

In my opinion for vendor specific things we should stick to the vendor specific naming. As soon as we want to name more general entities, it makes sense to use a more general naming. We can use Sycl, OpenCl, etc. as a guide, but we should not consider their names to be the only options. From what I see, this has also been the approach taken by those involved so far.

One example where I personally would not take the naming from OpenCL/Sycl is local memory. In these frameworks it mostly stands for workgroup shared memory. However, it is easily to be confused with CUDA/HIP local memory. Instead, here I would use thread local memory and workgroup memory as more verbose terms.

To your structured approach: I really like this idea and I would consider this approach also for gpu_launch_sized_workgroup_mem:

mod gpu {
    struct LaunchSizedWorkgroupMemory;

    impl LaunchSizedWorkgroupMemory {
        fn get_base_ptr<T>() -> *mut T;
    }
}

Then it could be called like this:

gpu::LaunchSizedWorkgroupMemory::get_base_ptr()

This way launch does not sound like a verb to me. However, for the code in core::intrinsics this approach was never taken, so I do not know if this approach is possible and desirable for rustc_intrinsics. So this might rather be the way for intrinsic wrappers.

Nevertheless, for the unstable intrinsics I do not think the naming discussion should block implementing these. Renaming should be possible later when we have found proper names / a proper structure.

5 Likes

I don't know if it's slightly out of scope, but in this MR review comment there was a need to put a name on what calls a kernel.

The "runtime" is a good name. The "driver" is also in the top contestants. Maybe there are other alternatives as well?

Perhaps this should be an item we standardize on as well?

3 Likes

Rust has a fundamental assumption of uniform memory (everything addressable is in address space zero), even more so than C, so this will always continue to be more of a problem for Rust on the GPU than just a naming issue. My 2¢ is that, if this really does appear only in rather specific cases, it should be resolved more like VolatilePtr and less like VolatileCell. (&mut VolatileCell<T> can't have the desired semantics.)

Although, hopefully extern type (or even just using ?Sized to prevent mem::swap) could be sufficient for the compiler to be able to soundly treat a type as existing in a separate address space? I haven't really thought about it too too much, despite extern type being a highly desired feature for me.

I am aware that using Rust on the GPU will probably always require some global concessions to safety and guarantees provided by CPU Rust in the name of shader performance, hoping shader validation and GPU architecture will limit the blast radius to only incorrect results in the effected thread cluster.

Picking names familiar to GPU devs is important here. So I think the best option is launch/workgroup/subgroup/thread.

core::arch::wasm is for the wasm target family (both wasm32 and wasm64), so "GPU stuff" could go to core::arch::gpu with cfg(target_family = "gpu").

On the other hand, core::arch items are generally meant to match the vendor intrinsic semantics and naming exactly, with a shared interface outside of arch, so core::gpu makes as much sense as anything else. The alternative would be os::gpu, but with os::ffi moved to top-level, top-level seems fine.

It could be argued that the non vendor intrinsic versions should live in "the GPU version of std," but I really don't know the build system concerns there.

If we look at std::thread for prior art, we would get an API like gpu::available_launch_parallelism() -> NonZero<usize>, gpu::available_workgroup_parallelism_x() -> NonZero<usize> (per-launch), gpu::available_subgroup_parallelism_x() -> NonZero<usize> (per-workgroup), etc.

available_parallelism returns the number per parent because the parent only has that much parallelism available to it, even if an ancestor has access to more parallelism. (Consider a CPU compute cluster with multiple CPUs available as an analogy.)

1 Like

On available_parallelism: It's important to draw a distinction between "my quota share of the entire GPU" (a combination of the hardware and OS policy on the process) and "how many workgroups are in the currently executing launch". The first can be called from the driver/prelaunch code, while the second can only be meaningfully queried from inside a kernel.

I think we need a language feature for "code is inside an extern "gpu-kernel". Perhaps safe #[target_feature(in_gpu_kernel)]?

I think we need a language feature for "code is inside an extern "gpu-kernel". Perhaps safe #[target_feature(in_gpu_kernel)]?

Do you mean textually inside an extern "gpu-kernel" or may only be called (perhaps transitively) from a kernel? Because the latter is implied by the target and we want to be able to use arbitrary no_std crates on the device.

1 Like

Oh, if it's cfg(target_) then we're all good :+1:

Thanks for the comments, that’s all very valuable input!

Extending on the name for the group of threads that runs on a SIMD unit, warp/wave/wavefront/subgroup

  • Warp is coined by nvidia (comes from weaving as jdahlstrom mentioned above)
  • Wavefront is coined by amd
  • Subgroup is from Khronos (the vendor-crossing APIs like OpenCL)
  • Wave is from DirectX
  • Simdgroup is from Apple/Metal
  • Maybe threadgroup? (but that means workgroup on Apple/Metal)
  • Maybe executiongroup?

I feel like none of these choices is particularly appealing.
Subgroup does not feel fitting (analogous on CPUs, something running on a core is not a subprocess, but called a thread).
Warp doesn’t resonate with me so far (in its original meaning in weaving, it describes a single, orthogonal thread, but we want to describe a certain group of threads).
Simdgroup doesn’t make sense to me as it’s not a group of SIMDs (that sounds like a better name for an SM on nvidia or WGP on amd).

With the lack of self-explanatory names, I’m currently leaning towards wave as it’s short and unambiguous.

A slightly refined structured API, without nesting and extended with launch_sized_workgroup_mem:

/// Everything from launch scope
struct Launch {}
impl Launch {
	/// Workgroup id (x/y/z)
	fn workgroup_x() -> u32;

	/// Number of launched workgroups (x/y/z)
	fn workgroups_x_len() -> u32;
}

/// Everything from workgroup scope
struct Workgroup {}
impl Workgroup {
	/// Number of waves in a workgroup
	fn waves_len() -> u32;

	/// Number of threads in the workgroup (x/y/z)
	fn threads_x_len() -> u32;

	/// Thread id inside workgroup (x/y/z)
	fn thread_x() -> u32;

	/// Maybe makes sense to expose gpu_launch_sized_workgroup_mem() here
	/// as it is on the workgroup scope?
	fn launch_sized_mem<T>() -> *mut T;
}

/// Everything from wave scope
struct Wave {}
impl Wave {
	/// Number of threads in the wave
	/// ~= Number of lanes in a SIMD unit
	fn len() -> u32;

	/// Thread/lane id inside wave
	fn thread_id() -> u32;

	// Could include wave intrinsics like ballot, read_first_lane, reduces, etc. here
}

As a place for writing this down, I think the module documentation of core::gpu would fit well.

New/changed naming:

  • workgroup-shared memory → workgroup memory
  • The thing that launches a gpu-kernel: runtime (i.e. an application calls into the runtime, which then launches the gpu-kernel on the GPU)

Regarding address spaces

I agree that something like AddrspacePtr<T, const /* enum? */ Addrspace ADDRSPACE> akin to VolatilePtr would be nice. I tried to take a look at implementing it but didn’t get very far. I didn’t find an easy way to define a type where the implementation is in the compiler. We want an AddrspacePtr to be Sized but the size depends on the generic ADDRSPACE argument (in that sense it’s different from extern type which doesn’t have a size and cannot be stored by value). Adding it as ty::AddrspacePtr turned out to be a rather large change (maybe adding an addrspace argument to ty::RawPtr would make it easier?). But that can be a separate discussion.

Misc

In my opinion for vendor specific things we should stick to the vendor specific naming. […]

I agree.

However, for the code in core::intrinsics this approach was never taken, so I do not know if this approach is possible and desirable for rustc_intrinsics. So this might rather be the way for intrinsic wrappers.

Right, the core::gpu module should be a wrapper for (vendor specific) intrinsics. I think it’s something we can eventually stabilize (compared to (rustc)intrinsics, which are mostly not stabilized).

gpu::available_*_parallelism

I agree to the other answer above, the available parallelism doesn’t match what we want here (the name does make sense in std::thread for ~cpu-count). The GPU launch/workgroup sizes are specified on the CPU when launching a gpu-kernel and the kernel wants to get these exact, specified sizes.

1 Like

The setup will probably be that you have some number of primitive/intrinsic types, one as an opaque pointer into each relevant address space. Then you have in library space something like

pub struct GpuPtr<T: Sized, S: AddressSpace>(S::Addr, PhantomData<*mut T>);

#[unstable(feature = "gpu_ptr_internals")]
pub trait AddressSpace {
    type Addr;
    // etc
}

pub struct WorkgroupMemory;
impl AddressSpace for WorkgroupMemory {
    type Addr = intrinsics::workmem_ptr;
}
// or whatever is needed

if you want to have the address space signifier different from the raw pointer type. Or we could directly make the alt-address-space pointer be the type parameter directly.

I don't know how much LLVM optimization depends on keeping gpu pointers as ptrLLIR vs as i32LLIR (or whatever addr size); if just using the address works (especially as an initial implementation) then the pointer type theoretically could be almost entirely library side with LLVM intrinsic externs.

So my logic in using the available_parallelism name is that it represents the available parallelism as visible/available to the running code. The CPU available_parallelism numbers might not match the hardware numbers if virtualization or resource limiting of some sort is going on (even if the current impl happens to not respect such; I don't know either whether it does or not).

Is there a name for this concept in (standard, non extension, or at least supported by multiple GPU vendors) Vulkan/Kronos APIs? AIUI those are intended to be at least somewhat target independent, so looking to those for target independent naming ideas makes sense.

If this is all static information (or rather, always queried relative to the current execution context), I'd lean towards keeping them as just top level fn, though putting them in submods for grouping purposes is reasonable. I'd keep the actual struct for handles/ids (as much as those exist on the GPU) like thread::current().

1 Like

Thanks for the help @CAD97! I got AddrspacePtr<T, const ADDRSPACE: u32> working and opened a PR for it: Add AddrspacePtr for pointers to non-0 addrspaces by Flakebi · Pull Request #150452 · rust-lang/rust · GitHub

I don't know how much LLVM optimization depends on keeping gpu pointers as ptr vs as i32

inttoptr (ptrtoint %ptr) cannot be optimized away because of pointer provenance, so it would effectively alias everything with everything, which is not too great.

Khronos uses sub-group. I’m fine with that name if the people here think we should go with it, but I’m not much of a fan. Sub-group could mean many things – it’s just a subset of a group, but which one? – I think a distinct name would be better.

mod sounds good to me.

I don't think we want to expose LLVM's addrspace attribute to user code. It seems too LLVM-specific.

We may still use it for the implementation in std (or not), and we can expose more specific GPU-only pointer types.

Hi all! I see some discussion of names for different levels of the execution hierarchy. Have y'all considered OpenACC's names: gang, worker, and vector?

One of the risks of using hardware-specific terms like "warp" or "wavefront" is that it may lead to users assuming things about forward progress guarantees of the entities in that thing.

I find simdgroup the most intuitive naming. I parsed it as "a group of threads that form a SIMD vector", which gives a better intuition about what the hardware is doing.

Oh, I didn’t know the OpenACC names. I like the name “vector”, it rings nicely with SIMD unit (as the hardware concept a vector runs on)!

Reading documentation, I’m not quite sure what gang and worker are (it seems like vector=worker=warp/wavefront, workers=gang=workgroup, gangs=launch), so I’m inclined to keep the current launch+workgroup naming.

Although, the work in workgroup doesn’t really mean anything, so we could as well just call it “group”?

Then we would have

  1. Launch (launched as x/y/z groups)
  2. Group (launched as x/y/z threads)
  3. Vector (64/32/… threads within a group)
  4. Thread

Surely it is meant that way, and having simd in the name makes it quite nice :slight_smile:
But it breaks the system with other wording, including Metal’s own names:

workgroup → “group of work” (not used in Metal but by other APIs, potentially including Rust)
threadgroup → “group of threads” (used in Metal)
simdgroup → not “group of simds”

The terms "gang," "worker," and "vector" refer to OpenACC's software abstraction that exposes levels of parallelism. OpenACC implementations have freedom to map those to actual hardware in different ways.

For example, a gang might map to a CUDA thread block, but the mapping of workers and vectors "in" a worker might depend on whether that part of the OpenACC program ever enters vector-partitioned mode (see the "Execution Model" section in the Introduction chapter of the OpenACC Standard). If an OpenACC loop only ever uses gang and worker parallelism, then a reasonable mapping would be one CUDA thread per worker. Likewise, for a loop that only ever uses gang and vector parallelism, a reasonable mapping might be one CUDA thread per vector.

I'm not an expert on our implementation, but I can ask some colleagues if you have questions!

"SIMD" carries strong implications of a particular execution model. Users who assume that execution model will suffer bugs if it is ever relaxed. People who wrote code for pre-Volta GPUs found out about that when Volta came out. (Please refer to the Independent Thread Scheduling section of the CUDA Programming Guide.)

If I'm not misunderstanding something...
Guarantees that values are valid should be kept, otherwise returning a NonZeroU32 from the shader can start a blast on CPU side as well.

Having not seen it mentionned already in this thread, I think looking at what people have already done to abstract/unify multi-vendor CPU/GPU parallel computation in Rust could be interesting : GitHub - tracel-ai/cubecl: Multi-platform high-performance compute language extension for Rust.