Need input regarding very weak memory models (GPUs, VMs...)

With some help from the UCG group, I've recently been trying to improve the imperfect state of Rust's volatile by providing an alternative to ptr::[read|write]_volatile that builds on LLVM's atomic volatile loads and store rather than non-atomic ones.

The expected benefits of this alternative are that...

  • Data races become unambiguously well-defined behavior, instead of being in this weird situation where LLVM claims that volatile data races are UB but happens to always compile them into something sensible in practice.
  • Accidental load/store tearing becomes a thing of the past, as asking for a load/store which is not supported at the target level is now a compile-time error.
  • The proposed API makes it hard to accidentally mix volatile and non-volatile accesses to a memory location, which is almost always a mistake in some volatile usage scenarios.

One open question, however, is whether the proposed API can fully replace ptr::[read|write]_volatile in all circumstances as currently specified, in which case the former API could just be reimplemented in terms of its successor.

What's unclear here is if there is a way to specify atomic volatile accesses so that they work on targets that have a very weak memory model without global cache coherence, where even Relaxed atomics require special load/store instructions or synchronization, such as GPUs (NVPTX, AMDGPU...) and virtual machines (WASM, JVM...).

I've tried to come up with atomic-but-super-weakly-ordered semantics which are morally equivalent to those of LLVM's unordered atomics (but without any explicit commitment to translating into them, since we don't want to commit to supporting LLVM's unordered at this point in time), in the hope that this is weak enough for such targets. But I'm not sure if that is enough, and I need input from people familiar with such targets:

  • Do these targets guarantee atomicity of native load/store instructions, in the sense that when a load and store race against each other, the load can observe either the former or the new value of the target memory region but nothing else?
  • Has there been any study on whether LLVM's unordered atomic ordering specifically (since that's what we'll most likely use for the initial implementation) is implementable without using special loads and store instructions or extra synchronization on those architectures?
    • I know that LLVM's unordered has specifically been designed to accomodate the needs of the JVM, so I guess the answer is yes in that case, but I don't know if it is also an appropriate model for other VMs like WASM, or for "normal" GPU loads & stores.

Have you checked the LLVM AMDGPU backend documentation?

It does not mention this explicitly anywhere, but it states that unordered atomics are supported for all address spaces. I don't know how those semantics could be supported everywhere if native load/stores would be able to tear. You might want to ping the amdgpu maintainers directly here, e.g., by opening an issue in LLVM's bugzilla, or by pinging them on IRC .

1 Like

Thanks a lot, I didn't know about that docs page and it's pure gold. I am a little bit worried about this bit...

The memory model does not support the region address space which is treated as non-atomic.

...but aside from that, my interpretation of the table below the hardware-specific bullet points is that indeed, unordered is translated to the same code as non-atomic for all load and store operations supported by the memory model, which in turn suggests that basic loads and stores are atomic in the "cannot tear" sense on this hardware.

I'll give this page a more careful read when I have more time, and post questions like the above to llvm-dev (as that seems more appropriate than bugzilla for questions), but it seems that overall on AMDGPU it's fair to (ab)use unordered volatile as a better specified form of non-atomic volatile. Let's cross fingers that this is true for other targets...

For NVidia, the LLVM docs are comparatively super-scarce, and I have the impression that the closest thing that I can dig into is the PTX ISA documentation and in particular its memory consistency model section, which I don't have time to read right now but seems promising on a first skim.

There are also some interesting bits here and there in their Programming guide, e.g. around the "Hardware Implementation" and "Compute Capabilities" sections, I probably need to dig a bit more into that as well.


Is "my processor is RV32IMC without the A extension and thus has no atomics" relevant, or am misreading what you're talking about?

(The processor I write code for is a single-hart, no-atomics, in-order-execution thing, and I honestly have no idea where this thing sits on the memory model hierarchy.)

I'm not familiar enough with RISC-V to answer your question directly, but if it already has an Rust toolchain, a quick experimental test for finding out whether you have the kind of weird architecture that I'm interested in on your hands is to check whether...

  • AtomicUsize::load(Relaxed) and <*usize>::read() compile into the same hardware instructions.
  • AtomicUsize::store(Relaxed, x) and <*usize>::write(x) compile into the same hardware instructions.

Formally speaking, what I'm looking for is targets that allow SMP implementations without guaranteeing cache coherence between compute cores. But since hardware specifications are not always super-explicit regarding memory models, you may need to resort to the experimental "look at what LLVM does" method mentioned above :wink:

RV32IMC, different assembly:

RV32IMAC, same assembly:

This seems like a bit of a rabbit hole, actually. I'll take a closer look later. I kind of expected #2 to use the LL/SC instructions...

1 Like

This seems quite reasonable to me. I've seen many unusual CPU and memory architectures, and the memory operations available on them, and I don't know of any where this would cause a problem.

1 Like

So, after perusing the LLVM AMDGPU backend documentation and the NVPTX ISA documentation, I'm starting to see some common patterns which allow me to reach some tentative conclusions.

It is my understanding that compiler backends for both of these architectures provide volatile loads and stores with stronger synchronization guarantees than basic loads and stores for the target memory space, closer in spirit to C11's Relaxed cache-coherence guarantees:

If my understanding of this is correct (to be checked here and perhaps on llvm-dev as well), it means that for these two GPU architectures at least, word-sized volatile loads and stores always have Relaxed semantics and there's no harm in marking them atomic.

There is a small wrinkle though. This is about scalar word-sized loads, and the PTX reference quite clearly warns that vector (in the SIMD sense) loads/stores are not atomic, but should rather be modeled as an unordered set of scalar atomic loads/stores.

So SIMD vector loads and stores might actually be a first reasonable use case for keeping a non-atomic volatile load operation around, since they are not atomic as a whole (just piecewise atomic). From memory, the x86 situation is similar.

An alternative to non-atomic volatile ops in that case would be to have a piecewise-atomic SIMD load/store operation, in the spirit of LLVM's element-wise atomic memcpy, but I'm not sure how to integrate that in the API design that I'm currently investigating.

The only thing that is clear to me at this point in time is that a volatile 128-bit SIMD load has to be somehow different from e.g. a pair of 64-bit volatile scalar loads, because volatile prevents load/store merging whereas we want such merging to happen here.

@mcy Oh-ho, that's interesting! Just to be sure, can you check if this __atomic_load_4 intrinsic expands into anything more complicated than the lw a0, 0(a0) that it replaces in the non-atomic ASM?

If not, I'm curious what that means. Should I understand from this that RISC-V guarantees cache coherence of SMP implementations in the presence of the A instruction set (where a regular lw is enough to get a Relaxed atomic load), but not otherwise?

I kind of expected #2 to use the LL/SC instructions...

Personally I wouldn't expect this. Why should a program need to carry out a write (the Store in SC's Store Conditional) in order to atomically read a value?

I have no idea. I'll try to remember to take a look later. I don't work with the A extension, and I don't have my printout of the RISC-V manual on hand.

I had a minor brain fart there, don't mind me...

Like @HadrienG mentions below, the question is how __atomic_load_4 is implemented. Did you managed to check that out? IIRC in RISC-V the {l,s}{b,h,w,d} instructions do not synchronize but they are atomic, so __atomic_load_4 should just lower to a single lw instruction. No idea why LLVM isn't optimizing it properly and producing the same assembly for both cases. You might want to fill in a bug.

What the A extension provides is synchronizing instructions (the lc, sr and fence "building blocks" as well as some instructions for the common cases).


I would intuitively expect it to be consistent however, as otherwise CUDA code with volatile variables would encounter performance regressions on newer GPUs.

Notice that volatile in CUDA C (at least) is used for synchronization purposes (e.g., check out the CUDA C programming guide and search there for volatile), e.g., in combination with __syncthreads, __threadfence, etc. and it is therefore quite different from C's volatile which does not synchronize.

Unless you are writing the GPU driver, there is no inter-process synchronization, MMIO, etc. going on in the GPU kernels themselves, so AFAICT volatile operations with LLVM's meaning of volatile are probably meaningless there, and CUDA's volatile is instead more similar to LLVM's "atomic volatile" operations - although I've no idea about their order, maybe that's why LLVM has intrinsics for them for NVPTX?

I actually looked into this... it looks like nightly refuses to compile on IMC at all?

Either way, I don't think it's a big deal. Though now I wonder what a compare/exchange instruction does without being able to emit lr/sc...

@gnzlbg Thanks for pointing this out! The meaning of volatile that is discussed in the CUDA C programming guide (which can be summarized as "bypass the SMs' L1 caches") seems consistent with that used by amdgpu's implementation of volatile, and IIUC that provides Relaxed ordering at at least device scope (system scope too apparently for sm7x GPUs).

@mcy Compare/exchange instructions most likely won't ever compile on IMC, since Rust atomics are guaranteed to be lock-free and IIRC compare/exchange cannot be implemented in a lock-free manner without hardware support.