Add volatile operations to core::arch::x86_64

There are a couple of discussions about how to do MMIO, interprocess-shared memory, etc. efficiently, and with guaranteed semantics without introducing undefined behavior.

The main issue with ptr::read/write_volatile<T> is that their semantics aren’t precisely specified yet. While we should fix that, the semantics do depend on the actual T being read, the architecture being targeted, whether the pointer is aligned or not (right now ptr::read/write_volatile<T> does not support unaligned read or writes, but we could add _unaligned variants to support that).

One of the options that we have available is to add intrinsics with guaranteed semantics for each architecture to core::arch. For example, for x86_64, it could look like this:

/// (Module documentation)
///
/// Semantics of volatile memory operations: volatile memory 
/// operations (reads and writes) must be emitted by the 
/// compiler and cannot be re-ordered across other volatile 
/// operations.
/// 
/// The result of a volatile read is frozen (`freeze` is applied to it). 
/// That is, reading uninitialized memory via a volatile read never 
/// returns uninitialized memory, the result of the read is picked
/// non-deterministically.
///
/// When volatile reads participate in data-races with any write 
/// operation, the result of what is read is picked non-deterministically. 
/// When volatile writes participate in data-races with other 
/// volatile-write operations, the result that is written is picked 
/// non-deterministically. If volatile writes participate in data-races
/// with other write operations, the behavior is undefined.
/// A race betweeen:
///
/// * a volatile atomic read with volatile atomic writes reads the 
///   content of memory either before or after any of the writes - the 
///   read will not observe partial modifications because the reads 
///   and the writes are atomic
/// * a volatile atomic write with other volatile atomic writes results
///   in the content of any of the writes being written to memory - the 
///   memory won't contain partial results of the different writes
///
/// Non-atomic volatile operation perform the reads and writes as a 
/// sequence of smaller volatile atomic reads or writes. Data races 
/// result in partial results being read from or written to memory.

/// Volatile atomic 8-bit load
///
/// 8-bit volaitle atomic load from `x`. 
///
/// If the load introduces a data-race, the result is picked 
/// non-deterministically.
unsafe fn volatile_atomic_load_u8(x: *const u8) -> u8; 

/// Volatile aligned atomic load u16/u32/u64
///
/// 16/32/64-bit volatile atomic load from aligned `x`.
///
/// If the load introduces a data-race, the result is picked 
/// non-deterministically.
///
/// If `x` is not aligned, the behavior is undefined.
unsafe fn volatile_atomic_load_u16(x: *const u16) -> u16;
unsafe fn volatile_atomic_load_u32(x: *const u32) -> u32;
unsafe fn volatile_atomic_load_u64(x: *const u64) -> u64;

/// Volatile unaligned load u16/u32/u64/u128/256/512
///
/// Volatile 16/32/64/128/256/512-bit unaligned load.
/// 
/// This operation is not necessarily a single atomic load. 
/// The memory is read in a data-race free way by performing 
/// either a single volatile atomic load, or multiple smaller volatile 
/// atomic loads in an unspecified order .
///
/// If the load introduces a data-race, the result is picked 
/// non-deterministically.
unsafe fn volatile_load_u16_unaligned(x: *const u16) -> u16;
unsafe fn volatile_load_u32_unaligned(x: *const u32) -> u32;
unsafe fn volatile_load_u64_unaligned(x: *const u64) -> u64;
#[target_feature(enable = "sse")]
unsafe fn volatile_load_u128_unaligned(x: *const u128) -> u128;
#[target_feature(enable = "avx")]
unsafe fn volatile_load_256_unaligned(x: *const [u8; 32]) -> [u8; 32];
#[target_feature(enable = "avx512")]
unsafe fn volatile_load_512_unaligned(x: *const [u8; 64]) -> [u8; 64];

/// Volatile atomic 8-bit write
///
/// 8-bit volaitle atomic write of `x` to `ptr`. 
///
/// If there is a data-race with another volatile atomic write to `ptr`, 
/// the memory written to `ptr` is picked 
/// non-deterministically.
unsafe fn volatile_atomic_write_u8(x: *mut ptr, x: u8); 

/// Volatile aligned atomic write u16/u32/u64
///
/// 16/32/64-bit volatile atomic wrote of `x` to `ptr`.
///
/// If there is a data-race with another volatile atomic write to `ptr`, 
/// the memory written to `ptr` is picked non-deterministically.
///
/// If `x` is not aligned, the behavior is undefined.
unsafe fn volatile_atomic_write_u16(ptr: *mut u16, x: u16);
unsafe fn volatile_atomic_write_u32(ptr: *mut u32, x: u32);
unsafe fn volatile_atomic_write_u64(ptr: *mut u64, x: u64);

/// Volatile unaligned write u16/u32/u64/u128/256/512
///
/// Volatile 16/32/64/128/256/512-bit unaligned write of `x` to `ptr`.
/// 
/// This operation is not necessarily a single atomic write. The 
/// memory is written to `ptr` in a data-race free way by performing
/// either a single volatile atomic write, or multiple 
/// smaller volatile atomic writes in an unspecified order .
///
/// If there is a data-race with another volatile write to `ptr`, 
/// the memory written to `ptr` is picked non-deterministically.
unsafe fn volatile_write_u16_unaligned(ptr: *mut u16, x: u16);
unsafe fn volatile_write_u32_unaligned(ptr: *mut u32, x: u32);
unsafe fn volatile_write_u64_unaligned(ptr: *mut u64, x: u64);
// Optionally:
#[target_feature(enable = "sse")]
unsafe fn volatile_load_128(x: *const [u8; 16]) -> __m128;
#[target_feature(enable = "avx")]
unsafe fn volatile_load_256(x: *const [u8; 32]) -> __m256;
#[target_feature(enable = "avx512")]
unsafe fn volatile_load_512(x: *const [u8; 64]) -> __m512;

Users could then either use these intrinsics directly to get guaranteed semantics, or use them to build more generic abstractions in a library.

For example, one could use these to build a generic read_volatile_atomic<T: SizeBoundt> API in a library that rejects reads not supported by the target, e.g., by only implementing the trait SizeBound for , e.g., u64, on targets that support the operation on that type.

1 Like

Your text is wider than the code block, making the comments in there really hard to read. Could you try to improve the formatting?

the result of what is read or written is indeterminate

“indeterminate” is not a word that we used in Rust so far I think, and given its history of confusion and ambiguity in C, I’d rather avoid it. “uninitialized” is the term I usually use. Or if you want to say it returns an unspecified sequence of bits, the result is “picked non-deterministically”.

When volatile reads and writes participate in data-races with other volatile operations

Your data race semantics only cover volatile-volatile pairs of accesses, why that? For volatile reads, we can say that if this is in a data race with any other write access, the result is NOT UB but instead some non-deterministically chosen but initialized data is returned. And in fact we need this for the use-case that triggered this.

The use of the term “atomic” is a bit confusing since the operation is not atomic in the sense of having an atomic memory ordering. I don’t have a good idea for a better name though.

And finally, note that implementing these intrinsics requires several things that LLVM does not do or at least not commit to:

  • Volatile reads always return frozen data. (This seems to be the case in current LLVM but I have seen no documentation that this is more than an accident.)
  • Volatile writes do not cause UB when being in a race with another write. (I wouldn’t even know how to check if LLVM currently does this the way we want.)

I don’t think we should accept intrinsics that we cannot implement with the desired semantics in a way that is explicitly guaranteed by the backend.

1 Like

Done.

“uninitialized” is the term I usually use. Or if you want to say it returns an unspecified sequence of bits, the result is “picked non-deterministically”.

I’ve changed that to “picked non-deterministically”. but maybe we should give these values a name (non-deterministic value). I suspect we would it at least use it for defining freeze. Are we using these terms in some other parts of the documentation already?

Your data race semantics only cover volatile-volatile pairs of accesses, why that?

Logic bug, fixed. EDIT: with “When volatile reads participate in data-races with any write operation, the result of what is read is picked non-deterministically. When volatile writes participate in data-races with other volatile-write operations, the result that is written is picked non-deterministically. If volatile writes participate in data-races with non-volatile write operations, the behavior is undefined.”

The use of the term “atomic” is a bit confusing since the operation is not atomic in the sense of having an atomic memory ordering.

I agree.

  • Volatile reads always return frozen data. (This seems to be the case in current LLVM but I have seen no documentation that this is more than an accident.)

You are assuming that we would lower these to volatile load/stores in LLVM-IR. While we could do that, we don’t have to. These are architecture specific, so we can always just insert the right machine code using asm!(... : "volatile"). That would have the defined semantics, although it might inhibit more optimizations than necessary. We could lower them to volatile load/store, and then “launder” the undef (e.g. by using freeze if we ever get that, or by using an asm! block just for that.

  • Volatile writes do not cause UB when being in a race with another write. (I wouldn’t even know how to check if LLVM currently does this the way we want.)

The documentation does not mention anywhere that write-write races are undefined behavior either. The only thing that AFAIK is currently documented is a load-write race makes the load return undef, which would be ok for us (see above). FYI I’ve opened an bug in the LLVM documentation about this, but I don’t expect this bug to be resolved any time soon: https://bugs.llvm.org/show_bug.cgi?id=42435

Worst case, we can prevent all re-ordering by using inline assembly as well to implement the writes.

1 Like

Where are you reading that the read is volatile, but the write is not ? AFAICT, the read is indeed volatile, but the write must be volatile as well for that use case to ever work.

EDIT: Expanding on that: if the write isn’t volatile, the thread of execution doing the write has UB, and it doesn’t really matter what the code doing the read does. If the compiler can prove that the write cannot be observed without invoking UB (e.g. due to a data-race), it can optimize the write away. While this misoptimization can’t break havok in the thread of execution doing the read (the compiler can’t optimize the volatile read away), undefined behavior is undefined.

1 Like

I agree finding a less unwieldy name would be good. There’s nothing special about the value. If I tell you that the relevant bitstring is 0b0001111, you cannot tell if it was picked non-deterministically by freeze or whether the user used the literal 15u8 in the source. It’s the process of picking the value that we have to find a name for.

This is in strong contrast to the initial stored in uninitialized memory, 0bUUUUUUUU (8 uninitialized bits). That’s a very special value, the process of picking it was rather mundane (the standard says fixes the initial content of all memory to be that).

This seems like a problem. That means if I do a volatile write and the untrusted thread does a non-atomic non-volatile write, we have UB. Didn’t we want to avoid that?

(Your next post indicates you were confused by what I was trying to say here.)

Fair.

That’s a very good point! They are UB in C++, so I assumed LLVM would copy that, but given that they diverge from C++ for read-write races, who knows that they do for write-write races.

However, to my knowledge, the formalizations of the LLVM memory model make write-write races UB. I will email some people and ask about this.

That would however mean leaving the space where we could have any hope of saying anything formal or definite (with the kinds of specs we currently have).

See above – I meant that our write is volatile, but the one we are racing with is not.

1 Like

I think this should be saying that the memory gets uninitialized. That’s how I read LLVM’s spec when it says

  • Otherwise, if there is no write to the same byte that happens before Rbyte, Rbyte returns undef for that byte.
  • Otherwise, if Rbyte may see exactly one write, Rbyte returns the value written by that write.
  • Otherwise, if R is atomic, and all the writes Rbyte may see are atomic, it chooses one of the values written. See the Atomic Memory Ordering Constraints section for additional constraints on how the choice is made.
  • Otherwise Rbyte returns undef .

Arguably, when “reading from a write-write race”, you see more than one write, and not all of them are atomic, so you end up in the last case. So all reads from a write-write race must return undef, which is achieved by putting undef in memory.

1 Like

It’s the process of picking the value that we have to find a name for.

Makes sense. Might be worth opening an issue in the UCG repo for this.

IIUC the use case correctly, we have a trusted thread of execution that only does volatile reads from shared memory (no writes to it), and untrusted threads of execution that write to that shared memory. So I don’t understand what you mean by “our write is volatile”.

The trusted thread of execution does not invoke UB doing volatile reads, but if an untrusted thread of execution uses a normal write to write to the shared memory, that write is unsynchronized and non-volatile. With what’s proposed here, that’s UB (e.g. as mentioned the compiler can just remove that write, or determine that all execution paths leading to that write are unreachable).

In the use case from @hsivonen, both threads of execution are part of the same program, so the whole program has UB.

To avoid the UB, we would need to define what the behavior of those racy non-volatile unsynchronized writes are. One way to do that would be to say that the write is not required to happen, but if it does, it writes a bit-pattern picked non-deterministically. That allows the compiler to optimize the writes away, but does not make the writes themseles UB, such that the code is still reachable.

That would however mean leaving the space where we could have any hope of saying anything formal or definite (with the kinds of specs we currently have).

The specified semantics would still be that volatile cannot be reordered across volatile. The implementation might be more strict, and never reorder anything across these intrinsics. We could relax that up to the specified semantics in the future.

I think this should be saying that the memory gets uninitialized. That’s how I read LLVM’s spec when it says

[…]

Arguably, when “reading from a write-write race”, you see more than one write, and not all of them are atomic, so you end up in the last case.

What the proposal intended to say (but failed) is that a volatile write racing with another volatile write, what gets written is picked non-deterministically, and that this race is not undefined behavior. Note that per the definition of volatile, LLVM must emit both writes.

Now, if there are reads, then we might have a data-race, and that’s UB. But if the reads are all volatile, the first line of the LLVM spec says:

  • If R is volatile, the result is target-dependent.

The intrinsics being proposed are target specific - we know to which instructions they lower to for this particular target, and we use that to guarantee that the value read is picked non-deterministically without invoking undefined behavior.

For other read operations, the behavior would probably be undefined.

1 Like

Rereading @hsivonen’s post, I think the writes coming from the untrusted thread are being performed by JITted code, which probably provides stronger guarantees about data races than we do.

Since the use case is “mutithreaded Wasm”, it sounds like there would be nothing preventing two Wasm threads from accessing the same memory simultaneously, as opposed to the mentioned scenario of one Wasm thread and one native thread. In that case, if the Wasm code were executed by an interpreter using regular Rust writes/reads, it would already be well into UB territory. Instead, an interpreter would have to use volatile or atomic accesses to be safe, while a JIT would have to ensure that the instructions it generates are safe in the face of data races – or at least that any misbehavior caused by them doesn’t escape the sandbox. (Not sure exactly what the Wasm spec requires here.)

Assuming the Wasm VM they’re using does provide such guarantees (because it would be completely unsafe if not), all we have to worry about is whether the trusted thread causes UB. In other words, the untrusted thread can be treated as if it were part of a different process even if it really isn’t.

Edit: That said, it would be nice to provide guarantees in the “volatile read / nonvolatile write” case even if it isn’t necessary in @hsivonen’s scenario.

Yes, it would be nice to guarantee this. It makes the wasm interpreter/JIT easier to implement and allows it to generate better code, and it is mandatory for other use cases like OS kernels where the adversary can use arbitrary machine code.

Many people are satisfied with the guarantee being de facto provided by all current implementations, but it is unsatisfying to have a memory modeal that does not really support those very important “supervisor code” real-world use cases and to have to defer to the implementation like this.

2 Likes

I’m not sure I follow what this has to do with WASM. These are core::arch::x86_64 intrinsics, which do not exist if the target is wasm32 (only core::arch::wasm32 exists there).

If you are modificating a shared buffer across two processes, at the end what matters is the semantics of the instructions being executed by the CPU. Rust provides guarantees based on Rust operations, WASM does the same for WASM. If those guarantees have some requirements, then you have to think how the WASM guarantees satisfy Rust requirements and vice-versa. If a Rust guarantee requires an atomic write from the WASM side, but WASM doesn’t have any instruction with that guarantee, then unless Rust’s defines the behavior for the case in which this does happen, all bets are off.

AFAICT, WASM supports optimizing machine code generators, so if you emit WASM code with two consecutive writes to the same memory, e.g., using LLVM volatile, the machine code generator might optimize one of the writes out. Also, all WASM writes support unaligned memory, and if that faults, the machine code generated will catch the CPU exception, and retry with a non-aligned write (or might just always emit unaligned writes, etc.). So if you want to write to shared memory from WASM, you probably need to use WASM atomics, and hope that two atomic writes to the same memory location don’t get optimized out.

Oh, I see. I thought the trusted thread would also want to write.

If the trusted thread only reads, then we don’t need any new write volatile intrinsics. So we might hold off on finalizing them until we know more about the constraints. Sometimes trusted threads also have to write to memory shared with untrusted code, so that does seem like a use-case to consider eventually.

That part has nothing to do with memory or volatile though. It also affects that thread doing out-of-bounds accesses or so. Whatever solution we have for that needs to apply to all of that, and I see no way in which we could do something specifically with volatile accesses that would help here. It’s just a separate discussion.

And what I am saying is I don’t see how to get those semantics from LLVM. But what we maybe can get is that if two volatile writes race, the memory in that memory becomes uninitialized.

1 Like

Btw, I have since then learned that LLVM developers do not consider write-write races UB. They consider them to basically leave the memory as undef (uninitialized).

To my knowledge, there is no theoretical/formal work at all studying the properties of this memory model.

2 Likes

An alternative to the design here might be to extend the AtomicXY types with volatile_load/store instructions. LLVM has an Unordered (https://llvm.org/docs/Atomics.html#unordered) ordering for atomics that might do what we want when combined with volatile.

3 Likes

@gnzlbg I would love that! If this lets us also remove the existing intrinsics (and implement the user-facing API another way), it would solve two problems:

  • Never again would we wonder what the interaction of concurrency and volatile is. The LLVM model is not very specified, but knowing it is Unordered is a lot better than not knowing anything, and there are access modes very similar to Unordered in some academic memory models (e.g. plain accesses in this paper).
  • And also we could finally answer the questions users have about tearing of volatile accesses. Not saying anything definite there makes the current volatile accesses basically unusable for MMIO.
1 Like

Hmm... as currently specified in C*, the compiler is not allowed to change the order of multiple volatile accesses, although it can reorder non-volatile accesses around volatile accesses. This sounds more like relaxed (a.k.a. monotonic) ordering than unordered. Although in theory we could provide a weaker guarantee than C, it would be a footgun for MMIO, and a backwards compatibility hazard if retrofitted onto the existing, stable volatile access functions.

In any case, using volatile atomic sounds good to me, and I think ideally we would allow specifying any ordering, just as LLVM does. But that would still leave some questions:

  • Should volatile loads also guarantee that they return a frozen value?

    • If so, should this be "an architecture-specific guarantee that happens to apply to all current architectures", or should it be a real guarantee (with the consequence that some hypothetical architecture might not be able to implement volatile)?
    • If not, what should people use for inter-process shared memory?
  • Should volatile loads make other guarantees depending on the architecture? For example:

    • Most architectures provide stronger memory ordering guarantees for normal load/store instructions than unordered or even relaxed. It's not clear to me how useful it would be to make such guarantees for volatile accesses, assuming that we do allow specifying any ordering and thus making the desired semantics explicit. But it's also hard for me to imagine what kind of optimization would violate such guarantees while staying compatible with the MMIO use case.

    • Invalid memory accesses are typically guaranteed to trap. This includes accesses where the corresponding page is unmapped, or where it's mapped with incompatible permissions (e.g. writing with only PROT_READ). Of course, questions of how to manage memory mappings and how to catch traps are OS- and architecture-specific, but do we guarantee that invalid accesses trap as opposed to being straight UB?

      • In particular, should we or should we not assume that volatile load addresses are well-aligned and non-null?

* From the C standard, emphasis added: "Actions on objects [declared as volatile] shall not be 'optimized out' by an implementation or reordered except as permitted by the rules for evaluating expressions."

2 Likes

@comex we have some motivation for volatile unordered atomic load / stores, but which use cases require volatile atomic load / stores with other orderings ?

I am not suggesting to weaken the volatile guarantees. But those reordering guarantees stem entirely from the fact that volatile accesses are observable external events of the program, which is very different from the way atomic memory accesses are specified. The compiler is also not allowed to reorder two write syscalls, not because write has anything to do with concurrency but because making a syscall is an externally observable event.

In contrast, Relaxed guarantees things that I do not see any reason for us to guarantee for volatile: if in thread A, a release fence is followed by a relaxed write, and then in thread B a relaxed read reads from A's write and is followed by an acquire fence---then we get a happens-before edge between the two fences. So, even Relaxed can be used for cross-thread synchronization when combined with fences the right way.

In contrast, the LLVM docs for Unordered say "This cannot be used for synchronization". Semantically, that seems exactly right for volatile.

I think it is a mistake to view the definitions of our concurrency and volatile memory models as being in terms of allowed transformations, such as reorderings. A language is not defined by the optimizations you can make, it is defined by an Abstract Machine, and that Machine determines which optimizations you can make. Experience shows that it is very easy to come up with a set of optimizations that all look reasonable on their own but miscompile intended-to-be-correct programs in practive---we need an Abstract Machine to make sure there is consistency. (This is basically the same argument that I am making in my most recent blog post.) I have proposed a semi-operational spec for volatile here and I think it should not be too hard to make this fully formal and operational, but it requires agreeing on some formal and operational spec for syscalls in general (where multiple options exist, so this is now a question of "is there a way to do this" but of "which is the best option for us").

I think reordering non-volatile accesses around volatile accesses can easily violate such guarantees---but I haven't tried to actually come up with a counter-example yet.


Another idea for the user-facing API for volatile accesses: we could add Ordering::Volatile. On Zulip it was mentioned that sometimes you need volatile RMW instructions (read-modify-write, e.g. to do a fetch_or on the page table); making Volatile an Ordering would mean we would not have to duplicate all those RMW methods.

On the other hand, the entire idea of telling people to use Atomic* types for volatile accesses could cause a lot of confusion; people are already "mixing up" volatile and synchronization too frequently.

2 Likes

Well, pretty much any lock-free data structure can in principle be placed in shared memory and coordinated between processes that don't trust each other.

A real-world example is a queue based on a circular buffer – for example, Linux's io_uring, which uses memory shared between the kernel and userland. The kernel accesses the queue from a dedicated thread, so this needs to be synchronized like any concurrent structure. As a writer, you first write data into the circular buffer, then update a variable (tail) to indicate how much data you wrote, where both the circular buffer and tail are in shared memory. The write to tail needs to be ordered after the writes into the circular buffer, so it should use release ordering (or an equivalent fence must be inserted).

edit: In that specific example I guess userland trusts the kernel, so it might not need volatile; a better example would be the reader side in the kernel, if it were written in Rust.

1 Like

Many lock-free data structures rely on all parties involved to follow a certain protocol, otherwise their correctness guarantees are not longer provided.

But there are certainly data structures that can work with untrusted peers. However, why would volatile be needed there?

Perhaps it shouldn't. *shrug* This is getting back to "C++ people think volatile is needed for shared memory, but it's not clear why".

I’m going to see if I can get JF Bastien, author of those papers that emphasized that shared memory should use volatile, to explain his thoughts in more detail.

2 Likes