In general, the optimizer can't do that for atomic loads, because it wouldn't be valid even if other threads were performing atomic stores.
If you do a wide read followed by bytewise reads, then (unless there are other constraints preventing it) the optimizer is allowed to replace the bytewise reads with, e.g., "extract byte N from the SIMD register containing the previously-read wide value", and in that case of course it could assume that the values are consistent, since it's just math. But if the generated code performs multiple actual loads to shared memory, the optimizer can't assume that the returned values will be consistent, since an atomic store could have modified the memory in between. (It could assume that if you performed a non-atomic load, since then any kind of racing store is UB.)
There are also valgrind and ASan, which are essentially software emulations of such a CPU.
Relaxed atomics don't compile to "atomic instructions" at the assembly level, just regular loads/stores ā at least on existing architectures. That said, LLVM currently doesn't do much to optimize relaxed accesses ([1][2]), e.g. it doesn't even coalesce repeated reads to the same location, so using them in place of non-atomic accesses can cause slowness in practice.
An even weaker form of atomic access that allows for tearing reads (for whatever minimal optimization benefit that provides), and
Formally specifying the behavior of mixed atomic and non-atomic accesses (specifically inspired by seqlocks in the Linux kernel), such that atomic accesses that race with nonatomic accesses are not UB but do return an indeterminate value.
If adopted and implemented, I think that would do a lot to clear up the concerns here.
Right. Compiler barriers are fine, and what I'd expect those annotations to become. (I'd like them to have as little impact on the compiler's optimizations as possible.)
RCU currently has some operations that cast pointers to volatile and dereference them.
LLVM does not currently do this, though of course that's no guarantee when it comes to future versions. The relevant documentation is, as usual for LLVM, vague about what exactly it guarantees, but I don't think it says anywhere that that's UB.
From a C++ perspective (in the current standard, not the proposal I mentioned before), mixed accesses between, say, int * and atomic<int> * are UB just because mixed accesses of different types are UB in general (strict aliasing). However, this doesn't apply if at least the non-atomic access is of character type. The relevant section of the standard is somewhat vague about what happens then, but two key passages are:
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below.
Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.
So:
A non-atomic write followed by (not concurrent with) an atomic read, or vice versa, is probably fine, modulo the strict aliasing issue (which doesn't exist in Rust).
Races between atomic and non-atomic accesses, on the other hand, definitely cause UB (so things like seqlocks are currently not allowed, at least not without volatile). However, the standard does not address shared memory, so it's not clear whether a non-atomic write in one process, racing with an atomic read in another process, causes UB in both processes or only in the writer.
I'm confused what you mean by this. If by "critical region" you mean something like acquiring a mutex, those include barrier semantics, and the compiler and CPU can't move even nonatomic accesses past barriers; otherwise mutexes would be unusable. And atomic accesses provide strictly stronger guarantees.
By the way, mixed volatile and non-volatile accesses are an entirely separate issue, at least from a C/C++ perspective ā and JF Bastien has written a proposal to fix those as well.
To clarify this bit, @hsivonen's motivating use case was a syscall-like scenario where OS kernels and WASM runtimes are to do privileged operations based on whatever they read from a buffer to which untrusted threads are allowed to write concurrently.
I assumed that the correctness of these privileged operation could depend on the timing at which the reads occur. In that case, some kind of synchronization (including Acquire and Release barriers, but also good old "temporal" synchronization like mutexes) must be used in order to prevent the unordered reads to be reordered by the compiler + hardware in a place that would make the syscall incorrect.
But come to think of it, that's probably not a concern in the WASM scenario, where the client and host reside on the same thread and the evil writer can be assumed just to inject noise in the bytes of data that the runtime/kernel will read.
Re LLVM and data races, the reasons why I have the impression that concurrent atomic and non-atomic ops are UB are that...
LLVM's documentation is full of "we do as the C++ memory model says lol" comments
My reading of the exact C++ standard wording that you managed to exhume give me the same impression.
I certainly agree that a C++ standard or LLVM memory model revision which says that atomic reads and writes running concurrently with non-atomic reads and writes do not cause UB in the thread doing the atomic operations would bring much peace of mind to the kind of use cases discussed in this thread.
(The theoretical issue of attacker code writing mem::uninitialized() in the shared buffer would probably remain, though. Meh.)
I see that Rust's documentation says this, but I don't understand why. The first version of the C++ paper about deprecating useless parts of volatile lists "Shared memory with untrusted code" (i.e. my use case) as the first item on its list of legitimate use cases.
As I understand it, volatile guarantees that the compiler won't invent reads that were not in the source code and that the compiler doesn't assume that two volatile reads from the same memory location yield the same value. AFAICT, these constraints should be enough to rule out UB (but not garbage values) at least on ISAs where non-atomic loads/stores and relaxed loads/stores use the same instructions. Is the UB issue something that would only manifest on Itanium, or is there a more practical reason why concurrent accesses need to be UB with volatile? (I understand that volatile isn't well-suited for intentional communication between threads, but that's different from whether adversarial concurrency can cause UB (as opposed to mere garbage values) for the thread using volatile.)
I think that optimization is permitted even if the function is visible outside the crate, because with C++11 atomics, the compiler is allowed to reduce the set of possible executions.
Ah, indeed. I gather that Itanium is the only architecture for which the relaxed memory order affects the generated instructions in addition to affecting optimization compared to non-atomic.
I don't need to support Itanium, and I'm very confident I won't have to start supporting Itanium. I have to care about x86, x86_64, armv7 and aarch64. Support for POWER8 and higher, MIPS, and RISC-V would be nice but not essential at this time.
How can that be, considering that unordered is supposed to model non-volatile Java (for Java's meaning of volatile)?
These are all stronger guarantees than what I need. Both volatile and relaxed are too strong as well: I'm pretty sure the optimization constraints I need are only 1) the compiler must not invent reads to the memory locations that the source code tells it to write to (i.e. the compiler can't use that memory as a spill space prior to writing what the programmer asked to write, since reading back the spilled values could go wrong) and 2) if the compiler generates instructions to read the same memory location twice, it must not assume that it gets the same value both times (but it is OK to optimize away the second read and reuse the value already read).
Relaxed is too strong, because it provides indivisibility. In addition to Rust not having a way to generate relaxed SIMD loads/stores, when using C to get LLVM to generate them, LLVM generates a library call that takes a lock to provide indivisibity.
Volatile is too strong, because it prohibits reordering of write operations within a sequence of writes and reads within a sequence of reads as well as combining operations to wider reads/writes. Optimizations like that should be fine for my scenario, since I don't care what garbage values an adversarial thread observes and in the presence of adversarial writes, I'm OK with reading garbage as long as the compiler doesn't use assumptions about the consistency of the garbage to eliminate any safety checks.
(Last week, I discussed this with SpiderMonkey and Cranelift developers with the premise that volatile would be UB with concurrency (and, therefore, excluded from consideration) and settled on relaxed and relaxed ruling out SIMD. Then I read the C++ paper about selectively deprecating volatile and it giving my use case as the first item on the list of legitimate uses, so now I'm trying to understand if I could get SIMD back with volatile...)
In theory, you can use llvm.memcpy.element.unordered.atomic with element_size == 1 to avoid indivisibility, and it could optimize to regular loads/stores.
In practice... it looks like LLVM will currently perform that optimization only if the pointer is known to be aligned, even though it's perfectly valid even if the pointer is unaligned.
And of course, for now Rust doesn't even expose that intrinsic.
Taking a lock in that case is actually necessary, because you're requesting an atomic load/store, and SIMD load/stores are not guaranteed to be atomic at the architecture level. However, in theory, LLVM should be able to create SIMD loads/stores out of either llvm.memcpy.element.unordered.atomic, or a series of regular atomic loads/stores (e.g. loading from a pair of AtomicU64s in sequence, at least on x86 where atomicity is free at a 64-bit granule).
In practice... it doesn't know how to do that. So you're better off with volatile.
I'm not aware of any way that a malicious process could cause volatile loads from a shared memory segment to misbehave on any architecture I've heard of. (Not that it matters, but that includes Itanium. Itanium has an 'uninitialized' bit for registers, but not for memory.)
The following might be more controversial, but since the definition of volatile is basically 'defer to hardware semantics', I believe that it would be illegal for any future version of LLVM to assume that concurrent volatile accesses to the same memory can't happen* ā as long as it's targeting current architectures. After all, current architectures have semantics that explicitly allow concurrent loads and stores. Instead, I'd say that concurrent volatile accesses (that are not also atomic) are merely non-portable, since hypothetically there could be some bizarre architecture where they would trap or even cause broader UB.
* Not that there would be much benefit in LLVM making that assumption even if it could, considering that it's very limited in what kinds of optimizations it can perform on volatile accesses. But that's beside the point.
I think that in a language as young as Rust, where many important "implementation details" are still in flux it's important to keep a clear distinction between "what is UB" (i.e. language does not specify what happens, hardware catching fire is standard-compliant) and "what current compiler+hardware implementations do".
For this, I'll steal @comex's nice "in theory" vs "in practice" wording.
UB is a purely theoretical concept. The fact that hardware is allowed to catch fire, does not mean that it will. Sometimes, compilers and CPUs provide stronger guarantees than programming languages alone do. The danger, of course, being that this aspect of hardware & compiler behaviour may not be subjected to any future compatibility guarantee.
In theory, concurrent volatile accesses are UB because in the C++11 memory model, concurrent non-atomic access to memory is a data race (notice that volatile is not part of this definition), and data races are UB.
In practice, the purpose of volatile's existence is to enable memory-mapped IO, which is 1/outside of any sane programming language memory model and 2/a form of concurrency between the CPU and other hardware. So volatile accesses are unlikely to result in surprising behavior in current compilers + hardware, and the standard's definition of a data race should arguably be amended accordingly to allow concurrent volatile accesses too.
This paper leans heavily on an "in practice" point of view. One of its core arguments is that if large C/++ codebases like the Linux kernel assume volatile to mean a certain thing, then any compiler that understands it to mean something different breaks the large codebases, and will thus be deemed evil by programmers.
As a result, compilers are forced to follow the large codebases' world view in practice, and one can clarify this by integrating this world view into the C standard (thusly turning what is currently undefined behavior from the language's PoV into the reasonably well-defined behavior that it actually is from the implementations' PoV).
We should be wary of blindly applying such reasoning to Rust, though, because...
Rust is not C/++. It has already diverged from C/++ semantics in major areas such as struct layout, alias analysis and enum exhaustive-ness, and may diverge further from those semantics in the future.
Rust is much younger than C/++. There are less established large Rust codebases out there, and therefore rustc still has some wiggle room for changing implementation details before developers start demanding that they be set in stone.
Rust only has one widely used compiler, which means that its semantics were subjected to much less scrutinity than those of C/++.
For an example of the dangers of using C knowledge on Rust...
...notice that the world view of LLVM and Rust diverges from that of the C language here.
The C language has volatile variables and pointers, which come with the "no extra read" restriction that you have in mind. But Rust only has volatileaccesses, which resemble accesses to volatile variables in C, but importantly do not forbid the compiler from introducing extra non-volatile accesses to the variable at hand.
As far as I know, said property can only be achieved through much discipline, e.g. making sure that no reference to the "volatile variable" is ever created (which turns simple things like accessing struct fields into unsolved language design problems).
Make sure you truly need this guarantee, as it will be painful to provide in Rust.
UB is a language property, it is not architecture dependent. It only means that the language spec doesn't specify what happens when a program does a certain thing. Sometimes, additional implementation knowledge of the compiler, OS and hardware will allow you to tell what happens in certain circumstances, but the UB is still there, as the behavior may still be different on a different compiler/OS/CPU.
Now, with that theoretical consideration out of the way, I fully agree with @comex's "practical" view here :
Without allowing some form of concurrent access, volatile isn't even good enough for its original use case of memory-mapped I/O. Therefore, "concurrent volatile is UB" language semantics are too weak and "effects of concurrent volatile are implementation-defined" would be better.
Is it allowed to create an infinite loop (which is UB) where there wasn't one before, though? I would personally find this surprising, but then again, "infinite loops are UB" is a C/++-specific concern and we're mostly discussing Rust here...
As @comex pointed out, the Itanium situation is even more complicated (it depends if the data is in RAM or in a register). But overall...
UB is not hardware-specific. Future compiler updates are free to break UB-ish code, if they can find a good reason to do so and can convince their user base.
All current compilers and hardware will do what you want.
So one reasonable stance would be to advocate for some reasonable wording about concurrent volatile access to be merged into the Unsafe Code Guidelines as a life insurance, and be done with that.
Unfortunately, I don't know enough about Java's guarantees to comment on this. But in C++11, coalescing two consecutive relaxed reads to a single variable into one is definitely mostly legal, except in peculiar cases like when it could break forward progress by e.g. creating an infinite loop.
Going a bit faster...
...adding new writes to a shared memory location is generally forbidden in the C++11 memory model, for the reason that you state, so as long as the compiler correctly identifies your memory as shared (I guess that means there's an UnsafeCell somewhere in Rust) you don't need to worry about it.
I'm not sure if I understand, the parenthesis seems to contradict the original statement.
As @comex points out, excessive granularity is the problem which byte-wise atomic memcpy is meant to solve, but it may not immediately resolve the SIMD issue due to current LLVM/rustc limitations.
Unfortunately, here, we face the problem that volatile is about memory-mapped I/O and precise hardware control, which will always kill optimization. This is why I think unordered atomics would be a more promising direction.
Thanks. It would be great if the Rust docs said this instead of talking about UB without qualification about potential weird future architectures.
I'm aware of this. If something doesn't need to be UB, it's frustrating for it to be designated as UB, since something being UB gives the compiler a license to do unwanted things even if there isn't a concrete practical reason why it would need such a license in the case at hand.
Right, so given the purpose of volatile to exist, it seems bad to have changes not caused by the program we're using volatile designated as "data races" for the purpose of the definition of data races being UB.
This should be easy enough to provide in my case, since the scenario involves getting a pointer and a length from SpiderMonkey, either reading stuff or writing stuff via the pointer in the Rust function and then returning without remembering the pointer. I don't see any reason why I'd need to materialize a reference from the pointer, though it's kinda annoying to have to create a manual "volatile slice" struct wrapping a pointer and a length to avoid materializing a normal slice.
This seems to be going into the weeds, but if we're talking about the return value false of questionable resulting in a loop controlled by the return value looping infinitely, it looks to me like that's one of the possible executions of the unoptimized questionable.
OK. I'll do that. Thanks!
I mean: If the compiler tells the CPU to do two reads, the compiler has to optimize on the assumption that the values received could be inconsistent. However, if the compiler generates only one read and uses whatever it has in a register twice, another thread of execution can't change what's already in a register, so reusing that value as a substitute for the second read is fine, since we're not trying to observe changes by another thread--we're just trying to not to have a security bug in case there is a badly-behaved other thread.
Hopefully for my use case, the memory accesses in the source code will be close enough to what would be optimal to actually perform, so hopefully this won't be too awful.
My guess without actually measuring is that losing SIMD would be worse than losing compiler-performed reordering and widening. (Since volatile on normal architectures generates the kind of instructions that non-atomic generates, AFAICT, reordering by the CPU is not lost.)
But, indeed, something like relaxed memory order with tearing allowed (for lock-free SIMD even when unalignedly crossing cache lines) would seem better than either volatile or relaxed.
I have some general notes that are not specific to volatile...
(Hardware may very well catch actual literal . While the CPU perhaps may not, they may be connected to a PLC which could get faulty instructions due to UB and then cause centrifuges at a power plant to tear themselves apart. This is not science fiction, it has actually happened. If you are not connected to anything safety-critical the risk is smaller of course, but it will still exist)
(N.B. at this point in time, the UCG is merely a draft and has research material. Until such time that the language team has approved something in the UCG, it is not life insurance.)
"Need to be" can be as simple as "we want to preserve future language design freedom".
Lock-free algorithms typically expect that you can access machine-word-sized (e.g. usize) machine-word-aligned values in a relaxed manner without any possibility of tearing. (Such algorithms do not expect the same of unaligned accesses, or of u8/u16.)
Yeah, it's a known issue that there are ergonomic and correctness problems in that area. Relevant links:
Discussion about VolatileCell and how it interacts with LLVM deferenceable (TL;DR: it's currently broken. If it weren't broken, you could safely cast to types like &[VolatileCell<i32>] rather than needing to use raw pointers.)
RFC for an operator to take a raw reference (TL;DR: Currently, there's no way to go from a raw pointer to a struct to a raw pointer to one of its fields without creating a reference in between, which technically causes UB if the pointer is invalid.)
I meant better for my use case, not better for everything. The general problem with many language features in this area is that they are designed for multiple threads intending to communicate with each other. My use case involves having obvious semantics in the single-threaded case and being not a security bug in the presence of an adversarial second thread, so I don't care if the adversarial thread could observe tearing, since I'm not trying to communicate with it.
I don't think RCU has any problem there, the data reads are properly synchronized. I mean, when you access the data protected by a lock, you are also accessing shared memory without using atomic instructions.
No. The compiled program may only diverge (run into an infinite loop without any more observable behavior) if the source program also had that as a possible behavior. (When considering concurrency, you likely want to add "under a fair scheduler" as a restriction -- "fairness" exists as a formal concept, though several versions of it exist.)
Racing volatile accesses and UB
I added that. I did that because, to my knowledge, that's the current status of the C++ memory model, and Rust follows that model (despite us actually compiling through LLVM, which uses a slightly different model -- but the compilation from C++ to LLVM is okay, just the other direction is not).
If the compiler "detects" that two threads it is compiling race through volatile accesses, it is allowed to exploit that just like any other UB. "Implementation-defined" is not strong enough wording for that. It would mean that rustc, the implementation, should be able to specify what happens -- but we cannot specify that.
The only reason this does not lead to a problem with MMIO is that the two "threads" racing here are not in the same compilation unit.
This is why I don't think we should weaken the "it is UB" qualification in the docs of our volatile accesses.
I acknowledge that the de facto C++ memory model does not have UB here, only the standard does. But I am wary of relying on that for our own "standard-like" text.
I fully agree, we should reduce "unnecessary" UB. However we have to be careful as this is a one-way street -- if we later find out about new optimizations, we cannot go back.
For the concrete case of concurrency that we have here, there also is the problem that we don't even define the UB ourselves. We inherit it from C++. Moving away from the C++ atomic memory model to another one (like LLVM's) is a big decision, not just a small docs change.
If the C++ model changes though and exempts volatile accesses from data races, we inherit that "for free".