(This is an attempt to retry a topic that was previously closed without replies.)
There are cases where Rust code is used for implementing services for callers that are untrusted. My case at hand is about providing host services to multithreaded Wasm, but I imagine the same issue arises when implementing syscalls in an OS kernel.
The untrusted caller designates some range of memory for reading or writing by the Rust code that we’re writing. There is no synchronization issue with the calling thread of execution: it is suspended while our Rust code runs. That is, for concurrency purposes the case is as though the untrusted caller called our Rust code as a function.
Therefore, for concurrency purposes, it’s desirable for the Rust code to use regular loads and stores and not cause any stronger synchronization.
However, for optimization purposes regular loads and stores assume that the program is free of data races, and the compiler is allowed to optimize on that assumption. Here, the absence of another untrusted thread of execution is not guaranteed: Another untrusted thread of execution might read or write the same memory region.
What mechanism should be used in Rust to guarantee that such an untrusted thread of execution can’t cause UB to our Rust code? That is, to guarantee that our Rust code performs writes as “write and forget” and doesn’t invent reads that depend on the written data coming back as written, and that repeated reads of the same memory are not assumed to yield consistent results and there are no optimizations made on the assumption that subsequent reads should see data consistent with previous reads?
It is OK for our Rust code to inflict UB onto the other thread of execution: we don’t need to guarantee that it experiences our writes in any reasonable or consistent way. That is, if another thread of execution accesses the same memory concurrently, it is badly behaved (i.e. it’s not supposed to do that, but since it’s untrusted, we need to assume it might) and it is appropriate for it to experience UB.
Does Rust already have a mechanism for performing non-synchronizing loads and stores from memory that’s not guaranteed to be free of data races in a way that cannot contaminate the Rust code with UB?
TL;DR I believe that you want a combination of raw pointers and LLVM’s memcpy.element.unordered.atomic.
First of all, if it is not safe to insert arbitrary loads to a given memory region, Rust references cannot be used, because they are annotated with LLVM’s dereferencable attribute which allows just that. So the racey data must only be accessible via raw pointers.
Second, if there is a risk of a data race, then you must use atomic or volatile reads/writes. The two annotations provide different guarantees and it gets confusing quickly, but in a nutshell…
Relaxed atomic reads/writes inform the compiler of the possibility of a data race (which suppresses some optimization such as caching previous reads from a value). They are ~free on all widely used CPU architectures, up to pointer-sized data, as they only assumes cache coherence. But be sure to pair them with appropriate Acquire/Release operations in order to prevent them from being reordered forward or backward in a dangerous region of the program.
volatile is meant for memory-mapped IO use cases, and provides stronger guarantees which you likely do not need, such as lack of reordering of volatile operations with each other or guaranteeing that four u8 reads cannot be coalesced into an u32 read.
Now, AtomicSomething::load/store and ptr::atomic_stuff do not do exactly what you want, because they impose a certain granularity on atomic memory accesses. And this is where LLVM’s memcpy.element.unordered.atomic comes into play.
What this does is to weaken the atomic contract by weaking the assumption of cache coherence (see the definition of “unordered” in LLVM’s memory model) and allowing an atomic read of any type T to be broken down into smaller atomic reads. Together, these allow “atomically” reading types of any size in a manner that is as cheap as a regular read.
Thanks. This matches my understanding. Specifically, that I need LLVM “unordered”, because the C++11/C11 atomics are only valid if paired (acquire with release) within the memory model.
Note that you still need something of that kind with "unordered", as far as I can tell. Otherwise, the compiler and CPU are free to move your reads/writes as far forward and backward as they like, which can move them out of the critical region and create new data races!
What kind of scenario would be a problem? Naively, it seems that the operations can’t be moved past the entry and return into the function that implements the host service for Wasm (or a syscall in a kernel).
I've been told (by a JS compiler developer) that none of the C++11/C11 atomic modes are valid to use when the other thread of execution isn't using them, too, for the same memory locations.
What guarantees "relaxed" atomic access not to be UB if another thread of execution accesses the memory locations via a mechanism that is not guaranteed to be a Rust/C11/C++11 atomic access?
Consider the following pseudocode, where “shared” is a shared value
// Caller task
shared = 24;
call_host()
x = shared;
// Host task, possibly on a different CPU
shared = 42
wake_caller()
While unordered atomics guarantee the the write of 42 to shared will occur at some point, in the absence of a Release barrier in wake_caller() and an Acquire barrier in call_host() that synchronize with each other, it is legal for x to contain 24 at the end of this program, which is likely not what you want.
The reason is that the compiler and CPU are allowed to reorder the write to shared inside or after the call to wake_caller(), and also allowed to reorder the read to x before or inside the call to call_host().
Now, I’m pretty sure that any sane implementation of call_host() and wake_caller() would feature there barriers. Just wanted to point out that you need them for correctness.
This seems to assume that "Caller task" and "Host task" could run on different OS threads. I'm assuming that they run on the same OS thread. Is there still a problem if they run on the same OS thread? (i.e. wake_caller() is a return.)
No, the CPU and compiler guarantee sequential consistency for single-threaded code. You actually don’t need atomics at all if the memory region is guaranteed to never be accessed from multiple threads (but maybe you cannot guarantee this on the caller’s side…).
Right, the whole issue is that I can't guarantee that there isn't another (badly-behaved) Wasm thread that's not the one calling into the host service.
Then I guess the question is, do you have host-side critical sections whose correctness relies on the fact that a read or write to/from shared memory occurs within a certain window of the program flow?
For example, given this code…
// Thread 1
shared = 24
// --- CALL TO HOST ---
y = shared
shared = do_something(y)
// --- RETURN TO CALLER ---
shared2 = 666
x = shared
// Thread 2
while shared2 != 666 {}
shared = 42
…it is possible to get x == do_something(42), if the compiler/CPU decides to reorder shared2 = 666 before the call to the host in thread 1, or to reorder shared = 42 before the busy-wait for shared2 == 666 in thread 2.
If your host code’s logic is okay with reading arbitrary garbage from shared memory regions (which it probably should, given the constraints that you operate upon where a malicious second thread may write to shared memory at any time), this is probably fine.
Wait… Sorry, I accidentally took you in the wrong direction here.
Your original question was, I believe, whether a program which only interacts with shared memory using unordered loads and stores is free of undefined behavior when facing an adversary that can access said memory in an arbitrary way.
And I think the answer to that question is no. For example, using unordered loads to access memory that was written to using non-atomic stores is still UB, and LLVM’s optimizer will not hesitate to destroy your code with undef if it ever finds out that you are doing this.
For unordered loads/stores to be strictly UB-free, you still need to guarantee that the adversary is using (at least) unordered atomic loads and stores to access the shared memory region.
In a WASM setting, this is actually something which you should be able to do, because you control the compiler that is used to compile the adversary code, I believe.
But in an OS kernel setting, reading from shared memory that’s accessed in a non-atomic way by some other code, no matter if it’s done with unordered or relaxed loads, is always UB. The only reason why it works is that LLVM cannot figure out what we’re doing, and all known hardware actually does something reasonable in that case (namely returning integers of the requested size with arbitrary bit patterns).
So…
If you want to be free of undefined behavior, you can do so today by tweaking the WASM compiler so that all reads and writes to memory that’s shared with the host are atomic and at least Relaxed, and then using Relaxed reads in the host code in the manner that @bill_myers has suggested.
If you need reads or writes to happen in a certain window of time (e.g. because the shared memory block was allocated at some point in the past and will be freed at some point in the future, possibly in a different thread), you must also use Acquire/Release barriers + synchronization to delineate the region of code within which they should occur.
If the reads and writes can truly be reordered arbitrarily far in the past or the future by the compiler and CPU, then you do not need such barriers.
Someday, you will be able to optimize the performance by using Unordered load/stores instead of Relaxed load/stores in both the WASM compiler and the host code. But you do not need this to achieve UB-freedom today. It just enables more compiler optimizations.
An OS kernel that reads from untrusted shared memory using LLVM’s atomic load/stores is actually triggering undefined behavior. But it works out in practice because LLVM cannot detect the UB and unlike LLVM, the hardware does something sensible.
If you truly want UB-freedom in that case, volatile should work out as it builds on the super-strong assumption that hardware is fiddling with memory behind your back. But then you’ll also pay performance for guarantees of volatile that you do not actually use…
Scratch that. LLVM’s documentation explicitly states that code which uses volatile on shared memory is UB. So even volatile won’t save you from LLVM’s notion of UB.
Thinking about it further, guaranteeing that the adversary code only performs atomic operations on shared memory is not enough to prove absence of UB. You must also ensure that the adversary is not allowed to write mem::uninitialized::<u8>() into bytes of the shared memory block. Because if it successfully did, it would also be undefined behavior for you to read these from LLVM’s perspective.
This all gives me a growing suspicion that LLVM’s memory model was just not designed for shared memory interactions with untrusted unsafe code. For those interactions to be safe, you need to rely on properties outside of LLVM’s design scope, e.g. absence of LTO between the host and the untrusted guest (to prevent LLVM from detecting and exploiting “harmless” UB when it can’t be prevented) and hardware with sane load/store semantics (no uninitialized value tracking, data races result in valid integers with unpredictable bit patterns, etc).
In a WASM context, it is probably easiest to design the WASM runtime in such a way that it’s impossible for malicious code to trigger undefined behavior in the host. So, only data for which arbitrary bit patterns are valid (like integers) can be shared, no non-atomic operations on shared memory, no access to mem::uninitialized (or equivalent), and no ability to deallocate memory that’s shared with the host. I think that by construction, WASM should guarantee some of these properties already, but you tell me
EDIT: I think the following API sums it up
/// Build a slice of bytes that can be shared with arbitrary untrusted code
///
/// This is an alternative to std::slice::from_raw_parts[_mut] for cases where
/// the underlying bytes are shared with arbitrary and untrusted unsafe code.
/// Which is extremely dangerous, because said code has countless ways to make
/// us engage in Undefined Behavior, and LLVM will eat our laundry if it ever
/// finds out that we're doing that.
///
/// Remember that because the bytes are shared with untrusted code, you cannot
/// assume anything about the values of the bytes in the slice (e.g. that they
/// can be transmuted into a valid value of a certain type T), and you cannot
/// rely on the other code to use atomic memory orderings correctly for
/// synchronization.
///
/// # Safety
///
/// You are responsible for ensuring the following safety conditions:
///
/// 1. The "data" pointer is neither null nor dangling, and will remain safe to
/// dereference for the entire `'a` lifetime. You do not need to care about
/// alignment because `u8` is always well-aligned.
/// 2. The specified "length" does not overflow the allocation backing the
/// "data" pointer, and will never do so for the entire `'a` lifetime.
/// 3. No link-time optmizations are performed between you and the untrusted
/// codebase. This is necessary because some "harmless" forms of Undefined
/// Behavior cannot be avoided, and LLVM can miscompile your code if it is
/// is exposed to that UB.
/// 4. The underlying hardware guarantees that reading an u8 from a valid pointer
/// always yields a valid u8 value, and that writing an u8 using a valid
/// pointer always succeeds.
///
/// These conditions have the following implications:
///
/// - The untrusted code must be unable to manipulate the shared allocation in
/// harmful ways. In particular, it should not be able to deallocate the
/// shared allocation, resize it, or unmap it from your virtual address space.
/// Ideally, it should only be able to manipulate the data inside of the
/// allocation, every other access to the allocation must be carefully vetted.
/// - This, together with condition 3, implies that the untrusted code is either
/// living in a separate OS process, or *both* linked to this process in an
/// LTO-hostile way and sandboxed in such a manner that it cannot manipulate
/// this process' memory allocations in dangerous ways.
/// - This abstraction is not suitable for exotic hardware architectures that
/// can track uninitialized memory, nor for accessing "exotic" memory such as
/// memory-mapped IO and memory-mapped hardware registers.
///
pub unsafe fn from_raw_parts_untrusted<'a>(data: *mut u8, length: usize) -> &'a [AtomicU8] {
core::slice::from_raw_parts(data as *const AtomicU8, length)
}
Yes, the assumption here is that the host code would read garbage form the Wasm heap if another Wasm thread (not the one calling into host services) writes concurrently to the memory region that the host service is reading from. AFAICT, compared to normal Rust reads, the constraint is that if the host service code reads the same memory location twice, the code must not be optimized on the assumption that the two reads yield consistent garbage: They might yield inconsistent garbage. For example, if I first do a wider (ALU word or SIMD) read and test the value such that I've proven some property about the value, if I then do byte-wise reads of the same memory locations, the code must not be optimized based on the test performed on the value obtained from the wider read. For example, if I've masked a 32-bit word such that it looks like at least one of its bytes has the high bit set, and I then loop over the bytes, the optimizer must not delete the loop termination on a max number of bytes on the assumption that the loop must find a byte with the high bit set and terminate on that condition, because another thread of execution might have zeroed out the memory locations.
Assuming that the memory region stays allocated (the Wasm runtime presumably is responsible for keeping the Wasm heap in existence), considering that unitialized memory is an optimizer construct and not a machine construct, I have a really hard time understanding why this kind of reading could not be done using plain read instructions as long as the optimizer is told to assume that each read's value is potentially independent of values previously read from the same location. (I think this is what "unordered" does.) I don't understand why I'd need the Wasm compiler to guarantee specific kinds of writes.
Considering that LLVM's optimizer operates statically and the other thread of execution is a run-time phenomenon whose code isn't available at the time the LLVM optimizer run, how would the optimizer find out?
I have a really hard time understanding why this would be the case for reads. See above.
I admit that I understand less about the write case, but I don't understand why it would be insufficient to perform regular write instructions and tell the optimizer that it mustn't invent reads that'd read back anything that gets written. If a rogue thread of execution goes read the range of memory before caches have been synced up, I'd expect it to read some garbage, but that seems fine, since it would be inflicting UB (for the purpose of its source language) onto itself. That is, I don't understand how whatever it sees could not be explained by another thread having done some sequence of byte-wise writes, and since another thread might have done any sequence of byte-wise writes, how could the rogue thread experience anything worse than another Wasm thread having written some sequence of byte-wise writes before this observation and another sequence of byte-wise writes between now and the next observation if the rogue thread re-observes the memory after caches have been synced.
Since e.g. Itanium and aarch64 take the pairing of acquire and release all the way to the ISA, I understand why atomics need to be paired, but I still don't understand why my scenario needs anything but regular load/store instructions plus turning off optimizations that make assumptions about reads yielding values consistent with previous read or write to/from the same memory location.
How does Linux compiled with clang or xnu actually read or write user-process memory regions that are operands of syscalls? Do they use clang-generated instructions or are all such accesses using manually-written asm that doesn't participate in clang's instruction selection or optimization at all?
I though allocated but uninitialized memory was exclusively an optimizer construct and that it's impossible to have machine instructions that would actually materialize uninitialized values into allocated memory at run time. What am I missing?
This seems trivially the case, because the untrusted code isn't compiled at the time the host code is compiled.
As a disclaimer, if it can help with your current confusion, what we're discussing here is, as far as I can tell, undefined behavior that LLVM will not be able detect in any current use case, and that all current mainstream hardware should handle cleanly for any current code generation and optimization scheme. So we're definitely in pedantic compiler rule lawyering territory here.
Strictly speaking, you need atomic volatile (which afaik exists in LLVM IR but isn't exposed in Rust) in order to guarantee that all memory operations within your code will occur. With atomic alone, a compiler is still allowed to optimize out some memory operations. And with volatile alone, concurrent accesses to memory from multiple threads are UB.
There are some sanity restrictions on the extent to which the compiler can optimize atomics, though. For example, C11-style atomics guarantee forward progress by eventually propagating atomic writes to threads doing atomic reads. So if you have something like this...
fn spin_wait(b: &AtomicBool) {
while b.load(Ordering::Relaxed) { do_stuff(); }
}
...then the compiler is not allowed to transform it into this:
fn bad_wait(b: &AtomicBool) {
let x = b.load(Ordering::Relaxed);
while x { do_stuff(); }
}
...but as far as I know, it can transform it into this:
fn ugly_wait(b: &AtomicBool) {
while b.load(Ordering::Relaxed) {
for _ in 0..100 {
do_stuff();
}
}
}
Furthermore, since this is about guaranteeing forward progress, only infinite loops are subjected to this rule. So if you have something like this...
fn questionable(bs: &[AtomicBool]) -> bool {
for b in bs {
b.store(false, Ordering::Relaxed);
}
for b in bs {
if b.load(Ordering::Relaxed) { return true; }
}
return false;
}
...then if it is able to prove that the return value of questionable() will never be used to control a loop (because it has inlined the function everywhere and it's not publicly visible outside of the crate), the compiler is allowed to optimize it into this:
fn questionable(bs: &[AtomicBool]) -> bool {
for b in bs {
b.store(false, Ordering::Relaxed);
}
return false;
}
If that kind of optimization is harmful in your case, then you must use atomic volatile or assembly.
Although it is indeed a pure optimizer construct on mainstream hardware (see my rule lawyering warning above), there are CPUs in the wild that can track uninitialized memory. Every low-level dev should thank AMD for the fact that Itanium is not a mainstream architecture anymore, but compiler and language authors cannot assume that no other CPU architecture will get the same "great" ideas in the future...
unordered does not tell LLVM to assume that each read's value is potentially independent of values previously read from the same location. That's closer to volatile's business. What unordered does is to tell LLVM that...
Memory operations cannot be split (e.g. an unordered u16 read cannot be split into two u8 reads)
Memory operations cannot be narrowed (e.g. it is not okay to turn an u16 write into a write to the 8 low-order bits even if only those bits changed from the optimizer's perspective)
The optimizer cannot add new stores to the memory location on code paths where there weren't (but it can add new loads, remove some loads and stores if it doesn't break forward progress, and reorder existing loads and stores).
Basically, unordered is C11's relaxed minus the "cache-coherence" assumption that all threads agree on a single total store order for all writes to a single memory location. This allows optimizations like reordering atomic stores with respect to each other (which is forbidden when using relaxed ordering).
You need the WASM compiler to guarantee specific kinds of writes in order to be free of undefined behavior from LLVM's perspective. But you may also decide that you don't care about this specific undefined behavior because you made every precaution to ensure that LLVM doesn't see it, you are confident that no sane code generation strategy can make anything bad out of it, and you trust all hardware you are targeting to do something sensible if it happens.
Personally, my point of view is that the memory model of LLVM (and perhaps that of some niche hardware like Itanium too, from what I gather) is simply not designed to allow two mutually untrusting processes to share memory safely, and that you need to operate outside of that model to some extent when doing so. So I wouldn't frown about relying on UB to be implemented in a certain way in this case. But people who care more about being UB-free than I do might ask you to implement this part in assembly just to be safe. Assembly is safer here, because it operates in the hardware's memory model instead of LLVM's.
You must be doing something really weird for this to happen, basically doing LTO between the host and the client in a fashion that recompiles the host's code.
I don't think that it can happen in statically compiled Rust, but it might happen in a different Rust implementation that re-optimizes code at runtime using a JIT (and yes, I know a crazy person who is experimenting around that sort of implementation).
So to summarize, it works on today's Rust, but it may not work on every possible implementation in the future. It's just a fact that you must be aware of: you're relying on implementation-specific behavior here, but it's behavior that's likely to stay the same for a long while so it may not be too bad.
"Data races are UB" are, as far as I can tell, a pure optimizer concept on current hardware when accessing physical RAM. All multi-core hardware that I am aware of guarantees that reading from initialized and correctly mapped RAM will return bytes, even if another core is concurrently writing to the same memory location. Those bytes which you are reading might be completely garbled and random, but for every read you do, you will receive correct bytes in return.
Now, that's for physical RAM, but I think there might be edge cases around weird uses of the memory bus, like operating system's memory-mapped I/O APIs (where any read from RAM traps to the OS and can therefore cause arbitrary behavior) or memory-mapped hardware registers (where what looks like operations on physical RAM is actually triggering arbitrary behavior in hardware other than the CPU). Which is why I said that those "special" memory types are, I believe, not that safe to share with untrusted code. You must only allow such sharing to happen in extremely controlled ways.
TL;DR Don't share weird memory with untrusted code without carefully examining the implications, and don't let the optimizer see that you are engaging in a data race, and as far as I can tell you should be fine on all current hardware and for all current optimization + code generation algorithms (which do not assume being aware of every thread in the program).
So, there are two different ways of looking at acquire and release, and it may sometimes be helpful to switch between the two points of views.
The first way to look at these is the language memory model view. From this perspective, acquire and release are about synchronizing non-atomic memory accesses by piggybacking on an atomic transaction (loads for acquire, stores for release).
The other way to look at these is the hardware view. From that perspective, acquire reads are a memory barrier that prevents subsequent loads and stores from being reordered before them in the current thread, and release writes are a memory barrier that prevents previous loads and stores from being reordered after them in the current thread.
Although the language memory models do not actually mandate this "hardware view" to be correct, it is pragmatically speaking the only way to implement the "language memory model view" as long as 1/memory barriers are the hardware's tool of choice for reordering control (true of all current hardware) and 2/compilers do not assume an omniscient view of all threads in existence and cannot elide the hardware barriers.
You can't rely on the first view when the other thread is not cooperating, but you may sometimes need to rely on the second view when doing low-level stuff where the timing/ordering of loads and stores is very important. By default, compilers can reorder loads and stores arbitrarily far away in the "past" and the "future" of the program, and this is not always correct in low-level use cases.
Linux operates outside of the C language's memory model, and a lot of what it does is undefined behavior according to modern C standards.
To make it all work, the Linux code uses a complex mess of asm statements to enforce reordering barriers for the compiler (via clobbers) and hardware (via asm instructions), together with careful use of volatile in order to prevent new loads and stores from being inserted when that's dangerous.
It's all very fragile, and GCC broke Linux several times by tweaking its optimizer. I also suspect that it's part of the reason why compiling Linux with clang is such a long-running effort : Linux relies on a lot of GCC-specific implementation details.
There are talks about trying to port the Linux code to the C11 memory model, but before that can be done, said memory model must be extended a bit in order to handle things like seqlocks (which cannot be cleanly expressed in terms of acquire loads and release stores) or RCU (which relies on "consume" memory ordering that no one knows how to specify at the language level).
TL;DR: You probably don't want to use Linux as an example of what to do unless you have no choice.
Note that people also use RCU in userspace, and it’s critically important to have some way to access shared memory (in aligned machine-word-sized pieces) without using atomic instructions.
Any memory model we apply to Rust needs to have some way to support such accesses. In the C standards, there was a strong push to just define the problem away and use much stronger ordering, and that got (correctly) shot down by developers working on scalable synchronization.
I'm not 100% sure if I fully understand what you mean. If we use RCU as an example, do you mean being able to read the RCUd memory block without atomic operations? Or being able to access the shared pointer to that memory block without atomic operations?
The former is already allowed by the C11/LLVM memory model, because you're not racing with any other thread concurrently writing to that memory block.
The latter, on the other hand, is more of a problem. One way or another, you need some sort of annotation to tell the compiler and hardware that some code optimizations that are valid in sequential code, such as speculatively prefetching data from an old version of the RCUd memory block before checking that the value of the shared pointer hasn't changed (if you check that at all), are dangerous in a multi-threaded environment and should not be carried out.
Now, that annotation can be C11-style atomics, or it can be the kind of manually hacked up compiler and hardware memory fences + volatile annotation abuses that people used before these were a thing. Or it can be something else entirely, maybe a programming language + hardware combo whose semantics are not so tightly bound to an assumption of sequential execution. But there has to be something, and ideally that something shouldn't depend on implementation details that would change from one compiler to another, or even from one version of a compiler to another.
Then if rustc is to continue using LLVM as a backend, serious work will need to go into improving LLVM's memory model so that it can accomodate those extra use cases.
That wouldn't be a bad thing, if you ask me. A memory model that unconditionally allows the compiler's optimizer to destroy your code when you read from untrusted bytes is just not a good fit for writing security-critical code.
But the difficulty of combining high control on what the hardware is doing with the high degree of automatic optimization that is expected of modern programming languages should not be underestimated. Things like volatile and atomic exist precisely because yesterday's compiler authors reached the conclusion that these two objectives just can't be satisfied with a single 100% unified memory access model.
If we disagree with them, then we must find out what was the flaw in their reasoning. If we don't disagree with them, then we may need to extend the memory model with more memory access modes that fit those kind of use cases (I don't known, one could maybe call that untrusted pointers or something).
I mean both reading the pointer and reading the data through the pointer. However, it’s fine if doing so requires annotations on the accesses, as long as those annotations don’t add any runtime overhead such as atomic operations or unnecessary memory barriers.
Some accesses currently use volatile annotations to force read-once semantics. And RCU also uses “compiler barriers” extensively, which we ought to specify in some reasonable way.
I’m not asking for more than what C compilers (with extensions) support today, just that we should have standardized versions of enough of that in Rust.
I suppose you are thinking about hardware-level atomic RMW operations (like LOCK-stuff on x86) and memory barriers (like MFENCE on x86) here, am I correct? Because some of the things that compilers call atomics are pure optimization barriers, and do not intrinisically add any hardware-side overhead. All they do is to prevent the compiler's optimizer from transforming your code in some ways.
Unordered and Relaxed atomic loads and stores are two examples of this in LLVM's memory model. On any cache-coherent hardware (that is, any modern multicore CPU), these two are free at the hardware level, but they prevent some optimizations that are generally not desired in multi-threaded code. Kind of like volatile accesses to memory-mapped hardware.