Pre-pre-RFC: Exploring API design space for volatile atomics

scottmcm · December 7, 2023, 8:08pm

(I fixed the link.)

I find that particular example not terribly persuasive, since it's using local variables -- which Rust might have put anywhere -- without passing that address outside of something Rust can see, and it's passing &mut u64 to the volatile one, so it would be UB for anything else to change or have a reference to that memory even volatile, and thus I think if the volatile matters it's already UB before you got there.

Do you have an example that shows this is a more realistic situation, like an extern/exported static, or something using mmap/shm_open? What's the design you're trying to use with shared memory where you think that volatile is necessary?

If it's "I want to do exactly what x64 does", that's called asm!, not "I'm trying to turn off the optimizer by using volatile".

SkiFire13 · December 7, 2023, 8:10pm

Wouldn't this mean that code getting an arbitrary *const T/&T would have to assume it might have gone through red_bikeshed, and hence never be able to optimize writes through them?

GoldsteinE · December 7, 2023, 8:29pm

I don’t have any particular example of this optimization in a more realistic situation, but there’s no guarantee that it wouldn’t happen either. My own usecase is mmap(), which probably shouldn’t be affected even without volatile, but I don’t think this “probably shouldn’t be affected” is codified in either LLVM or C/++ docs. LLVM’s semantic for volatile guarantees that access wouldn’t be optimized out, which is what I actually want:

The optimizers must not change the number of volatile operations

In other words, I don’t think that this problem currently affects Rust programs dealing with shared memory, and I rely on compiler being sensible enough to not optimize out accesses to weird pointers from FFI functions, but I’d like to stop relying on stuff that’s not actually written down anywhere.

One other possible option could be documenting that “Rust won’t optimize out accesses to pointers from FFI calls and to exported statics”, but I’m not sure how such a guarantee would actually be enforced.

GoldsteinE · December 7, 2023, 8:30pm

My design in particular just uses shared atomics paired with futexes for interprocess synchronization, so I need to be sure that when I write to the atomic, change is actually observed on the other side.

sahnehaeubchen · December 7, 2023, 8:43pm

I avoided using volatile because from my understanding this would actually be an unnecessary optimization barrier for the original usecase where you basically just want to declare that another thread/process might have access to this memory.

pub unsafe fn some_fn(num: &core::cell::UnsafeCell<i32>) {
    *num.get() = 3;
    *num.get() = 4;
}

Still can optimize out storing the 3 if the reference went though red_bikeshed because you would have UB otherwise. But it cannot optimize out storing the 4. Maybe observable is the wrong concept for what I meant.

scottmcm · December 7, 2023, 9:33pm

I think it would really help if you could figure out what you actually rely on more precisely, like how the different memory orderings (https://marabos.nl/atomics/memory-ordering.html) do for atomics.

For example, if you're actually doing shared memory, I don't see how you could ever care about the compiler removing a read that you didn't use, and thus I can't see why you'd ever need a volatile-and-atomic read. And if you do two relaxed stores to the same place in shared memory one after the other, there's no way to reliably read the first store anyway, so I can't see why you'd need those writes to be volatile either.

Basically, if want you want in shared memory is to have guarantees about how the other process can see the writes that happened from the first process, then that sounds to me like exactly what atomic rules are all about. So if you need something more than that, it would be good to get details about what.

GoldsteinE · December 7, 2023, 10:41pm

Okay, let’s take a simple mutex as an example. I acquire the lock, do something with the protected resource and then I do a release-store to switch it back into unlocked state. I rely on the fact that this release-store actually happens and is not optimized out, so the other process with access to the same atomic can do an acquire-swap, observe the unlocked state and acquire the lock.

Let’s say that the first program unconditionally exits right after release-store. Compiler is currently allowed to decide that this release-store is unobservable and remove it. In fact, it does so in this program: Compiler Explorer. Granted, this atomic is not exported, so this optimization is in fact valid, but there’s no documentation about which cases are “protected” from this optimization, so there could be no guarantee that this won’t happen when the change is actually somehow observable.

For example, consider this code: Compiler Explorer. Do you consider this atomic to be externally observable? It could be found by its section name, so I’d say it is, but the access got optimized out anyway.

To summarize,

Compiler sometimes optimizes out atomic accesses, including stores as “unobservable”;
There’s no documentation about what counts as “observable”, except that volatile accesses certainly do.
Without either documented “safe cases” or volatile atomics there’s no safe way to do an atomic store and be sure that it happened from the perspective of an external observer.

Consider also LLVM documentation on the matter:

Atomic and volatile in the IR are orthogonal; “volatile” is the C/C++ volatile, which ensures that every volatile load and store happens and is performed in the stated order.

LLVM Atomic Instructions and Concurrency Guide — LLVM 18.0.0git documentation (emphasis mine)

When dealing with external code that could observe actions of the current program, be it shared memory, MMIO or linker shenanigans, I want to ensure exactly this: that every load and store actually happens.

scottmcm · December 7, 2023, 11:46pm

Well, if you make that static pub then it's not optimized out (https://godbolt.org/z/dEKvbcef1), with or without the link_section.

But I think this is a good direction: you care about the rules for whether, in LLVM terms, the static is a global or an internal global https://llvm.org/docs/LangRef.html#linkage-types, and maybe you think link_section should impact that, rather than just whether it's pub.

Could your request here be solved with documentation that "if you want to use a static to communicate between different things, it better be pub so the linker knows they both should use the same thing"? If you don't tell the linker they're the same, like with the non-exported atomic from your first example, I'm not sure that even atomic-volatile would fix it because the linker wouldn't necessarily make them use the same address.

I don't think "add new intrinsics" is a solution to missing documentation, rather you have a request for clarity that you'd like a specification (https://spec.ferrocene.dev/values.html#syntax_staticdeclaration/https://blog.rust-lang.org/inside-rust/2023/11/15/spec-vision.html/...) to guarantee about how rust code and atomics work.

GoldsteinE · December 8, 2023, 12:11am

Could your request here be solved with documentation that "if you want to use a static to communicate between different things, it better be pub so the linker knows they both should use the same thing"?

That was only a one possible example. There’re many ways memory can be externally observed. Documentation could solve this problem if it could also make a guarantee that accesses to, e.g. any pointers that were leaked to FFI in any way (so pointers returned from mmap(), pointers passed to FFI functions etc) are not optimized out.

I’m not sure it would be correct to make this guarantee with the current implementation. I don’t think LLVM atomics provide it, as shown in the documentation I linked. Basically, I think that while useful examples of code using atomics in shared memory do compile correctly today, LLVM is still allowed to miscompile them. LLVM documentation specifically mentions that volatile and atomic modifiers are orthogonal.

Dante-Broggi · December 8, 2023, 2:22am

Having looked in on the t-opsem Zulip and the UCG repo, the consensus was that file-mmap and other things which allow cross-process communication must and currently do use FFI/ASM to "virtually spawn" one or more "AM threads" which represent the other processes in the AM (Abstract Machine). This allows for normal atomics to work in the usual manner across processes. Thus, so long as process termination is required to permit these extra threads to read the final values in this shared memory before completion, normal atomics should work correctly, IIUC.
Though of course actual t-opsem members should confirm/deny this opinion as necessary.

GoldsteinE · December 8, 2023, 7:12am

That’s one way to solve it. It would nicely provide an universal method to make any atomic IPC-capable, and, in fact, memory from mmap() would be FFI-capable from the start, since I think being returned from a foreign function call counts as a virtual thread spawn?

CAD97 · December 11, 2023, 1:32am

Some assorted notes:

Generally speaking, Rust considers any item which is not pub to be internal and not exported in any way.

Doing linker tricks is inherently unsafe and under documented. But volatile isn't the way to say other code could be looking at a static, pub and #[used] are.

If it's another process, then using volatile is proper.

AIUI, these become relevant only in the face of whole program optimization. Given it's “smart enough,” optimization would be justified to notice that you only ever read this allocated object (e.g. from mmap) and replace all of your atomic reads with nonatomic reads, and to coalesce time separated reads. If some other process results in that memory changing, you have UB.

You need the reads to be volatile such that the volatile quality can do the “abstract machine IO” equivalent of the other processes manipulating the visible memory.

Member of T-opsem, but not speaking for the team.

I believe the temperature is roughly that we do want access to volatile atomics, but comparatively speaking it's relatively low priority. The abstract op.sem is straightforward: do both the atomic “thing” and the volatile “thing” to guard the operation.

I believe volatile atomics are also a case where LLVM unordered may be useful semantically, as generally only the “cannot tear” part of atomics is desired, and the synchronization of even monotonic (our Relaxed) isn't necessary.

IIUC, because there's no simple way to restrict AtomicCell<T> to only primitive integer types with processor atomics support as a trait obligation. It could have been some trait Atomic over the relevant types that dispatched to the various intrinsics::atomic_* functions, but then you still have the follow-up question of why AtomicCell<IndexNewtype> can't work, even just with load/store. It's essentially the numeric trait design problem but worse.

AtomicNN was a “working enough” solution. And you can generalize over atomic sizes with associated types in a library, e.g. radium.

Also FWIW, &VolatileCell<T> as a library type is fundamentally broken and cannot be correctly papered over. It could be implemented with compiler magic, but that compiler magic is necessary to prevent spurious accesses, similar to the magic applied for UnsafeCell and UnsafePinned (async/!Unpin).

However, you also then have to ask the question of what are the semantics of &(u64, VolatileCell<u64>, u64), or other compound types having a volatile place sitting in their middle. It's not a question that lacks a reasonable answer, but it's much less self evident than just asking what it means for an access to be both volatile and atomic.

(Not said with any authority.) A write on the Abstract Machine is observable if that write could at any point be validly read without causing UB. Given whole program knowledge, any Rust Allocated Objects (i.e. allocated via Box/alloc::alloc::alloc (heap); let, function parameters (stack); or static (global)) are constrained to access within the Abstract Machine's vision unless

the memory is sourced from outside the AM (e.g. extern static or an extern fn originating pointer);
the static place is visible externally to the AM (e.g. it has a known export name (#[no_mangle]/#[export_name]) or is marked pub and #[used]); or
the memory is visible through a pointer which has been passed beyond the AM visibility (e.g. to an extern fn), whose provenance has not been invalidated, and a read through which would be sequenced with the write (i.e. not race and be UB).

There's no one definition of “observable” because we're aiming at an operational specification of the Rust abstract machine (op.sem == operational semantics). This definition just falls out of the definition of external linkage as being unknown code that could possibly do any defined operation to the AM state (i.e. you could define the external operations as some number of threads doing some sequence of valid things that Rust could do). So this isn't exhaustive, and shouldn't be.

volatile then essentially turns *place = Read.volatile(pointer); into extern_arbitrary(); *place = Read(pointer); extern_arbitrary();. ...But it's unfortunately not that simple because the set of things which LLVM permits a volatile access to do (by LangRef) is actually smaller than the set of things which an arbitrary function call can do. Or at least, I think it is; this part is solidly out of my believed understanding. (Critically, w.r.t. atomic synchronization. Atomics are reasonably well studied. volatile is AFAICT still much more vibe based around “don't do this optimization” rather than operational.)

Yes, this is the current working model for any communication “out of” the Rust AM world; any operation done by code “outside” the AM is modeled as native AM threads doing the AM operations corresponding to whatever the external operations are, according to the implementation mapping semantics between the shared semantics (e.g. LLVM-IR for LTO, or x86 for the processor) and the AM semantics.

The one wrinkle to ask is whether any such additional threads are allowed to be running when entering main(). They certainly are after the first extern fn call, as that call can be said to spawn all of those threads necessary to model the outside world.

scottmcm · December 11, 2023, 3:14am

I feel like that's a "tried to be smart but is buggy" problem, though, because it could only justify that if it was smart enough to know that mmap is an allocation, but somehow not smart enough to know that using MAP_SHARED with mmap means it's a shared allocation. I would say that doing that optimization for a mmap that's not MAP_PRIVATE is just a bug.

GoldsteinE · December 11, 2023, 8:30am

@CAD97 Thanks for your answer!

I believe the temperature is roughly that we do want access to volatile atomics, but comparatively speaking it's relatively low priority.

Can I do something to push it forward or is it waiting on t-opsem decision? The implementation seems relatively straightforward, although somewhat messy, given the number of intrinsics involved.

the memory is visible through a pointer which has been passed beyond the AM visibility (e.g. to an extern fn), whose provenance has not been invalidated, and a read through which would be sequenced with the write (i.e. not race and be UB).

Does a pointer returned from extern fn count? As mmap() is not in stdlib, this would make any mmap-ed pointer leaked outside the AM, which is what I want for my usecase.

GoldsteinE · December 11, 2023, 10:14am

I imagined Volatile<T> as not having Deref, but just providing .load()/.store()/atomic methods, which looks trivial to implement. It’s not so ergonomic though.

talchas · December 12, 2023, 3:35am

The case where you do need volatile atomics (or to give up and use asm) is the mmio case mentioned in a few places here - something like

read_user_memory(&mut packet.memory, user_buffer, length);
let idx = device.ring_buffer.insert(packet);
device.mmio.send_queue.volatile_store(idx, Release);

where a) the various memory operations on packet/ring_buffer must be visible (to the device) before the send_queue store causes the device to read them (ie needs to be atomic), and b) the store is a side effect and must not be combined with a later store or something like that which would be a valid atomics optimization (ie needs to be volatile).

A concrete example of an actual network driver that needs at least a) is e1000: the Release fence, the actual write. Given ring buffers, it might well actually be fine combining several writes, but I'd be surprised if there isn't a driver that definitely needs both.

In general code that needs mmio is likely to already need asm and often also be arch-specific though, so I wouldn't expect it to be prioritized.

pitaj · December 12, 2023, 4:31am

I don't understand why this implies they must be atomic. It's just not jiving with my understanding of atomics. Sounds more like all you need is volatile and memory barrier, unless there are multiple threads involved that I'm not seeing.

talchas · December 12, 2023, 4:51am

In the C11/etc atomics model, a memory fence does not do anything unless there are also atomics involved. The compiler is free to compile *foo = bar; fence(Release); ptr.write_volatile(baz); the exact same way it compiles *foo = bar; ptr.write_volatile(baz); fence(Release); (and then the compiler can also reorder the volatile before the normal write, or on weak archs just not emit an asm fence in between).

Meanwhile, for *foo = bar; fence(Release); atomic.store(baz, Relaxed);, it's roughly equivalent to *foo = bar; atomic.store(baz, Release);, and the compiler cannot do any problematic reordering.

pitaj · December 12, 2023, 5:46am

Okay I guess that makes sense. But could you not just use volatile in both places of it's just ordering you care about? Using atomics for this just seems heavier than necessary.

Maybe this is one place where Rust could improve the memory model by, for instance, adding a volatile memory fence.

talchas · December 12, 2023, 6:29am

No; first of all, that's insufficient on platforms which require that dma_wmb() in the linux source to actually turn into a real fence instruction. Second, you'd want all the normal memory writes to have the typical optimizations applied to them - they don't need to be volatile, they just need to be finished before the magic mmio write. That is exactly what atomics were created for in the first place.

You could just use normal stores, then asm for the fence (the compiler can make far fewer assumptions about an asm block than a C11 fence()), and then a volatile or asm for the mmio write; that's the linux kernel model.

Topic		Replies	Views
Pre-RFC: Extended atomic types internals	8	4739	March 25, 2019
Adding more atomic intrinsics	11	3054	March 25, 2019
Pre-RFC: Stabilize volatile copy and set functions libs	6	1136	September 10, 2019
Atomic cmpxchg with volatile semantics	5	2411	March 25, 2019
Volatile Wrapping Structs libs	14	2349	August 25, 2019

Pre-pre-RFC: Exploring API design space for volatile atomics

Related topics