Pre-RFC: core::ptr::simulate_realloc

that argument basically boils down to "it works because current compilers aren't smart enough", which doesn't say anything about future compilers...that said I think we could probably get compilers to declare atomics are enough (relaxed, no acq/rel/seqcst is necessary unless you want to synchronize memory outside of the atomic load/store) -- thereby making it no longer UB if you use atomics

though, saying atomics are enough only works on platforms that have threads at all, some platforms don't and convert all atomic ops to non-atomic ops

No my argument is that on any platform that has atomics there is nothing the compiler can do that can generate incorrect code that wouldn't also break other use cases that have to be supported for atomics.

And the compiler can't know that it doesn't share memory with another process even from the very start. It in fact does on Linux at least: The VDSO is mapped in every process and used to make some syscalls ultra cheap.

For example to get the current time the process can use read a seqlock protected value in the VDSO instead of doing a syscall. So even from the first line of user code, any compiler can't assume there isn't any shared memory with atomics somewhere. As long as it isn't memory created by AM known means, the compiler has to assume the worst.

On Linux that read is done by calling a function in the VDSO, but there is nothing forbidding an alternative OS where you are supposed to read some value from the VDSO yourself.

Platforms without atomics are an interesting case. I'm not aware of any platform with virtual memory but without atomics. But without both or without virtual memory exist for sure. There are even some multi core platforms without both[1].


  1. The popular RP2040 has atomic load and store but no CAS, LL/SC, etc. It has a finite number of fixed hardware mutexes and queues instead. As far as I remember the CPU is in order, scalar and has no cache, so the difference between relaxed etc isn't actually relevant. ↩︎

wasm is likely to be one such platform since there are wasm variants without threads (wasm originally didn't have threads) and there are proposals for adding mmap-like stuff. on platforms without threads llvm intentionally converts all atomic ops to non-atomic ops

1 Like

except that the VDSO is read-only from the process's view point and doesn't act like memory with multiple aliases in one process, it instead acts like the kernel started some extra threads that just read/write it at the same addresses as every other thread. being shared across all processes is just an invisible optimization, kinda like .rodata can be shared across processes.

1 Like

The VDSO does indeed act like read only but changing memory. That you need to do atomic reads from. So you are on the read side of shared memory in fact.

Which is quite different from the rodata case that doesn't change over time.

In that case we need volatile atomics for any sane shared memory behaviour. This is not a thing that Rust has. I don't know what the holdup for adding those is.

2 Likes

Like with /proc/mem/self or unique ownership of file descriptors, this would thus likely fall under yet another case of some property which isn't strictly maintained as true by the target physical machine, but everyone agrees to handle carefully to maintain the useful illusion of the stronger property.

On a target where relaxed atomic writes require no additional synchronization, it would be valid to model nonatomic writes done from outside the process as being exposed to the AM as a sequence of relaxed atomic writes. This reasoning of assembly-level equivalence can't be done without the context barrier, though, as it's not just the operations themselves that matters but also the assumptions which can be got from the AM semantics.

Attempting to strictly formalize any such schemes will always be a difficult mixing of abstraction layers and rely on implementation details of how the AM is lowered to the target machine. But in the limit it just becomes a game of defining things in a useful way. And where possible, scoping down to specific shared memory techniques instead of all shared memory.

For reference: IRLO thread for volatile atomics

The last state was that I was able to make my case that volatile atomics are a concept that makes sense, but didn’t have the energy to push it further. I would be happy to see this revived and am willing to help with implementation.

2 Likes

Based on my recollection, T-opsem agrees that volatile atomics are a thing which are needed. However, there's also a question as to whether volatile by itself should have some level of (relaxed) atomicity by itself for "sufficiently small" writes that the target guarantees will not tear.

But there should be no blockers for volatile-atomic intrinsics other than just the implementation work.

2 Likes

Naively adding intrinsics for volatile atomics would require adding a huge amount of intrinsics given how they’re organized. Each combination of operation * ordering (* second ordering if any) currently has a separate intrinsic, leading to combinatorial explosion. There’re 101 atomic intrinsics currently and adding volatile_ versions would double that. Some kind of redesign is probably needed there.

It's not about what the program can tell apart, it's about what can be described in the model. The model can account for "outside actors" acting on Rust-accessible memory, as long as those outside actors only perform actions that Rust code could also perform. That's how we rationalize atomic accesses being done from another process -- in essence, that process virtually exists as a non-Rust thread inside the Rust AM, doing actions that Rust could also do, and since optimizations have to be correct with arbitrairy unknown "surrounding code", that suffices.

I was about to say, maybe there's a way here to model multiply-mapped pages (in different allocations) as pages that are kept in sync by some background thread. We'd have to carefully think this through to ensure it fully works out; Rust has no operation that says "notify me when an atomic write happens in this memory region", so the background thread has to do the sync without knowing which of the two allocations is being written to when. It is possible that the mere fact that a background thread could exist that "guesses" this correctly suffices, but this is getting into the territory of angelic non-determinism which, in general, is incompatible with some basic reordering optimizations -- so it's far from obvious that this will "just work".

1 Like

I don't see how this would constrain the ordering more than atomics already do? If you have two relaxed accesses to different atomics they can obviously be reordered. If you use acquire/release semantics the reordering isn't possible (in specific directions). So it should be exactly equivalent to communicate with a different thread? And that the fictional thread happens to get lucky really shouldn't affect anything.

If we had volatile atomics that would however be the better option. Much easier case to reason about for a start. Especially when you have multiple threads involved.

This is about a very different kind of reordering than what atomics are concerned with. Angelic choice does not reorder around demonic choice. Currently it is easy to reorder malloc (which is demonically non-deterministic) around any other operation, but once we allow any part of the system to use angelic choice, that is no longer the case.

This is quite tricky to see and I'd have to give a lecture to explain it, so I'm afraid I won't spell it all out right now -- sorry. T-opsem is deep enough in the woods with me that they will know what I am talking about; I don't know if we have a nice explanation of this somewhere though. Sadly these discussions at some point tend to require pretty advanced PL theory to avoid accidentally constructing something unsound. :confused: (I guess it's not too surprising, discussions about bridge design also require advanced understanding of physics and materials at some point... there's a reason people spend years studying this kind of stuff.)

It suffices to say that it would be a lot better to find a model that does not rely on angelic choice.

2 Likes

I don't see how that helps here. Or do you mean, using volatile atomics to access these kinds of "duplicated" pages? Yes that would be trivially correct, but also inhibits a bunch of optimizations that are correct on regular atomics.

1 Like

For a normal SPSC ring buffer the implementation would be to have two atomics (read and write head pointers/indices). To write to the buffer you would:

  • Load both indices atomically (acquire)
  • Check that there is enough space
  • memcpy your data starting at the write head
  • Atomically update the write head pointer (release)

Similar but reversed for the read side. (As an optimisation you don't need to load the index you yourself write to with atomic semantics.)

Lets think about how this changes when the memory is mapped twice. On a machine level it really doesn't change of course (expect you don't need to have logic to split your memcpy if it happens to wrap around, the MMU handles it). You can instead just do the wrap around handling at the beginning or end by offsetting your pointer by minus the size of the ring buffer.

On an AM level, is there actually anything that could break? Okay, it is UB because it isn't in the model. But it seems to me to be UB because we haven't thought through what could break. As opposed to there actually being a concrete example of how it would break.

That said you could construct a pathological program that writes one address and then immediately tries to read the clone of it. If this is done in a loop there is a good chance it will be optimised to be broken. But if you did that same operation with atomics (not relaxed obviously)? Another thread could be writing, so I don't see how it could break.

What if every atomic operation OP in code be converted to OP', defines as:

  • Perform OP in the current context.
  • Spawn a thread which performs OP on the corresponding address in the mirroring region (perhaps once per mapping?).

Would that work?

It is still theoretically possible for that to break with atomics.

Suppose you do an atomic store to A followed immediately by an atomic load from B, where the compiler thinks A and B can't alias. I believe the compiler could legally swap the order of those two accesses if the memory ordering is anything short of seq_cst.

In theory another thread could observe the the store to A and then modify B - but that's not guaranteed to happen fast enough. And if it's not guaranteed to happen, then the compiler is allowed to optimize the program into a form where it never happens.

In practice, LLVM basically doesn't optimize atomics at all.

3 Likes

Relevant: N4455 No Sane Compiler Would Optimize Atomics

Abstract

False.

Compilers do optimize atomics, memory accesses around atomics, and utilize architecture-specific knowledge. This paper illustrates a few such optimizations, and discusses their implications.

2 Likes

The discussion is definitely useful in its own right, though as someone pointed out in one of the replies its's probably more relevant for a library making use of the proposed helper much more so than the proposed helper itself.

Unless there are other things to address in terms of the actual proposal, I'm thinking of posting the updated version on GitHub this week. Though still need to settle on a specific name, based on the discussion I quite like assume_moved, do we have any arguments against it?

1 Like

I do believe it is useful to discuss the potential use cases for a proposed feature to make sure they actually work in practice. Otherwise we risk having to go back to the drawing board and redo the feature.

That said, it seems at this point what we might need is not a redesign of assumed_bikeshed_move, but possibly extra things in addition to it.