Pre-RFC: core::ptr::simulate_realloc

Oh, interesting. I didn't realize multiply-mapped pages would cause "concurrency shenanigans". I guess that means that a simulate_realloc that works with multiply-mapped pages actually has to contain some sort of fence to unambiguously order all memory accesses. That sounds like it is hardware-specific. So simulate_realloc could be used as an ingredient in a library that uses mmap like this, but it doesn't suffice on its own... that would need careful documentation.

I don't see how we could offer this as two different operations though. It's just that sometimes, the operation alone is not enough.

It does invalidate the old allocation, in the sense that it becomes UB to access. In the LLVM memory model, an allocation can only exist at a single place.

As far as I can tell, the answer is that people just don't care about this being an impossible operation in the memory model (and therefore UB). This is a typical example of a feature that was designed and shipped without a language-based perspective in mind, and problematic optimizations are rare enough that they didn't cause enough problems (and if they do cause problems, they were probably blamed on the compiler, rather than leading to the realization that this is a misuse of the language).

4 Likes

Really simply; we offer two functions:

  1. unsafe fn change_tag_bits, where it's an error (or maybe UB) to touch bits not masked out as part of VA tagging by this platform. This covers the Arm TBI + MTE, Intel LAM, AMD UAI cases. On Arm TBI and AMD UAI, change_tag_bits can error if you change anything other than the top 8 bits; on Intel LAM57, it's an error to change the top bit (sign bit), or anything after the 7th bit. And on Intel LAM48, it's an error to change the top bit or anything after the 16th bit.
  2. unsafe fn move_pointer_target, where it's not an error to touch bits not masked out as part of VA tagging, but you are on the hook for the extra fences etc needed to keep the AM and real machine views of memory in sync.

They take the same parameters, and have (nearly) the same effect on the abstract machine; indeed, it's permissible to implement change_tag_bits as a call to move_pointer_target, since the only benefit of calling change_tag_bits is that the implementation can assume that the output VA is the same as the input VA from the perspective of the hardware, whereas move_pointer_target makes no such promise.

Then, as a reviewer of code that uses these deeply unsafe low-level functions, I know that a call to change_tag_bits should only be touching tag bits, and if it's not, we've got deep problems here, while a call to move_pointer_target needs to be associated with some set of fences else we've got problems.

tbh if your target needs extra fences, I think move_pointer_target should contain those fences (since you commonly don't want to research which fences you need on all targets whenever you want to use it for otherwise largely portable code (e.g. targeting everything with a mmap call)) and maybe we can have a move_pointer_target_unfenced for when you know you can omit the fence (not necessarily a replacement for change_tag_bits)

That's not the right thing for this low-level interface, because the fences don't necessarily go with move_pointer_target; for example, if you're doing the ringbuffer mapping trick, you need a Release ordering with writes to the ringbuffer, and an Acquire with reads from the ringbuffer. If you're making a thread-safe ringbuffer, you can put these ordering requirements with the reads and writes, and get them "for free" in the single-threaded case.

You can, of course, just dump an AcqRel fence into move_pointer_target, but then you're paying the cost of premature fencing if you're writing a thread-safe ringbuffer anyway, and even in the single-threaded case, you're possibly overdoing the fencing on platforms like x86, where no fence is needed, but an AcqRel fence is not free.

1 Like

I meant the fences needed for single-threaded consistency, e.g. if I do a relaxed atomic increment (or some other relaxed atomic op) on a thread and then move_pointer_target and another relaxed atomic op on the same thread I expect the second one to always see the changed value from the first one and not have the virtual address vs. physical address shenanigans cause the atomic ops to not see each other as you described earlier.

imo, it should also contain whatever fences are necessary for two relaxed atomics on different threads to behave as if you did the relaxed atomics to the same virtual address, acquire/release fencing should still only be necessary on the atomic operations if they need other memory with different physical addresses to be synchronized by the atomic ops (e.g. to ensure a pointer sent by the atomic ops can be dereferenced).

Atomics ops should be using instructions that are guaranteed to be atomic across threads anyway, and therefore won't see this because those instructions are guaranteed to check store queues by physical address, not virtual (even though it's more expensive for the hardware to do this).

The problem comes in when you're not using atomics for everything; if you're using ptr::read and ptr::write to access ringbuffer elements (because hey, single-threaded, why do you need atomics?), the mmap trick needs fences where TBI does not, because the instructions used by ptr::read are permitted to only check store queues by virtual address, and you've got two VAs that, from the CPU's perspective, are unrelated, but which share the same PA underneath.

And that's why I don't want the fences inside move_pointer_target; you only need a fence in there at all if you're not accessing in a thread-safe way (no atomics involved, or wrong orderings if you're mixing atomic and non-atomic operations), and you expect (erroneously) that you can get away with ordinary reads and writes.

if those are a problem, then relaxed atomic load and store ops are almost certainly a problem too since they compile to the exact same instructions on almost all cpus. hence why I was saying that it should have the fences to make relaxed atomics still work on a single thread, because they do work on a single thread in Rust's semantics.

On the platforms I'm aware of where the difference matters, they must compile to different instructions, otherwise there is no atomicity between threads, either. If the compiler isn't doing that, then it's already buggy.

What platform(s) are those?

  • On x86 everything is somewhere between acqrel and seqcst for loads/stores, so here no barriers at all will be needed I guess?
  • I belive ARM (32 and 64) defaulted to at least consume ordering, so nothing special needed here either?

I'm not really familar with other architectures, but I thought pretty much everything was at least relaxed by default (for loads and stores)? Maybe Alpha or IA64 are not (since they are famously weak in their memory models)?

I once found a website listing the assembly lowering for atomics for many architectures, but I can't find the link any more. I think it was hosted on some university website or other. That would have been really useful about now.

(There could be special cases that are weaker, such as non-temporal stores, but I don't believe LLVM will normally generate those.)

To be clear, the problematic case isn't both places being used from the AM at the same time, correct? Since that would just be incompatible with the AM model, which relocated the allocated object from one address to the other. Rather, the problem would be write(p, …); let q = assume_realloc(p, …); read(q) where aliasing virtual memory may result in a cache incoherent result, but hardware aliased memory is guaranteed to remain properly cache coherent.

This feels weirdly like nontemporal stores in that it kind of fundamentally breaks the conventions by which the AM utilizes the concrete machine. We can't just put the AM in an invalid state and have the user fix it up after the fact. I don't see how the potentially cache incoherent version can be exposed to the AM without embedding the necessary synchronization in the intrinsic used to relocate the allocated object[1].

But like nonoverlapping pointer copies are the less nice API to use, if we expose both temporarily strong and weak assume_realloc, it may be a good idea to have the more general ("safer") version be the nicer name, to nudge people away from using the more constrained version without considering whether the added constraint is satisfied.


  1. MAYBE: Allocate a new object with uninit bytes. Start a thread to write the bytes of the old object to the new object non-atomically. Deallocate the old object. Require the user code to synchronize with the spawned thread's write in the appropriate platform specific manner before it can read the new object. ↩︎

1 Like

I'm under an NDA that prevents me identifying the embedded platform in question; I would note that consume ordering only constrains loads, and the vendor's argument is around stores being allowed to do weird things per the architecture definition (the core of their argument is that the architecture does not ban them them from splitting a single 32 bit write into two 16 bit writes non-atomically, and thus tear if observed from a different core). I no longer work with that vendor, but I'm still trying to convince them to stop having the argument with their customers under NDA, and start having it in the open, ideally in the form of a discussion with LLVM and GCC maintainers.

x86 is not affected here, since all stores are at least Release, which makes the ordering happen the way we need it to; RISC-V and ARMv7 onwards are unaffected, too, since they require aligned stores to happen atomically.

The problematic case is exactly the write(p, …); let q = assume_realloc(p, …); read(q); case, where from the hardware's perspective, p != q. Because they're different addresses, consume ordering doesn't come into play, and because the store buffer is indexed by virtual address, the load from address q is not guaranteed to observe the store through address p in a single thread. Obviously, going cross-thread has to solve this with proper ordering (e.g. Acquire and Release pairings), which also solve it for the single-thread case.

But that's why I'd like the mmap operation and the TBI operation to have different names, even if they have different code; the mmap operation brings in the same hazards as switching thread, where the TBI one does not.

not necessarily, it is perfectly legal to transfer data cross-thread using only relaxed atomics, you don't always need acquire/release. Afaik, for all modern cpus (so excluding IA64 and maybe Alpha and apparently whatever weird cpu you're stuck with), compilers compile relaxed atomic load/store to just a load/store, no special atomic instructions needed, and no fences needed. e.g.:

static A: AtomicUsize = AtomicUsize::new(0);
static B: AtomicUsize = AtomicUsize::new(0);
/// returns `v + 1` assuming `thread_1` and `thread_2` are only run once and on different threads.
pub fn thread_1(v: NonZeroUsize) -> NonZeroUsize {
    A.store(v.get(), Ordering::Relaxed);
    loop {
        if let Some(retval) = NonZeroUsize::new(B.load(Ordering::Relaxed)) {
            return retval
        }
    }
}

pub fn thread_2() {
    loop {
        if let Some(v) = NonZeroUsize::new(A.load(Ordering::Relaxed)) {
            B.store(NonZeroUsize::new(v.get() + 1).unwrap(), Ordering::Relaxed);
            break;
        }
    }
}

That's a big part of why I want simulate_realloc to contain whatever fences are required to make relaxed atomics and/or single-threaded load/store still work correctly.

1 Like

This may be a bit controversial, but I don't believe niche platforms under NDA matters to open source projects at all. If the vendor want to be relevant and people to consider the quirks of their platform they have to play ball.

Otherwise they will have to maintain their own compilers, runtimes, etc.

Since as far as I know, all platforms that actually matter have at least free relaxed ordering for all memory operations (up to the native pointer size), I believe that we shouldn't make Rust unnecessarily complicated for everyone else, just to make it slightly more efficient on a platform that Rust doesn't even support. In fact, it is better to put pressure on such vendors by showing them that if they don't cooperate, they get left behind.

As far as I understand, on all currently supported targets there will be no barriers needed. And there are no publically documented targets where barriers would be needed.

(Rust cannot support a target that needs NDA: ""onerous" here is an intentionally subjective term. At a minimum, "onerous" legal/licensing terms include but are not limited to: non-disclosure requirements, [...]" , see Target Tier Policy - The rustc book)

4 Likes

I think it'd be useful to have those in principle, though this seems to me to be more higher-level than the simulate_realloc I'm proposing would be. Since the underlying operation is the same, I think it would be much cleaner to build both change_tag_bits and move_pointer_target with their respective error checking and potential fencing on top of the very basic simulate_realloc rather than have the actual primitive function contain a big cfg_if block for every tagging scheme under the sun. There's nothing stopping us from making move_pointer_target just be a call to simulate_realloc with its own documentation, no? Much like all the Strict Provenance functions like with_addr are built on top of wrapping_offset.

Same with change_tag_bits. I originally started out with something similar but people rightly pointed out that making the underlying operation possible is a very different thing from adding a generic pointer tagging interface, hence we should first focus on the former.

1 Like

On the platform I'm aware of, relaxed atomics would need some form of CPU fence to work cross-thread, anyway, and would contain that fence.

I'm literally just arguing that ptr::write(p, value); let q = simulate_realloc(p, …); ptr::read(q) and equivalent non-atomics should not be guaranteed to read value if you're doing mmap games, because that's not architecturally guaranteed by ARM-v8a to begin with, whereas it is guaranteed if you use TBI.

Plus, if you're doing mmap games and using Release and Acquire orderings correctly, fencing in simulate_realloc is a pure performance cost - if you're doing mmap games (but not TBI), then effectively writes before simulate_realloc to the object occur in a different AM thread to writes after simulate_realloc.

I disagree; I think we should ensure that the door is open for niche vendors to get their heads out of their backsides and get involved, especially since we're talking about making guarantees over and above those provided by the architecture, and at a potential performance cost.

We are not talking about something that is safe to do on existing cores, after all - we're talking about imposing a cost on publicly available Cortex cores (a forced fence), that isn't needed in all cases, and that can be argued about based purely on ARM-v8a on its own. The fact that niche platforms exist that need this fence more than ARM-v8a does is just icing, and you can ignore that platform.

1 Like

Hm, according to C/C++11 mappings to processors relaxed loads/stores lower to just plain LDR/STR on both ARMv7 and ARMv8? That site has been a good resource in the past, but you are saying it is wrong then?

But when I go and check godbolt I also get that result. So then that would be a LLVM miscompilation? I suggest you file a bug if you are convinced that is the case. However, I feel like people would have noticed by now if LLVM miscompiled something as simple as that.

That's not true. ARMv6 and earlier did impose restrictions on aliasing memory mappings to allow processor implementations to use virtually indexed and/or virtually tagged caches. However, in ARMv7 and later, including ARMv8-A, caches are required to act as if physically-indexed-physically tagged and so there are no restrictions. In other words, to quote from the latest architecture reference manual, "[t]he behavior of accesses from the same observer to different VAs, that are translated to the same PA with the same memory attributes, is fully coherent." (Mismatching memory attributes is generally not something you do on purpose.)

Speaking more broadly than ARM, it should be safe to assume that any processor that runs a fully-fledged OS is coherent with respect to aliasing mappings, but that may not be true for random embedded stuff.

That doesn't mean we don't care about random embedded stuff. We should. But it may be justified to have target-dependent guarantees about aliasing mappings. After all, the behavior of memory mapping is already fundamentally target-dependent: different targets have different page sizes, and some don't have MMUs at all.

On the other hand, I'm not sure whether we can safely make guarantees about aliasing mappings on any platform. What if the optimizer adds an assumption of the form "if two pointers compare non-equal, then accesses to them must be non-aliasing"? AFAIK, LLVM does not make any such assumption today, but are we sure it never will?

In any case, on platforms that do need a barrier for aliasing mappings, I don't think atomics should be the right way to express that barrier. Atomicity and mapping coherence are totally different properties. It sounds like your platform requires relaxed accesses to use a lock(?) or some other synchronization, which would coincidentally provide a strong enough barrier for accesses to aliasing mappings. But I'm sure there are other platforms where relaxed accesses don't need any special synchronization, yet accesses to aliasing mappings still need a barrier.

2 Likes

Maybe a silly question, but could you also remove the tag before accessing the memory in this case too? And then have an LLVM optimization recognize that as unnecessary under the target feature in question, so it doesn't actually get emitted?

Then it'd Just Work™, but work better on some platforms than others.

2 Likes

My understanding of the ARMv8a TRM is that it's correct for inter-thread atomics, but does not guarantee ordering within a thread if you play mmap games; instead, the write is atomic, but you are not guaranteed that a read will see the result of a previous write in program order through a different mapping.

In the single-threaded case, however, you normally get an extra guarantee; if I write to location FOO, then I read location FOO, the processor will ensure that the read can observe the write (and the only way it will read a value other than the one written is if a different agent wrote to the location between the write and the read).

That is, because you have virtual aliases, it's permissible for a processor to delay the write until a barrier operation, since the read is reading a different virtual alias. You still get atomicity (the write will not tear on ARMv8a), but you can observe a stale value, because there's neither a barrier operation nor a consume order dependency between the write and the later read.

This is fine if you don't play mmap games, or if the bits you change are TBI bits, because the processor must guarantee that a read from a virtual address is able to observe all previous writes to that virtual address; but if you have virtual aliases to the same physical address, the core's permission to delay writes means that a write to FOO followed by a read from FOO_ALT in program order does not create a dependency between the read and the write (where a read from FOO would), and if the processor has not had a reason to start the write, it doesn't have any reason to know that FOO and FOO_ALT are both the same virtual address.

In a practical processor, like a Cortex-A53 or similar, this is a short window; it's the gap between the write "completing" in program order, and the TLB and associated structures coming up with the physical address needed to put the write into cache (since ARMv8 caches are always physically tagged). But it's there, and it needs to be accounted for if you're doing mmap games.

Effectively, this means that when you use simulate_realloc, you need to know if you're changing TBI bits, or just swapping between two virtual aliases of the same physical memory; in the former case, nothing is needed, because you're not changing virtual address; in the latter case, you need to treat writes from this thread to the object before simulate_realloc as-if they were writes by a different thread.

It sounds to me like this would break IPC, when two processes shares memory. The processes have different mappings of the same page.

For atomics you need to look at the physical address for IPC to work. And since relaxed atomics compile to just plain loads and stores (on everything except IA64 and possibly Alpha) that means you need to use physical addresses always.

A lot of software would be broken if relaxed atomics didn't work in shared memory. I'm also fairly certain that the Linux kernel depends on consume ordering in the VDSO (for fast gettimeofday and clock_gettime), which I believe is also a normal load on ARM. (The kernel uses seqlocks for the time.) The VDSO is of course mapped at different addresses in different processes due to ASLR.

On a no-OS embedded system I could see you getting away this, but then you normally don't have virtual memory to begin with (you might have an MPU, but generally no MMU).