Pre-RFC: core::ptr::simulate_realloc

Indeed you could - that LLVM optimisation would be useful in its own right and could coexist with this - but that's only good for one-off use-cases. If you've got many TBI pointers then having to mask them out each time you access them becomes very verbose and not very ergonomic, because the actual pointer you have is always UB to access in that case. The kinds of use-cases this is targeted at are allocator wrappers, so every pointer returned from the allocator would always need to be manually masked out by the end user. With this proposal we can shift the heavy lifting from the user to the library.

2 Likes

Can you explain how? The processes are each guaranteed to read either the previous value, or the next value, not a torn value; the only thing that's "broken" is that if you use a Relaxed read to read the result of a Relaxed write, you do not know if other writes by the thread that did the Relaxed write have happened or not.

I'm not saying that Relaxed atomics don't work in shared memory; I'm saying that if you have multiple mappings of the same physical address, then the ordering guarantees provided by ARMv8a don't guarantee that a previous write by this thread to one VA is visible when you read from another VA.

But that's also the statement made by Relaxed atomics in general; a write made by one thread is not guaranteed to be visible to another thread, even if you can see that in program order, the invisible write comes before the visible write. It's just that mmap games make this something that can go wrong in a single thread, rather than being an issue you can only see between two threads.

Hm that is a fair point for relaxed atomics, but it would break consume ordering i believe? Which again is something that you get for free on ARMv8.

But @comex pointed out that:

Which seems pretty clear and final to me.

I'm not sure whether it's a typo or not, but the code does not work. Under TBI if the tag is in the top-byte, then both p and r will point to a. Unless the compiler loads the cached value of a of course, but the memory location of a contains 5 by the time you call the assert. This snippet is definitely UB for that reason. The way it works in current C/C++ codebases is that the pointer is tagged when allocating and the untagged version is never touched or exposed afterwards, that is the 'safe' way to do it.

I don't think this can be used with stack variables safely at all. The Android use-case only tags heap allocations for one, and I doubt there's any way to make this safe if we can access a variable both through the local variable and also through the pointer. I'll add some notes to the RFC and the documentation about this.

1 Like

I wrote a simple test program to see if you get stale reads with simulate_realloc, no memory barriers, on a single thread, with aliasing mmaps, and with standard byte load/stores. it needs linux but should otherwise be portable to any architecture with an instruction named "nop" (it tests between 0 and 511 nops between the store and load), I tested it on a Pixel 4a (not the 5G variant) with aarch64-linux-android, it didn't detect any stale reads.

use std::{arch::asm, ffi::{c_int, c_long}};

#[inline(always)]
fn nops<const N: usize>() {
    unsafe {
        asm!(
            ".rept {}",
            "nop",
            ".endr",
            const N,
        );
    }
}

#[inline(always)]
unsafe fn simulate_realloc(mut p: *mut u8, new_addr: usize) -> *mut u8 {
    asm!("/* {} {} */", inlateout(reg) new_addr => p, in(reg) p);
    p
}

#[inline(never)]
unsafe fn test_case<const NOPS: usize>(a: usize, b: usize, out: &mut [u8]) {
    for (i, o) in out.iter_mut().enumerate() {
        let ptr = a as *mut u8;
        *ptr = i as u8;
        nops::<NOPS>();
        let ptr = simulate_realloc(ptr, b);
        *o = *ptr;
        simulate_realloc(ptr, a);
    }
}

#[inline(never)]
unsafe fn check<const NOPS: usize>(a: usize, b: usize) {
    println!("testing {NOPS} nop(s)...");
    let mut out = [0; 1024];
    for _ in 0..1000 {
        test_case::<NOPS>(a, b, &mut out);
    }
    for (i, o) in out.iter().enumerate() {
        assert_eq!(*o, i as u8, "failed with {NOPS} nop(s) for index {i}");
    }
}

const SIZE: usize = 0x10000;

fn main() {
    unsafe {
        let memfd = libc::syscall(libc::SYS_memfd_create, c"my_memfd".as_ptr() as c_long, libc::MFD_CLOEXEC as c_long) as c_int;
        libc::ftruncate(memfd, SIZE as libc::off_t);
        let addr1 = libc::mmap(
            core::ptr::null_mut(),
            SIZE,
            libc::PROT_READ | libc::PROT_WRITE,
            libc::MAP_SHARED,
            memfd,
            0,
        ) as usize;
        println!("{addr1:#x}");
        let addr2 = libc::mmap(
            core::ptr::null_mut(),
            SIZE,
            libc::PROT_READ | libc::PROT_WRITE,
            libc::MAP_SHARED,
            memfd,
            0,
        ) as usize;
        println!("{addr2:#x}");
        macro_rules! checks {
            ([], [$($bits:literal,)*]) => {
                check::<{
                    let v = 0usize;
                    $(let v = v * 2 + $bits;)*
                    v
                }>(addr1, addr2);
            };
            ([$first:literal, $($rest:literal,)*], [$($bits:literal,)*]) => {
                checks!([$($rest,)*], [$($bits,)* 0,]);
                checks!([$($rest,)*], [$($bits,)* 1,]);
            };
        }
        checks!([2, 2, 2, 2, 2, 2, 2, 2, 2,], []); // check 0 to 511 nops
    }
}

The fence that "acquires" the page table update has to sit in an inline asm block. Logically we say that it is than block that actually makes the allocation accessible.

That probably means the pointer returned by the "fake realloc" should go through that block as well, because logically it only really exists after that block.

IIUC, x86 requires a special fence after page table updates. So the fundamental problem here seems to affect widely-used platforms, or did I misunderstand something?

A normal fence is not enough for this, since normal fences only have atomicity model effects. This needs to be an inline asm block. I don't think we can do anything portable here as the requirements regarding what to do after a page table change are inherently architecture-specific -- and the portable memory model doesn't have this concept so it can't be used.

2 Likes

imo changing which pages map to which physical addresses hasn't been what we were discussing and that changing page tables usually requiring a special fence is not the issue (imo page table modifications aren't even part of Rust's AM), instead we were discussing TBI-style stuff and what happens around when two virtual addresses map to the same physical address.

to be clear I not saying we can't talk about that, but that you might have misunderstood that we were talking about what weird semantics you're left with well after mmap has finished, not what mmap has to do internally

I was thinking about when the same address is mapped twice. This can be used to optimise ring buffers for example. See e.g. Using black magic to make a fast circular buffer. (that is just the first decent Google result for me, but there are other examples of it too).

Here the fact that the same physical memory is mapped twice is static, and in user space. No direct manipulation of the page table is happening over time. This is just calling mmap twice to set things up initially.

Other use cases for this is the modern Java GC, it uses bit 43-44 (iirc) for tagging (in the mark and sweep GC). But other code doesn't need to mask off those bits, since the entire java heap is mapped multiple times at suitable locations. Very clever!

Apparently some unspecified CPUs (unspecified due to NDAs) are incompatible with this, and check for address equality using the virtual address rather than the physical address. Within the same thread.

The argument has been "that makes no sense on ARM, since normal consume order memory load/store lowers to plain loads and stores, this can't possibly be allowed".

1 Like

I'd like to point out an unrelated reason why simulate_realloc is, in my opinion, not a good name for this hypothetical intrinsic: The purpose of realloc in C is to change the size of a heap allocation. It may or may not also change the address, there's no guarantee either way, and if it does change the address it gives you no control over the new address. Therefore the name simulate_realloc will give people the wrong idea about what the intrinsic does and what it's useful for.

change_tag_bits and move_pointer_target as proposed by @farnz are better names, IMO, even if the semantics discussion concludes that one intrinsic can do both jobs, just because they are so much more descriptive of what they do.

3 Likes

I guess my previous quote from the ARM Architecture Reference Manual was not explicit enough, but here's the full quote (from DDI0487K, G5.10.1). It definitely does guarantee ordering in the case you're describing.

This means that the behavior of accesses from the same observer to different VAs, that are translated to the same PA with the same memory attributes, is fully coherent. This means these accesses behave as follows, regardless of which VA is accessed:

  • Two writes to the same PA occur in program order.

  • A read of a PA returns the value of the last successful write to that PA.

  • A write to a PA that occurs, in program order, after a read of that PA, has no effect on the value returned by that read.

The memory system behaves in this way without any requirement to use barrier or cache maintenance instructions.

There is an exception if the mappings have mismatched memory attributes. But as explained in section E2.9, "memory attributes" only refers to things like cacheability, where a mismatch is almost certainly a mistake and should be considered an unsupported configuration. mmap itself doesn't even let you control cacheability. The more common/legitimate scenario of mismatched permissions is not considered a mismatch in memory attributes.

1 Like

Oh I see. Yeah that makes sense -- the actual page table doesn't change, so all we have to do is "logically" move around the allocation between its two mappings (basically, deciding which of the two mappings is "active"). And that doesn't require a fence on any tier 1 platform, or any ARM platform conforming with the manual @comex has been quoting. So we could offer simulate_realloc and it would work for "the mmap trick" on most targets.

Is it even possible to have an API that is portable to targets where the cache behaves in this strange way by keying on virtual addresses? Rust would need a new kind of fence intrinsic that is a NOP almost everywhere except for these odd targets. That seems hard to do without support from the backends, and AFAIK LLVM has no such fence.

I do seem to remember that GPUs do some weird stuff when it comes to memory mappings, so maybe that also includes caching on the virtual rather than the physical address? Cc @gnzlbg.

Consume order accesses are cursed anyway and (in their current form) cannot be properly supported by languages with an optimizing compiler, so I don't give much weight to this particular argument.

turns out armv6 allows behaving in that strange way, iirc armv6 is used in the Raspberry Pi 1 but idk if the cores they use have that behavior...

This imo is a good point, thanks! Actually, what about just simulate_move? That way it's both clear from the name that it's only "simulating" and it's also clear that it should be though of as moving from A to B and not accessing through A afterwards.

1 Like

I'm not a fan of any name with either "move" or "simulate" in it tbh, because "move" sounds like the pointed-to data is actually being moved, and "simulate" sounds like this is a pseudo-NOP along the lines of black_box, neither of which is true.

For a single intrinsic, how about ptr::set_associated_bits? "Associated bits" -- bits that a pointer carries around but that don't affect the target of the pointer -- is at least in the vicinity of the right mental model, I think. But I still prefer separating the idea of altering pointer tag bits from the idea of shifting a pointer between two virtual maps of the same physical memory, because these are different tasks in the programmer's head, regardless of whether the abstract machine needs to treat them differently.

1 Like

Well, this kind of is a pseudo-NOP in that it's just supposed to let the compiler know what's happening in case it wants to do something undesirable, once we get down to assembly there's nothing special happening there at all. Maybe it depends on how you view it. It feels to me that all the names with change or set are kind of misleading too because we're not actually modifying anything, just making a new pointer.

That being said, I don't have a super strong view on the naming so if the libraries team think we should go with a specific name I'm happy to just defer to them.

We're modifying the pointer itself, though, aren't we? Or do you have to do that separately?

With pointers being copy, given that the function takes a pointer, after the below line runs ptr does not change at all in a practical sense. tagged_ptr is a completely separate pointer with new_addr as the address. How we think of it changes because accessing it would now be considered UB due to compiler optimisations, but neither the pointer itself nor the thing it points to have been modified in any real way.

let tagged_ptr = unsafe { simulate_realloc(ptr, new_addr) };

I don't see how. It says that the data cache cannot cause reordering, but it does not define "last successful write" with sufficient clarity to force ordering in the case I'm describing; in the case I'm describing, there has been no successful write in program order as yet, because the CPU is executing out-of-order. It thus reads the last successful write, not the most recent incomplete write, and gets confused as a result.

This is the core difference between an architecture guaranteeing acquire ordering and an architecture guaranteeing consume ordering on loads; because the write has not yet succeeded, it's in limbo at this point in time, and because there's no dependency between the write and the read as far as the CPU core can see, there is no read-after-write hazard to stall the read instruction until after the write instruction succeeds (or fails, but I'm assuming here that the write won't fail).

Once the write has been observed succeeding, the architecture prohibits this reordering.

To put it in assembly terms:

; assume 0x1000 and 0x2000 map to the same PA
MOV R2, #0x1000
MOV R3, #0x2000
MOV R1, #42
STR R1, [R2]
LDR R0, [R3]
; At this point R0 may, or may not, equal R1, because there is no hazard forcing LDR to happen-after STR R1, [R2]
LDR R0, [R2]
; At this point, R0 *must* equal R1, since the load through [R2] creates a hazard, forcing the LDR to wait until after STR R1, [R2] has completed

Not an expert on GPUs, but GPU memory is often mapped as non-temporal/write-combining for performance reasons as I understand it. Maybe that is what you are thinking of?

This is true, but showing that the assembly lowering of consume is the same as relaxed which is the same as "normal non-atomic load/store" is a reasonable argument why these weird CPUs-under-NDA doesn't make sense on ARM. I am of course not suggesting that Rust should support consume. Merely pointing out that ARM tracks data dependencies on the CPU instruction level. And this must somehow work for IPC between processes (where there are multiple mappings of the same address).

Oh, I see what you're saying now. Borrowing from the "strict provenance" APIs, then, how about name(s) beginning with with_? ptr.with_tag_bits works well for me. I like ptr.with_moved_target a less but I think the problem there is that "moved target" still doesn't really describe the operation very well. with_virtual_alias or to_virtual_alias might be better. ("Virtual" as in "virtual address space".)