Pre-RFC: core::ptr::simulate_realloc

Well, then, go to B2.3, which is more explicit about ordering relationships.

The keys here are the definition of Location and Same Location.

Location:

A Location is a byte that is associated with an address in the physical address space.

Same Location:

An effect E1 and an effect E2 are to the Same Location if one of the following applies:

  • All of the following apply:
    • E1 is a Memory effect.
    • E2 is a Memory effect.
    • E1 and E2 are to the physical address (PA), or

[...snip other conditions relevant to non-Memory effects...]

I think this is missing the word "same" before "physical address". But the point is that the definition depends on physical address, not virtual address. More broadly, everything in B2.3 talks about physical rather than virtual addresses, except for one clause related to TLBI instructions.

That means that whatever path through B2.3's thicket of guarantees is normally responsible for guaranteeing same-thread coherence, it should apply with equal force whether the accesses are using multiple mappings or the same mapping.

Okay, but what is that path? B2.3 is extremely complicated, and this is my first time diving into it. There might be a simpler way to use its rules to prove same-thread coherence. But this is what I found:

Look at "Deriving Reads-from-memory and Coherence order from the Completes-before order" in B2.3.9.2. This section gives us two sets of sufficient conditions for an effect (in this case, the load) to "Read-from-memory" another effect (in this case, the store). Which set of conditions to use depends on which effect "Completes-before" the other. We know that one of the effects must Complete-before the other, because the Completes-before order is defined as "a total order" earlier in B2.3.9.2. I will show that regardless of which Completes-before the other, the store must "Read-from-memory" the load.

First suppose that the store Completes-before the load. Then we're on this set of conditions, where E2 is the store and E1 is the load:

[quote edited to replace bullets with numbers]

If all of the following apply:

  1. There is a Memory Write effect E2 (possibly representing the writing of the initial value to the Location).
  2. E2 and E1 are to the Same Location.
  3. E2 Completes-before E1.
  4. It is not the case that E2 Completes-before a Memory Write effect E3 and E1 is the local read successor of E3.
  5. There is no Memory Write effect E4 to the Same Location in the Completes-before order between E2 and E1.
  6. There is no Memory Write effect E5 to the Same Location in the Completes-before order between E2 and a Memory Read effect E6 to the Same Location that appears in program-order before E1.

then it must be that E1 Reads-from-memory E2.

Condition 1 is satisfied. Condition 2 is satisfied because the load and store use the same physical address (again, regardless of whether they use the same mapping). Condition 3 is satisfied by assumption. Conditions 4, 5, and 6 are satisfied because we're assuming that there are no other writes to the same location. (Or if there were any earlier writes, we're assuming that they appear earlier in the Completes-before order than any of the current operations. Proving correctness in the presence of multiple writes is possible but more complicated.)

Alternately, suppose that the load Completes-before the store. Then we're on this set of conditions, again with E2 as the store and E1 as the load:

[quote edited to replace bullets with numbers]

If all of the following apply:

  1. There is a Memory Write effect E2.
  2. E1 is a Local read successor of E2.
  3. E1 Completes-before E2.
  4. There is no Memory Write effect E3 to the Same Location in the Completes-before order between E2 and an Explicit Memory Read effect E4 to the Same Location that appears in program-order before E1.

then it must be that E1 Reads-from-memory E2.

Condition 1 is satisfied. Condition 3 is satisfied by assumption. Condition 4 is satisfied because, again, we're assuming that any other writes must appear earlier in the Completes-before order. And condition 2 is satisfied by the definition of "Local read successor":

A Memory Read effect E2 of a Location is the Local read successor of a Memory Write effect E1 to the same Location if E1 appears in program order before E2 and there is no Memory Write effect E3 to the same Location appearing in program order between E1 and E2.

QED.

--

One thing about B2.3 is that it doesn't contain an explicit exception for mismatched memory attributes. I think that is a mistake. But on that subject, here is another quote to provide a third "proof" that ordering is guaranteed in your example. This "proof" is more informal and implicit than the others, but I thought it was still compelling enough to tack on. From B2.16, "Mismatched memory attributes":

When a memory Location is accessed with mismatched attributes, the only software visible effects are one or more of the following:

  • Uniprocessor semantics for reads and writes to that memory Location might be lost. This means:
    • A read of the memory Location by one agent might not return the value most recently written to that memory Location by the same agent.
    • Multiple writes to the memory Location by one agent with different memory attributes might not be ordered in program order.

[...snip other effects...]

Accessing the same Location with mismatched attributes essentially requires multiple mappings. Therefore, this language strongly implies that in the general case where we have multiple mappings but not mismatched attributes, "uniprocessor semantics" are not "lost". And in particular, "a read of the memory Location by one agent" does need to "return the value most recently written to that memory Location by the same agent". (As long as there are no relevant writes by other agents, an assumption which is not stated explicitly.)

4 Likes

One important question is whether two threads should be allowed to access different virtual aliases of the same memory concurrently.

The name with_virtual_alias suggests that the answer might be yes. The name simulate_realloc suggests that the answer is probably no.

I think the answer should be yes; otherwise, the circular buffer example from @Vorpal's post is pretty useless. Technically, the example code at that link doesn't support concurrency at all, but most real-world uses for circular buffers do need concurrency and would be implemented using atomics.

Extra note about same-thread accesses

Actually, my first post in this thread was wondering if there were some conditions under which we could support accessing multiple virtual aliases of the same memory from the same thread without any barriers.

I thought other posts were asking for that, but I may have misunderstood.

Such accesses would definitely be a problem if one pointer was derived from the other using wrapping_offset, since LLVM is known to make no-alias assumptions in that case. But if LLVM doesn't know about the relationship between two pointers, it normally won't make no-alias assumptions. The question is: if two pointers start out having an unknown relationship, can we guarantee that they'll stay that way?

For example, might LLVM use the result of an earlier pointer comparison to introduce no-alias assumptions? That is, in code like this:

pub unsafe fn foo(x: *mut i8, y: *mut i8) -> i8 {
    if x != y {
        *x = 1;
        *y = 2;
        *x
    } else {
        1
    }
}

Could LLVM's optimizer say: if x != y, then *x = 1 and *y = 2 don't alias, so the read from *x always returns 1? Currently it doesn't do this. But might it do so in the future?

I don't know. But I did find another problematic example. LLVM's ability to make aliasing assumptions based on wrapping_offset also applies to integer math. The following will be optimized to always return 1 (though interestingly, only at the assembly/MachineFunction level, not in LLVM IR):

pub unsafe fn foo(x: *mut i8) -> i8 {
    *x = 1;
    *(((x as usize) + 123) as *mut i8) = 2;
    *x
}

You could arguably classify that as an LLVM bug, part of the longstanding umbrella of unsoundness around ptrtoint/inttoptr pairs. But arguably it's a separate mechanism. In any case, it does further discourage me from trying to come up with a way where no-barrier accesses could be guaranteed.

1 Like

assuming simulate_realloc ends up marking the previous pointer as invalid to use, I think any with_* name is misleading since it implies you're just creating a new pointer and not changing the old pointer when the old pointer and any related pointers are changed by marking their provenance as deallocated.

2 Likes

I did propose moved() upthread, because no moving (active tense) is happening, it already happened by magic and we're just telling the compiler about it.

Though we don't have to be this terse, I hope this isn't a such common operation.

That is how it is mapped on the CPU side. On the GPU side shaders have access to memory with unconventional caching semantics. It depends on the exact GPU how exactly it works.

1 Like

The answer is no. The Rust AM only supports a single location for every allocation.

To make the answer yes requires changing the LLVM memory model to support an allocation mapped in multiple paces at once. Whether it's in the same thread or not makes no difference.

EDIT: Nevermind, I guess these are fine, since these are about threads not seeing every allocation other threads see (and seeing different allocations at the same virtual address), not about threads seeing the same allocation at different addresses:

I'm fairly certain that break certain embedded dual core systems. I seem to remember seeing one where some MMIO registers were only avaible from one core, or where certain peripherals were duplicated per core.

For an example of the latter: the Pi Rp2040 (Pi Pico) has core-local MMIO buses, that contain for example an integer division peripheral per core (plus a bunch of other things that are critical to the operation of the chip, such as ways for one core to wake up the other one, which is needed for boot).

Another case is that of core-local RAM, for example on the PlayStation 3. It has different types of cores, and some helper cores had their own local fast RAM.

However, the following question remains, where multiple views of the same allocation could show up:

Also where is even the border between instances of the AM? You can't use terms like "OS" or "program" when answering this, because they only have meaning on std. I'm talking about the general case of an embedded system, with an MMU, but everything built into one image loaded onto the system at once. What exactly makes threads in that system part of the same AM or different AMs?

Such platforms are currently unsupported by Rust (and most C compilers), and this proposal doesn't change anything about that. But that's okay, this proposal doesn't have to solve all the problems in the word.

EDIT: Also, if you only use volatile to access something, the rules are different. volatile is basically a form of inline assembly, so the memory it accesses doesn't have to play by the rules of normal AM memory.

Having different address spaces on different threads is a completely different nightmare, if you want to discuss that please take it to a new topic.

1 Like

To be a little more verbose, maybe then ptr::assume_moved(ptr, addr)? Somewhat referencing assume_init() here to make it very clear to the user that it needs to be already accessible at the target address & that they are the ones making a promise to the compiler rather than asking the compiler to do something.

1 Like

I don't think this answered the second point about where the borders between AM instances are, I opened an issue on the unsafe code guidelines repo to hopefully get some clarification on that: Where does one instance of the Rust Abstract Machine end and another one start? · Issue #543 · rust-lang/unsafe-code-guidelines · GitHub

Do you have a source for LLVM not supporting it?

For a (partial) counterexample, even this Rust code actually makes use of an allocation of virtual memory that maps to the same physical memory, at least on Linux:

let mut v = vec![0u8; 1 << 30];
std::hint::black_box(&mut v);
dbg!(v.iter().sum::<u8>());

All pages in the zero-initialized allocation are mapped to the same zeroed range of physical memory, with copy-on-write semantics.

But maybe this doesn't count since it isn't really observable by the program? (Although it is an easy way to get bogus benchmark results accidentally.)

LLVM doesn't document their model. But I see no indication at all that the compiler would allow pointers to change their meaning when being moved to a different thread.

Do you have a source for LLVM supporting it? Generally features must be documented, not their absence. :wink:

That's completely invisible to the program (other than performance) so it's an entirely different case.

I wasn't saying that at all. I was assuming the discussion is about accesses to the same memory via two distinct pointers, with different addresses.

Yeah, that's part of why I asked. I tried to find something in the LangRef, but the best I could find is how the pointer aliasing rules allow this:

An integer constant other than zero or a pointer value returned from a function not defined within LLVM may be associated with address ranges allocated through mechanisms other than those provided by LLVM. Such ranges shall not overlap with any ranges of addresses allocated by mechanisms provided by LLVM.

... which sounds to me like two pointers returned from calls to mmap could well be associated with the same address range.

There's also (experimental) non-integral pointer type which seem like something where the compiler definitely wouldn't be allowed to infer that accesses don't alias when addr(p) != addr(q). What's less clear is if that inference is currently allowed for normal pointers.

1 Like

Oh, sorry, I attached your question to the wrong subthread.

LLVM's alias analysis assumes that ptr.wrapping_offset(N) and ptr never alias (for N != 0). So here it's fairly clear -- LLVM definitely does not support the multi-mapping situation, and supporting it would come at a cost for the remaining code.

(Again, this is all only about non-volatile accesses. volatile is basically inline asm so the rules are entirely different.)

Yes, that's clear. But given p: *mut u8 and q: *mut u8, which are not based on the same pointer (say, they were returned from calls to some opaque function), does p.addr() != q.addr() imply (to LLVM) that accesses through them don't alias? With provenance, *q = 1 is not the same as *(p.with_addr(q.addr())) = 1.

That is, pointer p has provenance to access a byte at address p.addr(), and pointer q has provenance to access a byte at the different address q.addr(), and there is no pointer that could do an access at both addresses. Is it fundamentally not allowed for that byte of memory to be one and the same?

Equivalently, must two distinct addresses refer to two distinct bytes, even if those addresses can never both be accessed with the same provenance? It kind of feels like maybe that should be modeled to better handle shared memory and the likes.

3 Likes

This must be OK as long as you use atomics or volatile (either on its own should be fine), since in both those cases LLVM cannot assume memory isn't changed behind it's back. This means loads are fine. Similarly for atomic and volatile writes it cannot elide them if another thread could see them.

I think what would really be needed to make this practical for ring buffers and the like though, would be some form of atomic_memcpy that is byte wise atomic, allows tearing, but is allowed to use bigger operands than bytes for efficiency (no one wants a byte wise copy). I know this was discussed before, but I don't know what the status on it is.

I would very much assume so, yes.

What would it look like to have a coherent memory model in which two inequal pointers can affect each other, but also wrapping_offset will always yield a pointer that cannot affect the original pointer? It would have to be something where we have two distinct allocated objects (with their own provenance, thus preventing usable wrapping_offset between them) but they actually share the underlying bytes stored inside of them. I would be quite surprised if LLVM was compatible with such a model. Cc @nikic

For volatile, you are right, as already mentioned above. Atomics are not special, they live entirely inside the memory model, so it is not possible to do "unknown code"-style reasoning for them. You have to explain everything you do within the memory model.

That means you can't possible do IPC over shared memory in Rust? Since you need atomics there. And it is being done from a different program.

That would make a lot of software UB. For example, everything talking with an X server or Wayland. I assume you run into similar issues on Mac and Windows, but I don't know those platforms well.

Also it breaks zenoh, iceoryx2 and any other IPC library with shared memory support.

And rust doesn't support volatile atomic, so that isn't an option either.

So that can't be right.

1 Like

being done from a different program is different than two aliases in the same program, since you can most likely model the different program as a separate set of threads that always use atomic ops on the shared memory using the same addresses as your program without the memory actually having aliases

1 Like

Maybe I'm being stupid or missing something, but to me it seems there is no way for the program to tell apart an atomic write being done from another thread and one done from another process over IPC.

The compiler certainly doesn't track or even know what specific library or syscalls cause memory to be shared. As soon as it can't definitely track where it comes from it has to assume the worst. This is as I understand it called escape analysis.

That means that unless the compiler can see that memory for sure comes from a source it understands (stack, malloc, static variables) it must assume the worst (subject of course to the constraints of the type of reference/pointer, interior mutability, etc).

In the example of the ring buffer (memory from some mmap call) the compiler has to assume that writes from another thread or process can be interleaved between every instruction it emits.

If it is another thread or another process makes no difference to thr machine code that has to be generated. Those cases are exactly the same. The running program cannot tell these two cases apart. It also can't tell the case of itself writing to an alias of the memory elsewhere in ram apart from another thread writing to the memory.

In fact you couldn't tell apart these two cases:

  • A page is mapped twice in ram, we perform an release atomic write to one page and then read back (with a acquire atomic access) the value from the other page.
  • There is another thread or process that reads our first atomic write and atomically copies the data to the second location. The interleaving is such that it happens to run between our write and our read. (This is obviously a valid ordering that can happen).

So it seems ludicrous to me to say that one of these is UB and the other is not, when there is no observable difference between them. As long as only atomic accesses are used I do not understand how this could cause UB.

The MMU is equivalent to another thread on another core doing stuff concurrently with you in this case.

There is no possible sequence of instructions that the compiler could generate that would cause UB in the one case that wouldn't cause UB in the other.

1 Like