Include racy reads in Rust memory model with `MaybeInvalid<T>`

Disclaimer: This post has nothing to do with previous post mentioning MaybeInvalid.

Unsafe code guidelines contains a deliberate UB section mentioning the SeqLock issue, i.e. SeqLock algorithm isn’t compatible with Rust memory model. In fact, SeqLock relies on a racy read, which is known to be valid only after a subsequent check.

The document mentions two possible solutions:

  • (a) adopt LLVM's handling of memory races (then the problematic read would merely return undef instead of UB due to a data race)
  • (b) add bytewise atomic memcpy and using that instead of the non-atomic volatile load.

There is currently a RFC opened about solution (b). I would like to explore a path closer to solution (a).

It would be materialized by the following types/functions:

// in core::mem
#[lang = "maybe_invalid"]
#[derive(Copy)]
#[repr(transparent)]
pub union MaybeInvalid<T> {
    invalid: (),
    value: ManuallyDrop<T>,
}

impl<T> MaybeInvalid<T> {
    pub fn assume_valid(self) -> T { /* .. */ }
}

// in core::ptr
pub unsafe fn read_maybe_invalid<T>(ptr: *const T) -> MaybeInvalid<T> { /* .. */ }

Concretely, it would mean to defer the UB of a racy read to MaybeInvalid::assume_valid call. SeqLock implementation would then become:

pub struct SeqLock<T> {
    seq: AtomicUsize,
    data: UnsafeCell<T>,
}

unsafe impl<T: Copy + Send> Sync for SeqLock<T> {}

impl<T> SeqLock<T> {
    /// Safety: Only call from one thread.
    pub unsafe fn write(&self, value: T) {
        self.seq.fetch_add(1, Relaxed);
        fence(Release);
        unsafe { ptr::write(self.data.get(), value) }
        self.seq.fetch_add(1, Release);
    }

    pub fn read(&self) -> T {
        loop {
            let s1 = self.seq.load(Acquire);
            let data = unsafe { ptr::read_maybe_invalid(self.data.get()) };
            fence(Acquire);
            let s2 = self.seq.load(Relaxed);
            if s1 & 1 == 0 && s1 == s2 {
                return unsafe { data.assume_valid() };
            }
        }
    }
}

For SeqLock, MaybeInvalid::assume_valid would be called after ensuring there was no data race. Compared to RFC 3301, I see the following advantages:

  • The API is simpler: one type, one method, one function.
  • It doesn't reuse MaybeUninit, as initialization is not the issue here, avoiding confusion.
  • Writes remain non-racing plain writes, so assume_valid cannot return torn values. This is in my opinion the biggest advantage, as it removes an entire class of problems (for example the Drop issue of RFC 3301).
  • The semantic is closer of what SeqLock algorithm is built on: a read which may be invalid, and is assumed valid after an atomic check.
  • This model is de facto already supported by LLVM, so there would be no change on this side.

Of course, the drawbacks are significant:

  • It requires to modify the Rust memory model, which is, I assume, quite a blocker by itself.
  • Modifying Rust memory model might impact interoperability with C/C++, as Rust memory model is inherited from C++.
  • (I've realized it while writing SeqLock implementation) racy reads, but also plain writes should interact with fences as atomics do; the consequences of including plain writes here may be bigger than I would have expected.

I'm not an expert in the domain, so I may be completely off the mark here. Moreover, the fence issue made me reconsider my will of having MaybeInvalid compared to RFC 3301. But this post has the merit of discussion option (a) of unsafe code guidelines.

Writing this post gave me the idea of replacing AtomicPerByte for RFC 3301 by

struct AtomicCell<T>(UnsafeCell<T>);

unsafe impl<T: Send> Sync for AtomicCell<T> {}

impl<T> AtomicCell<T> {
    pub const fn new(value: T) -> Self { Self(UnsafeCell::new(value)) }
    /// Safety
    /// 
    /// Calls to `store` must be serialized, 
    /// i.e. two thread shall not call `store` concurrently.
    pub unsafe fn store(&self, value: T, ordering: Ordering) { /* .. */ }
    pub fn load(&self, ordering: Ordering) -> MaybeUninit<T> { /* .. */ }
    pub const fn get_mut(&mut self) -> &mut T { self.0.get_mut() }
    pub const fn as_ptr(&self) -> *mut T { self.0.get() }
}

It keeps the biggest advantage of the MaybeInvalid proposal, i.e. no torn writes, i.e. no drop issue, while not requiring the memory model to be modified. The proposal is written in a comment to RFC 3301.

So now my issues about AtomicPerByte are solved with AtomicCell, I have less interest into this MaybeInvalid proposal.

By "cannot return torn values" you mean that the cases where values would be torn, it'll just be UB instead, right?

Though I'm not sure exactly on your subsequent point: what is "the Drop issue of RFC 3301" that you're avoiding? (That's a long RFC and even longer discussion, so I have no idea what exactly you're referring to.)

By "cannot return torn values" you mean that the cases where values would be torn, it'll just be UB instead, right?

Exactly. Writes are still done through UnsafeCell, so they must be serialized. A valid read, i.e. not racing with writes, means a torn value cannot be observed.

Though I'm not sure exactly on your subsequent point: what is "the Drop issue of RFC 3301" that you're avoiding? (That's a long RFC and even longer discussion, so I have no idea what exactly you're referring to.)

RFC 3301 AtomicPerByte contains a MaybeUninit, so it cannot drop its content. So, even if you use it with a content which is always defined, for example in a SeqLock, you sill have to drop its content manually. This is mentioned as error-prone in the RFC drawbacks, and the RFC even mention to limit the content of AtomicPerByte to be Copy.

I'm afraid this will remain hypothetical until someone actually works out a memory model that can do this. Currently, all we have is a sketch. Getting rid of data race UB fundamentally alters the core of the model, so we don't know which of the existing results (DRF theorems, optimizations, lowerings to asm) remain valid. We definitely know it's not trivial; this paper explored a semantics that avoids such UB -- but it works very different from the C++ memory model, so we can't just switch to that or we'd have trouble linking Rust code and C++ code.

The LB example from that paper (page 5) is particularly interesting:

Somehow we have to say that it is legal for that program to print 1 from both threads. Currently we achieve this by saying that the program is UB so anything can happen. If read-write races are no longer UB, this means non-atomic accesses have to participate in the memory model in a much deeper way than they do now, and I don't think anyone knows what that should look like.

So I'm afraid this idea is many years of research away from being realizable, unfortunately.

4 Likes

I've thought quite a lot about this, and I'm pretty sure that the correct model is along the lines of "a data race for which the value read affects observable program behaviour is undefined behaviour" but permitting, e.g., reading a value in a racy way but discarding it before making any use of it, as long as the read was done using a function specifically designed to permit races. (You can think of this as a "data races read as LLVM's poison value" sort of model, which clearly explains the "both threads print 1" example above as both threads would be printing poison.) This is sufficient for sequence locks because you can read the value, then check whether the read could have been raced against, and discard the read value without doing anythng with it if a race could have existed.

LLVM effectively implements a version of this (by making raced reads return undef, which is strictly more defined than poison is), and for all reads rather than just specially marked ones.

I tried (a while ago) to prove that any compiler that allowed separate compilation and opaque inline-asm blocks and that implemented the C++ memory model would also implement this memory model (in which case we could get an implementation of the memory model we wanted simply by using our existing compilers that follow the C++ memory model), via implementing the race-allowed read as an assembly-level read that was opaque to the compiler. Unfortunately, the result turned out to be false, but only in a fairly unlikely situation (in which the compiler uses the fact that a block of opaque code would race in one scenario in order to make assumptions about how it behaves in different scenarios). As such, I'm fairly confident that implementing a racy read using an inline asm that's opaque to the compiler would work in practice, despite there being a theoretical possibility of it breaking if LLVM adds an unlikely sort of optimisation.

1 Like

Yes that's exactly what this thread is about: racy reads return undef. undef is a value where if the program behavior ever depends on what the value is, that's UB. (poison would work just as well for us here.)

However it's very hard to actually make both reads in my example return undef. One of the reads has to happen first, before any write even occurred, at which point it can only return 0 (the initial value of X and Y). Only later do we realize that actually there's a write that races with this read, now we have to... go back in time and change what the read returned? That's not even a meaningful thing to say. :wink: In the current semantics, we can just say "UB" at this point and we're fine. But in a world where data-races are not always UB, we need an odd form of time-traveling semantics. That can be done, but it is exceedingly complicated and nobody has managed to actually work with such a model so far.

You can also entirely forget about having a tool like Miri for such a semantics. So we can add that to the list of downsides of this model. (Miri of course already has the same problem for the variant of this program where the two loads are Ordering::Relaxed. But given that only a tiny fraction of reads in any given program use that ordering, I think it is a very different situation if we have that problem for all reads vs. only for a few of them.) Miri would become largely useless for finding data races if we adopted this model.

2 Likes

I would likewise expect that very few pieces of data in a program would use this kind of pseudo-cell that permits racy reads. By way of example, the Linux kernel has RCU, but the actual fraction of pointers managed by RCU is a small fraction of pointers in the kernel.

I would love to see a mechanism for this.

Making the kind of access a property of the location is a mistake that C made that we won't repeat -- there's no "atomic location", no "volatile location", and so also no "location that permits racy reads".

So I think you are suggesting we should introduce a new kind of read that has these semantics, while preserving the typical UB semantics for reads by default. That would greatly reduce the problems with Miri. It does nothing to fix the fact that no such model has actually been worked out in any level of detail (except for this which is a huge departure from how C++ does concurrency so we'd lose C++ interop, not to mention that the model also has various gaps and issues). In fact I don't think any model has been worked out where both "reads with UB races" and "reads with undef races" co-exist.

We haven't even solved the problems with the existing C++ model yet, and people can't agree on how they should be solved. :frowning: I'm referring to the out-of-thin-air problem, which in fact is closely related to the example I posted above. Hardware people say they can't live without load buffering, and at the same time load buffering breaks deeply-rooted compiler assumptions.

2 Likes

I don't see the problem here – there's no "which read happens first" because this is a concurrent context. Each read happens in a situation where there is no synchronisation mechanism preventing another thread from writing, thus they are both raced at the time they occur (with no retroactive change being needed).

The usual way to formalise this sort of model is to consider reads and writes to be only partially ordered – there are certain situations that can define an ordering (with the two main ones being program order for two events that happen on the same thread, and atomics), and whenever none of these situations define an ordering between two events, the two events are considered to be unordered with respect to each other. So in the example of "two threads each read a variable the other is writing", we have the read of A being sequenced before the write of B, and the read of B being sequenced before the write of A, but the events on the two threads are unsequenced with respect to each other. As such, because the read of A is unsequenced with respect to a write of A, it is raced on – likewise, because the read of B is unsequenced with respect ot a write of B, it is raced on. This means that it is possible to determine that both are raced on without any time travel or retroactive reasoning, and they can be made to return undef as soon as it happens.

The C++ memory model (which Rust currently uses) is already formalised like this, so it wouldn't require any change in what we already have. (The only controversial part is in which specific things should cause a dependency and which specific things shouldn't.) At the hardware level, many processors also work like this: reads and writes don't occur at a single point in time but rather take some time to run, but the processors are designed to ensure that this sort of memory access always appears to start after everything it is ordered after, and always appears to finish before everything it is ordered before.

There are some examples that are interesting at the hardware level. For example (written in Rust notation, but imagine you're executing a compiled version of this on a processor):

fn example(x: &mut usize, y: &mut usize) -> usize {
    *x = 1;
    let a = *y;
    return a;
}

It is a valid implementation of this, at the hardware level, to start the performing the read of a from *y in such a way that it won't see a value currently being written to *x (and this is usually more efficient in practice than waiting for the write to actually complete) – the two are effectively "simultaneous". However, that would violate the language semantics for the code (in which the write is supposed to happen first), so the processor inserts a check for ptr::from_mut(x) == ptr::from_mut(y) and just overrides the read to return 1 (the value being written) if it succeeds. This sort of thing makes it clear why the read and write are considered unordered with respect to each other from the point of view of other threads (because if those threads tried to detect which order the accesses occurred in, they might see the read as happening before the write).

(With respect to the out-of-thin-air problem, I'm pretty sure that one of the solutions which has already been suggested by others as potentially solving the problem – banning cycles formed out of "program order in same thread", dependencies implied by atomics, and semantic dependencies – is the correct solution for the problem, and the only reason that it's still a problem in practice is that it's difficult to to formally define what a semantic dependency actually is.)

The usual way to formalise this sort of model is to consider reads and writes to be only partially ordered – there are certain situations that can define an ordering (with the two main ones being program order for two events that happen on the same thread, and atomics), and whenever none of these situations define an ordering between two events, the two events are considered to be unordered with respect to each other. So in the example of "two threads each read a variable the other is writing", we have the read of A being sequenced before the write of B, and the read of B being sequenced before the write of A, but the events on the two threads are unsequenced with respect to each other. As such, because the read of A is unsequenced with respect to a write of A, it is raced on – likewise, because the read of B is unsequenced with respect ot a write of B, it is raced on. This means that it is possible to determine that both are raced on without any time travel or retroactive reasoning, and they can be made to return undef as soon as it happens.

I know all that. Most models if this sort however can still be built up in some "order", adding one node after the other to the graph. You can't just magically handwave the graph into existence, after all -- the per-thread semantics has a notion of time, and you have to define which executions you consider. You get into reasoning cycles if you ignore the temporal aspect of this, that's where OOTA comes from.

To my knowledge, the C++ memory model fundamentally relies on data-race UB for some of its key properties, like the DRF theorem. Nobody has demonstrated a language-level model written in that style where data races are not UB. And of course even if such a model exists, it's unclear whether it can be linked with the C++ memory model, which is required for cross-language interop.

It's easy to wave your hands and say "just do it", but if you actually get into the math, things aren't that easy. RC11 is the product of a line of work that spans more than a decade and involves a dozen of the smartest people I know. This kind of work requires a PhD-thesis-level of training and commitment.

So, the only problem with the solution is that it doesn't actually exist? :wink:

1 Like

I am not suggesting at any point that we have a "location for which reads are racy by default". I do think it makes sense to have "location that permits explicitly performing racy read operations", so that writers know that's a possibility to account for, even if not every reader does it.

I also think, for instance, Rcu<T> makes sense, rather than having that be solely a property of the operation.

If by "these semantics" you mean

then I'd need to know more about what that implies before agreeing that's a sensible model. For instance, does "dereference the pointer" count as "program behavior ever depends on what the value is"?

The kind of thing I'd like to be able to build in Rust, and can build today in C, is the kind of data structure I wrote my dissertation and other papers on, where writers can overwrite a pointer to a struct with a pointer to another instance of the same struct, and readers will get the old or new pointer, until an operation occurs that waits for all existing readers to finish, at which point it's safe to assume that new readers cannot get the old pointer (and thus, for instance, it could be freed or reused).

There have been whole memory models built for RCU atop weak memory ordering models, including various formalizations and proofs. And, importantly, people were using RCU before that, too.

C and Rust use the same memory model (the one from C++) so I don't think there is anything you can build today in C that you can't build in Rust. I am curious why you think the "obvious" translation from C to Rust would not work.

I don't really know what you mean by this. But location-based models typically just behave poorly, especially in Rust where we have operations like get_mut which allow "normal" non-atomic accesses to what otherwise look like "atomic locations". C++ started out with a location-based model but even they are moving away from it with atomic_ref (though they still provide no way to mix concurrent atomic and non-atomic read accesses to the same memory; in Rust we do allow this).

I don't know which pointer you mean. But deref'ing an undef pointer is UB, yes. At this point I think we are entirely talking past each other though.

If you want to load a pointer in a way that allows races, we already have a way to do that: AtomicPtr::load (with Ordering::Relaxed if you want to get just a regular boring load instruction). This thread is about situations where the existing atomic APIs we have do not suffice, such as SeqLocks (which can be implemented neither in C nor Rust nor C++ at the moment; the intended way to fix that is atomic bytewise memcpy).

RCU exists in many variants and at least some of them are entirely compatible with the C++ memory model and hence can already be implemented in Rust. For instance, this paper even proves one RCU implementation correct against the RC11 model.

I'm suggesting that it is potentially useful, from a type-system perspective, to tag things for which racy reads are permitted, so that writers of such things know that what they write may be subject to racy reads.

Then that doesn't seem like a useful model for racy reads, unless I'm missing something here.

I feel like we're definitely talking past each other at this point. People do implement seqlocks in C, and they're used regularly in production systems, including the Linux kernel. Is this the kind of "can't" where it actually works just fine in practice but compilers don't guarantee it despite that? I can appreciate the desire for clearer models and paths to support, but I think there's value in acknowledging things that do work in practice and are used in practice.

In C, func(i++, i++) may work in practice with gcc, but switching to clang gives another result. Hopefully, it is not the case for SeqLock implementations but, like my short snippet, most of them still rely on undefined behavior at language level. And relying on UB, while sometimes working in practice, is not something we want to keep in the Rust/C++/C ecosystem, especially regarding well-known and used primitives like SeqLock.