Bit-wise reasoning for atomic accesses


#1

This issue uncovered the following very interesting method in bytes:

    #[inline]
    fn kind(&self) -> usize {
        // This function is going to probably raise some eyebrows. The function
        // returns true if the buffer is stored inline. This is done by checking
        // the least significant bit in the `arc` field.
        //
        // Now, you may notice that `arc` is an `AtomicPtr` and this is
        // accessing it as a normal field without performing an atomic load...
        //
        // Again, the function only cares about the least significant bit, and
        // this bit is set when `Inner` is created and never changed after that.
        // All platforms have atomic "word" operations and won't randomly flip
        // bits, so even without any explicit atomic operations, reading the
        // flag will be correct.
        //
        // This function is very critical performance wise as it is called for
        // every operation. Performing an atomic load would mess with the
        // compiler's ability to optimize. Simple benchmarks show up to a 10%
        // slowdown using a `Relaxed` atomic load on x86.

        #[cfg(target_endian = "little")]
        #[inline]
        fn imp(arc: &AtomicPtr<Shared>) -> usize {
            unsafe {
                let p: &u8 = mem::transmute(arc);
                (*p as usize) & KIND_MASK
            }
        }

        #[cfg(target_endian = "big")]
        #[inline]
        fn imp(arc: &AtomicPtr<Shared>) -> usize {
            unsafe {
                let p: &usize = mem::transmute(arc);
                *p & KIND_MASK
            }
        }

        imp(&self.arc)
    }

This got flagged by miri because of the transmute from an &UnsafeCell<T> to an &T (which is not okay, because the shared pointer expects to point to something frozen). But that’s not even the problem: The justification for the non-atomic access here just doesn’t hold. The C++11 model and also LLVM’s model say that a read-write race is either UB or yields undef, but it certainly won’t give you correct values for some bits. LLVM’s undef-on-race is explicitly happening on a byte-for-byte basis.

Any thoughts?

@carllerche One thing I am wondering, have you benchmarked using a relaxed load? EDIT: D’oh, you did, sorry for not reading properly. That is somewhat surprising, in particular on x86, but I guess LLVM is just too conservative around relaxed accesses.
Also, from the comment in your code, it is not clear to me whether you are saying that this is not UB, or whether you are saying that this is UB but does not cause trouble in practice. I am convinced that, if anything, it is the latter – and the comment should clearly say so.


#2

Note that compilers may turn non-atomic reads into multiple reads, like loading the value twice and expecting it (meaning the entire value) to be the same.

Moreover, for data races, it does not matter whether the value changes. The following is a race:

static mut GLOBAL: u8 = 0;

unsafe { rayon::join(
  || *GLOBAL = 0,
  || println!("{}", *GLOBAL)
) }

So even if LLVM defined its undef bitwise, that wouldn’t help here because the read still can see two different write events, and it doesn’t matter that they write the same value.


#3

Potentially relevant (including the preceding stuff). TL;DR this kind of trick would also be useful to optimize atomic into non-atomic reference counting when the object hasn’t ever escaped its creating thread, with a dynamic escape check; there is a paper investigating the performance impact for Swift (it is significant), and it seems to basically just assume that it’s sound. (I don’t know whether this is directly relevant to Rust – it’s at least not remotely obvious to me how one could go about adding/requiring those write barriers in library code.)

Are the problems all at the level of memory models? Is there anything where this would cause a problem at the level of hardware?


#4

@RalfJung

You are correct that it is a data race according to the LLVM model. It is assuming that, in practice, no chip will randomly set the bits to random values.

So, in your example with GLOBAL, while it is a data race, in practice you will never see a non-zero value.

Perhaps this assumption is incorrect. It would be interesting to see cases in which it does not hold.


#5

Hardware is not our enemy here, compilers are. :wink: Many of the strange effects of weak memory are introduced by compilers.

it is also assuming that, moreover, LLVM will not exploit the freedom it gets from the model during optimization. That is the part I am worried about.

If you would have written assembly code, I would agree with you that your code is fine. But anyway, in assembly, relaxed and non-atomic accesses are the same thing, so the question is moot there.


#6

Do you have an example of how it could be problematic?


#7

A hypothetical example? Yes, compilers are allowed to turn

x = 0;

into

x = 1;
x = 0;

Now, a realistic example… that will be harder. One could imagine a compiler moving an assignment out of a conditional if it knows that it will be overwritten later anyway:

if foo {
  x = 0;
} else {
  x = 1;
  bar();
}

could be turned into

x = 1;
if foo {
  x = 0;
} else {
  bar();
}

Still seems somewhat dump, but compilers do move assignments out of conditionals (and that is the reason for some of the quirks in relaxed accesses).

I will ask some people if they can come up with more realistic examples.


#8

I believe this sort of trickery could be OK under the Java memory model, which banishes undefined behavior but otherwise does not guarantee much about shared non-volatile memory. The LLVM memory model has a corresponding ordering unordered that’s weaker than relaxed ordering (called monotonic there). It still come with some optimizer restrictions though. On the other hand, I am not sure whether the observed slowdown of relaxed access is inherent or just some passes being more conservative than they would have to be.


#9

All mutations are done with atomic ops and will never set the bits in question to what is checked in the race. Any possible read of invalid bits happens before the racey read.


#10

I think it is mostly passes being conservative. But that does not help.

Oh, I had forgotten about that. Any reason why Rust doesn’t expose this (unstably, at least)?

Yeah, that does rule out my artificial examples.

Technically speaking, the very first mutation to initialized the AtomicPtr is done with a non-atomic write. Not sure if that makes any difference though.


#11

Right, but that write happens-before all reads.


#12

The read still sees a whole bunch of writes to pick from, at least one of them non-atomic. Some of those writes might happen-after the initial write, but they don’t happen-before the non-atomic read, so both the initial write and the atomic write are still candidates.


#13

It could pick from every single possible write in the history and still be correct as long as bits don’t get randomly set. Every single write ensures to not set the bits that get checked by is_inline_or_static.


#14

Take a u8 value with the LSB being only ever set on initialization. If the LSB is set, then the entire value is immutable.

  1. An initial value is set, with or without the LSB.
  2. All reads happens-before the initial write
  3. All writes are atomic, acquire the initial write, and write to the cell only if the LSB is not set (and they will never set the LSB).

This can only be a problem if:

  1. The LSB is unset on initialization (indicating mutability).
  2. An atomic write that happens-after initialization writes to the value, while ensuring the LSB remains unset.
  3. A read (unsynchronized) that happens-after the initialization can see the LSB set even though no atomic write will set it.

#15

I get that. That would require two changes to the LLVM model:

  • Define data races and reading undef on a per-bit instead of per-byte basis. I do not foresee any principal problems with this, but could be missing something.
  • Define non-atomic reads that can read from multiple identical writes to just use that value. So undef only happens where there are at least two different writes that the non-atomic read could pick from. This is the part where I am worried about losing optimizations. I just don’t know them well enough to estimate why the rule here is the way it is right now.

#16

I’m probably naive, but I just don’t see how the code can’t work unless LLVM injects made up values.


#17

I am approaching this not from the “what does LLVM do” perspective, but from the “what does LLVM promise” perspective. This may seem like language lawyering, but as a formal methods person I am more interested in figuring out if it is possible to write the “contract” between you and the compiler in a way that it covers your use-case, than in determining if LLVM currently happens to do the thing you want it to do. As mentioned above, I cannot currently come up with an optimization that would break your code, but there is a large and interesting space of programs that are UB but are not getting miscompiled (where the latter is based on experimental evidence and being unable to come up with a counter-example, not an in-depth analysis of the compiler), and you provided a very nice example for such a program. :slight_smile:

The thing is, LLVM is perfectly allowed to inject made-up values if it can show that that doesn’t change program behavior. I gave some examples of that above. And while that sounds like a stupid thing to do, there are circumstances where it can be beneficial to “speculatively” write some value (moving a write out of a conditional branch), and that can look a lot like injecting made-up values (because if we only look at the one execution trace where the speculation is wrong, that speculative write does come out of nowhere).

I’ve too often been surprised by what compilers do to just assume that they won’t do horrendously strange-looking transformations.^^


#18

One of my colleagues came up with an example that breaks at least “if all writes have the same value then use that value”. Using it to break byte requires some creativity.

Consider the following program (two threads running in parallel):

1 {
2 X_na = 1; // meaning "non-atomic write to X"
3 X_na = 0; // consider this the "initialization write"
4 Y_rel = 1; // release-write to Y
5 X_na = 0;
6 }
7 ||
8 {
9   if(Y_acq){ // acquire-read from Y
10      a = X_na;
11  }
12}

When executing line 10, we see two possible writes to pick from (the ones from lines 3 and 5). Both have the same value. However, LLVM can optimize the first thread to remove line 3: That write is redundant because it gets overwritten in line 5, and the fact that there is a release write does not matter because lines 3 and 5 are non-atomic accesses.


We can almost, but not quite get there with bytes. Something not too dissimilar from the above situation can still arise with byte. Like, when I do

let mut buf = BytesMut::with_capacity(1);
buf = BytesMut::with_capacity(1024);

and when the compiler inlines all of that and reuses the same storage for the second buffer, the buf.inner.arc will first be KIND_INLINE and then later KIND_VEC (matching lines 2 and 3 of the example above). If we then send a pointer to the buf away, that will look like lines 4 and 9.

The one thing we do not get, I think, is line 5. We can, I guess trigger an atomic write to buf.inner.arc by doing something with buf that would make it change the pointer, but I am not familiar enough with this data structure to say what that would be and I am also not sure if the optimization mentioned above is still legal when line 5 is changed to be atomic (I asked my colleague).

Still, the fact that you can overwrite an existing BytesMut with a new one means that the KIND of one of these pointers can change, if you consider the entire lifetime of the location it is stored into (which you have to). So it’s not like all writes have the same KIND, just the ones that the non-atomic read can choose from – and as the example above shows, that on its own is not sufficient to make your argument work. It might be that BytesMut does something else that would be a sufficient condition, but it’s certainly more complex than we thought.


#19

You are correct that it is a data race according to the LLVM model. It is assuming that, in practice, no chip will randomly set the bits to random values.

Ehm… what this actually is assuming is that LLVM will never be able to tell that this code has an unconditional data-race in a multi-threaded context and can therefore not be reached along with all code leading to it.

We should definitely fill in a missed optimization bug in LLVM with this example (spawning some threads) because all of it should compile to nothing.


We should also find out a way to make @carllerche’s example not have UB and still produce efficient code. The unordered ordering definitely looks worth exploring.


#20

Yes pretty much…


I thought more about it, and there is a way that I could probably restructure the internals of Bytes to avoid the race, but it would take a large chunk of work. Unless, in practice, there is a threat today, I don’t think I’ll be able to prioritize the work.