Reordering of writes via differently-typed pointers

hsivonen · August 13, 2018, 9:42am

In order to accelerate ASCII, encoding_rs reads and writes slices of u8 and u16 via pointers to usize (if SIMD is disabled) or pointers to u8x16 and u16x8 (if SIMD is enabled).

In particular, the writes need to happen in program order, because there are situations where the crate first does a larger-typed write (writes a usize, u8x16 or u16x8) and then partially overwrites the larger write using writes of a smaller type (u8 or u16). Such overwriting of the larger write with smaller writes doesn’t take place within the same function, but the overwriting always happens from the caller of the function that did the larger writes. These all get inlined, though.

All this works, but is there any guarantee of it working?

Is there a SIMD-specific rule guaranteeing that reads and writes of u8 and u8x16 (or u16 and u16x8) can’t be reordered relative to each other as the SIMD type is clearly related to the basic type?

Is there some rule that function scopes establish some write ordering barriers even when the functions get inlined? (This sort of thing could explain why the usize writes work.)

If there are no such guarantees and the code is now working by luck, does there exist some kind of thread-internal write barrier that I could explicitly use after a sequence of large writes to indicate that when the code is optimized after inlining, differently-typed writes must not be reordered across the barrier?

hanna-kruppe · August 13, 2018, 10:19am

As far as I am aware, Rust has never even seriously considered C/C++ style “strict aliasing” based on the pointee types. It certainly isn’t part of any of the proposals in the last year or two. That’s because

such rules are very annoying/restrictive in many kinds of system software (such as your example)
Rust has a much better source of even more accurate aliasing information: &mut T, and also &T freezing the memory it refers to except for UnsafeCell

kornel · August 13, 2018, 1:05pm

Rust has C-style unions how, which in C are the way to make such hacks legal, so maybe you could cast memory to a slice of unions and write through that?

Ixrec · August 13, 2018, 3:43pm

That gets into unsafe code guidelines; we haven’t really decided what the rules around unions should be yet. I think https://github.com/rust-lang/rfcs/pull/1897 was the last serious attempt to specify them.

RalfJung · August 13, 2018, 5:29pm

As @hanna-kruppe already said, Rust does not do type-based aliasing and (if I have my way) it will never do so. We have unions solely for the purpose of FFI. I see C using unions to “get around” TBAA as more of a hack, and it’s not even free – C has a bunch of really strange rules for accessing unions that are specifically needed to make TBAA work. (Namely, it is only okay to break TBAA if you do it “through the union”; just taking a pointer to a union field and then using that is not enough. This programs most likely has undefined behavior, and it probably does even if you inline read_float.)

So, the part where you first use a u8x16 pointer and then a u8 pointer to access the same memory is perfectly fine. The only thing you need to be worried about is the rules associated with the reference types – here is my latest proposal for these rules, but nothing is decided yet. If you are doing all accesses through raw pointers, there is no problem; with references, you have to take care that during the lifetime of an &mut (from their creation to their last use) no other access is made to the same locations.

RalfJung · August 13, 2018, 5:37pm

Specifically, align_to is actually safe if T and U are both integer types, including integer SIMD types. Maybe we should try to somehow express that with traits, to document this point more clearly? That function lets you “transmute” the aligned part of a slice if u8 to u8x16, for example. It has proper lifetimes so the rules for reference types cannot be violated.

hsivonen · August 14, 2018, 11:39am

I have caller-provided &mut [u8] or &mut [u16]. Then I call into a function that takes a subslice of the slice, calls as_mut_ptr() on the subslice and calls into another function that transmutes the pointer to a pointer to u8x16, u16x8 or usize and writes though the pointer or pointers derived from it via offset(). After returning back to the function that had the &mut [u8] or &mut [u16] mentioned in the first sentence of this paragraph, writes are made by indexing into the slice such that the u8 or u16 writes through the slice may overlap the memory that was written via pointers to the wider types.

Is this OK?

RalfJung · August 14, 2018, 12:23pm

Sorry I don’t follow… could you just show some code?^^

gnzlbg · August 14, 2018, 2:15pm

In C++ pointers to objects of different types are assumed not to alias [0], so the compiler can assume that a *mut u16 won't point to the same memory as *mut u16x8 and optimize accordingly.

In Rust only &mut T provides this guarantee, &T of different types can alias the same memory, and so can raw pointers. That is, a *mut u16 and *mut u16x8 are allowed to alias in Rust. The compiler is allowed to re-order reads and writes via this pointers, only as long as you are not able to tell the difference.

I have caller-provided &mut [u8] or &mut [u16] . Then I call into a function that takes a subslice of the slice, calls as_mut_ptr() on the subslice and calls into another function [...] Is this ok?

As long as there aren't in any given scope two live &mut T aliasing the same memory, then yes, that's ok. If at any point in your program you have two objects a and b of type &mut _ and you are able to access the same memory through both, that's probably undefined behavior.

[0] with the exception of char*, which is allowed to alias any other object - that's why memcpy works.

RalfJung · August 14, 2018, 2:50pm

To expand a bit on this, &mut provides this guarantee completely independent of the type. So two &mut u16 must not alias either, even though they have the same type. This really is a totally different guarantee.

hsivonen · August 16, 2018, 11:31am

Let's look at single-byte to UTF-8 decode:

    pub fn decode_to_utf8_raw(
        &mut self,
        src: &[u8],
        dst: &mut [u8],
        _last: bool,
    ) -> (DecoderResult, usize, usize) {
        let mut source = ByteSource::new(src);
        let mut dest = Utf8Destination::new(dst);
        'outermost: loop {
            match dest.copy_ascii_from_check_space_bmp(&mut source) {
                CopyAsciiResult::Stop(ret) => return ret,
                CopyAsciiResult::GoOn((mut non_ascii, mut handle)) => 'middle: loop {
                    // Start non-boilerplate
                    //
                    // Since the non-ASCIIness of `non_ascii` is hidden from
                    // the optimizer, it can't figure out that it's OK to
                    // statically omit the bound check when accessing
                    // `[u16; 128]` with an index
                    // `non_ascii as usize - 0x80usize`.
                    let mapped =
                        unsafe { *(self.table.get_unchecked(non_ascii as usize - 0x80usize)) };
                    // let mapped = self.table[non_ascii as usize - 0x80usize];
                    if mapped == 0u16 {
                        return (
                            DecoderResult::Malformed(1, 0),
                            source.consumed(),
                            handle.written(),
                        );
                    }
                    let dest_again = handle.write_bmp_excl_ascii(mapped);

So we get a read-only input slice and a writable output slice. We wrap these in ByteSource and Utf8Destination, which know our current position (index) when processing these slices.

At the top of the loop we do dest.copy_ascii_from_check_space_bmp(&mut source), which copies ASCII from src to dst (until the first non-ASCII or running out of space) and advances the position indices within ByteSource and Utf8Destination.

If we didn't run out of space, we get non_ascii (u8) and mut handle. The handle borrows the Utf8Destination and represents an assertion that the destination has space for at least on Basic Multilingual Plane character encoded as UTF-8 (i.e. at least 3 bytes of space). Writing through the handle consumes the handle to make sure there is at most one write transaction per space check.

On the last line quoted here, we do handle.write_bmp_excl_ascii(mapped), which looks like this:

    pub fn write_bmp_excl_ascii(self, bmp: u16) -> &'a mut Utf8Destination<'b> {
        self.dest.write_bmp_excl_ascii(bmp);
        self.dest
}

Which calls into this stuff:

    #[inline(always)]
    fn write_code_unit(&mut self, u: u8) {
        unsafe {
            // OK, because we checked before handing out a handle.
            *(self.slice.get_unchecked_mut(self.pos)) = u;
        }
        self.pos += 1;
    }
    #[inline(always)]
    fn write_ascii(&mut self, ascii: u8) {
        debug_assert!(ascii < 0x80);
        self.write_code_unit(ascii);
    }
    #[inline(always)]
    fn write_bmp(&mut self, bmp: u16) {
        if bmp < 0x80u16 {
            self.write_ascii(bmp as u8);
        } else if bmp < 0x800u16 {
            self.write_mid_bmp(bmp);
        } else {
            self.write_upper_bmp(bmp);
        }
    }
    #[inline(always)]
    fn write_mid_bmp(&mut self, mid_bmp: u16) {
        debug_assert!(mid_bmp >= 0x80);
        debug_assert!(mid_bmp < 0x800);
        self.write_code_unit(((mid_bmp as u32 >> 6) | 0xC0u32) as u8);
        self.write_code_unit(((mid_bmp as u32 & 0x3Fu32) | 0x80u32) as u8);
    }
    #[inline(always)]
    fn write_upper_bmp(&mut self, upper_bmp: u16) {
        debug_assert!(upper_bmp >= 0x800);
        self.write_code_unit(((upper_bmp as u32 >> 12) | 0xE0u32) as u8);
        self.write_code_unit((((upper_bmp as u32 & 0xFC0u32) >> 6) | 0x80u32) as u8);
        self.write_code_unit(((upper_bmp as u32 & 0x3Fu32) | 0x80u32) as u8);
    }
    #[inline(always)]
    fn write_bmp_excl_ascii(&mut self, bmp: u16) {
        if bmp < 0x800u16 {
            self.write_mid_bmp(bmp);
        } else {
            self.write_upper_bmp(bmp);
        }
}

So we see that the write after ASCII happens via *(self.slice.get_unchecked_mut(self.pos)) = u;, where self.slice is the dst slice from decode_to_utf8_raw.

How can this write overlap a usize write, then? Let's look at how the ASCII copying happens.

First, we call into this:

    #[inline(always)]
    pub fn copy_ascii_from_check_space_bmp<'b>(
        &'b mut self,
        source: &mut ByteSource,
    ) -> CopyAsciiResult<(DecoderResult, usize, usize), (u8, Utf8BmpHandle<'b, 'a>)> {
        let non_ascii_ret = {
            let dst_len = self.slice.len();
            let src_remaining = &source.slice[source.pos..];
            let dst_remaining = &mut self.slice[self.pos..];
            let (pending, length) = if dst_remaining.len() < src_remaining.len() {
                (DecoderResult::OutputFull, dst_remaining.len())
            } else {
                (DecoderResult::InputEmpty, src_remaining.len())
            };
            match unsafe {
                ascii_to_ascii(src_remaining.as_ptr(), dst_remaining.as_mut_ptr(), length)
            } {
                None => {
                    source.pos += length;
                    self.pos += length;
                    return CopyAsciiResult::Stop((pending, source.pos, self.pos));
                }
                Some((non_ascii, consumed)) => {
                    source.pos += consumed;
                    self.pos += consumed;
                    if self.pos + 2 < dst_len {
                        source.pos += 1; // +1 for non_ascii
                        non_ascii
                    } else {
                        return CopyAsciiResult::Stop((
                            DecoderResult::OutputFull,
                            source.pos,
                            self.pos,
                        ));
                    }
                }
            }
        };
        CopyAsciiResult::GoOn((non_ascii_ret, Utf8BmpHandle::new(self)))
    }

So the code first takes tail subslices of the source and destination but instead of passing those subslices onward, it passed the start pointers and the minimum of their lengths to ascii_to_ascii, which is generated by the macro invocation ascii_alu!(ascii_to_ascii, u8, u8, ascii_to_ascii_stride);.

It looks like this:

#[allow(unused_macros)]
macro_rules! ascii_alu {
    ($name:ident,
     $src_unit:ty,
     $dst_unit:ty,
     $stride_fn:ident) => (
    #[cfg_attr(feature = "cargo-clippy", allow(never_loop))]
    #[inline(always)]
    pub unsafe fn $name(src: *const $src_unit, dst: *mut $dst_unit, len: usize) -> Option<($src_unit, usize)> {
        let mut offset = 0usize;
        // This loop is only broken out of as a `goto` forward
        loop {
            let mut until_alignment = {
                // Check if the other unit aligns if we move the narrower unit
                // to alignment.
                //               if ::std::mem::size_of::<$src_unit>() == ::std::mem::size_of::<$dst_unit>() {
                // ascii_to_ascii
                let src_alignment = (src as usize) & ALU_ALIGNMENT_MASK;
                let dst_alignment = (dst as usize) & ALU_ALIGNMENT_MASK;
                if src_alignment != dst_alignment {
                    break;
                }
                (ALU_ALIGNMENT - src_alignment) & ALU_ALIGNMENT_MASK
                //               } else if ::std::mem::size_of::<$src_unit>() < ::std::mem::size_of::<$dst_unit>() {
                // ascii_to_basic_latin
                //                   let src_until_alignment = (ALIGNMENT - ((src as usize) & ALIGNMENT_MASK)) & ALIGNMENT_MASK;
                //                   if (dst.offset(src_until_alignment as isize) as usize) & ALIGNMENT_MASK != 0 {
                //                       break;
                //                   }
                //                   src_until_alignment
                //               } else {
                // basic_latin_to_ascii
                //                   let dst_until_alignment = (ALIGNMENT - ((dst as usize) & ALIGNMENT_MASK)) & ALIGNMENT_MASK;
                //                   if (src.offset(dst_until_alignment as isize) as usize) & ALIGNMENT_MASK != 0 {
                //                       break;
                //                   }
                //                   dst_until_alignment
                //               }
            };
            if until_alignment + ALU_STRIDE_SIZE <= len {
                // Moving pointers to alignment seems to be a pessimization on
                // x86_64 for operations that have UTF-16 as the internal
                // Unicode representation. However, since it seems to be a win
                // on ARM (tested ARMv7 code running on ARMv8 [rpi3]), except
                // mixed results when encoding from UTF-16 and since x86 and
                // x86_64 should be using SSE2 in due course, keeping the move
                // to alignment here. It would be good to test on more ARM CPUs
                // and on real MIPS and POWER hardware.
                while until_alignment != 0 {
                    let code_unit = *(src.offset(offset as isize));
                    if code_unit > 127 {
                        return Some((code_unit, offset));
                    }
                    *(dst.offset(offset as isize)) = code_unit as $dst_unit;
                    offset += 1;
                    until_alignment -= 1;
                }
                let len_minus_stride = len - ALU_STRIDE_SIZE;
                loop {
                    if let Some(num_ascii) = $stride_fn(src.offset(offset as isize) as *const usize,
                                   dst.offset(offset as isize) as *mut usize) {
                        offset += num_ascii;
                        return Some((*(src.offset(offset as isize)), offset));
                    }
                    offset += ALU_STRIDE_SIZE;
                    if offset > len_minus_stride {
                        break;
                    }
                }
            }
            break;
        }
        while offset < len {
            let code_unit = *(src.offset(offset as isize));
            if code_unit > 127 {
                return Some((code_unit, offset));
            }
            *(dst.offset(offset as isize)) = code_unit as $dst_unit;
            offset += 1;
        }
        None
    });
}

We see that when $stride_fn fails, it returns the number of ASCII bytes in the stride. The stride function looks like this:

        #[inline(always)]
        unsafe fn ascii_to_ascii_stride(src: *const usize, dst: *mut usize) -> Option<usize> {
            let word = *src;
            let second_word = *(src.offset(1));
            *dst = word;
            *(dst.offset(1)) = second_word;
            find_non_ascii(word, second_word)
        }

And there it is: The function writes the stride unconditionally and then examines it for non-ASCII. So in the case where non-ASCII was found, the usize writes made here are overlapped with writes of u8 made in

fn write_code_unit(&mut self, u: u8) {
        unsafe {
            // OK, because we checked before handing out a handle.
            *(self.slice.get_unchecked_mut(self.pos)) = u;
        }
        self.pos += 1;
    }

which we saw earlier.

hsivonen · August 16, 2018, 11:35am

(When researching the post above, I noticed that I had removed all other overlapping writes based on benchmarking. IIRC, this particular overlapping write was a perf win on ARMv7 without NEON.)

RalfJung · August 17, 2018, 4:10pm

Thanks! I think I mostly followed along.

I cannot see any problem with this. As already mentioned, the types of the accesses (the T part of &[mut] T) does not matter – what matters, however, is the & vs &mut and how the reference was created. But your unsafe code seems to use raw pointers entirely, and not have any long-lived references in there that “overlap” in their lifetime with a raw pointer with overlapping destination.

Did you have a look at my recent Stacked Borrows proposal? It essentially proposes as set of rules for the kind of aliasing that should matter for Rust (modulo some horrible details around pointer-integer-casts and “guessing” the integer address of an allocation, which do not seem relevant here). These are the rules I have been trying to apply to your code here, and could not find any issue. Expressed in terms of reordering, the intention is that the compiler can reorder to accesses if at least one of them happens through a reference (as opposed to a raw pointer) and neither ref/ptr is “derived from” the other. “Derived from” is essentially defines as &[mut] foo being derived from whatever went into computing foo.

hsivonen · September 5, 2018, 7:44am

Thank you.

I did, but afterwards I still wasn't confident I understood it well enough to assess how it applies to my code.

system · March 25, 2019, 8:30am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Exploit the padding?	50	2603	September 28, 2021
Safe trasnsmute for slices; e.g. &[u64] -> &[u32], particularly SIMD types libs	12	5327	March 25, 2019
Memcpy is backwards	59	14087	March 25, 2019
Stacked Borrows: An Aliasing Model For Rust Unsafe Code Guidelines	49	7564	March 25, 2019
How much more tricky is `unsafe` code than C code? (Aliases and `const_cast`) internals	5	1830	March 25, 2019

Reordering of writes via differently-typed pointers

Related topics