Reading and erasing bits outside of an allocation in inline asm

newpavlov · August 27, 2024, 8:12am

RISC-V has an annoying peculiarity in respect of misaligned operations. Technically, it allows misaligned loads using the usual scalar load/store instructions, but the ISA spec does not make any particular guarantees about it. It can run fine, be "extremely slow" (exception trapping by OS and subsequent emulation), or even can result in a fatal exception. Even with the new Zicclsm extension misaligned operation can still be "extremely slow". And there are no extensions (neither existing, nor planned) with explicit misaligned load/store instructions. So, practically speaking, RISC-V does not support misaligned loads/stores for any performance-sensitive code.

We can see consequences of this situation in this snippet: Compiler Explorer Misaligned load of a 64 bit integer takes 22 instructions on RV64! The code loads 8 bytes separately and then combines them together.

In my code I need to load &[u8; 128] into [u64; 16] using BE byte order and do it reasonably fast. The naive approach results in a horrible codegen because of the above issue, so I had to work around this using the following code:

use core::{arch::asm, ptr};

pub fn load_block(block: &[u8; 128]) -> [u64; 16] {
    let p: *const u64 = block.as_ptr().cast();
    if p.is_aligned() {
        load_aligned_block(block)
    } else {
        load_unaligned_block(block)
    }
}

fn load_aligned_block(block: &[u8; 128]) -> [u64; 16] {
    let block_ptr: *const u64 = block.as_ptr().cast();
    debug_assert!(block_ptr.is_aligned());
    let mut res = [0u64; 16];
    for i in 0..16 {
        let val = unsafe { ptr::read(block_ptr.add(i)) };
        res[i] = val.to_be();
    }
    res
}

fn load_unaligned_block(block: &[u8; 128]) -> [u64; 16] {
    let offset = (block.as_ptr() as usize) % align_of::<u64>();
    debug_assert_ne!(offset, 0);
    let off1 = (8 * offset) % 64;
    let off2 = (64 - off1) % 64;
    let bp: *const u64 = block.as_ptr().wrapping_sub(offset).cast();

    let mut left: u64;
    let mut res = [0u64; 16];

    unsafe {
        asm!(
            // left = unsafe { ptr::read(bp.add(1 + i)) };
            "ld {left}, 0({bp})",
            // left >>= off1;
            "srl {left}, {left}, {off1}",
            bp = in(reg) bp,
            off1 = in(reg) off1,
            left = out(reg) left,
            options(pure, nostack, readonly, preserves_flags),
        );
    }
    for i in 0..15 {
        let right = unsafe { ptr::read(bp.add(1 + i)) };
        res[i] = (left | (right << off2)).to_be();
        left = right >> off1;
    }

    let right: u64;
    unsafe {
        asm!(
            // right = ptr::read(bp.add(16));
            "ld {right}, 128({bp})",
            // right <<= off2;
            "sll {right}, {right}, {off2}",
            bp = in(reg) bp,
            off2 = in(reg) off2,
            right = out(reg) right,
            options(pure, nostack, readonly, preserves_flags),
        );
    }
    res[15] = (left | right).to_be();

    res
}

Godbolt

It works fine, but I wonder about the two asm! blocks. Technically, they load bits outside of the block allocation, which would be an insta-UB if done in pure Rust. But, in my understanding, doing it in inline assembly should be fine. The outside bits are intentionally erased inside asm blocks using shifts, so they are not observed by pure Rust code. And since it's guaranteed that block reference is misaligned, the load instructions can not trigger page fault assuming correctness of the reference (i.e. we load 64 bits from a page which contains at least one byte of block).

Is my understanding correct?

RalfJung · August 27, 2024, 8:52am

If the asm block only returns data that is in-bounds of the Rust reference, and is guaranteed to never trap, then yes that sounds correct. This assumes that the hardware has no "odd" semantics if there is a data race of the out-of-bounds part of the read with some other thread doing a write to the same memory.

comex · August 27, 2024, 3:18pm

Just for reference, here is the definitive thread about the status of doing this without inline asm:

github.com/rust-lang/unsafe-code-guidelines

I need to do an oob vector load. How?

opened 06:49PM - 05 Jul 18 UTC

brson

A-memory S-pending-design

As an optimization during a buffer search, I need (very want) to load that buffe…r into a SIMD vector, even when the buffer doesn't fit into the vector. E.g. I might have a 31-byte buffer that can be efficiently searched with a 32-byte wide AVX2 vector. From a machine perspective, I don't see this as a problem, as long as the load doesn't extend beyond the current page; from LLVM's perspective this seems like UB. I'd really like to be able to write this code in Rust and not have to use assembly. Here's an example of this pattern: ```rust #[inline(always)] unsafe fn do_tail_clever(needle: u8, p: *const u8, len: isize, i: isize, q: __m256i) -> Option<usize> { let rem = len - i; debug_assert!(rem < 32); // Check if the 32-byte load is within the current page let page_alignment = 4096; let page_mask = !(page_alignment - 1); let current_p = p.offset(i) as usize; let avx_read_end = current_p + 32; let next_page = (current_p & page_mask) + page_alignment; if likely(avx_read_end <= next_page) { let x = _mm256_loadu_si256(p.offset(i) as *const __m256i); let r = _mm256_cmpeq_epi8(x, q); let z = _mm256_movemask_epi8(r); let garbage_mask = { let ones = u32::max_value(); let mask = ones << rem; let mask = !mask; mask as i32 }; let z = z & garbage_mask; if z != 0 { return off(i, z); } return None; } // Slow path do_tail_simple(needle, p, len, i, q) } ``` It loads beyond the array, does vector operations on it, then disregards the oob bytes with a mask. I'm hopeful that there is some mechanism to tell LLVM to 'forget' what it knows about this pointer, 'fooling' the optimizer into not messing with it. From the LLVM [aliasing rules](https://llvm.org/docs/LangRef.html#pointer-aliasing-rules), there is some language that makes me hopeful: > An integer constant other than zero or a pointer value returned from a function not defined within LLVM may be associated with address ranges allocated through mechanisms other than those provided by LLVM. Such ranges shall not overlap with any ranges of addresses allocated by mechanisms provided by LLVM. So there is a class of pointers that can operate on arbitrary memory (those that don't come from LLVM). That suggests to me that I could e.g. send my pointer through assembly or some other black-box function to 'clean it', maybe. On the other hand, calling into any function, or even into inline asm imposes extra instructions that more-or-less defeat the optimization (inline asm in LLVM seems to always spill registers). Though that sentence also says "such ranges shall not overlap with any ranges of addresses allocated by mechanisms provided by LLVM" I'm not sure how much 'wiggle-room' there is. Is a malloc'd array "provided by LLVM"? What are the consequences of disobeying this "shall not"? Even if there's no in-language solution and it is technically UB, I am hopeful that I can do this thing without LLVM messing with my codegen. cc @nikomatsakis writing this here per your request.

(tl;dr: you can't, but maybe LLVM will have explicit support for this in the future, and then Rust could expose it)

Topic		Replies	Views
Unaligned SIMD (SSE2 in particular) loads/stores Unsafe Code Guidelines	4	2610	March 25, 2019
Misaligned access costs libs	4	1066	October 1, 2022
Bit-wise reasoning for atomic accesses Unsafe Code Guidelines	39	3690	May 25, 2019
`unordered` as a solution to "Bit-wise reasoning for atomic accesses" Unsafe Code Guidelines	23	3455	October 15, 2019
Loads and stores to/from outside the memory model language design	36	3412	September 27, 2019

Reading and erasing bits outside of an allocation in inline asm

Related topics