RISC-V has an annoying peculiarity in respect of misaligned operations. Technically, it allows misaligned loads using the usual scalar load/store instructions, but the ISA spec does not make any particular guarantees about it. It can run fine, be "extremely slow" (exception trapping by OS and subsequent emulation), or even can result in a fatal exception. Even with the new Zicclsm extension misaligned operation can still be "extremely slow". And there are no extensions (neither existing, nor planned) with explicit misaligned load/store instructions. So, practically speaking, RISC-V does not support misaligned loads/stores for any performance-sensitive code.
We can see consequences of this situation in this snippet: Compiler Explorer Misaligned load of a 64 bit integer takes 22 instructions on RV64! The code loads 8 bytes separately and then combines them together.
In my code I need to load &[u8; 128]
into [u64; 16]
using BE byte order and do it reasonably fast. The naive approach results in a horrible codegen because of the above issue, so I had to work around this using the following code:
use core::{arch::asm, ptr};
pub fn load_block(block: &[u8; 128]) -> [u64; 16] {
let p: *const u64 = block.as_ptr().cast();
if p.is_aligned() {
load_aligned_block(block)
} else {
load_unaligned_block(block)
}
}
fn load_aligned_block(block: &[u8; 128]) -> [u64; 16] {
let block_ptr: *const u64 = block.as_ptr().cast();
debug_assert!(block_ptr.is_aligned());
let mut res = [0u64; 16];
for i in 0..16 {
let val = unsafe { ptr::read(block_ptr.add(i)) };
res[i] = val.to_be();
}
res
}
fn load_unaligned_block(block: &[u8; 128]) -> [u64; 16] {
let offset = (block.as_ptr() as usize) % align_of::<u64>();
debug_assert_ne!(offset, 0);
let off1 = (8 * offset) % 64;
let off2 = (64 - off1) % 64;
let bp: *const u64 = block.as_ptr().wrapping_sub(offset).cast();
let mut left: u64;
let mut res = [0u64; 16];
unsafe {
asm!(
// left = unsafe { ptr::read(bp.add(1 + i)) };
"ld {left}, 0({bp})",
// left >>= off1;
"srl {left}, {left}, {off1}",
bp = in(reg) bp,
off1 = in(reg) off1,
left = out(reg) left,
options(pure, nostack, readonly, preserves_flags),
);
}
for i in 0..15 {
let right = unsafe { ptr::read(bp.add(1 + i)) };
res[i] = (left | (right << off2)).to_be();
left = right >> off1;
}
let right: u64;
unsafe {
asm!(
// right = ptr::read(bp.add(16));
"ld {right}, 128({bp})",
// right <<= off2;
"sll {right}, {right}, {off2}",
bp = in(reg) bp,
off2 = in(reg) off2,
right = out(reg) right,
options(pure, nostack, readonly, preserves_flags),
);
}
res[15] = (left | right).to_be();
res
}
It works fine, but I wonder about the two asm!
blocks. Technically, they load bits outside of the block
allocation, which would be an insta-UB if done in pure Rust. But, in my understanding, doing it in inline assembly should be fine. The outside bits are intentionally erased inside asm blocks using shifts, so they are not observed by pure Rust code. And since it's guaranteed that block
reference is misaligned, the load instructions can not trigger page fault assuming correctness of the reference (i.e. we load 64 bits from a page which contains at least one byte of block
).
Is my understanding correct?