Unaligned SIMD (SSE2 in particular) loads/stores


#1

It appears that LLVM doesn’t have intrinsics for unaligned loads and stores, so the user of a high-level language can’t talk directly to LLVM to request unaligned loads and stores via something like the link_llvm_intrinsics feature. Instead, the compiler for the high-level language needs to provide the means to have the kind of LLVM IR generated that eventually compiles to unaligned load/store instructions.

Looking at emmintrin.h, e.g. the Intel-defined _mm_loadu_si128 SSE2 intrinsic doesn’t map to a __builtin call but to a dereference of a pointer to a single-member struct annotated with __attribute__((__packed__, __may_alias__)).

The simd crate uses the same pattern for the same purpose with a single-member #[repr(packed)] Rust struct. In debug builds, this works if the result of the dereference is assigned to a local variable before extracting the single member. Using an expression without the intermediate variable fails, though. Furthermore, AFAICT, debug mode doesn’t actually emit a MOVDQU instruction but accomplishes the results of the computation by other means.

At present (did it work pre-MIR?), that pattern fails in release mode. The load is emitted as MOVDQA, which requires 16-byte alignment.

Given the past and the clang approach, the obvious way forward would be to make the #[repr(packed)] pattern work with MIR. However, making things work for packed structs generally seems over-complex considering the narrower goal of accomplishing unaligned SIMD loads/stores and too much of an obscure incantation from the language user perspective.

From the language user perspective, it seems to me that having read_unaligned() on *const and *mut and write_unaligned() on *mut would be more obvious and would be consistent with read_volatile() and write_volatile().

Looking at the LLVM IR clang generates for _mm_loadu_si128 vs. _mm_loadu_si128, it seems that the difference between eventual MOVDQU vs. MOVDQA instruction generation is annotating the LLVM load and store instructions with align 1 instead of align 16. It seems to me that it should be possible to add rust-intrinsics unaligned_load and unaligned_store next to volatile_load and volatile_store and make the new intrinsics generate LLVM load and store with align 1. Then these could be exposed on *const and *mut in the same manner as the volatile variants.

Does this seem like an OK way forward?


#2

Having read_unaligned() and write_unaligned() would be a nice convenience. It seems, though, that you can currently get unaligned loads by casting the source pointer to *const u8 and the target pointer to *mut u8. For example, see this.


#3

I have already opened a RFC for read_unaligned and write_unaligned.


#4

Thank you @jneem and @Amanieu. copy_nonoverlapping addresses my use case.