Misaligned access costs

AngelicosPhosphoros · July 3, 2022, 4:28pm

I come up with that idea while working on PR https://github.com/rust-lang/rust/pull/98824

Basically, implementing std::mem::swap using read_unaligned and write_unaligned greatly reduces number of memory reads and writes while reducing generated code size: Compiler Explorer

Also it solves this problem: rust/swap-small-types.rs at b04bfb4aea99436a62f6a98056e805eb9b0629cc · rust-lang/rust · GitHub

Also, there is 10 years old article which claims that misaligned reads and writes are free on newer x86-x64 processors: Data alignment for speed: myth or reality? – Daniel Lemire's blog

And there is also article which tells that unaligned access on ARMv7 is slow and is not supported in older ARMs: The curious case of unaligned access on ARM | by Levente Kurusa | Medium

As I understand, main problem with misaligned access is that they may result in reading 2 cache lines instead of 1. But this concern is not very important for std::mem::swap implementation because we would load whole values into exclusive cache line anyway.

So: should I implement our std::mem::swap for x86_x64 using unaligned reads/writes?

Maybe we even have some compile-time check to put in cfg guard like "is_misaligned_memory_access_fast" and I just don't know about it?

AngelicosPhosphoros · July 3, 2022, 4:37pm

Am I right that this 2 things are identical from LLVM perspective?

let mut v: MaybeUninit<usize> = MaybeUninit::uninit();
copy_nonoverlapping(
       source as *mut MaybeUninit<u8>,
       v as *mut MaybeUninit<u8>,
       size_of::<usize>()
);

And

let v: MaybeUninit<usize> = read_unaligned(source as *mut MaybeUninit<usize>);

scottmcm · July 3, 2022, 5:19pm

What I really want is Wishlist: allow adding intrinsics in a way that doesn't break every backend · Issue #93145 · rust-lang/rust · GitHub -- that way we could just go "oh, this is swap" in the backend, and emit something more appropriate.

For example, that might be i48 for (i16, i16, i16) in LLVM, or a loop of alignment-sized chunks for "too big" values. But in cranelift it might use a different approach, because the copy-by-u8s approach is terrible without vectorization.

AngelicosPhosphoros · July 3, 2022, 5:51pm

I don't know how to implement intrinsics and how they handled in compiler, sorry. Aren't they backend-implemented functions?

system · October 1, 2022, 5:51pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
`unordered` as a solution to "Bit-wise reasoning for atomic accesses" Unsafe Code Guidelines	23	3455	October 15, 2019
Reordering of writes via differently-typed pointers Unsafe Code Guidelines	14	2017	March 25, 2019
Reading and erasing bits outside of an allocation in inline asm Unsafe Code Guidelines	2	271	August 27, 2024
Unaligned SIMD (SSE2 in particular) loads/stores Unsafe Code Guidelines	4	2610	March 25, 2019
Using LLVM's "unordered" reads and writes	1	813	May 19, 2019

Misaligned access costs

Related topics