Misaligned access costs

I come up with that idea while working on PR https://github.com/rust-lang/rust/pull/98824

Basically, implementing std::mem::swap using read_unaligned and write_unaligned greatly reduces number of memory reads and writes while reducing generated code size: Compiler Explorer

Also it solves this problem: rust/swap-small-types.rs at b04bfb4aea99436a62f6a98056e805eb9b0629cc · rust-lang/rust · GitHub

Also, there is 10 years old article which claims that misaligned reads and writes are free on newer x86-x64 processors: Data alignment for speed: myth or reality? – Daniel Lemire's blog

And there is also article which tells that unaligned access on ARMv7 is slow and is not supported in older ARMs: The curious case of unaligned access on ARM | by Levente Kurusa | Medium

As I understand, main problem with misaligned access is that they may result in reading 2 cache lines instead of 1. But this concern is not very important for std::mem::swap implementation because we would load whole values into exclusive cache line anyway.

So: should I implement our std::mem::swap for x86_x64 using unaligned reads/writes?

Maybe we even have some compile-time check to put in cfg guard like "is_misaligned_memory_access_fast" and I just don't know about it?

Am I right that this 2 things are identical from LLVM perspective?

let mut v: MaybeUninit<usize> = MaybeUninit::uninit();
copy_nonoverlapping(
       source as *mut MaybeUninit<u8>,
       v as *mut MaybeUninit<u8>,
       size_of::<usize>()
);

And

let v: MaybeUninit<usize> = read_unaligned(source as *mut MaybeUninit<usize>);

What I really want is Wishlist: allow adding intrinsics in a way that doesn't break every backend · Issue #93145 · rust-lang/rust · GitHub -- that way we could just go "oh, this is swap" in the backend, and emit something more appropriate.

For example, that might be i48 for (i16, i16, i16) in LLVM, or a loop of alignment-sized chunks for "too big" values. But in cranelift it might use a different approach, because the copy-by-u8s approach is terrible without vectorization.

I don't know how to implement intrinsics and how they handled in compiler, sorry. Aren't they backend-implemented functions?