That may be less efficient then what can already be generated. For (u32, u8), perhaps (since that's 32-bit read/write+8-bit read/write), but for say (String, u8), which is 4 usize, It could easily be worse to emit a memcpy for only the first 3, then a 8-bit read, rather than simply a
4*size_of::<usize> memcpy, which can just be an avx2 move or two sse moves (and only 1 on ix86), vs. 1 sse move, one qword move, and one byte move.
Example codegen (edit, fixed to use intel syntax instead of AT&T but not actually):
// Rust signature: extern"sysv64" fn((String,u8))->(String,u8)
movups xmm0, [rdi]