I'd like LLVM's freeze
operation to be exposed at the Rust level. I've read through some past threads such as this one which discussed freezing on arbitrarily sized memory regions, which LLVM's freeze is not designed for. I'm instead looking for the following operation, which seems to be a perfect fit for LLVM's freeze semantics:
unsafe fn freeze<T>(v: MaybeUninit<T>) -> T
It would be sufficient even to expose this on a limited set of types T
, such as just the primitive types u8
/u16
/u32
/u64
/u128
plus the 256-bit and 512-bit vector register types, perhaps exposed via T=[u8; 32]
and T=[u8; 64]
or via T=std::simd::u8x32
and T=std::simd:u8x64
.
My questions are:
- Do you have any initial feedback on adding such a function?
- How would I go about proposing such functions and implementing them in rustc?
- Can you confirm that my workarounds below (section "Workarounds") are sound?
Additionally, I'll provide some motivation for why I need this.
Motivation
In high-performance branchfree code (typically SIMD, but sometimes also SIMD-within-a-register) we often want to use wide memory reads that read both initialized data and uninitialized data. While processing this data, we typically mask out the uninitialized data, e.g. by bitwise AND.
Two examples follow.
Example 1: fast hashing/equality/memcpy/etc on types with padding
Given an 8-byte type with internal padding, we can implement fast hashing/equality/memcpy on it by loading the 8 bytes into a u64 register and masking out the padding:
#[repr(C)]
struct TypeWithPadding(u8, u32);
fn masked_load(x: &TypeWithPadding) -> u64 {
// What we want to do, but sadly it's unsound (i.e. undefined behavior):
let x = unsafe { transmute::<&TypeWithPadding, &u64>(x) };
let mask = 0xFF_FF_FF_FF_00_00_00_FF; // Little endian machine
(*x) & mask
}
fn fast_hash(x: &TypeWithPadding) -> u64 {
some_u64_hash(masked_load(x))
}
fn fast_eq(x: &TypeWithPadding, y: &TypeWithPadding) -> bool {
masked_load(x) == masked_load(y)
}
fn fast_copy(dst: &mut TypeWithPadding, src: &TypeWithPadding) {
let dst = unsafe { transmute::<&mut TypeWithPadding, &mut u64>(dst) };
*dst = masked_load(src);
}
On x86, these hash/copy/eq implementations are typically faster than the variant that carefully avoids reading uninitialized data, because the careful variant has to issue two memory loads (one for the u8
and one for the u32
) whereas the fast variant can issue just one memory load (the u64
) and then can operate on wide registers.
Sadly, the masked_load
function doesn't appear to be possible to implement soundly in Rust, other than the workarounds I list at the end of this post. With freeze
, we could implement it soundly, as follows:
fn masked_load(x: &TypeWithPadding) -> u64 {
// What we want to do, but sadly it's unsound (i.e. undefined behavior):
let x = unsafe { transmute::<&TypeWithPadding, &MaybeUninit<u64>>(x) };
let v: u64 = unsafe { MaybeUninit::freeze(*x) };
let mask = 0xFF_FF_FF_FF_00_00_00_FF; // Little endian machine
v & mask
}
Example 2: variable-length arrays
When processing arrays of bytes, we often want to write vectorized loops, for example operating on 32-byte registers. But what do we do when the array is found at runtime to be <32 bytes long?
A pretty good approach is to arrange for the underlying buffer to be guaranteed >=32 bytes long, with the valid part of the array being a prefix, and uninitialized data following. In that case, in the <32-byte case, we do a 32-byte vectorized load, then mask out the uninitialized data, and operate on the valid data using wide vectors.
We run into the same issue as we saw above: we need to implement masked_load
, but Rust doesn't seem to give us the tools to implement it soundly.
Workarounds
Without the MaybeUninit::freeze
primitive, is there any way we can provide something similar in today's Rust?
One approach is inline assembly. For example:
use std::arch::asm;
unsafe fn load_and_freeze(x: &MaybeUninit<u64>) -> u64 {
let mut result;
asm!(
"mov {result}, {x}",
x = in(reg) x,
result = lateout(reg) result,
);
result
}
I believe this is sound, i.e. Rust/LLVM assume that all inline assembly might potentially have a freeze operation inside it. Can you confirm that this is sound?
This workaround has disadvantages in portability, as well as performance: it hardcodes a specific addressing mode rather than allowing instruction selection to pick the best addressing mode for the context.
Another possible approach is architecture-specific SIMD intrinsics. E.g.:
use std::simd::u8x32;
use std::mem::MaybeUninit;
use std::arch::x86_64::{_mm256_loadu_si256, __m256i};
#[inline(always)]
unsafe fn load_and_freeze_avx2(x: &MaybeUninit<[u8; 32]>) -> u8x32 {
let v = _mm256_loadu_si256(x as *const _ as *const __m256i);
std::mem::transmute::<__m256i, u8x32>(v)
}
The Intel intrinsics don't specifically document their semantics with respect to uninitialized memory, so I'm not sure on whether this is sound, but I suspect it is not. Can you confirm that this is not sound?