`freeze(MaybeUninit<T>) -> MaybeUninit<T>` for masked reads

I'd like LLVM's freeze operation to be exposed at the Rust level. I've read through some past threads such as this one which discussed freezing on arbitrarily sized memory regions, which LLVM's freeze is not designed for. I'm instead looking for the following operation, which seems to be a perfect fit for LLVM's freeze semantics:

unsafe fn freeze<T>(v: MaybeUninit<T>) -> T

It would be sufficient even to expose this on a limited set of types T, such as just the primitive types u8/u16/u32/u64/u128 plus the 256-bit and 512-bit vector register types, perhaps exposed via T=[u8; 32] and T=[u8; 64] or via T=std::simd::u8x32 and T=std::simd:u8x64.

My questions are:

  1. Do you have any initial feedback on adding such a function?
  2. How would I go about proposing such functions and implementing them in rustc?
  3. Can you confirm that my workarounds below (section "Workarounds") are sound?

Additionally, I'll provide some motivation for why I need this.

Motivation

In high-performance branchfree code (typically SIMD, but sometimes also SIMD-within-a-register) we often want to use wide memory reads that read both initialized data and uninitialized data. While processing this data, we typically mask out the uninitialized data, e.g. by bitwise AND.

Two examples follow.

Example 1: fast hashing/equality/memcpy/etc on types with padding

Given an 8-byte type with internal padding, we can implement fast hashing/equality/memcpy on it by loading the 8 bytes into a u64 register and masking out the padding:

#[repr(C)]
struct TypeWithPadding(u8, u32);

fn masked_load(x: &TypeWithPadding) -> u64 {
  // What we want to do, but sadly it's unsound (i.e. undefined behavior):
  let x = unsafe { transmute::<&TypeWithPadding, &u64>(x) };
  let mask = 0xFF_FF_FF_FF_00_00_00_FF;  // Little endian machine
  (*x) & mask
}

fn fast_hash(x: &TypeWithPadding) -> u64 {
  some_u64_hash(masked_load(x))
}

fn fast_eq(x: &TypeWithPadding, y: &TypeWithPadding) -> bool {
  masked_load(x) == masked_load(y)
}

fn fast_copy(dst: &mut TypeWithPadding, src: &TypeWithPadding) {
  let dst = unsafe { transmute::<&mut TypeWithPadding, &mut u64>(dst) };
  *dst = masked_load(src);
}

On x86, these hash/copy/eq implementations are typically faster than the variant that carefully avoids reading uninitialized data, because the careful variant has to issue two memory loads (one for the u8 and one for the u32) whereas the fast variant can issue just one memory load (the u64) and then can operate on wide registers.

Sadly, the masked_load function doesn't appear to be possible to implement soundly in Rust, other than the workarounds I list at the end of this post. With freeze, we could implement it soundly, as follows:

fn masked_load(x: &TypeWithPadding) -> u64 {
  // What we want to do, but sadly it's unsound (i.e. undefined behavior):
  let x = unsafe { transmute::<&TypeWithPadding, &MaybeUninit<u64>>(x) };
  let v: u64 = unsafe { MaybeUninit::freeze(*x) };
  let mask = 0xFF_FF_FF_FF_00_00_00_FF;  // Little endian machine
  v & mask
}

Example 2: variable-length arrays

When processing arrays of bytes, we often want to write vectorized loops, for example operating on 32-byte registers. But what do we do when the array is found at runtime to be <32 bytes long?

A pretty good approach is to arrange for the underlying buffer to be guaranteed >=32 bytes long, with the valid part of the array being a prefix, and uninitialized data following. In that case, in the <32-byte case, we do a 32-byte vectorized load, then mask out the uninitialized data, and operate on the valid data using wide vectors.

We run into the same issue as we saw above: we need to implement masked_load, but Rust doesn't seem to give us the tools to implement it soundly.

Workarounds

Without the MaybeUninit::freeze primitive, is there any way we can provide something similar in today's Rust?

One approach is inline assembly. For example:

use std::arch::asm;
unsafe fn load_and_freeze(x: &MaybeUninit<u64>) -> u64 {
    let mut result;
    asm!(
        "mov {result}, {x}",
        x = in(reg) x,
        result = lateout(reg) result,
    );
    result
}

I believe this is sound, i.e. Rust/LLVM assume that all inline assembly might potentially have a freeze operation inside it. Can you confirm that this is sound?

This workaround has disadvantages in portability, as well as performance: it hardcodes a specific addressing mode rather than allowing instruction selection to pick the best addressing mode for the context.

Another possible approach is architecture-specific SIMD intrinsics. E.g.:

use std::simd::u8x32;
use std::mem::MaybeUninit;
use std::arch::x86_64::{_mm256_loadu_si256, __m256i};

#[inline(always)]
unsafe fn load_and_freeze_avx2(x: &MaybeUninit<[u8; 32]>) -> u8x32 {
    let v = _mm256_loadu_si256(x as *const _ as *const __m256i);
    std::mem::transmute::<__m256i, u8x32>(v)
}

The Intel intrinsics don't specifically document their semantics with respect to uninitialized memory, so I'm not sure on whether this is sound, but I suspect it is not. Can you confirm that this is not sound?

6 Likes

I think there's definitely a bunch of questions about that vs, say, impl MaybeUninit<T> { fn freeze(&mut self); }.

Alternatively, safe-transmute might get us the required "no validity nor safety invariants" that would allow a safe freeze: MaybeUninit<T> -> T.

There are also interesting possibilities about wrapping https://llvm.org/docs/LangRef.html#fast-math-flags for safe code using some kind of freeze -- one could do a bunch of operations on the MaybeUninit<f32>s that could be poison, but then have a safe freeze step at the end to get back a safe f32 (albeit perhaps a meaningless one).

> Blockquote

Ah, I see. I had side-stepped the "no validity or safety invariants" question by focusing on a limited set of types T for which that's already true, namely the u8/u16/u32/u64/u8x32/u8x64 types. But indeed, safe-transmute would allow more general choices of T. And your signature that keeps the type as MaybeUninit<T> also sidesteps that issue, by moving the unsafety to the subsequent assume_init() call.

To be clear: while validity of the bit pattern under type T is important, it's not my primary concern here. Even for a type such as u64 for which all bit patterns are valid, reading uninitialized data is undefined behavior in current Rust. It's that undefined behavior that I'm trying to avoid with the function freeze.

In light of this discussion, I'll revise my suggested type signature to be:

fn freeze<T>(v: MaybeUninit<T>) -> MaybeUninit<T>

which has the semantics of calling LLVM's freeze on the underlying byte storage, and returning a new object with the resulting byte storage. This has no hazards with respect to types T whose invariants must be satisfied; those hazards are all part of MaybeUninit::assume_init(), which already exists.

Interesting. I'm not aware of what this looks like in practice; would you care to share a concrete example?

2 Likes

Can this operation be implemented by the gcc and cranelift backends? (I guess that since cranelift doesn't have the concept of UB, this might be just a no-op identity function for cranelift. But what about gcc?)

At its simplest, fast_fmul(MaybeUninit<f32>, MaybeUninit<f32>) -> MaybeUninit<f32> intrinsics. At its full form, roughly struct ff32<const F: FFloatFlags>(MaybeUninit<f32>), ff32::<A> * ff32::<B> -> ff32::<{A | B}>, and ff32::get() -> /* freeze */ f32.

1 Like

SIMD intrinsics don't do what you want. Despite many of them looking like thin wrappers for assembly instructions, the compiler is not guaranteed to actually generate those instructions; it can and does optimize them like normal memory accesses. (This is true in C as well.)

asm! probably does have an implicit freeze, though I think this is not entirely uncontroversial.

1 Like

Freezing at the compiler level is not sufficient on its own due the the presence of things like MADV_FREE.

1 Like

Other use-cases of freeze aside, a masked load can be implemented within the current Rust semantics:

unsafe fn assume_init_masked(abs_byte: MaybeUninit<u8>, mask: u8) -> u8 {
  (if mask == 0 { 0 } else { abs_byte.assume_init() }) & mask
}

Except the branch should be removed at code-gen (current code-gen uses cmove).

This is specifically asking about freeze in the cases where it is sufficient to freeze a scalar.

fn freeze(&mut self) will not function in-and-of itsel, because of the aforementioned MADV_FREE, and any other kind of lazy memory reclamation scheme implemented in the kernel.

Will it function for types specifically only under a given size?

No, not at all. The necessary condition is that the page that has been MADV_FREEd needs to be written to (which causes the whole page to become zeroes first).

If the fn freeze(&mut self) was implemented as writing to that memory, wouldn't that meet this criterion? %1 = load u64, %p; %2 = freeze u64 %1; store u64 %2, %p is not a NOP, after all.

Sorry, I should amend my question to be more inline with the OP. What is actually being requested is

const N: usize = 8; // 16? 32? Platform dependent?

unsafe fn freeze_but_i_swear_i_touched_it<T>(v: MaybeUninit<T>) -> MaybeUninit<T>
where
    const { size_of<T>() < N };

If v doesn't cross a page boundary this new function should work without running into MADV_FREE, I'm wondering if we can get a (possibly platform dependent) guarantee for some (potentially very small) size_of<T>() (well and align_of<T>()).

By-value freeze works unconditionally (unless hardware uninit memory/registers are involved, or NaT happens). In-place freeze does not work b/c of MADV_FREE.

1 Like

To be clear: I am proposing by-value freeze. As @InfernoDeity says, this avoids MADV_FREE-related issues.

I don't see how. Because either *x = freeze(*x); still has MADV_FREE issues, or the &mut freeze could just do that.

It's the latter, i.e "&mut freeze could just do that".

I think it's worth distinguishing two possible semantics that one could choose for fn freeze(&mut MaybeUninit<T>):

  1. It could generate LLVM IR to read all bytes in the memory region, call LLVM's freeze on those bytes, and then write the frozen results back to the memory region. When LLVM compiles that to machine code, the freeze operations will become noops, but the memory reads and writes will remain. This avoids MADV_FREE issues, because the writes to memory will cause the pages to become physically backed.
  2. Alternatively, it could generate some kind of LLVM IR (unclear what that would be, but let's put that aside) which eventually compiles to completely noop machine code. This variant is prone to MADV_FREE issues, because there are no memory writes to cause the pages to become physically backed.

In my understanding, variant (2) is broken; variant (1) works. I suspect some of the back-and-forth on this thread has been disagreement in which semantics people think is being discussed. Past discussions of freeze such as this one have focused on variant (2), because the overheads of the memory reads and writes in variant (1) were considered to be too severe.

In my original proposal I gave the signature fn freeze(MaybeUninit<T>) -> MaybeUninit<T> instead of the &mut-based signature. This has some minor ergonomic advantages:

  • it makes the copying explicit, whereas the &mut version is ambiguous between (1) and (2);
  • it works well in a chaining context where you read and then freeze: let value = ptr::read(x).freeze();
1 Like

The existence of fn freeze(MaybeUninit<T>) -> MaybeUninit<T> implies that we could write:

fn freeze_mut(v: &mut MaybeUninit<T>) {
    *v = freeze(*v);
}

The semantics need to be correct either way.

1 Like

This is also why I'm asking: For what T can we guarantee that v: MaybeUninit<T>is entirely contained within a page? Because if we can make that guarantee, then writing to any byte of v should mean we're in the clear wrt MADV_FREE, enabling freeze_but_written_to_first and masked_load.