Rust needs a safe abstraction over uninitialized memory

Consider this simple function which creates an array containing the Fibonacci sequence:

const LENGTH: usize = 80;

fn fibonacci_array() -> [u64; LENGTH] {
	let mut output = [0; LENGTH];
	output[0] = 1;
	output[1] = 1;
	for i in 2..LENGTH {
		output[i] = output[i - 1] + output[i - 2];
	}
	output
}

As it turns out, the compiler is not smart enough to realize that none of the zeros exist in the final array and does unnecessary work to zero out the array (as seen on Godbolt). Therefore, this function can be optimized using MaybeUninit like so:

fn fibonacci_array_unsafe() -> [u64; LENGTH] {
	use std::mem::{MaybeUninit, transmute};
	let mut output = [MaybeUninit::<u64>::uninit(); LENGTH];
	output[0].write(1);
	output[1].write(1);
	for i in 2..LENGTH {
		output[i].write(unsafe { output[i - 1].assume_init() + output[i - 2].assume_init() });
	}
	unsafe { transmute(output) }
}

This runs about 20% faster on my machine. But is it actually better code? It's extremely ugly and uses copious amounts of unsafe. If I make a logic error like forgetting to initialize one of the values, it triggers undefined behaviour. My proposed safe abstraction looks like:

use std::mem::arbitrary;

fn fibonacci_array2() -> [u64; LENGTH] {
	let mut output: [u64; LENGTH] = arbitrary();
	// At this point, the contents of `output` are well-defined but unspecified.
	output[0] = 1;
	output[1] = 1;
	for i in 2..LENGTH {
		output[i] = output[i - 1] + output[i - 2];
	}
	output
}

arbitrary() tells the compiler: I don't care what the value is. The compiler can put anything it likes into output at this point. 0, 42, the current timestamp, whatever. But the hope is that the compiler will choose a value such that the resulting program is as fast as possible. At this point, it would make sense to grab some uninitialized memory, call LLVM's freeze on it (cf. 1, 2), and whatever happens to be there is a valid output of arbitrary().

If we accidentally forget to initialize output, the result is merely garbage data rather than undefined behaviour. Thus we get the full speed of the unsafe version without having to use any unsafe blocks.

Note that arbitrary() can only be used for types that can hold arbitrary bit patterns (think bytemuck's Pod trait):

let a: bool = arbitrary(); // ERROR
let b: String = arbitrary(); // ERROR
let c: i128 = arbitrary(); // this is ok
let d: [f32; 50] = arbitrary(); // this is ok

This includes signed and insigned integers, floats, and arrays or structs containing these. As always, assigning an invalid bit pattern to a type is undefined behaviour.

Anyway, I just wanted to say my piece, but maybe this can be implemented someday.

2 Likes

Garbage data is a safety issue. If it's not overwritten, it may contain whatever was there before, which may be dangerous to disclose. The Heartbleed vulnerability was merely sending garbage data, but the data could contain private keys.

In your example I suspect ArrayVec could be a zero-cost abstraction.

7 Likes

I tried benchmarking this function per your suggestion:

use arrayvec::ArrayVec;
fn fibonacci_array_arrayvec() -> [u64; LENGTH] {
	let mut output = ArrayVec::<u64, LENGTH>::new();
	output.push(1);
	output.push(1);
	for i in 2..LENGTH {
		output.push(output[i - 1] + output[i - 2]);
	}
	output.into_inner().unwrap()
}

It turned out to be almost twice as slow as the unsafe version, so quite far from a zero-cost abstraction.

fibonacci_array         time:   [42.947 ns 43.807 ns 45.015 ns]
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

fibonacci_array_unsafe  time:   [34.922 ns 35.467 ns 36.152 ns]
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

fibonacci_array_arrayvec
                        time:   [63.128 ns 64.027 ns 65.202 ns]
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) high mild
  3 (3.00%) high severe

I gotta ask how exactly did you do your benchmark.

I benchmarked each function with the criterion crate.

Please post the benchmark code so that we can review it.

1 Like

We do already have nondeterminism in the language, though maybe not in a form that the compiler would take advantage of:

fn arbitrary() -> bool {
	(1.0 * f32::NAN).is_sign_positive()
}

It would be useful to have something that could be optimized better.

Here's the benchmark code if anyone wants to try reproducing it:

use std::hint::black_box;
use criterion::{criterion_group, criterion_main, Criterion};

const LENGTH: usize = 80;

fn fibonacci_array() -> [u64; LENGTH] {
	let mut output = [0; LENGTH];
	output[0] = 1;
	output[1] = 1;
	for i in 2..LENGTH {
		output[i] = output[i - 1] + output[i - 2];
	}
	output
}

fn fibonacci_array_unsafe() -> [u64; LENGTH] {
	use std::mem::{MaybeUninit, transmute};
	let mut output = [MaybeUninit::<u64>::uninit(); LENGTH];
	output[0].write(1);
	output[1].write(1);
	for i in 2..LENGTH {
		output[i].write(unsafe { output[i - 1].assume_init() + output[i - 2].assume_init() });
	}
	unsafe { transmute(output) }
}

use arrayvec::ArrayVec;
fn fibonacci_array_arrayvec() -> [u64; LENGTH] {
	let mut output = ArrayVec::<u64, LENGTH>::new();
	output.push(1);
	output.push(1);
	for i in 2..LENGTH {
		output.push(output[i - 1] + output[i - 2]);
	}
	output.into_inner().unwrap()
}

fn criterion_benchmark(c: &mut Criterion) {
	c.bench_function("fibonacci_array", |b| b.iter(|| black_box(fibonacci_array())));
	c.bench_function("fibonacci_array_unsafe", |b| b.iter(|| black_box(fibonacci_array_unsafe())));
	c.bench_function("fibonacci_array_arrayvec", |b| b.iter(|| black_box(fibonacci_array_arrayvec())));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

Try this code instead:

pub fn fibonacci_array_arrayvec() -> [u64; LENGTH] {
	let mut output = ArrayVec::<u64, LENGTH>::new();
	output.push(1);
	output.push(1);
	for i in 2..LENGTH {
		output.push(output[i - 1] + output[i - 2]);
	}
	output.as_slice().try_into().unwrap()
}

output.into_inner().unwrap() for some reason causes compiler to insert a bunch of manual copies instead of relying on memcpy. It balloons the function size, which probably causes the microbenchmark regression. UPD: The ArrayVec code results in 4 additional instructions in the loop body compared to the MaybeUninit code, so I guess it's the main reason for the regression.

1 Like

I wonder if it would help to not need to read from the array within the loop. Something like

pub fn fibonacci_array_arrayvec() -> [u64; LENGTH] {
    let mut output = ArrayVec::<u64, LENGTH>::new();
    let mut a = 1_u64;
    let mut b = 1_u64;
    for i in 0..LENGTH {
        output.push(b);
        let s = b + a;
        b = a;
        a = s;
    }
    output.as_slice().try_into().unwrap()
}
1 Like

It certainly results in a slightly nicer assembly since a and b are kept in registers, but I am not sure if it's a fair comparison.

BTW I think the additional instructions were just from compiler unrolling the loop to perform several additions per iteration.

Yes, this code is practically the same speed as the unsafe version. But as @newpavlov said it's not a really a fair comparison since the compiler can optimize much more aggressively when it notices that the array is never read from. For some applications you might not be able to change the algorithm in this way.

Related: `freeze(MaybeUninit<T>) -> MaybeUninit<T>` for masked reads - #32 by RalfJung

With just the change to output.as_slice().try_into().unwrap() I get almost identical assembly for the arrayvec version:

-test::fibonacci_array_unsafe:
+test::fibonacci_array_arrayvec:
 Lfunc_begin0: # replaced all digits with 0 for diffing
    .cfi_startproc
    stp x0, x0, 0sp, #-0!
@@ -15,19 +15,23 @@
    sub sp, sp, #0
    mov x0, x0
    mov x0, #0
+   add x0, sp, #0
    mov w0, #0
    dup.0d v0, x0
-   str q0, 0sp0
-   mov x0, sp
+   stur q0, 0sp, #0
+   mov w0, #0
+   str w0, 0sp, #0
 LBB0_0:
+   add x0, x0, x0, lsl #0
+   ldr x0, 0x0, #0
    add x0, x0, x0
-   ldr x0, 0x0
-   add x0, x0, x0
    str x0, 0x0, #0
+   add w0, w0, #0
+   str w0, 0sp, #0
    add x0, x0, #0
    cmp x0, #0
    b.ne LBB0_0
-   mov x0, sp
+   add x0, x0, #0
    mov w0, #0
    bl _memcpy
    add sp, sp, #0

I guess my point is that a "safe abstraction over uninitialized memory" might well be better split into two abstractions, one for when you don't need to read from memory that the compiler can't prove you have already initialized, and one for when you do. And that maybe an example of the latter would be helpful.


I'm disappointed to report that this doesn't work, even for small values of LENGTH:

pub const LENGTH: usize = /* tried: 4, 8, 16, 128 */;

struct FibGen(u64, u64);

impl FibGen {
    pub fn new() -> Self { FibGen(1, 1) }
}

impl Iterator for FibGen {
    type Item = u64;
    fn next(&mut self) -> Option<Self::Item> {
        let a = self.0;
        self.0 = self.1;
        self.1 += a;
        Some(a)
    }
}

pub fn fibonacci_array() -> [u64; LENGTH] {
    // error[E0277]: an array of type `[u64; LENGTH]` cannot be built
    // directly from an iterator
    FibGen::new().take(LENGTH).collect()
}

You can use std::array::from_fn, but I'm not sure about the performance. It's probably bad.

pub fn fibonacci_array() -> [u64; LENGTH] {
    let mut iter = FibGen::new();
    
    std::array::from_fn(|_| iter.next().unwrap())
}

Perhaps it's better if we're not using the iterator trait:

impl FibGen {
    fn next(&mut self) -> u64 {
        let a = self.0;
        self.0 = self.1;
        self.1 += a;
        a
    }
}
pub fn fibonacci_array() -> [u64; LENGTH] {
    let mut fibgen = FibGen::new();
    
    std::array::from_fn(|_| fibgen.next())
}

Unfortunately, your concept of relying on the compiler to prove that your program does not do any uninitialized reads is impossible due to Rice's theorem. There will always be programs that the compiler is not smart enough to reason about, leading to a runtime cost when unnecessary work is done to initialize a variable. By the way:

let mut num: u64 = arbitrary();
num = num ^ num;
assert_eq!(num, 0);

Does this count as reading uninitialized memory? (Consider that xoring a register with itself is actually a perfectly normal thing to do in x86.)

Your method of using take(LENGTH) won't work because the compiler cannot prove (in the general case) that it will really output LENGTH elements, rather than prematurely returning None.

If you want the best performance you will need a solution that is using MaybeUninit. Either directly or via a safe wrapper like ArrayVec. Just because a possible uninitialized read gives the compiler a better optimization handle than an arbitrary read.

I think your proposal is still a bit weak on the usecase side, since the example you gave is on one hand doable with a good abstraction and on the otherhand not complex enough that a logic error with an uninitialized read is not obvious. Nevertheless, I think there are cases where a freeze/arbitrary read is the best solution.

What I find the more important issue is that the result is not constructed in the return value position in the first place but only copied there as the last step in all solutions. I think we could lower the overall inlining threshold if this would be handled better. Furthermore, LLVM's analysis is aparently not capable of removing the array entirely if you do a calculate_array().get(i).copied(). Even when the calculation does not read the array but uses temporaries to store the previous elements.

This is not really a good argument: all that theorem says is that for some programs, it is equally impossible for computers as it is for humans to prove this. If the correctness properties of such a program can't be proved, it's a bad program, and nobody should be relying on them.

But in general I agree it might be useful to have a "freeze" operation.

Good usage of freeze would be a receive buffer for checksummed packet. That way you can just write packet that is arriving into buffer and then verify checksum. Even if some byte is uninit, checksum will discard the packet (or accept if freeze happened to yield the right byte).