MaybeUninit: consider supporting extract the value safely

Currently we can safely wrap a value in MaybeUninit:

pub const fn new(val: T) -> MaybeUninit<T>

but there seems no safe API to get the value back. The recommended API seems to be:

pub const unsafe fn assume_init(self) -> T

It seems that the knowledge that it was fully initialized was lost. Is it possible for MaybeUninit to support extracting the value back via a safe API (when already initialized) ?

Such safe API will be very helpful in case a library takes in MaybeUninit and the user already has fully initialized buffer, so that allow the user to continue to use the library safely.

Currently such library API would become unsafe in practice, for example recv_from in socket2.

Yes, that's fundamental to MaybeUninit.

If you want a MaybeUninit that tracks whether it's initialized, that's called Option :upside_down_face:

Note also that MaybeUninit::write returns a mutable reference through which it can further be read in safe code, but fundamentally if all you have is a MaybeUninit, there's no way to know whether it's initialized.

If you're interested in partially-uninitialized buffers, then you're probably looking for something like BorrowedBuf in std::io - Rust anyway, not byte-by-byte initialization checking.

15 Likes

In particular, MaybeUninit<T> can’t have any extra flags of any sort, because it is guaranteed to have the same layout as T. For example, there's no room in Vec<MaybeUninit<u8>> to store another bit per byte.

11 Likes

Thanks for explaining the details! Honestly I felt it's often a bad trade-off (safely vs. supposed performance gain in certain cases) when I encounter MaybeUninit used in crates, especially when its use broke their API backward compatibility and safety. Maybe it's just me but it really gets in the way (of using the crate) when you don't want to deal with it :slightly_frowning_face:

Can you give an example? Generally if it's used, it's for a very good reason.

1 Like

Exposing only an unsafe API (which is inevitable when using MaybeUninit) is usually a bad tradeoff, but if they've exposed both MaybeUninit for users who care about the performance implications of initializing then immediately overwriting memory, plus a safe API (e.g. BorrowedCursor::append, which is safe, but has extra copies compared to using uninit_mut, MaybeUninit::slice_as_ptr and set_init to replace uninitialized bytes with the real data.

Can you give an example? Generally if it's used, it's for a very good reason.

I am not trying to pick on socket2, but here is an example I mentioned earlier:

recv_from in socket2.

And to compare, here is the same API in older version 0.3.19, which can be used safely: recv_from in v0.3

Wait, wait. I do think this request might have a useful interpretation, even though it can't work as-stated.

Yes, if you're holding a value of type "MaybeUninit<T> that is known to be initialized", that type can be expressed as just T. But I figure you would sometimes want to construct a MaybeUninit in a struct and remember that it was initialized – and what we lack are utilities for converting between a "struct with T member and struct with MaybeUninit<T> member" (in fact, even if you manually implement both structs, you can't transmute between them, as noted in the docs kpreid linked). So it could genuinely be useful to have a wrapper type that is generic over whether it is known-initialized or not, while guaranteeing that it gives the same layout to other types that have it as a field… and conveniences for doing type-state transformations on such a field without having to use unsafe code…

I don't see an easy way to bring this into the Rust type system, but it is a coherent thing to want, and might help libraries expose only the strictest API instead of one that leaves things maybe-uninitialized more than theoretically necessary.

I personally find this unlikely, because it's a transitive guarantee. It means that if anything in your type -- no matter how deep -- uses that type, it can no longer be layout-optimized or -Z randomize-layouted etc.

If you want to make two types, one with T and one with MaybeUninit<T>, where you can transmute between them, you can absolutely do that with repr(C).

It is instructive to look at the history of this change. It comes from the fact that socket2 is a low-level library and higher-level callers were already trying to pass uninitialized buffers to it (Adding methods for accepting `&mut [MaybeUninit<u8>]` · Issue #1574 · tokio-rs/mio · GitHub) presumably to avoid the overhead of initializing data that would then immediately be overwritten.

This change also shows you why a MaybeUninit<T> cannot track whether it is initialized. It is used in contexts where the representation has to be exactly the same as T, a single byte in this case, so there is nowhere to store the initialization state. This allows for [MaybeUninit<T>] to [T] casts after initialization.

The overhead of zeroing a buffer is typically low, but it is definitely not zero if you're dealing with potentially large buffers [1], so there are valid use cases to provide APIs that allow avoiding initialization overhead.

The way to avoid MaybeUninit<T> without losing any performance is to design abstractions on top of it. Hence the experimental BorrowedCursor API in nightly Rust.

[1] Depending on how exactly the memory gets allocated, the zeroing might be "free". See eg. c - Kernel zeroes memory? - Stack Overflow.

8 Likes

Why couldn't it be layout-optimized or -Z randomize-layouted in a way that's agnostic to whether the contained type is known-initialized or not? For randomization, you could seed the random numbers based on the nominal-identity and fields of your type, but where fields contribute the same seed-data when they are "guaranteed equivalent" in this way. (I realize this might create a lot of developer work refactoring the layout algorithms, I'm just talking about theoretically)

2 Likes

It is instructive to look at the history of this change.

It's all good, except the problem is that socket2 did not add a new API to support the case of using MaybeUninit, and changed a (commonly used) existing API in a way that broke the compatibility and safety. It's unfortunate.

Like said in a previous comment MaybeUninit: consider supporting extract the value safely - #6 by farnz

1 Like

With this example in hand, I think that there's an actual improvement possible to MaybeUninit for this use case.

You want to be able to take a reference to a T or slice of T, and treat it as a reference to MaybeUninit<T> or slice of MaybeUninit<T> - this should be an entirely safe operation in your code, since the underlying owned T is known to be initialized by the rules of the language, and it does not require tracking whether a given MaybeUninit<T> is initialized or not.

I'd propose adding the following functions to MaybeUninit if this was something I was pushing forwards:

pub const fn from_ref(&T) -> &MaybeUninit<T>;
pub const fn from_mut_ref(&mut T) -> &mut MaybeUninit<T>;
pub const fn from_ref_slice(&[T]) -> &[MaybeUninit<T>];
pub const fn from_mut_ref_slice(&mut [T]) -> &mut [MaybeUninit<T>];

There's already a pub const fn new in MaybeUninit that takes ownership; these are the obvious extensions of that to cases that don't take ownership.

Then, for the socket2 API case, you'd write code like:

let mut buf = vec![0;16384];
let count = s.recv_from(MaybeUninit::from_mut_ref_slice(&mut buf));

It's a bit uglier than a nice API in socket2 would be, because you have to have the conversion function in there, but it avoids you needing unsafe.

The big blocker I can see to this is that it's possible to write MaybeUninit::uninit() through a mutable reference, making it UB to access the value, and to rely on the contract of your function to stop people accessing the uninitialized component (e.g. recv_from could swap an internal buffer that's only initialized up to the number of bytes it returns, instead of copying directly). I'd be inclined to document this as more requirements on MaybeUninit users (that you never change an initialized value to uninit), but that's an arguable position.

I'd be inclined to document this as more requirements on MaybeUninit users

It would allow UB in safe code

let mut init = 5;
*MaybeUninit::from_mut_ref(&mut init) = MaybeUninit::uninit();
println!("{init}")

What would be required is an uninit type that cannot be "deinitialized". I think one needs a write only reference wrapper for this.

6 Likes

You can track it via type state. The layouts of SometimesInitialized<true> and SometimesInitialized<false> are not guaranteed to be equal so you must not transmute between them or their references, but in non-debug builds without randomization the moves should probably optimize away.

Or you could slap a repr(C) on it and the structs containing it.

use core::mem::MaybeUninit;

struct SometimesInitialized<const IS_INIT: bool = false> {
    foo: MaybeUninit<u8>
}

impl SometimesInitialized<false> {
    fn new() -> Self {
        SometimesInitialized { foo: MaybeUninit::uninit() }
    }
}

impl<const IS_INIT: bool> SometimesInitialized<IS_INIT> {

    fn write(mut self, val: u8) -> SometimesInitialized<true> {
         self.foo.write(val);
         SometimesInitialized::<true> { foo: self.foo }
    }
}

impl SometimesInitialized<true> {
    fn read(&self) -> u8 {
         // Safety: this type can only be constructed by initializing the field
         unsafe { self.foo.assume_init_read() }
    }
}

fn main() {
    let foo = SometimesInitialized::new();
    let foo = foo.write(1);
    dbg!(foo.read());
}
1 Like

This won't work for the fn read case though. Only part of the slice is written, you can't have a slice of mixed type.

The amount of written bytes is runtime information, so you need a type that tracks this at runtime anyway, it can't be done purely at compile time.

We've got BorrowedBuf for that, and it'd be nice if socket2 ported over to it.

I suspect that you could, at quite some effort, make the functions I described work, with a few nasty changes:

First, add an unsafe marker type AllBitPatternsValid or similar. Types where any bit pattern of the correct size is a valid instance of the type can implement AllBitPatternsValid, other types cannot. Then say that specifically for AllBitPatternsValid types (and those types only), MaybeUninit::uninit and friends create an arbitrary fixed bit pattern (which can change between calls), with semantics along the lines of LLVM's freeze operation as applied to undef (you cannot predict the bit pattern, but you know it won't change). Then, the functions I suggested could have T: AllBitPatternsValid as a bound, so they'd work for u8, but not std::cmp::Ordering, since the latter has bit patterns that are the right size, but are invalid.

This still leaves you in pain if your type doesn't implement AllBitPatternsValid. And it constrains the compiler implementation further, for a case that's better solved by BorrowedBuf.

That statement makes very little sense. The API isn't even stable yet. From the point of view of most of the rust ecosystem that means it might as well not exist. And probably won't exist in a meaningful way for several years to come, given the usual pace of things.

Nightly is a playground for future ideas, and a place for the Rustc developers to shake out the bugs etc. You should even test your crate in CI with nightly to discover regressions early. And it has some useful tools like miri.

But no serious crate would ever depend on nightly. That makes your crate irrelevant to the majority of the rust ecosystem. A few crates goes to the extra effort of having optional functionality on nightly, but that means extra churn as that code breaks down the line.

1 Like

Uninitialized is not simply some "random but fixed bitpattern". In LLVM it's undef, which infects downstream operations. Branching on uninitialized bits is UB. The bits are allowed to change between each read, which means uninit.read() == uninit.read() can end up being false but the optimizer assumes it's true, which is why it's UB. real memory can in fact behave that way. A MADV_FREE'd page can represent some bit pattern one time, and then suddenly get zeroed at another time. Perhaps some uninitialized RAM on some embedded systems might also behave this way.

To stabilize the bits on a read we would need freeze, but this is currently not part of rust's memory model, so we can't expose it.

2 Likes