Mem::uninitialized, `!` and trap representations


#1

Many Rust types have trap representations - bit-patterns that fit into the type’s size, but cause UB when interacted with. For example, an all-zeros bit pattern for a safe reference, or the trivial 0-sized bit-pattern for an empty type (I don’t see any good reason to distinguish the two).

In our LLVM backend, this is mostly realized through attributes such as !range and !nonnull, which are emitted on callsites and LLVM load instructions - basically at the convenience of the compiler when values of that type are present (i.e. after a vexpr had created/loaded one).

This generally works just fine - garbage values are not used, and the metadata allows for good optimizations. However, there is one use-case where this leads to trouble.

That case is mem::uninitialized. The intrinsic is often assigned to a local variable, to create a memory buffer. The returned garbage is often wrapped inside a struct, as in

struct Data {
    type: u8,
    data: [&'static str; 16]
}

unsafe fn example() -> Data {
    let mut data = Data {
        type: 0,
        data: mem::uninitialized()
    };

    initialize(&mut data);
        data
}

Of course, this also occurs when the type is used generically.

We have to support this sort of pattern - it is very popular, and these buffers are useful for low-level code. On the other hand, we can’t just allow uninitialized values everywhere - that would create too much trouble, as well as ruin our optimizations.

I can’t figure out a clean solution for this. However, there’s a hacky solution that seems to work: allow for poisoned and partially-poisoned values (where an enum with a poisoned field is a poison value itself, to allow for invalid-value optimizations), but prevent them from being passed to functions, and prohibit all operations, except for constructors and explicit calls to the uninitialized intrinsic from creating them.

This means that moving uninitialized data from place to place, including returning it from a function, is UB. I don’t think this would be too much trouble, but I’m open to usecases.

However, this solution by itself does not restrict references to uninhabited types in any way - you can play with your &! as much as you want, as long as you don’t actually dereference it.


Recent change to make exhaustiveness and uninhabited types play nicer together
#2

Would the following implementation of initialize be valid?

fn initialize(data: &mut Data) {
    let mut count = 0;
    for i in 0..data.type as usize {
        count += if data.data[0] == data.data[i] { 1 } else { 0 };
    }
    // [do something with count to initialize all of data to not be garbage]
}

Assuming that there’s a written contract that data.type is never greater than the number of initialised values in data.data, I believe this looks superficially ok.

But if this is valid, is the compiler permitted to lift data.data[0] out of the loop as an optimisation?


#3

I don’t think it is true that in the long run this pattern has to be supported well. In particular, if Rust supported this pattern:

fn example() -> Data {
    let mut data = Data {
        type: 0,
        data: unsafe { initialize() },
    };
}

where initialize() is an extern "C" function or similar, then there would be much less need to support that other pattern.


#4

Sometimes you will want to implement the scatter by yourself. Also, some C functions won’t have the right ABI for this.


#5

Not sure about this. However, you can fix this by using a raw pointer.


#6

The implication is that no matter how widely initialize(&mut data) (where data may contain trap representations) is used, it’s not a pattern that can be supported in general without defeating optimisations - you have to disallow passing references to poisoned values to functions as well.

(I may have misread your post and “prevent them from being passed to functions” is intended to include references as well)


#7

Would that mean that SmallVec<&T> would be UB to use? SmallVec::new currently uses mem::zeroed, but I assume that’s not exactly better in this case.


#8

A variant of SmallVec without an array (e.g. passing an mem::uninitialized::<(usize, &u32)>) is UB even today so I say yes.

However, after reading about this, it might be wiser to remove mem::uninitialized() altogether, and force people to use unions with () that would not solve the partial struct initialization issue. OTOH, that would definitely require a Rust 2.0, so maybe we can leave with SmallVec being UB.


#9

I think the real solution is to teach Rust more about uninitialized values, like with my stateful MIR. This example is one of the reasons why it is necessary to be able to reason about uninitializedness per-field (another is partial drop).


#10

This issue touches upon it

We just have to define semantics a bit more for unions, and they could be good to use for this? Maybe just for #[repr(C)] unions? As pointed out in the issue, a union with () still leaves rustc with the option to use layout optimizations.


#11

I don’t get why people bring up unions with these sorts of things. Unions are dynamic and a barrier to compiler and human reasoning alike. Here the compiler needs to know that while initializedness varies, it varies statically and is thus known and constant at each point of the CFG.


#12

Ok. I have a different usecase then, where I need a region where the uninitialized part is varying dynamically.


#13

That would solve the “alternate RVO ABI” problem. However, it would not solve the “gather^Wscatter” problem, or the “SmallVec” problem. Using a union with () nicely solves these problems, but is a breaking change and requires an unstable feature.


#14

See, for example, here. This is essentially a small vector, as others have mentioned. If returning uninitialized values were unsafe behavior, it would totally ruin modularity.

In general, if a field is not supposed to be used at all in unsafe code, I want to make that field uninitialized, not just for performance but for easier debugging. Valgrind will complain at uses of uninitialized values, but not at uses of “sane defaults” like 0. Additionally, it just makes code more clear: if I see that something is uninitialized, it is obvious that I shouldn’t use it, but if something is 0, I don’t know what will happen. Declaring it illegal to return uninitialized values makes it illegal to have a new method, which makes modularity worse and leads to more bug-ridden code: I have to separately verify every place where an object is constructed to make sure that the right fields are filled out, when a new method could do that automatically. Passing uninitialized values to functions should also be legal, especially behind pointers.

Essentially, the rules you suggest make it nearly impossible to use partially uninitialized structures in a modular way, and we ought to support those structures.


There are a few pieces of code that I definitely expect to be safe that are not under your rule:

unsafe fn my_uninit<T>() -> T {
    mem::uninitialized()
}
// This is like ptr::read, but takes a reference (so is somewhat safer),
// should compile to the same code locally (the compiler can optimize
// away writes of uninitialized values), and tells anyone reading this
// code that I won't use whatever is behind the reference anymore.
unsafe fn read_final<T>(ptr: &mut T) -> T {
    mem::replace(ptr, mem::uninitialized())
}
// This is literally the implementation of `ptr::read` in the standard
// library today. Note that it involves passing a reference to an
// uninitialized value to a function.
pub unsafe fn read<T>(src: *const T) -> T {
    let mut tmp: T = mem::uninitialized();
    copy_nonoverlapping(src, &mut tmp, 1);
    tmp
}

I agree that this should be valid, and the compiler should be permitted to lift data.data[0] out of the look as an optimization. This is actually a place where making the rules for when you can use uninitizalized data stricter hurts optimization - unless it is legal to read uninitialized data, the compiler can’t do LICM properly here.


I think the rule for use of invalid data (which includes, but is not limited to uninitialized data) can be very simple: everything but inspecting the data is valid. Primitive operations on invalid values return arbitrary, possibly invalid results, and branching based on invalid data (as in, e.g., match mem::uninitialized() or if mem:uninitialized()) is undefined behavior. This

  • matches intuition, since it treats uninitialized and invalid data just like anything else.
  • has a nice parametricity property: if a function (e.g. fn foo<T>(bar: T)) is fully generic on a type, then it is valid to pass uninitialized/invalid data to that function. This can be used to be fully confident that functions like ptr::read, ptr::write, mem::replace, mem::swap, Vec::push, etc. are all safe to execute on uninitialized/invalid data.

#15

Yes SmallVec is dynamic. Which is the “gather” problem?


Also while its way way complex, stateful MIR + type level nats + existentials can deal with SmallVec.


#16

That’s actually the “scatter” problem. My mistake.

fn pi(index: u32) -> u32 { /* permutation */ }
fn gather() -> [u32; 1024] {
    let mut result : [u32; 1024] = mem::uninitialized();
    for i in 0..1024 {
        result[pi(i)] = i;
    }
    result
}

#17

Yeah that’s very hard to make safe. The best one could do with stateful MIR is

fn pi(index: u32) -> u32 { /* permutation */ }
fn gather() -> [u32; 1024] {
    let mut result : [Uninit<u32::Size>; 1024] = mem::uninitialized();
    'a {
        let refs = result.map_borrow_out(|r: &out T| r); 
        // result: [Borrow<'a, Uninit<_>, u32>; 2014]
        for i in 0..1024 {
             let r = unsafe { copy_instead_of_move(refs[pi(i)]) };
            *r = i;
        }
        unsafe { mem::forget(refs) };
    }
    // result: [u32; 1024]
    result
}

which isn’t much


#18

I think we can solve the SmallVec problem by introducing a MaybeUninitialized library type.

union MaybeUninitialized<T> {
    value: T,
    // to solve https://github.com/rust-lang/rust/issues/36394,
    // can be implemented today with a #[lang="maybe_uninitialized"]
    buf: [u8; mem::size_of::<T>]
}

impl<T> MaybeInitialized<T> {
    fn new_uninitialized() -> Self {
        unsafe { mem::uninitialized() }
    }

    fn new(value: T) -> Self {
            MaybeInitialized { value: value }
        }

        fn get_mut(&mut self) -> *mut T {
            &mut self.value
        }

        fn get(&self) -> *const T {
            &self.value
        }
        // more APIs might be implemented
}

#19

I’m still not sure why enum layout optimization is a bad thing. If we make ELO better, and start to allow it inside of padding, then it’s a nice optimization that

Option<StructWithPadding>
== StructWithPadding
== Option<MaybeInitialized<StructWithPadding>> 
== MaybeInitialized<StructWithPadding>

which could be accomplished with

union MaybeInitialized<T> {
  valid: T,
  invalid: (),
}

and which could be implemented today.


#20

However, this solution by itself does not restrict references to uninhabited types in any way - you can play with your &! as much as you want, as long as you don’t actually dereference it.

I don’t get the point of distinguishing between ! and &! in this way - they’re both equally uninhabited, any representation of either is invalid. If you can have a &! in scope so long as you don’t inspect it then you should be able to have a ! in scope so long as you don’t inspect it.

I think the rule for use of invalid data (which includes, but is not limited to uninitialized data) can be very simple: everything but inspecting the data is valid.

Though if the compiler can no longer assume that data is valid - even when it’s not being inspected - then it can no longer assume that functions which return ! actually diverge. This would fix bugs around using mem::uninitialized::<!> but it would break everything else which uses !.

One solution for this would be to distinguish between safe and unsafe code and say a safe block or function always returns a valid value but an unsafe block or function can return any chunk of bits of the right size. This means that an unsafe function (like mem::uninitialized) can return a ! without diverging but it’s UB to ever read the !.

A much better solution would be to deprecate mem::uninitialized, introduce &uninit pointers and MaybeInitialized but this would be a lot more backwards-incompatible.

Edit: The conversation around the 2017 roadmap had strong vibe of “We won’t bother to implement/fix language features or specify semantics until we need to” - but that’s how you get into situations like this. I really think that any deep changes to the language which aren’t purely extensions, things like !, linear types or new pointer types, need to be figured out ASAP before we build any more of an ecosystem on top of this thing.