pre-RFC FromBits/IntoBits

Is it possible to mem::transmute a DST into a Sized type and vice-versa ? Is it possible to mem::transmute two different DSTs to each other?

No because mem::transmute operates on values, not references. So both its arguments must be Sized.

Huge credit to @comex for figuring out a way to do this today! Here’s the idea:

trait AlignCheck {
    const BAD: u8;
}

// only compiles if align_of::<T>() <= align_of::<U>()
impl<T, U> AlignCheck for (T, U) {
    // This is a division by 0 if align_of::<T>() > align_of::<U>(),
    // producing a constant evaluation error
    const BAD: u8 = 1u8 / ((std::mem::align_of::<T>() > std::mem::align_of::<U>()) as u8);
}

pub unsafe fn unsafe_transmute_ref<T, U>(x: &T) -> &U
{
    let _ = <(T, U) as AlignCheck>::BAD;
    &*(x as *const T as *const U)
}

And it actually works!

Also, they pointed out that:

So that might be an approach we could take as well.

2 Likes

OK, here goes a first draft. A few things to note:

  • I went with FromBits<T> instead of Compatible<T> because I think it’s more descriptive. However, it behaves roughly as Compatible<T> has been proposed here, and there’s no IntoBits<T>.
  • I covered the case in which T is a DST and Self: Sized, but I haven’t yet figured out what to do when Self is a DST.
  • The file is pretty long, so here’s a summary of what’s offered if you just want to skim:
    • FromBits<T> - as described
    • FitsIn<T> - guarantees that T is no smaller than Self
    • AlignedTo<T> - guarantees that Self is as aligned as T
    • transmute - like mem::transmute, but T can be larger than U
    • coerce - like transmute, but safe
    • coerce_{ref,mut}_xxx - coercions from one reference type to another, including variations with both compile- and run-time-verified size and alignment.
    • LayoutVerified - An object whose existence proves that certain size and alignment checking has been performed, allowing for size and alignment checking to be elided in the future when doing coercions.

I’d love any feedback you have! I’d also be interested to know whether you can think of any use cases for transmute. The only difference between it and mem::transmute is that T can be larger than U, and @cramertj feels that its presence is unjustified. If we can’t think of any use cases, then I agree.

Interesting question from @cramertj: Is it safe to have unsafe impl<T> FromBits<T> for [u8]? You might expect that the answer is obviously yes since any random set of bytes is a valid byte slices, however…

  • In other languages, it can be UB to read an uninitialized value. Some notable quotes:
    • “Reading an uninitialized CPU register on Itanium is the best example of a hardware-induced crash covered by this rule.”
    • “Reading uninitialized memory by an lvalue of type unsigned char does not trigger undefined behavior. The unsigned char type is defined to not have a trap representation, which allows for moving bytes without knowing if they are initialized.”
    • “However, on some architectures, such as the Intel Itanium, registers have a bit to indicate whether or not they have been initialized. The C Standard, 6.3.2.1, paragraph 2, allows such implementations to cause a trap for an object that never had its address taken and is stored in a register if such an object is referred to in any way.”
  • In C, it is always safe to read uninitialized memory as unsigned char *, but not as anything else. So maybe this is safe precisely because we’re implementing it for [u8]? I suspect (though can’t find a reference) that LLVM has a notion of character type, and so this question comes down to whether u8 is considered a character type by LLVM.

The criterion for deciding this question are Rust’s semantics. What LLVM, other languages, and CPUs do is only relevant in two respects:

  • What LLVM and CPUs do might prevent us from implementing some particular semantics efficiently. (NB: LLVM does not have a notion of “character type”.)
  • The reasons why other languages are aggressive about uninitialized memory (e.g., optimizations enabled by it) might also be relevant for Rust.

Aside: why does this question lead to considering uninitialized memory? It seems to me the problem is padding – which at the end of the day is probably physical memory that isn’t written to. However, when talking about language semantics, it’s perfectly possible and perhaps even advisable to distinguish padding bytes from non-padding bytes.

Regardless, the meaning (or lack thereof) of reads from uninitialized memory are a broader question whose answer is part of the unsafe code guidelines. Unfortunately it appears this particular question hasn’t been addressed yet. There’s multiple threads touching on the subject here in this forum, but as far as I remember it’s never been in focus for the working group.

1 Like

Well that depends.

I think that you can always implement this safely for types without padding bytes (e.g. repr(packed), or maybe even repr(C)?), and also for all types if the implementation does not touch the padding bytes in which case the size of the [u8] might be smaller than the size of T.

If you are talking about implementing FromBits<T> for [u8] by memcpying all the bytes for types with Rust layout including padding bytes then, as @hanna-kruppe says, that will depend on whether reading a padding byte is a read from uninitialized memory or not (cc @ubsan @RalfJung).

In any case, given that the layout of types with Rust layout is unspecified (e.g. the compiler can reorder fields at will), you will run into issues when serializing/deserializing with different Rust versions on top of the endianness issues that you get when serializing/deserializing on different architectures.

At that point you might as well restrict that blanket impl to repr(C) types and call it a day.

The question isn’t whether it’s safe for some T, but rather whether a blanket impl for all T - unsafe impl<T> FromBits<T> for [u8] {} - is safe.

I agree, although it sounds like LLVM might at the very least give us safety for “character types.” It’s obviously another question whether Rust formally guarantees that u8 corresponds to an LLVM “character type.” It doesn’t sound like such a guarantee is made right now, but it’s not clear to me whether this is one of those “Rust’s memory model is undefined anyway, and this is a pretty safe bet” things or one of those “you really shouldn’t be relying on it, as there’s a meaningful chance that might not be part of a Rust memory model that is formalized in the future” things.

That’s true, but the question is only about whether it will cause UB.

I’m not sure that’d be sufficient, as @dtolnay suggests in Does repr(C) define a trait I can use to check structs were declared with #repr(C)? that repr(C) types can contain non-repr(C) types. Plus, as that thread describes, there’s no trait for repr(C) types anyway.

The blanket impl still isn’t safe because it allows you to go from uninitialized -> &[u8] -> &T where T: FromBits<[u8]>.

cc @RalfJung – halp!

You can use union MaybeInit<T> { uninit: (), x: T } to get uninitialized memory and then get a &[u8] from that if FromBits<MaybeInit<Foo>> is implemented for [u8].

2 Likes

Just summarizing the recent discussion with @hanna-kruppe and @mbrubeck on IRC as I understood it.

whether a blanket impl for all T - unsafe impl FromBits for [u8] {} - is safe.

Without a memory model, its impossible to say anything completely accurate here.

But chances are that this is safe for all types if you implement it properly, preferably by just using ptr::{read,write}. These functions might be special in the memory model if they can read uninitialized memory and copy it into a destination without making the memory in the destination initialized.

That is, “the impl is safe”, what would introduce undefined behavior is reading the padding bytes from the resulting [u8] without going through ptr::{read,write} again. So if you coerce from T to U using that impl, as long as when accessing U fields you don’t read any padding bytes from T, then everything is ok.

@cramertj

The blanket impl still isn’t safe because it allows you to go from uninitialized -> &[u8] -> &T where T: FromBits<[u8]>.

For that you would need an impl<T> FromBits<&[u8]> for &T where T: FromBits<[u8]> , but adding this impl would be incorrect since not all &[u8] bit patterns are valid &T patterns (e.g. null) right?

This conversion doesn’t require that impl because of the coerce_ref_size_checked function. This function allows going from &T to &U where U: FromBits<T> and T and U have the same alignment (coerce_ref_size_align_checked will dynamically check both size and alignment).

1 Like

Is there perhaps room for discussion of conversions which are safe on values (e.g., using ptr::{read,write}) but not on references? In the current draft, I just assume that U: FromBits<T> implies that you can convert references safely (with some size and alignment conditions), but maybe we don’t want to make that assumption? I feel like the assumption is still valid if we rule out uninitialized memory, but uninitialized memory itself seems to be what’s tripping up that model.

@cramertj I think I don’t follow the reasoning here, could you go step by step?

The unsafe impl<T> FromBits<T> for [u8] {} allows you to go from a T to a sized slice of u8s: [u8]. One can then take a reference and get a &[u8], and then one can call an API that gives you a &U from the &[u8] somehow. If T has padding bytes, as long as U has padding bytes in the same locations, there is no way to read the padding bytes of T via the &U.

So how can that be unsafe? What am I missing?

What if U doesn’t have padding bytes in the same locations? It would still be the case that any valid T is a valid [u8], and that, separately, any valid (initialized) [u8] is a valid U, but it’s not the case that any only partially-initialized [u8] is a valid U. And that’s the crux of the problem - the current FromBits definition implies transitivity of references (if &T -> &U and &U -> &V then &T -> &V), but it seems that such transitivity is actually unsound given that [u8] can safely have uninitialized memory while other types can’t.

Note that this wouldn’t be a problem if we removed the exception for [u8], because &T -> &U would only be safe if none of U's data fields overlapped with T's padding, and so &U -> &V really would safely imply &T -> &V.

1 Like

Sure! If we have unsafe impl<T> FromBits<T> for [u8] {}, then we get an impl FromBits<MaybeInit<u16>> for [u8] {}. Presumably we also have an impl FromBits<[u8]> for u16 {}. Then we can create &MaybeInit::empty() an pass it to coerce_ref_size_checked to get a &[u8]. Then we can pass the &[u8] to coerce_ref_size_align_checked which sees that the size and alignment match and gives an &u16 out, which now points to the original uninitialized memory.

1 Like

That is the crux of the problem indeed. It now makes sense to me that the blanket impl is unsafe.

Having thought about it some more, I’m not sure that there’s a solution other than to remove the blanket impl and pretend that reading uninitialized memory with a [u8] is unsafe. Concretely, my idea for having U: FromBits<T> not imply &U: FromBits<&T> runs into the exact same problems. Consider this straw man proposal…

  • U: FromBits<T> doens’t imply &U: FromBits<&T>
  • If &U: FromBits<&T>, and either U or T are DSTs, then do something reasonable. I’m not sure what that would be, but it’s not relevant to this strawman.
  • There’s a blanket impl, unsafe impl<T> FromBits<&T> for &[u8] {}.
  • It is never safe to have unsafe impl FromBits<&[u8]> for MyT {} because of the possibility of reading uninitialized memory as MyT, which is UB.

This isn’t very useful because interpreting a random byte slice as another type is one of the most important use cases of this stuff. Thus, consider another straw man in which we allow for reinterpreting byte slices:

  • Unlike before, there is no blanket impl for &[u8], and we require that unsafe impl FromBits<&MyT> for &[u8] {} is only safe if MyT doesn’t have any padding/isn’t an enum/etc.
  • Now it’s safe to have unsafe impl FromBits<&[u8]> for &MyT {} for some values of MyT.

However, this latter proposal is essentially exactly what we have already, only less ergonomic because U: FromBits<T> doesn’t imply &U: FromBits<&T>.

Thus, I propose the following:

  • U: FromBits<T> implies &U: FromBits<&T> as it does now.
  • [u8]: FromBits<MyT> is only valid if every valid value of MyT contains no uninitialized memory.
  • As a result, unsafe impl FromBits<[u8]> for MyT {} is sometimes safe
  • For convenience, we add a function to coerce T into [u8] by copying bytes, which results in the [u8] being completely initialized, but we do not provide a function to coerce &T into &[u8].

Note that this isn’t all that bad with transitivity since if From<T> for [u8] can be safely implemented manually for a type T, that’s the only impl that a user must manually provide (all other impls follow from that one due to transitivity).

I’m not sure I’d say all other impls, but definitely a lot. That’s a good point.

Ah, right, thanks.

These functions are not special in this sense, any loads and stores should have this property. ptr::{read,write} are at most special in that they are the primitive way to express “write without dropping old contents first” (in contrast to *p = x; which drops) and “read without concern for move semantics” (in contrast to let x = *p; which only works for Copy types). But this isn’t really true either, at least of ptr::read (check out its implementation, you could write that in stable Rust today).


Aside from that, the IRC discussion @gnzlbg alludes to was about raw pointer loads and stores. When you add references in the mix, it gets a lot more complicated. While it’s probably fine to cast a *const T to a *const [u8; size_of::<T>()] and copy around any of those bytes even if they’re padding or uninitialized, a &[u8] to the same memory is a rather different story. References have quite strong invariants both about their address (+ metadata, if any) and about the concents of the memory they point at – the main open questions are about when exactly these invariants are asserted (e.g., “all the time”, “at function call boundaries”, “when you load or store through the reference”, etc.).

To say that a &[u8] to padding or unintialized memory is fine amounts to saying there is no such thing as uninitialized or padding memory, every byte of memory always one of 256 possible values and it’s safe to do anything with any of these bytes. I don’t want to rehash that debate here but this option is very radical and, I believe, unacceptable for Rust because of how much it constrains the optimizer.

Thus, I don’t think a &T -> &[u8] conversion could be safe for all T even if there was then no way to reinterpret the &[u8] as any other type (e.g. imagine it was a trait object for trait Blob { fn get_byte(idx: usize) -> u8; }). Or, put differently, there needs to be some way to correctly handle padding and uninitalized memory in unsafe code, but I don’t see any (desirable) way to give safe code such capabilities.