pre-RFC FromBits/IntoBits

It’s a bit hard for me to visualize it, could you show a code example?

Sure.

// NOTE: need to verify that size_of::<T>() == size_of::<U>().
// How to do that is an open question in the pre-RFC.
fn safe_transmute<T, U: ArbitraryBytesSafe>(t: T) -> U {
    unsafe {
        // First, convert t to its underlying bytes. This effectively
        // mem::forget's it. Could also drop first. Now we just have
        // a meaningless pile of bytes.
        let bytes = mem::transmute::<T, [u8; mem::size_of::<T>()]>(t);
        // Second, convert the pile of bytes to a U. We know this is
        // safe because U: ArbitraryBytesSafe. We could have done both
        // of these steps at once (transmuting T to U), but this is more
        // illustrative of why it's safe to do the conversion.
        mem::transmute::<[u8; mem::size_of::<T>()], U>(bytes)
    }
}

I see. So there are some pairs of types for which safe_transmute is not bidirectional. For example, m32x4 and m16x8 can be safely transmuted into i32x4 but i32x4 cannot be safely transmuted into either m32x4 nor m16x8. I guess that would be handled by making i32x4 derive ArbitraryBytesSafe and leaving m32x4 and m16x8 without an implementation. Since all these types have the same size, m32x4 and m16x8 can be safely transmuted into i32x4, but since m32x4 and m16x8 do not derive ArbitraryBytesSafe, i32x4 cannot be safely transmuted into any of these. So far so good.

However, m32x4 can be safely transmuted into a m16x8 while m16x8 cannot be safely transmuted into m32x4. I wonder how that could be handled by ArbitraryBytesSafe without allowing any safe transmutes of m32x4 or m16x8 to i32x4.

So my thinking was that ArbitraryBytesSafe would be somewhat related to but also somewhat orthogonal to FromBits/IntoBits. In particular, you could imagine the following holding:

  • If something is ArbitraryBytesSafe, then it is FromBits<T> for arbitrary T
  • If something is ArbitraryBytesSafe, then immutable references to it are FromBits<&T> for arbitrary T
  • If two things, T and U, are both ArbitraryBytesSafe, then mutable references to T are FromBits<&mut U> and vice versa
  • If something is FromBits<[u8; size_of::<Self>()]>, then it is ArbitraryBytesSafe.

And I could also imagine other blanket/default (I'm bad with terminology) impls holding for other combinations. But the example you mentioned would be naturally handled by only writing blanket/default impls that are definitely sound. For example, it sounds like m32x4 shouldn't be ArbitraryBytesSafe; if it were, then it would be safe to transmute m16x8 into m32x4.

Note that there are also concerns around alignment that are trickier when you're trying to transmute references than when you're only transmuting values. My proposal focuses on references because the goal is to enable safe zero-copy deserializing, but I don't think you have to deal with those concerns here because you're doing everything by value. There are also concerns, when consuming by value, around whether you drop or forget the input.

Ah! I thought that ArbitraryBytesSafe should solve the whole problem!

I don't know how I feel about having 2 mechanisms to achieve almost the same thing, but not quite.

In particular, @jmst proposed a Compatible trait (read from here downwards: pre-RFC FromBits/IntoBits - #23 by gnzlbg) that appears to solve all problems better than FromBits/IntoBits and ArbitraryBytesSafe.

So I wonder why can't ArbitraryBytesSafe just derive Compatible<[u8; mem::size_of::<T>()]> or similar instead?

I don't think it can because it's strictly less powerful than FromBits/IntoBits. In particular, the latter can express the idea that "any valid instance of this type has a bit pattern which is also valid for this type" even if the latter type is not ArbitraryBytesSafe. For my particular use cases, I don't need anything that powerful, but it sounds like you do.

Have you thought about having a custom derive to cover the gaps? If I've written a type that I'd like to be Convert<T>, but it has private fields, then I'm forced to unsafe impl Convert<T> for MyType {}. The thing is, it's not actually memory unsafety that's the issue here, but invariants on the values of my fields, so having to invoke unsafe here feels wrong and dangerous since it might let you not only break your contract, but actually introduce memory unsafety. A custom derive would give us something like "dear custom derive, I know that I'm OK with converting from T for the purposes of my invariants, so could you verify for me that it would be memory safe?"

Also, it looks like there hasn't been any discussion about converting references. I'm interested in zero-copy deserializing, so being able to convert references would be huge. In particular, I can imagine the following:

  • pub fn safe_transmute_ref<T, U>(x: &T) -> &U where U: Compatible<T> { ... }
  • pub fn safe_transmute_mut<T, U>(x: &mut T) -> &mut U where U: Compatible<T>, T: Compatible<U> { ... }

As discussed in Pre-RFC: Trait for deserializing untrusted input, there are still issues with verifying alignment, but it'd be a very powerful feature to have in general.

Not really, but why can't that be done with a custom derive? If it can, then it can just be done on a crate, without having to write any kind of RFC for it. The proposal ensures transitivity, so this derive doesn't really need any kind of compiler support, it is purely syntactic.

Also, it looks like there hasn’t been any discussion about converting references.

One might be able to solve this with some blanket impls for references and raw pointers:

impl<T, U> Compatible<&T> for &U where U: Compatible<T> {}
impl<T, U> Compatible<&T> for &mut U where U: Compatible<T> {}
impl<T, U> Compatible<&mut T> for &mut U where U: Compatible<T> {}
// ... and for *T ...

there are still issues with verifying alignment

Which issues? For practical purposes transmute is just a memcpy, so if the source type is properly aligned and the destination type is properly aligned, which they must be, then there aren't any alignment issues AFAICT. The same applies to "endianness": since transmute is just a memcpy, the bytes will just be copied from the source to the destination. If you don't take endianness into account when reinterpreting the bytes on the destination, you will get different results on big endian and little endian systems, but that is just how memcpy works, so that's "working as intended".

Yeah, and I know that there was a proposal to have v0 of this just require manual impls. If you can have a custom derive, then v0 could use the custom derive (so that folks don't have to unsafe impl manually, which introduces a risk of unsafety), and have v1 be the move from custom derive to compiler-supported auto trait.

I'm referring to alignment of references. impl<T, U> Compatible<&T> for &U where U: Compatible<T> {} isn't safe in general because U may have higher alignment requirements than T, so it's not actually guaranteed that any valid &T is also a valid &U. Pre-RFC: Trait for deserializing untrusted input discusses some options here, but if you don't have compiler support, then your options are either somewhat unergonomic (use a macro that uses static_assert! under the hood) or unsafe (since the caller needs to verify alignment manually).

This makes sense.

At this point doing this in the compiler looks like the most appealing solution to me, and is something that Compatible<T> could do for references, for example: if T is compatible with U and some alignment conditions hold, then &T is compatible with &U.

Maybe one day we will be able to do where mem::align_of::<T>() >= mem::align_of::<U>() in the language, but this won't be the case in the near future since we need more than just const generics for this.

Yeah, my stopgap idea was to use macros to do something like:

pub unsafe fn transmute_ref<T, U>(x: &T) -> &U where U: Compatible<T> { ... }

macro_rules! transmute_ref {
    ($x:expr, $T:ty, $U:ty) => (
        static_assert!(::std::mem::align_of::<$T>() >= ::std::mem::align_of::<$U>());
        unsafe { transmute_ref::<$T, $U>($x) }
    );
}

Another question: What about DSTs? In particular:

  • If T is a DST and U: Sized, then what does U: Compatible<T> mean?
  • If T and U are both DSTs, then what does U: Compatible<T> mean?

Some ideas:

  • If T is a DST and U: Sized, then
    • size_of_val(t) == size_of::<U>() implies that t's bits correspond to a valid U.
    • size_of_val(t) > size_of::<U>() might imply that t's bits correspond to a valid U? Is this always safe?
  • If T and U are both DSTs, then
    • If you have an existing u: U, then any t: T of size size_of_val(u) is a valid U (in other words, the existence of t is proof that size_of_val(t) is a valid size for T)
    • Maybe it’s the case that any t: T whose size corresponds to a valid size for U is a valid U? What does "valid size for U" even mean? Can we query it at compile or run time?

There’s a caveat here, which is that since we’re defining what Compatible means, the questions of “is this safe?” are somewhat up to how we define Compatible. I have a vague intuition for how this would work with [T] and composite types ending in [T]. I have essentially no idea how this would work for trait objects. I’d love to hear some thoughts on all of this.

Is it possible to mem::transmute a DST into a Sized type and vice-versa ? Is it possible to mem::transmute two different DSTs to each other?

No because mem::transmute operates on values, not references. So both its arguments must be Sized.

Huge credit to @comex for figuring out a way to do this today! Here's the idea:

trait AlignCheck {
    const BAD: u8;
}

// only compiles if align_of::<T>() <= align_of::<U>()
impl<T, U> AlignCheck for (T, U) {
    // This is a division by 0 if align_of::<T>() > align_of::<U>(),
    // producing a constant evaluation error
    const BAD: u8 = 1u8 / ((std::mem::align_of::<T>() > std::mem::align_of::<U>()) as u8);
}

pub unsafe fn unsafe_transmute_ref<T, U>(x: &T) -> &U
{
    let _ = <(T, U) as AlignCheck>::BAD;
    &*(x as *const T as *const U)
}

And it actually works!

Also, they pointed out that:

So that might be an approach we could take as well.

2 Likes

OK, here goes a first draft. A few things to note:

  • I went with FromBits<T> instead of Compatible<T> because I think it’s more descriptive. However, it behaves roughly as Compatible<T> has been proposed here, and there’s no IntoBits<T>.
  • I covered the case in which T is a DST and Self: Sized, but I haven’t yet figured out what to do when Self is a DST.
  • The file is pretty long, so here’s a summary of what’s offered if you just want to skim:
    • FromBits<T> - as described
    • FitsIn<T> - guarantees that T is no smaller than Self
    • AlignedTo<T> - guarantees that Self is as aligned as T
    • transmute - like mem::transmute, but T can be larger than U
    • coerce - like transmute, but safe
    • coerce_{ref,mut}_xxx - coercions from one reference type to another, including variations with both compile- and run-time-verified size and alignment.
    • LayoutVerified - An object whose existence proves that certain size and alignment checking has been performed, allowing for size and alignment checking to be elided in the future when doing coercions.

I’d love any feedback you have! I’d also be interested to know whether you can think of any use cases for transmute. The only difference between it and mem::transmute is that T can be larger than U, and @cramertj feels that its presence is unjustified. If we can’t think of any use cases, then I agree.

Interesting question from @cramertj: Is it safe to have unsafe impl<T> FromBits<T> for [u8]? You might expect that the answer is obviously yes since any random set of bytes is a valid byte slices, however…

  • In other languages, it can be UB to read an uninitialized value. Some notable quotes:
    • “Reading an uninitialized CPU register on Itanium is the best example of a hardware-induced crash covered by this rule.”
    • “Reading uninitialized memory by an lvalue of type unsigned char does not trigger undefined behavior. The unsigned char type is defined to not have a trap representation, which allows for moving bytes without knowing if they are initialized.”
    • “However, on some architectures, such as the Intel Itanium, registers have a bit to indicate whether or not they have been initialized. The C Standard, 6.3.2.1, paragraph 2, allows such implementations to cause a trap for an object that never had its address taken and is stored in a register if such an object is referred to in any way.”
  • In C, it is always safe to read uninitialized memory as unsigned char *, but not as anything else. So maybe this is safe precisely because we’re implementing it for [u8]? I suspect (though can’t find a reference) that LLVM has a notion of character type, and so this question comes down to whether u8 is considered a character type by LLVM.

The criterion for deciding this question are Rust's semantics. What LLVM, other languages, and CPUs do is only relevant in two respects:

  • What LLVM and CPUs do might prevent us from implementing some particular semantics efficiently. (NB: LLVM does not have a notion of "character type".)
  • The reasons why other languages are aggressive about uninitialized memory (e.g., optimizations enabled by it) might also be relevant for Rust.

Aside: why does this question lead to considering uninitialized memory? It seems to me the problem is padding -- which at the end of the day is probably physical memory that isn't written to. However, when talking about language semantics, it's perfectly possible and perhaps even advisable to distinguish padding bytes from non-padding bytes.

Regardless, the meaning (or lack thereof) of reads from uninitialized memory are a broader question whose answer is part of the unsafe code guidelines. Unfortunately it appears this particular question hasn't been addressed yet. There's multiple threads touching on the subject here in this forum, but as far as I remember it's never been in focus for the working group.

1 Like

Well that depends.

I think that you can always implement this safely for types without padding bytes (e.g. repr(packed), or maybe even repr(C)?), and also for all types if the implementation does not touch the padding bytes in which case the size of the [u8] might be smaller than the size of T.

If you are talking about implementing FromBits<T> for [u8] by memcpying all the bytes for types with Rust layout including padding bytes then, as @hanna-kruppe says, that will depend on whether reading a padding byte is a read from uninitialized memory or not (cc @ubsan @RalfJung).

In any case, given that the layout of types with Rust layout is unspecified (e.g. the compiler can reorder fields at will), you will run into issues when serializing/deserializing with different Rust versions on top of the endianness issues that you get when serializing/deserializing on different architectures.

At that point you might as well restrict that blanket impl to repr(C) types and call it a day.

The question isn't whether it's safe for some T, but rather whether a blanket impl for all T - unsafe impl<T> FromBits<T> for [u8] {} - is safe.

I agree, although it sounds like LLVM might at the very least give us safety for "character types." It's obviously another question whether Rust formally guarantees that u8 corresponds to an LLVM "character type." It doesn't sound like such a guarantee is made right now, but it's not clear to me whether this is one of those "Rust's memory model is undefined anyway, and this is a pretty safe bet" things or one of those "you really shouldn't be relying on it, as there's a meaningful chance that might not be part of a Rust memory model that is formalized in the future" things.

That's true, but the question is only about whether it will cause UB.

I'm not sure that'd be sufficient, as @dtolnay suggests in Does repr(C) define a trait I can use to check structs were declared with #repr(C)? that repr(C) types can contain non-repr(C) types. Plus, as that thread describes, there's no trait for repr(C) types anyway.

The blanket impl still isn't safe because it allows you to go from uninitialized -> &[u8] -> &T where T: FromBits<[u8]>.

cc @RalfJung -- halp!

You can use union MaybeInit<T> { uninit: (), x: T } to get uninitialized memory and then get a &[u8] from that if FromBits<MaybeInit<Foo>> is implemented for [u8].

2 Likes