[Pre-RFC]: Safe Transmute

rylev · November 21, 2019, 9:08pm

ATTENTION: This has been superseded by version 2 of the proposal.

-------------------------------------------------------------------------------------------

I've been working with Josh Triplett on a design for safe transmute. Please let us know what you think.

Safe(r) Transmute

Transmuting a buffer of bytes to a type and vice versa in Rust is extremely dangerous so much so that the docs for std::mem::transmute are essentially a long list of how to avoid doing so. However, transmuting is sometimes necessary. For instance, in extremely performance-sensitive use cases, it may be necessary to transmute from bytes instead of explicitly deserializing and copy bytes from a buffer into a struct.

Causes of Unsafety and Undefined Behavior (UB)

At the core of understanding the safety properties of transmutation is understanding Rust's layout properties (i.e., how Rust represents types in memory). The best resource I've found for understanding this is Alexis Beingessner's blog post on the matter.

The following are the reasons that transmutation from a buffer of bytes is generally unsafe:

Wrong Size: A buffer of bytes might not contain the correct number of bytes to encode a given type. Referring to uninitialized fields of a struct is UB. Of course, this assumes that the size of a given type is known ahead of time which is not always the case.
Illegal Representations: Safe transmutation of a slice of bytes to a type T is only possible if every possible value of those bytes corresponds to a valid value of type T. For example, this property doesn't hold for bool or for most enums. While size_of::<bool>() == 1, a bool can only legally be either 0b1 or 0b0 - transmuting 0b10 to boolis UB.
Non-Deterministic Layout: Certain types might not have a deterministic layout in memory. The Rust compiler is allowed to rearrange the layout of any type that does not have a well defined layout associated with it. Explicitly setting the layout of a type is done through #[repr(..)] attributes. To be deterministic, both the order of fields of a complex type as well as the exact value of their offsets from the beginning of the type must be well known. This is generally only possible by marking a complex type #[repr(C)] and recursively ensuring that all fields of the struct are composed of types with deterministic layout.
Alignment: Types must be "well-aligned" meaning that where they are in memory falls on a certain memory address interval (usually some power of 2). For example the alignment of u32 is 4 meaning that a valid u32 must always start at a memory address evenly divisible by 4. Transmuting a slice of bytes to a type T that does not have proper alignment for type T is UB.

Transmuting from a type T to a slice of bytes can also be unsafe or cause UB:

Padding: Since padding bytes (i.e., bytes internally inserted to ensure all elements of a complex type have proper alignment) are not initialized, viewing them is UB.
Non-Deterministic Layout: The same issue for transmuting from bytes to type T apply when going the other direction.

Suggested Improvements

Introduce a marker trait for safely transmutable types.

We first introduce the trait Transmutable (name subject to bike-shedding) that represents any type where all properly aligned and sized byte patterns are legal (from here on referred to as "byte complete" types)

All core types that are byte complete implement Transmutable. This includes u8 and usize but do not include basic types like bool that need further validation before being safely transmuted. Transmutable can be safely opted into using #[derive(Transmutable)] as long as they are only recursively composed of Transmutable types, they have a deterministic layout (i.e., they are repr(C)), and they contain no padding bytes. The compiler will return an error when the type does not fit one of the necessary conditions for being Transmutable.

The following should be noted:

A struct that requires internal padding can become a struct that can derive(Transmutable) by explicitly including padding fields.
Manual impl Transmutable is not allowed.
The user must opt into a complex type being Transmutable because this has implications on the public API of the type. Adding a new non-Transmutable private field to a type and thus making it non-Transmutable itself is a breaking change.
While deriving Transmutable for [T; N] where T is itself Transmutable is theoretically possible, this is left to future work.

The following types should automatically be marked as Transmutable:

u8, u16, u32, u64, u128, usize, i8, i16, i32, i64, i128, isize, f32, f64, (), all SIMD types that are byte-complete, and [T; N] for all of those types (but not for arbitrary Transmutable types).

Introduce trait for types that be transformed to/from bytes

Next, we introduce a trait called ToFromBytes (name subject to bike-shedding).

This trait represents a type that can go to and from bytes in a way that may fail. All Transmutable types would implement this trait. (note: FromBytesError is explained in the following sections).

trait ToFromBytes {
    fn to_bytes(&self) -> &[u8];
    fn from_bytes(bytes: &[u8]) -> Result<&Self, FromBytesError>;
}

impl<T: Transmutable> ToFromBytes for T {
    fn from_bytes(bytes: &[u8]) -> Result<&Self, FromBytesError> {
        if bytes.len() < size_of::<Self>() {
            return Err(FromBytesError::InsufficientBytes);
        }
        if bytes.as_ptr().align_offset(align_of::<Self>()) != 0 {
            return Err(FromBytesError::InsufficientAlignment);
        }
        Ok(unsafe { std::mem::transmute<*const u8, &Self>(bytes.as_ptr()) })
    }
    
    fn to_bytes(&self) -> &[u8] {
        let pointer = self as *const Self as *const u8;
        unsafe {
            std::slice::from_raw_parts(pointer, size_of::<Self>())
        }
    }
}

Users can implement ToFromBytes for their own types as well. The standard library will implement this forbool:


impl ToFromBytes for bool {
  fn from_bytes(bytes: &[u8]) -> Result<&Self, FromBytesError> {
    match bytes.get(0) {
      Some(b) if b == 1 || b == 0 => Ok(unsafe { std::mem::transmute<*const u8, &bool>(bytes.as_ptr()) }),
      Some(_) => Err(FromBytesError::InvalidValue),
      None => Err(FromBytesError::InsufficientBytes),
    }
  }
  
  fn to_bytes(&self) -> &[u8] {
     let pointer = self as *const Self as *const u8;
     unsafe {
        std::slice::from_raw_parts(pointer, size_of::<Self>())
     }
  }
}

The following should be noted:

While the aboveto_bytes implementation is applicable for all types with deterministic layout and no padding, there is no default implementation of to_bytes.
to_bytes returns a borrowed slice, so even a manual implementation of the trait cannot construct a slice of bytes that does not match the in-memory representation of the structure. In particular, this means a type with internal padding bytes cannot implement ToFromBytes. This would require a trait that either constructs an owned (or Cow) slice, or a trait that writes bytes to a mutable slice supplied as a parameter. This pre-RFC does not attempt to specify any such trait, leaving it to future work.
In the case where the slice contains more than the number of bytes required to represent the type, the extra bytes are simply ignored.
When implementing ToFromBytes, from_bytes should process size_of::<T>() bytes and return an error if supplied with less and to_bytes should return a slice of exactly size_of::<T>() in length. These APIs should also uphold that Value::from_bytes(value.to_bytes()) == value.

Introduce a type representing errors when transmuting from bytes

Next, we introduce a FromBytesError (name subject to bike-shedding) which represents the types of errors that can occur when transmuting from bytes to a concrete type.

#[non_exhaustive]
#[derive(Debug)]
enum FromBytesError {
    InsufficientAlignment,
    InsufficientBytes,
    InvalidValue
}

impl Error for FromBytesError { ... }
impl Display for FromBytesError { ... }

Question: ShouldFromBytesError contain specific information on where the errors occurred? For instance should FromBytesError::InsufficientBytes include the number of bytes required and the number given?

Safe Unions

Lastly, unions which are composed purely of Transmutable types will allow safe access to their fields since writing to and reading from the union is well defined no matter how one interprets it.

Question: Can we safely allow access to union fields if every field is Transmutable but the fields have different sizes? Is it possible, in safe code, to end up with a union that only has the bytes of a shorter field initialized and has uninitialized data in the remainder?

Acknowledgments

Shout out to the following crates for paving the way with many good ideas:
- safe-transmute
- zeroize

josh · November 21, 2019, 9:11pm

Thanks for all the detailed work, @rylev; I've really enjoyed working with you on this. I'm excited about safe unions and safe transmute!

djc · November 21, 2019, 9:17pm

This would be great to have for some of the things I've been working on.

Question on the mechanics: wouldn't it be possible to just use TryFrom instead of introducing a new trait?

197g · November 21, 2019, 9:21pm

I feel like this should acknowledge zerocopy which has a more similar set of already existing features than the two other cited crates. It splits ToBytes and FromBytes to allow exposing a mutable slice, but does not have the power of std so is restricted to #[repr(C)] for the layout assertions and is has no SIMD types as far as I am aware.

Centril · November 21, 2019, 10:01pm

I think it would fit the compiler internals better, and be strictly more flexible, if you moved the restriction to the manual implementation of Transmute and allowed them. Then the macro expansion of #[derive(Transmutable)] becomes:

#[automatically_derived]
#[allow(unused_qualifications)]
impl<...> ::core::marker::Transmutable for TheType {}

(This is what derive(Copy) does, with semantic restrictions arising from the implementation itself; see https://github.com/rust-lang/rust/blob/82cf3a4486bc882207a09bf0d9e2dea4632781aa/src/librustc_typeck/coherence/builtin.rs#L79-L144)

Mark_Simulacrum · November 21, 2019, 10:11pm

One interesting element is that this trait does not forbid implementations that do something other than transmuting; e.g., if I wanted to, I could define this for any C-like enum in a safe manner, or really any type with a at-compile-time known set of possible values. I don't think this is a bad thing, but it is intriguing that in theory one could write a derive that generates statics for each value and then pattern-matches on the byte slice to return the reference to that value, and creates the appropriate byte slice to make that possible.

One additional extension that I think would be super useful is to provide an operation on Vec<u8> which returns T (versus &[u8] -> &T); most of the time I've wanted to directly deserialize I want to throw away the original buffer (since it's e.g. a buffer from fs::read). But that could be implemented by library code -- it seems to me that this RFC really only "adds" the Transmutable trait; the FromIntoBytes is not something that we must add to make this useful; any library can implement that in a stable way. I do think we should add it -- or perhaps the TryFrom impls @djc suggests.

H2CO3 · November 21, 2019, 10:23pm

I'm generally against any sort of "safe transmuting" because juggling bits like this is usually so low-level that attention needs to be paid to how the resulting value is used anyway. Yet, one thing that sticks out even more is this clause:

That's really confusing. I'd expect transmuting to work only if bytes.len() == size_of::<T>(), and I can't imagine any legitimate use case where allowing this would be a good idea. If the slice is big enough, it's possible to index it and chop off the excess at the end. But allowing this implicitly is bound to cause logical errors.

nikomatsakis · November 21, 2019, 10:26pm

This is not a real guarantee -- for example, one could return a static array of bytes.

nikomatsakis · November 21, 2019, 10:32pm

Currently, constructing enums is safe. For example, this program compiles (playground):

union Foo { x: u8, y: u16 }

fn main() {
    let u = Foo { x: 22 };
}

It therefore seems likely that accessing u.y would be UB, since the top byte is undefined, unless we had some kind of zeroing rule or some poisoning mechanism.

Moreover, the current draft of the UCG states that #[repr(Rust)] unions are not guaranteed to put their fields at the same offset. If fields are of varying sizes, then, it would seem to expose those choices to safe code (which I guess is ok, now that I think about it -- i.e., not really worse than exposing those choices to unsafe code -- but something to be aware of).

scottmcm · November 22, 2019, 12:05am

I feel like the vague name here is symptomatic of a vague semantic. +1 to the previous to suggestions to split the trait:

One trait for types that are valid and safe to turn into bytes. That means they have no padding (or other undef), but can have things like bool or NonZeroU32.
One trait for types that can be safely populated from bytes. That means they accept all byte patterns (so no bool), but it's fine for things with padding. (This one can be done with auto trait today, I think.)

I don't like Transmutable as a name because there are lots of things that are transmutable but wouldn't implement this trait, like newtypes.

comex · November 22, 2019, 12:36am

What about generic repr(C) structs where the presence of padding bytes depends on type parameters? I assume they can't derive Transmutable?

Can there be an unsafe impl Transmutable for cases like that?

josh · November 22, 2019, 1:32am

I think there was a typo along the way. The link to zeroize should have pointed to zerocopy.

We could very easily split ToBytes and FromBytes, and that seems like a reasonable step.

If we do that, we should split the marker traits too.

josh · November 22, 2019, 1:33am

That sounds like a great improvement, as long as impl Transmutable enforces the same requirements.

josh · November 22, 2019, 1:37am

That's intentional; you can absolutely implement to_bytes.and from_bytes for an enum (if it has a fixed representation).

It would be unfortunate to only support this for unions where all the fields have the same size.

I would like to find a way to support this case.

I thought we had addressed that at the last all-hands. But in any case, this only works on #[repr(C)] unions.

SimonSapin · November 22, 2019, 1:43am

This kind of transmute can be useful and adding more systematic checking that it’s safe sounds great. But does it really need language support, or could it be done entirely in a library with a procedural macro? (Note that this is a separate question from: should that library be part of the standard library.)

I’ve done something very similar before, in a previous version of a TrueType font file parsing library:

There’s a trait named Pod (for “plain old data”) that provides a cast from (roughly) a slice of bytes. A few implementations (for arrays, and for primitive integers) but struct can derive this trait with a derive proc macro.

The macro checks that it’s used on a struct, that has #[repr(C)], whose field all implement the trait, and that doesn’t have padding. The latter is checked by making another struct with the same fields and #[repr(packed)], them emitting an instantiation (not a call) of transmute in order to statically assert that they have the same size.

SimonSapin · November 22, 2019, 1:47am

Now that size_of is a const fn, instead of &[u8] this could used &[u8; size_of::<Self>]. (I don’t know if that works in traits, though.) There’s a TryFrom impl to go from a slice to an array reference.

josh · November 22, 2019, 1:53am

And here I thought that was a long way off! I would love to use that if it works in stable.

nikomatsakis · November 22, 2019, 1:55am

(Personally, I think guaranteeing offset 0 would be reasonable, particularly if it has a strong motivation)

josh · November 22, 2019, 1:56am

We hadn't thought about generic structs. That case needs some thought...

Accessing padding is still UB even in an unsafe trait. The point of these traits is to do this safely, rather than an unsafe transmute call.

GolDDranks · November 22, 2019, 2:16am

I think you should definitely discuss endianness here. Would using this trait produce any compatibility lints/warnings? I'm thinking that there a similar story to the architecture cfg compatibility lints with SIMD could maybe reduce the cases where accidental incompatibility caused by using this trait indiscriminately.

Topic		Replies	Views
[Pre-RFC v2] Safe Transmute	32	6212	April 6, 2020
[Pre-RFC] Safer Transmutation language design	38	6325	November 30, 2020
Specifying a set of transmutes from Struct<T> to Struct<U> which are not UB	42	3214	December 15, 2019
Safe conversions for DSTs	20	2015	March 25, 2019
pre-RFC FromBits/IntoBits libs	90	6641	March 25, 2019