[Pre-RFC] Patterns allowing transparent wrapper types

I would like to gather some early feedback but this is in the full RFC format already. Rather lengthy motivation since the inability to do this safely came up multiple times already now in more than one crate in which I am involved.

Summary

Allow patterns to denote a structure annotated with #[repr(transparent)] in place of the structural pattern for the basis of the representation.

Motivation

Custom dynamically sized types (wrapping one of the native DSTs) are in an unfortunate place right now. It is almost entirely impossible to create them within a safe context and since they exist only (maybe soon mostly) behind references, #[repr(transparent)] is also mostly useless for this job. Custom newtype wrappers around usual types are also not optimal when interacting with code that instead targets the underlying type.

However, being able to have new invariants on (unsized) types without changing representation has many real advantages:

  • str can be regarded as such. While an internal type for the moment (and likely longer), a custom type in its likeness for other encodings can be useful.
  • Network programming and other forms of communication rarely deal with fixed size structures but have highly predictable content and internal invariants.
  • [impl Ord] that actually maintains order in the slice.
  • Avoids one level of indirection and a lifetime. Current code many times provides wrappers some with generic impl AsRef<[T]> instead of encapsulating the actual memory region.
  • Ascribing additional meaning to raw data:
    struct RGB([u8; 3]);
    

Calling a &self method on a type wrapping a reference to unsized data suddenly has two lifetimes to care about. The one of the container struct which is just struct _ { inner: C, } and the actually relevant one of the memory. This makes it unecessarily unergonomic to store a borrowed result from such a type.

It also ties two separate concepts strongly together. The struct for encapsulating an inner invariant is many times also in charge of owning that data. This is however unecessary, especially but not only if the data need only be accessed immutably but at that point library authors opt to introduce illusive bounds of C: AsRef<[T]> + AsMut<[T]> instead. This creates new problems if they actually do care about the memory location since those two methods need not return the ‘same’ slices at all times. And note how the AsMut<[T]> bound always conditionally pops up despite the receiving method already declaring itself to be &mut self.

A much more preferable solution would be to be able to provide a native, standard internal type on which such guarantees can be made and which will also produce the desired (co-)variance–&'_ T. Note that the standard library already follows this pattern! Vec<T> is the owning abstraction while [T] is the representational. And there are plenty other owners of [T] suggesting that indeed these concepts should be addressed in different levels of abstraction. Similar follows for String and str where we have the additional accommodation of being enabled to convert between [u8] and str due to internal, unsafe magics. (Pretty literally currently since they are lang items. Some code casts pointers, some does union access, it’s a bit all over the place).

The type of a value can change through coercion or an explicit constructor. The type itself can only be changed within an expression and can not be changed when matching a reference although matching by value may instead produce multiple values of other types. This presents a problem for dynamically sized types which are exclusively visible behind references (and pointers). Creating a reference to a custom DST thus always involves unsafe code for

  1. Converting the original reference to a pointer.
  2. Casting a pointer to the own DST
  3. unsafe: Dereferencing that pointer and reborrowing it

This is error prone since the lifetime information is lost; and the involved types may change without compilation errors introducing misalignment or size mismatches. Using a transmute of the reference directly can have even larger impact without failing to compile, if any of the involved types changes. Not requiring unsafe at this point would also encourage using it more sparingly, improve code quality, and allow more #[deny(unsafe)] usage. Note that the reverse, converting to a reference of a contained fied, is easy since it is a basic field access. The layout of value denoted by the inner field is afterall ensured already through the usual means, and so is the reference creation to it (disregarding#[packed]).

Guide-level explanation

Usually types can be constructed and destructed via their struct syntax:

struct Foo { bar: usize }

let foo = Foo { bar: 0 };
let Foo { bar, } = foo;

This proposes to allow a pattern to match as a another type if that type is a transparent wrapper, and only then. This requires the member to be accessible and valid in that context like a normal constructor would including correctly verifying privacy rules. This ensures that user-defined invariants can be upheld, in addition to the validity and correctness invariants already imposed by the underlying type.

#[repr(transparent)]
struct ascii([u8]);

let byte_slice: &[u8] = ...;
let (val as ascii(val)): &ascii = byte_slice;

Reference-level explanation

The layout of a wrapper type annotated as #[repr(transparent)] is guaranteed to be the one of the sole underlying type. Hence, it must contain exactly one non-zero-sized field.

Allow ref and ref mut identifier patterns for the type of the underlying type to contain an optional suffix as <construction> denoting the construction of the wrapping type. This can be used at the same time as the existing @ pattern suffix for identifier patterns but must occur before it.

#[repr(transparent)]
struct asii([u8]);

let byte_slice: &[u8] = ...;
let val as ascii(val) = byte_slice;
// Desugared:
let (&ref val as ascii(val)): &ascii = byte_slice;

#[repr(transparent)]
struct AsciiChar(u8);

impl AsciiChar {
    fn as_ascii_char(ch: &u8) -> Option<&AsciiChar> {
        match ch {
            ch as AsciiChar(ch) @ 0..=127 => Some(ch),
            _ => None,
        }
    }
}

Conceptually, the optional as <construction> argument allows viewing of the matched place, optionally including multiple zero-sized slices at eiter end, without permitting reinterpretation since the memory contain the type itself will remain being interpreted as that type including its invariants.

The construction expression itself must consist only of:

  • exactly one struct expression
  • where exactly the wrapped field names the identifier
  • and all other fields name a zero sized type
#[repr(transparent)] 
struct unparsed<T> {
    data: [u8],
    to_be_type: PhantomData<T>,
};

// Ok
let data as unparsed {
    data,
    to_be_type: PhantomData::<String>,
} = raw_bytes;

fn produces() -> PhantomData<String> { .. }

// NOT Ok
let data as unparsed {
    data,
    to_be_type: produces(), // No full expressions.
} = raw_bytes;

// NOT Ok
let data as unparsed {
    data: &data[1..], // No expressions on the value.
    to_be_type: PhantomData::<String>,
} = raw_bytes;

// Ok
let val as unparsed {
    data: val, // Can give it a different name.
    to_be_type: PhantomData::<String>,
} = raw_bytes;

Drawbacks

This complicates pattern matching.

It introduces a value-like syntax to places where actually no values are involved.

Rationale and alternatives

Non-ref patterns are explicitely excluded to not further dilute the meaning of patterns not to involve expression evaluation on their own.

An alternative is embracing unsized values. This would solve the problem of construction but does not address efficiency and usability concerns. In particular, it would not allow converting the type of values already behind a reference.

Prior art

C permits casting of types that have ‘compatible layout’ but the compiler undertakes no effort of validating it, silently permitting undefined behaviour to occur.

Unresolved questions

Should a slice of the wrapper type also be constructible from a slice of the wrapped type? This seems useful in avoiding duplicating every definition for a slice variant. But for unsized original types or other usecases this is not even be applicable so that further syntactical or semantical complications would be suspect.

Future possibilities

Leverage the concept for the benefit of several

Other types and representations also permit transmutes of effective transmutes e.g. to [u8] by having no alignment holes in their layout such as integer types. For some applications, this is not a problem since those types are usually Copy but does not hold in general. Support methods such as u16::to_{ne,be,le}_bytes further increase usability but do not establish a general pattern: Their internal implementations are still a standard transmute and so does not naturally expand to slices.

let slice as <[u8]> = u16_slice;
let [upper, lower] = &mut u16_value;

But note that this is related but different direction: This RFC is focussed on construction of a wrapper that enables additional invariants while this other possibility would be destructuring.

3 Likes

It would be nice to add type ascription in the examples to illustrate exactly what is happening (also to show how an as pattern and type ascription will work together syntactically), I assume the first example is something like:

#[repr(transparent)]
struct ascii([u8]);

let byte_slice: &[u8] = ...;
let (val as ascii(val)): &ascii = byte_slice;
1 Like

Example of a somewhat surprising but useful generic wrapper that is currently not possible: Playground

/// Wrapping makes one unable to call mutable methods.
///
/// Use case: A third-party api takes a `&mut _` but you want to ensure that
/// no code actually calls any mutable methods. Note that `T` may be a trait
/// object for even more confusing fun.
#[repr(transparent)]
struct ForbidMut<T: ?Sized>(T);

/// No `DerefMut` as design decision.
impl<T: ?Sized> Deref for ForbidMut<T> {
    type Target = T;
    
    fn deref(&self) -> &T {
        self.get_ref()
    }
}

This vaguely scares me as it means creating wrapper DSTs without going through a constructor that verifies correctness of the wrapped data (or asserts it with unsafe).

I think it should be safe since it requires access to the field (the same as sized wrapped values), but it still stands someone could’ve given field access with the assumption it couldn’t be abused.

I can’t speak to the grammar accessibility currently, I’d have to study it further. It initially seems ok, though, to just have ⟨pat⟩ [as ⟨pat⟩] [@ ⟨pat⟩].

Is there a reason transmute can't be used here?

2 Likes

Or even just some pointer casts if you don’t like transmute.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=60f28a470db777af0b95546a52ee59df

2 Likes

I had to read this several times before realizing that the grammar is not:

Pat =
  | ...
  | As:{ lhs:Pat "as" rhs:Type }
  ;

Can you please provide a diff to the grammar in https://github.com/rust-lang-nursery/wg-grammar/blob/master/grammar/pat.lyg ?

Also, it strikes me as odd to use as in a different form than $thing as $type. I would expect some symmetry between type ascription/casts in patterns (I hope we get those eventually) and in expressions.

3 Likes

I read it as

Pat |= Binding:{
    binding:Binding
    { "as" wrapper:{Pat::TupleStruct | Pat::Struct}?
    { "@" subpat:Pat }?
};

to be a bit loose with the lyg syntax for a smaller diff.

This elaboration makes me believe this specific syntax is indeed nonambiguous and available.

1 Like

Yeah, it requires unsafe and #[forbid(unsafe)] is nice to have for some crates :slight_smile: Any instance of reasoning for unsafe that does not require and locally established preconditions establishes that it is fine to use, and makes one less careful.

This suggests just:

Pat |= Binding:{ binding:Binding { "as" wrapper:Pat}? { "@" subpat:Pat }? };

I don't think I could do a better job than @CAD97 with this since I've never touched that part of Rust before. That said, his interpretation seems to correctly capture my intent. Trying to think of a better way to phrase this in prose but haven't come up with something specific. One usage of the full syntax with all the ordering was in there:

To address some part of the scariness of wrapping: The only value validity requirements are already verified by existance and validity of the value with underlying type. The user-defined invariants otherwise imposed through privacy rules should be correctly verified as if the construction was done via the constructor as specified at that point, but without the sizeness restrictions usually on values. With regards to the standard library and NonZero optimizations, this would of course require unsafe for those types where internal attributes make the constructor itself unsafe but I don't think any of these are currently exposed.

If we’re going to reuse the keyword as, why not just extend the regular as operator to support this?

let foo: &[u8] = ...;
let bar: &ascii = foo as &ascii;

(The compiler would still enforce that this can only be done in contexts where the field is visible.)

I can see an argument that as is overloaded (and dangerous) enough as it is and shouldn’t be extended further. But in that case, reusing the keyword in a different location can still be confusing to the programmer, even if the parser sees it as a totally separate construct. And it still seems unnecessary to involve pattern syntax at all.

4 Likes

Because while the restriction of #[repr(transparent)] does not allow for other non-zero-sized fields, it allows other zero-sized ones. Thus, stuff like this is happily allowed:

#[repr(transparent)] // Gets repr of `T`
struct ArbitraryGeneric<'a, T: ?Sized, U>(T, PhantomData<&'a U>);

And if there need to be fields involved for some types, then it becomes confusing why the typename is sufficient for other varaitions. For the compiler it may also be harder to choose the correct interpretation for foo as &ascii if ascii has more than the obvious single field. And when you begin actually having to specify the constructor, it quickly becomes tedious avoiding values at any cost. Patterns also depart from attaching as to builtin implicit conversion too much, that’s true but kind of intentional.

But this is about promoting the idea of this operation, mostly, so I’m open to other syntax suggestions. If you strictly do not want to involve patterns maybe a placeholder syntax could instead signify the relevant field:

let foo: &[u8] = ...;
// _ signifies the single representational field, all other fields get values the standard way.
let bar: &ascii = foo as &ascii(_);
// and for the above: [T = [u8]]
let bar: &ArbitraryGeneric = foo as &ArbitraryGeneric(_, PhantomData);

But there can only be 1 unsized field in a type that is also stored in memory (not in a PhantomData or similar) , so there should be 1 obvious choice for which field to use.

Since we already have unsizing coersions, we could reuse the logic from that.

struct Foo<T: ?Sized> {
    num: u32, bar: String,
    val: T
}

fn main() {
    let foo = Foo { num: 10, bar: String::new(); val: [0,1,2] };
    let foo : &Foo<[_]> = &foo;
}

The problem is not choosing the field, but filling the rest of the fields with values that satisfy the rest of the compiler. Afaik you can not elide those fields in other initializers for reasons of the type and borrow checker, although I’m not up-to-date nor fully positive. But you couldn’t write:

struct Foo<'a, T: ?Sized> {
    inner: u8,
    phantom: PhantomData<&'a T>,
}

let data = Foo {
     inner: 0,
};

From this point of view it seems both counter-intuitive and dangerous (in terms of implementation issues) to allow it for this type of initialization.

3 Likes

Ah, my bad

1 Like

After thinking about this a little I wonder how bad it would be to just allow unsized types to exist by-value within the context of a DST constructor and require them to be placed back within a reference to carry their metadata, so for the initial example you would just write:

#[repr(transparent)]
struct ascii([u8]);

let byte_slice: &[u8] = ...;
let val: &ascii = &ascii(*byte_slice);

This is more special-cased than the proposal as I don’t see any way to extend it to the AsciiChar example, but at least to me it seems much more readable for the DST wrapper case.

5 Likes

I really wished it could work like that but the operation is inherently different from an actual constructor. For example should unsized values be allowed, this sadly will lead to inconsistent behaviour. Then *byte_slice would be allowed to create a value which moves from the original value—e.g. moving from and destructing a Box<_>— and suddenly be stuck between moving by usual semantics but the commitment to this synax already requires that it only borrows. I’m not sure we can do both with that proposed syntax. A general concept for unsized values by move seems overall preferable but in addition to this one, which inherently should not move, it seems even more powerful.

I see. It makes sense, but still feels strange. For one thing… with your syntax, what if a zero-sized type implements Drop? Should it be called when casting? Since the zero-sized type logically remains part of the data, I suppose not, but then since the original type of the buffer being borrowed didn’t include the zero-sized type, there’s no reason to expect Drop will ever be called… On the other hand, I suppose that’s an edge case; mem::forget is already safe, and in theory the new syntax could forbid types with Drop impls.

On the other hand, for the regular-as-syntax alternative, the compiler could require zero-sized fields to have a Default impl (and call it when doing the cast, just for consistency). That would be slightly less flexible, since it wouldn’t work in cases where the ZST represents some invariant and you have to do something fancy to obtain a value of the type – but that seems like an edge case, and there’s always unsafe as an escape hatch…

3 Likes

Good observation. It seems that there is a slight disconnection between this and usual constructor semantics. (And your postfix expression as syntax has really grown on me, so I’ll use it instead for now. Hopefully it is also available).

struct SomeUnitStruct;
#[repr(transparent)] struct ascii([u8], SomeUnitStruct);

let bytes: &[u8] = ...;
let asc: &ascii = bytes as &ascii(_, SomeUnitStruct);
       //  passed by 'reference' ^^^ ************** <- but this by value

The identifier hides the details but the transparently wrapped value is conceptually passed by reference to entirely avoid depending on unsized value semantics. Could the other values also be passed by reference, thus avoiding the Drop discussion?

let tag: &SomeUnitStruct = _; // Bring your own tag.
let asc: &ascii = bytes as &ascii(_, tag);

This would also sidestep separately specifying how to handle the tag data having a smaller lifetime than the reference which you are casting, it would simply fall to an error of failing to infer an appropriate lifetime for tag itself.

However, the initial restriction to Copy for all zero-sized types seems useful enough already. So I don’t think resolving this would be necessary, it would be an unresolved question in the RFC.