Pre-RFC: Minimum viable FFI unions


#1

Motivation

I have recently become the Nth person to run into difficulties during an FFI binding due to the inability to put unions in a #[repr(C)] struct. My use case calls for very little interaction with the unions, but I do still need to access struct fields after the union-typed members, which requires Rust to be able to determine the size and alignment of the union. There have been quite a few union proposals in the past, but all of them have died due to not building consensus; this proposal is optimized for its ability to achieve consensus, which means doing as little as possible while at the same time being extensible in many ways so that when Rust has a much more ambitious union system, this proposal will appear as just a special case rather than a deprecated wart.

My intention in sending this is to build consensus behind some proposal, ideally to the point where anyone who gets the itch can implement it and expect it to be merged (subject to quality-of-implementation checks).

Detailed design

Syntax

We add #[repr(union)] as a new attribute. It may only be used on enums, and only in conjunction with #[repr(C)].

#[repr(union)]
#[repr(C)]
enum my_ffi_union {
    branch_a { ptr: usize },
    branch_b { bits: [u16; 3] },
}

(rationale: Because this is only allowed when #[repr(C)] is already specified, this can be considered a proposal to extend FFI rather than a proposal for unions.)

Semantics

#[repr(C)] unions follow the target platform’s C ABI rules for unions. In most cases this will require the union to be as large as the largest branch, be as aligned as the most aligned branch, and have all branches at an offset of 0.

Because #[repr(C)] unions have a layout constrained by the C specification and precisely specified by platform-specific ABI documentation, they can validly be used with raw pointer casts and mem::transmute_copy. In fact, this is necessary for many common tasks due to the minimalism of this proposal.

#[repr(C)] unions do not implement Drop, and their branches must be Copy. Adding a type with a nontrivial drop semantics to a union results in an ill-formed type. (rationale: There are again several proposals here. Forbidding it at compile time maximizes forward compatibility.)

(rationale: The rule on requiring Copy on branches is quite strict, and it could possibly be avoided in some cases, especially if we were to add linear types. In a world with mem::forget and half a dozen ways to leak memory, linear types are primarily a lint. C++11’s unrestricted unions patch appears to specify that if a union contains a branch with a nontrivial destructor, the union becomes a linear type, and must be destructured to a specific branch before being deleted; I would be quite fine with that, except that we don’t have linear types now and making a type affine now but linear tomorrow would be a breaking change. So, for now I propose to forbid unions where any branch has a nontrivial destructor. Copy versus !Drop remains as a question, but seems low enough impact that it can be left to the implementor’s discretion.)

Union constructors may not be used for pattern matching under any circumstance. This includes derive-generated code, so most derivations are not applicable to unions. (rationale: There are viable proposals for pattern matching unions, but they add quite a bit of compiler complexity that is not needed for a first pass.)

Union constructors may be used as functions, and fill the remainder of the union with mem::uninitialized(). (rationale: mem::transmute_copy cannot be used to expand a value, so otherwise constructing a value of FFI union type would require an awkward dance with mem::uninitialized and raw pointer casts. Still, this can be omitted if it proves difficult.)

Extracting data from a union, if it is to be done, must use raw pointer casts or mem::transmute_copy as no other method is specified at this time.

fn mk_branch_a(ptr: usize) -> my_ffi_union {
    branch_a { ptr: ptr }
}

fn mk_branch_a_manual(ptr: usize) -> my_ffi_union {
    // not using the constructor, to demonstrate that constructors are a severable part of this proposal
    unsafe {
        let mut tmp: my_ffi_union = mem::uninitialized();
        *(&tmp as *mut my_ffi_union as *mut usize) = ptr;
        tmp
    }
}

unsafe fn unmk_branch_a(un: my_ffi_union) -> usize {
    mem::transmute_copy(un)
}

Drawbacks

If this proposal is adopted and in the future, a desire exists to add a featureful native Rust notion of unions, there will be strong pressure to have the future unions subsume the unions proposed herein. As such, this proposal somewhat constrains the design of future unions. Efforts have been taken to make this proposal as unopinionated as possible, to minimize the impact of said constraint.

Alternatives

Some way of creating a type with the shape of a C union is necessary.

A much more ambitious proposal could exist which defines both Rust and FFI unions. In fact, several of them already do.

rust-lang/rfcs#371 is an interesting example of a very minimal proposal, but it creates new syntax that has no credible evolution to a full union system, so it seems much more problematic to stabilize as a language feature.

If we had a featureful type-level constants system with const fn integration, a very crude approximation of this could be done as a library:

trait HasType { type TYPE; }
struct ForAlign<NN: i32>;
impl HasType for ForAlign<1> { type TYPE = u8; }
impl HasType for ForAlign<mem::align_of::<u16>() == 2 ? 2 : -1> { type TYPE = u16; }
impl HasType for ForAlign<mem::align_of::<u32>() == 4 ? 4 : -2> { type TYPE = u32; }
impl HasType for ForAlign<mem::align_of::<u64>() == 8 ? 8 : -3> { type TYPE = u64; }
const fn is_power_of_two(x: usize) -> bool { (x & (x - 1)) == 0 }
impl HasType for ForAlign<(mem::align_of::<usize>() > 8 || !is_power_of_two(mem::align_of::<usize>())) ? mem::align_of::<usize>() : -4> { type TYPE = usize; }

type SomethingOfAlign<NN: u32> = ForAlign<NN>::TYPE;

struct AlignedBuffer<ALIGN, MIN_BYTES> {
    _array: [ SomethingOfAlign<ALIGN>; (MIN_BYTES / ALIGN).ceil() ],
}

struct Union2<BRANCH_1, BRANCH_2> {
    _padding: AlignedBuffer<cmp::max(mem::align_of::<BRANCH_1>(),mem::align_of::<BRANCH_2>()),
        cmp::max(mem::size_of::<BRANCH_1>(),mem::size_of::<BRANCH_2>())>,
}

However, this would have a poor integration with compiler lints, and since some C ABIs have specific requirements for unions (IIRC), it would spread C ABI knowledge between #[repr(C)] structs in the compiler and this external library. I’d much prefer to have the ABI knowledge in one place.

Unresolved questions

Copy or !Drop?

Should we allow multiple fields in branches? It would be somewhat more C-y to restrict branches to one field each, and force users to define structs if they want multiple fields. On the other hand, enforcing that seems user-hostile and complicates the compiler for no identifiable gain.


#2

It seems awkward to have native support for unions which doesn’t even allow getting the data out. That doesn’t seem much of an improvement over the current manual casting approach…


#3

Manual casting is fine. The problem with the status quo is, manual casting to what? The largest branch could be different on different architectures, and figuring out which often breaks encapsulation (which is bigger, a CRITICAL_SECTION or three function pointers?) I want the compiler to come up with a type that requires exactly the right amount of space, on all current and future supported platforms, with no guesswork. Everything else is gravy.

I want to avoid specifying the easy things, because if I include easy things people will bikeshed them and then we’ll have nothing.


#4

A related RFC that I wrote for unsafe enums which at least provided a way to get out any given field of a union via an unsafe let Variant(x) = foo;