Pre-RFC: unsafe enums. Now including a poll!

I’ve started work revising the old RFC. If anyone has any suggestions, feel free.

My qualm with providing this functionality under the guise of unsafe enum, “an enum without the discriminant”, is that this is taking out the very part that makes it enum-like.

From a familiarity perspective, Rust aims to make ADTs / sum types approachable to C-family programmers by presenting them as an enum, with the addition that each enumerator can have associated data. But if you have data without enumerators, then this analogy no longer makes sense. But that’s the smaller issue, as you can just name it something other than unsafe enum.

The other issue is that Rust’s enums have named variants because they are positional. If I have enum Bool { True, False }, then True is different from False, likewise if I have enum Integer { Positive(Natural), Negative(Natural) }, then Positive and Negative are distinguishable, despite both containing a Natural. WIth unsafe enum this is no longer the case, and the only thing that matters is the contained type, so packaging them in variants just fundamentally doesn’t make sense, in my opinion.

If the provided-as-a-library-type UnsafeUnion<A, B> I proposed at some point is too minimalist to be ergonomic, then something like union Foo { usize, *mut Bar, MyType } with the ability to unsafely cast between Foo and any of the listed types using as would still make more sense, imho.

I’d like to make an alternative proposal for unsafe unions. Rather than using unsafe enum and tagged field references, I’d instead suggest using a struct-like construct and dotted field accessors. For instance:

#[repr(C,union)]
struct U {
    f1 : T1,
    f2 : T2,
}

fn f(u : U) {
    let x = unsafe { u.f1 };
    do_something_with(x);
}

fn g(x : T1) -> U {
    U { f1 : x }
}

The syntax #[repr(C,union)] struct should be read as union; however, this syntax avoids adding a new keyword for unions.

The dotted field syntax naturally chains: within an unsafe block, you can write s.u.f1.g3, where u has a union type. This seems like the most natural syntax for a union, and it provides maximum familiarity for users of C APIs via FFI.

Safety properties: Any reference to a union field, either as an access, assignment, or initializer, requires an “unsafe” block. Safe code can opaquely pass around a union or a type containing a union. No hard requirements on the field types, but the compiler should produce a warning on putting a type with “Drop” into a union.

Rust code (with appropriate unsafe blocks) may assign into one field and access from another. This has the same semantics as std::mem::transmute; the unsafe code has the responsibility to do so only on appropriate types.

The union has the size of its largest field type, and the maximum alignment of any of its field types.

Union fields may use pub.

Using #[repr(C,union)] on a tuple struct, or on anything other than a struct, produces an error.

Using #[repr(union)] without the C does not have a defined layout (though in practice it’ll likely not change anything). If we want to reserve that for some other behavior, we could have it produce an error instead. Or we could make union imply C.

Optionally, we could also introduce anonymous unions within structures in the same RFC, or introduce those in a later RFC.

4 Likes

Borrowing an idea from Josh, unsafe enum could too allow direct field access.

unsafe enum MyUnion { Foo(i32), Bar { x: i32, y: f32 } }
let thing = MyUnion::Foo(5);
let x = unsafe { thing.Foo.0 };

Isn’t it possible to do this with macros, perhaps with some slight additions to the macro system?

union!
{
f1: T1,
f2: T2
}

which would expand to a struct containing an u8 array of the max size among fields, and a bunch of unsafe get_f1, get_mut_f1, into_f1 accessors

One point that doesn't seem to be covered in the old RFC is whether it is legal to access a different variant than the one that was used for initialization. (No one can ever seem to figure out whether it's legal in C, so I think it should be spelled out here.)

I think the history was that originally it was either UB or implementation defined. However, the standard made no provisions for type punning (i.e. reinterpret_cast) at all, or except via memcpy() - which is a real necessity in practice. So when this started to become an issue (if you remember when gcc started enabling -fstrict-aliasing by default - there was much wailing and gnashing of teeth), the end result was that unions were promoted to being the only "standard" way of doing type punning and the compiler folks committed to supporting it - and language to that effect seems to have made it into the C99 standard.

Plus, there's long precedent for APIs that use unions this way - BSD sockets being the most notable example.

Fortunately Rust’s aliasing rules are simpler. You’re allowed to have immutable pointers of different types that alias each other and everything works out fine.

Given a variable of type Foo and a pointer that aliases it with type Bar, that’s okay as long as both Foo and Bar are #[repr(C)], no fields in Bar overlap padding in Foo which would result in reading padding which is UB, and all valid bit representations in Bar would result in valid bit representations in Foo (so something like usize and &Blah could cause problems because 0 is valid for usize but not for &Blah). At least, that’s my understanding of Rust’s current rules. Feel free to correct me if I’m wrong.

I think the current macro system could do it, but it would require more CTFE. We’d need compile-time size_of, max(a, b), and maybe align_of (are zero-sized arrays defined to affect the alignment?).

What I would really like is: pluggable struct/enum implementations (pluggable #[repr], basically).

I haven’t yet dug into the relevant parts of the compiler to know if this would work, but the scheme I have in mind is just:

  • given some struct definition (a list of name/type pairs), the #repr implementation creates get(), set(), and optionally get_ref() methods (or equivalently, get_offset()) - and also spits out the size and alignment.

get_ref()/get_offset() wouldn’t exist for e.g. C bitfields.

I’d really love something like this for a cap’n proto implementation.

@koverstreet

I already use a macro solution for unions in winapi. One of the downsides to this is that because macros cannot create usable identifiers if I want to have a variant named foo I have to pass both foo and foo_mut to the macro explicitly. Also I still have to end up manually calculating the size and alignment of the union for every architecture.

The whole point of unsafe enums is so that I don’t have to deal with all that mess. I mean seriously, look at the kind of stuff I have to work with just to get something like unions:

Please, be nice to a bunny and vote for first class union support in Rust.

1 Like

This could be done with macros, yes, but accessor functions make this much more painful than field access syntax would be. @retep998 has some examples of using macros for this, and they make the resulting code far more painful.

unsafe enum could do this, yes; however, enums don’t normally allow field notation, and allowing it just for unsafe enums seems far more magic than an attribute on struct.

Also, your MyUnion::Foo(5) needs to go in an unsafe block, too; in the general case, putting something in a union field is unsafe.

Why would putting something into a union be unsafe? At worst the thing has a Drop impl which doesn’t get called, but Rust makes it clear that leaking something is safe although not necessarily desirable.

@josh Nor do structs with multiple fields normally allow construction with only one of the fields, but you did that in your example. It seems like an equivalent level of weirdness.

MyUnion::Foo(5) isn’t unsafe - it’s construction. Just like the construction wasn’t unsafe in your example.

I have added a poll to the top of the thread. Please vote on it so that we have an idea of what the community would prefer.

Sorry, I misread the enum syntax. Yes, construction is fine. Both reading and writing a field are unsafe.

With repr-based unions and field syntax, construction looks almost exactly like a struct, except that you only initialize one field instead of several; that seems natural for a union, and the difference directly relates to the nature of a union. And field reading and writing syntax looks exactly like that of a struct. It’s the syntax that you’d naturally expect from looking at the declaration.

Adding field syntax to unsafe enums doesn’t seem directly related to the purpose of a union; it’s a convenient syntax, but not one normally associated with an enum at all. It makes reading and writing look fundamentally different than that of an enum despite using the name “enum”, but despite using the syntax of a struct for reading and writing, it uses the syntax of an enum for construction. That seems to me like an error-prone special case, where you have to remember an entirely different set of syntax rules for “unsafe enum” that don’t exactly match those of any other object type, that don’t match “enum”, and that thus don’t follow naturally from the declaration.

If we use “unsafe enum”, the syntax should match an enum. If we use #[repr(union)] struct, the syntax should match a struct. I’d take either one over not having unions, but I’d prefer the struct syntax.

Yeah, I think this is an important point. The #[repr(union)] struct seems like the easiest for someone who has never seen this before to guess the syntax for.

I'd also take any of the three syntaxes in the poll over not having unions, but I want to particularly call out some issues with the unsafe-pattern-matching syntax.

Unlike with the other two syntaxes, there is no direct way to wrap unsafe around just the field access. You have to either thread the value out:

let x = unsafe { let Union::Variant(x) = union; x }

Or else put all the rest of your code inside the unsafe block:

unsafe {
    let Union::Variant(x) = union;
    // ...lots of safe code...
}

[Warning: Rust newbie]

A C-style union also offers a chance to add moderately-unsafe user-tagged unions to Rust. They are useful to implement more compact enums, tagged pointers, or smarter enums.

For a system language it’s nice to offer an unsafe optional way to manage the tag manually.

If you have to store this:

enum M { A(u16), B((u8, u8)) }

but you only need the less significant 15 bits for your data, you can implement a Tag trait for an unsafe enum, it asks for a tag() function that uses one bit of the value to perform the run-time dispatch for the A() and B() variants. This way your enum uses only 2 byte instead of 4, and you can still pattern match nicely on it with match{}.

#[repr(unsafe_union)]
enum MyEnum { A(u16), B((u8, u8)) }

impl UnionTag for MyEnum {
    fn tag(&self) -> u32 {
        if unsafe { self.A.0 as u32 & 0x80000000u32 } == 0 { 0 } else { 1 }
    }
}

Another usage example: if from Rust you want to call a C function that returns a positive value (0…maxint) in case of success and a -1 in case of error, you can wrap the result of such C function in a safer but compact unsafe union like this:

#[repr(unsafe_union)]
enum MyOption { Error, Result(i32) }

impl UnionTag for MyOption {
    fn tag(&self) -> u32 {
        if self.Result.0 < 0 { 0 } else { 1 }    
    }
}

MyOption still takes only 4 bytes, just as the result of the C function (unlike a generic Option) but it’s safer than the original C API, because you can pattern match on MyOption.

This usage pattern is offered by this feature of the D language standard library, a second alternative Nullable (similar to the Rust Option) that allows to specify the “null” value: http://dlang.org/phobos/std_typecons.html#.Nullable.2

2 Likes

Related to the above idea of custom tags, @thepowersgang pointed out in IRC that with the unsafe-pattern-matching version of unsafe enum, we could support something like the common “tagged union” pattern in C:

#[repr(C)]
struct Foo {
    kind: FooKind,
    data: FooData,
}

#[repr(C)]
enum FooKind {
    Zero,
    One,
}

#[repr(C)]
unsafe enum FooData {
    Zero(*const u8),
    One(u32),
}

fn bar() {
    let f: Foo = get_foo();

    unsafe {
            match (f.kind, f.data) {
                (FooKind::Zero, FooData::Zero(ptr)) => ...,
                (FooKind::One, FooData::One(n)) => ...,
            }
        }
}

Since FooData::Zero(ptr) and FooData::One(n) would be unsafe but irrefutable patterns, the only thing the match would actually switch on is the f.kind value.

Also, it’s worth pointing out that unsafe enum could support both unsafe pattern matching and unsafe field access by variant name at the same time, if that was desired.