Pre-RFC: unsafe enums. Now including a poll!

True. The same example using repr(union):

struct Foo {
    kind: FooKind,
    data: FooData,
}

enum FooKind {
    Ptr,
    Num,
}

#[repr(union)]
struct FooData {
    ptr: *const u8,
    num: u32,
}

fn bar() {
    let f: Foo = get_foo();

    match (f.kind) {
        FooKind::Ptr => ... unsafe { f.data.ptr } ...
        FooKind::Num => ... unsafe { f.data.num } ...
    }
}

That would look a bit cleaner with anonymous types (allowing f.ptr).

For that matter, you can pattern match on a struct, so you should be able to pattern match on a repr(union) struct, which would look like this:

    unsafe {
        match (f.kind, f.data) {
            (FooKind::Ptr, FooData { ptr }) => ... ptr ...
            (FooKind::Num, FooData { num }) => ... num ...
        }
    }

The match against f.data uses an irrefutable pattern, though if it said something like num: 42, thatā€™d be refutable.

So, you can use either pattern matching or field access with repr(union) structs, too.

(All that said, outside of an example I donā€™t see any obvious reason to create a DIY tagged union other than matching the layout of one for FFI. If you want a tagged union and you donā€™t care about its exact layout, enum should work.)

User-managed tagged unions like the ones I've shown can be useful in Rust to implement custom or better forms of tagging, I've shown some above.

These are unions, not enums, and should be named as such. An enum is, by definition, enumerated.

3 Likes

True; I meant, I didnā€™t see a reason if you used full-sized tags. But yes, if you can use bits that overlap with the types, sure.

Per my comment 42 above (https://internals.rust-lang.org/t/pre-rfc-unsafe-enums-now-including-a-poll/2873/42), #repr(union) struct could support pattern matching as well.

Based on the discussion in this thread, Iā€™ve prepared an RFC for the #[repr(union)] struct syntax: https://github.com/joshtriplett/rfcs/blob/master/text/0000-repr-union.md

Iā€™d appreciate feedback from the thread before submitting a pull request to the RFCs repository.

I like the RFC, but for me the point of unsafe{} in Rust is to expose unsafety with as much safe as possible interfaces. The "repr unions" can't become totally safe, but wrapping unsafety into a single function that gets re-used many times helps avoid bugs in the whole program. Rust seems to do this many times in its standard library, and I like this design principle. I'd like "repr unions" to follow the same design.

This means that I'd like "repr unions" to have an optional standard method that computes the tag, to be called automatically by match{} and if let/while let. This way, if you have written such tag method correctly, your code that uses match{}/if let/while let will be correct and safe.

I think this is quite important to implement more efficient data structures in Rust. I've given examples above.

intended primarily for use with FFI,

I don't agree with this. They should are also meant to write more user-specified layouts for regular (but more efficient or more low-level) Rust code, using the tagging functions described above.

1 Like

This I agree with entirely. I would expect many users of unions to provide safe functions that use the unsafe wrappers to get at the underlying data; I would not expect the majority of users to use the unsafe operations directly at every call site.

In many cases, the union itself cannot implement any such tag function, because the tag exists outside the union. Quite commonly, for instance, you have a struct containing a tag, some common fields, and a union. Only the "tag bits" case or similar could identify the branch of the union based only on the union itself.

Furthermore, a single "tag" may not fully identify a union; accesses may legitimately occur to multiple fields of the union. For instance, you might determine some flags by checking a numeric version of a field, and then follow a pointer after masking its low bits.

You could certainly have a safe function that used the unsafe direct access interfaces to extract data from the union, or from a containing structure, and then call that function and match on the result. For that matter, you could translate the entire union into a Rust-style safely tagged enum. However, it doesn't seem likely to me that functions doing so would have anything more in common than "safe, implementation contains unsafe blocks, accepts a type that at some level contains a union, returns something not containing a union". That seems far too general to build a standard interface on.

More importantly, I don't want to make the perfect the enemy of the good.

I would propose standardizing unions themselves, seeing if common practices arise in the use of unions for tag handling, and subsequently defining a standard interface for that if those practices have anything significant in common.

I could also imagine a potential compiler optimization for "function that returns an enum, and the caller frequently pattern-matches the result": inline the function enough to dispatch directly from the code that determines the enum (and has the relevant temporaries in registers) to the branches of the match. Such an optimization would help in many cases, including for functions to return safely matchable enums given structures that contain unions.

I have no objection to rephrasing this comment in the RFC, and specifically mentioning the use case you've described as another application.

Iā€™ve updated the draft RFC to take this feedback into account:

  • Mentioned space-efficient or cache-efficient data structures in the motivation.
  • Noted that code using unions would commonly want to provide safe wrappers around unsafe union field accesses.
  • Dropped mentions of FFI in the rationale for not assigning a keyword; mentioned expected frequency of use instead.

Then I suggest to add a realistic example of such function(s) in the RFC. A RFC shouldn't be too much abstract.

Right.

From the RFC:

A pattern match may match multiple fields of a union at once. For rationale, consider a union using the low bits of an aligned pointer as a tag; a pattern match may match the tag using one field and a value identified by that tag using another field.

I suggest to add one or two realistic examples of this. The realistic example is useful to show how things will look in real code, to better judge the RFC, and to help who writes future Rust documentation.

If you want some realistic examples, there are plenty of unions in Windows API. A quick grep of the Windows 10 SDK reveals 1951 mentions of union so thereā€™s plenty of examples to work with.

My own thoughts are still the same as w.r.t. the unsafe enum proposal: that since the components of a union are identified by their type ā€“ if a #[repr(union)] struct has two ā€œfieldsā€ with the same type but different names, they still refer to the same thing ā€“ it fundamentally doesnā€™t make sense to have names. In database language, the type is the primary key, we donā€™t need another primary key.

That still leaves many different ways you could formulate it: using a library-based approach as I outlined here, using nominal types (like this RFC) as I sketched here, or perhaps using structural (built-in anonymous) types, similarly to tuples: you could have a built-in union(T1, T2, ...) type for each arity (again, just like tuples) and allow accessing its components simply by casting between a union and its component types with as: e.g. type IntOrChar = union(i64, char); let my_union = 'x' as IntOrChar; let x = my_union as char;. (Or union<...>, or UnsafeUnion<...>, any color you like. Again this is just one possibility; the point is that itā€™s type-based.)

1 Like

Maybe somebody wants to temporarily remove #[repr(union)] to check if some bug still happens? (assuming repr(union) is for efficiency, not for FFI).

if a #[repr(union)] struct has two "fields" with the same type but different names, they still refer to the same thing -- it fundamentally doesn't make sense to have names

I can certainly imagine cases where the type provides sufficient information, but in many of the interfaces I've worked with, names for fields make the code more clear. I'd rather identify a field of a union as "bytes_remaining" than as "u64". And if a second field "records_remaining" exists that happens to also have type "u64", I'd rather use the different names associated with different semantic meanings, even if the types happen to coincide.

That said, just as both structs and tuple structs exist, perhaps it might make sense to also support a kind of union with unnamed fields accessed by type. But I don't think that should be the only form of union available.

2 Likes

Names are actually only one part of it, and I do agree that they can add useful information. I think itā€™s really re-using structs or enums to serve as unions which bothers me.

I know thereā€™s a strong motivation to piggyback on existing language features for the sake of minimalism, and to imitate C, but C-style unions are not fundamentally similar to either structs nor enums. An attribute on a type should not radically change the way a type behaves, but #[repr(union)] would. The repr attribute is for changing the way a type is represented in memory, not for changing its behavior. If different semantics is the whole point (which it is), then I think it would be preferable to use in-language syntax for it, instead of abusing attributes (which, apart from macro-attributes like derive, are supposed to be ā€œmetadata-likeā€).

Field access is strongly connected on an intuitive level to the guarantee that the fields of a struct are disjoint: if I refer to mystruct.foo and mystruct.bar, I expect that they will access different memory, and overwriting one will not affect the other, etc. The borrow checker even allows taking simultaneous &mut borrows to separate fields of the same struct. Reusing the same syntax for a C-style union would violate the principle of least surprise: when I see mystruct.foo I would have to look up the type definition to know whether the access has struct-like or union-like behavior.

I could write a similar paragraph for the enum approach, e.g. the expectation is that a match will match exactly one variant, i.e. that variants are disjoint.

A different syntax, such as as, would be more appropriate, making it clear that the same thing is being accessed in different ways, as opposed to different things being accessed.

(This is only my own, individual opinion - donā€™t change the RFC just to appease me unless you actually agree or thereā€™s a consensus for it.)

4 Likes

FWIW, this isnā€™t about ā€œminimalismā€, but about avoiding a new keyword that could break existing code. Iā€™d prefer to use union as a keyword, rather than #[repr(union)] struct, but that would break existing code. And picking a new, sufficiently obscure keyword to avoid conflicts seems worse.

If I carefully avoid thinking about C, I can somewhat see what you mean about the field access syntax. I could see how ā€œasā€ syntax would make it more obvious that itā€™s the same data interpreted different ways, rather than distinct fields.

However, leaving aside the use of types rather than names (potentially fixable), ā€œasā€ does not work well in larger expressions because it requires spaces and typically parentheses. For instance, st.un.field.x seems preferable to something like (st.un as field).x (even though I can see how the latter would make the union more explicit). If we wanted to more distinctly identify union field accesses as separate from structs, Iā€™d suggest using an non-keyword infix operator similar to the dot operator, such as .! or .> or .~. I personally would prefer not to do so, not least of which because I donā€™t see an obvious intuitive operator for it, but I can see an argument for doing so.

Iā€™ll add a line in the ā€œalternativesā€ section of the RFC mentioning the idea of a different field access operator.

Apart from that, your mention about borrowing fields of a struct reminds me that I should explicitly discuss borrows of union fields in the RFC.

Iā€™ve updated the RFC to cover borrows and to add a new alternative based on your comments.

Based on feedback from the #rust IRC channel (thanks ubsan), Iā€™ve updated the RFC to drop the mention of using a union to perform arbitrary std::mem::transmute operations, and to instead reference Rustā€™s list of undefined behavior. In particular, this change allows the Rust compiler to continue making various aliasing assumptions about pointers.

Based on further feedback from the #rust IRC channel, Iā€™ve posted an alternate RFC that replaces #[repr(union)] struct with a new keyword untagged_union. By introducing a new construct that doesnā€™t include struct or enum, this avoids any confusion with either. This still canā€™t use the keyword union due to many conflicts in existing code, but untagged_union seems unlikely to break existing code. The syntax and semantics remain otherwise identical to the #[repr(union)] proposal.

Also, Iā€™ve started making changes to the proposal as distinct commits, rather than squashing them; Iā€™ll wait to squash them until I submit this as a PR on the rfcs repository.

2 Likes

You could use C#'s trick of compound keywords. I donā€™t believe union struct is valid anywhere in the grammar at present.

A large part of the motivation for this change came from a vehement complaint that this is not a struct. :slight_smile: So, using struct as part of the compound keyword would defeat the purpose.

However, if a compound keyword approach could potentially allow the compiler to accept something like unsafe union { ... } anywhere struct { ... } can currently appear, without breaking code that uses union as an identifier elsewhere, that does seem preferable.

I added that as an alternative, in the hopes that a Rust parser/grammar expert could comment on feasibility.