Ideas around anonymous enum types

It's probably worth pointing out that in the old threads about enum impl Trait proposals, one of the biggest arguments in their favor was that it solves the "every new error case updates a zillion function signatures" pain point without introducing any of the type system or mental model problems we're discussing here, because the actual type is compiler-generated with no construction or matching syntax and only ever used via autogenerated trait implementations. This thread's probably dug into those problems in more depth than any of the old ones, but it looks like the conclusion is roughly the same (or at least, I'm personally finding myself agreeing once again with @CAD97's reasoning that the costs outweigh the benefits, which--full disclosure--is what I always felt/suspected).

I believe there was a claim earlier that autogenerated trait impls are fundamentally unsound, but IIRC the only example cited was a hypothetical Pod trait which seems like an awfully special case (if anything that seems like an argument against adding a Pod trait, not against enum impl Trait). Intended use cases like enum impl Debug or enum impl Error shouldn't cause any soundness problems, right?

4 Likes

If you squint, it looks like the requirements to make automatic impls for sum types sound would be similar to object safety. However, there's an unfortunate interaction in that the object safety rules were drafted by using sizedness as a proxy for whether a type was potentially an object type. Basically, if you want your trait to opt out of an auto-generated implementation for dynamic types, you make it inherit from Sized. If you want to exclude a particular method from consideration when determining if a trait is object safe, you constrain it with Self: Sized.

Having what are essentially Sized dyn types kind of messes this up.

So some good discussion has occurred since my last post. I proposed an idea late at night (which I really should have slept on) and I think that caused some obfuscation between sum types and union types.

I am personally in favor of a minimal sum type implementation, similar to the one in post 36. Where you match on top level types with index matching, but type matching for top level types is syntactic sugar.

let _: (A | (B | C)) = ...;
match _ {               // match _ {
   a : A => ...,        //    0(a) => ...,
   bc : (B | C) => ..., //    1(bc) => ...,
}                       // }

However, many people in the union type camp really don't like the ergonomics of this approach, as they would like the ability to match on the discrete types. Not only that, if duplicate types occur within a anon-enum, they would like to be able to match on all duplicates as well. The following is not valid sum type, but would be valid union type.

// Note the duplicates and the nesting
let _: (A | A | (B | C)) = ...;
match _ {             
   a : A => ...,     
   b : B => ...,
   c : C => ..., 
}                      

Sum types could not use type matching in the above case, you would have to resort to using index matching or coercion.

Coercion can achieve a similar effect, because sum types would support coercion into a more normalized form, meaning you could write the above using sum-types like the following

let not_normal: (A | A | (B | C)) = ...;
let normal: (A | B | C) = not_normal; // coerce into a normalized form
match normal {   // match normal {           
   a : A => ..., //    0(a) => ...,
   b : B => ..., //    1(b) => ...,
   c : C => ..., //    2(c) => ...,
}                // }

So as an appeal to those in the union type camp, I had the idea that maybe a type match could be a coercion site into a sum type described by branch arms. Essentially making the above coercion implicit. This was problematic as I discovered right after posting that this would force a move and matching on references wouldn't be allowed.


So coercion is off the table, but the logic built for coercion could still be useful in this situation. Sum type coercion is unidirectional, you can only make a sum type more normal. What this means is you create a many to one mapping from the non-normal rhs to a more normal lhs.

// LHS  = RHS
(A | B) = (A | A | (B | A))
// Creatins this mapping
(A | B) = (0, 0, (1, 0))

Now if we take our many to one mapping, and flip it to a one to many mapping, where each mapping is a multiple branch arm, we could potentially create the following match.

let _: (A | A | (B | A)) = ...,
// faux coerce _ into (A | B) to get the mapping
match _ {       // match _ {
   a: A => ..., //    0(a) | 1(a) | 2(1(a)) => ...,
   b: B => ..., //    2(0(b)) => ...,
}               // }

Before we said that type matching only works for sum types when unambiguous and the top level types are matched exactly. But now we could say that: essentially type matching works whenever the sum type is coercable into the match arms.

Now this still would still be based on the locally defined types.

let _ : (u8 | u8 | T) = ...,
match _ {       // match _ {
   u: u8 => ...,//     0(u) | 1(u) => ...,
   t: T => ..., //     2(t) => ...,
}               // }

I don't think this would break sum types, and this feature seems to be important to those who argue for union types. If this is valid, and I am not missing something, this could be an addition as an appeal to those in the union type camp. While still being consistent with sum types as type matching is sugar anyways.


Sorry I blurt out a million snippets, I'm an example oriented person.

2 Likes

But if match coerced to union types as in the above post, what is the advantage of using sum types rather than union types as the underlying structure? This combined approach seems to increase teachability/learnability issues rather than decrease them, even though the coerced union-type match has (for me) significant ergonomic advantages.

Hygiene. Cases don't get unified based on non-local information, which is what causes issues with union types.

2 Likes

Unlike union types, sum types would still use the local types. Meaning that the generic case is consistent

let _ : ( u8 | T) = ...,
match _ {    
   u: u8 => ...,
   t: T => ..., 
}              

union: T will match on T and u8 will match on u8 unless T is u8, then T will match on u8

sum T will match on T and u8 will match on u8.


Sum types also help resolve the lifetime case

let _ : ( &'a u8 | &'b u8) = ..., 

union: A union, wouldn't safely store this type because a &'a u8 is the same type as &'b u8. It views them both as &u8.

sum: A sum type doesn't view these as the same type, and so it can safely store both as it keeps them seperate. A sum type could use index matching to differentiate the two.


And could be sum types more consistent in the alias case consider the following two examples

type Error1 = anyhow::Error<String>;
type Error2 = anyhow::Error<String>;
fn fails() -> Error1 | Error2;
match fails() {
   err1: Error1 => ...,
   err2: Error2 => ...,
}

union: The user sees Error1 and Error2 as different types, but the compiler sees them as the same so this will always match on the Error1 case, and never on Error2

sum: Error1 matches on Error1 and Error2 matches on Error2. (Depending on implementation of sum types this could require an index match)


Now consider this case where a function returns 2 errors that both happen to be anon-enums

type Error1 = (A | B);
type Error2 = (B | C);
fn fails() -> Error1 | Error2;
match fails() {
   err1: Error1 => ...,
   err2: Error2 => ...,
}

union: If the code threw an Error1 it will match on Error1, if the code threw an Error2 it will also match on Error1 if the Error2's underlying type was B, Error2 will only match the Error2 branch when the type is C.

sum: Error1 matches Error1 branch and Error2 matches the Error2 branch.


Unions have a lot of these edge cases because they match on concrete types, since sum types maintain the locality of their variants and match on local types, it has, in my eyes, more consistent behavior.

2 Likes

It's not hypothetical, it is part of the bytemuck crate. And this is just the only example that I am aware of; there might be more.

Interestingly (and somewhat obviously), Pod inherits from Sized. Which means it opts out of being treated as object safe.

TL;DR: I believe that the root problem here is that the syntax chosen for "Anonymous Enums" looks too much like "Union/Junction Types" in most languages (that support such a thing). This makes the notion of Rust having both "Anonymous Enums" and "Union/Junction Types" prohibitive. If the syntax for "Anonymous Enums" were enum( A, B ) instead of ( A | B ), then it would be much clearer that it wasn't a "Union/Junction Type" and was an "Anonymous Enum". This, I believe, would make this proposal much less confusing and much more palatable.

I originally intended a longer comment, but I delayed responding and waited for more comments to develop and for me to read and understand them as best I could. After reading and digesting this proposal as well as the comments, I'm stuck on the fact that there are two similar, but different ideas here that both can't be in the Rust language simultaneously because it would result in something that no user of Rust (at least new users) could ever understand clearly and use effectively. Those two different proposals are:

  • Anonymous Enums: which are "Sum Types", just like Enums (tagged unions), but the variants aren't named and instead are known only by index. They are not "normalized" by type nor are the "collapsed" (just like Enums) semantically (though the underlying implementation may "collapse" in some way for efficiency, this cannot, and must not affect the semantics).
  • Union/Junction Types: which are a type that is the union/junction (in the set sense) of a set of types. This would normalize so that each type can only be in the set once and is collapsed such that sub-union/junction types are elevated to the root level and normalized.

I don't understand the arguments for why the first is the right way to proceed. It seems like the arguments for it boil down to:

  • The latter bullet point (actual Union/Junction types) has more corner cases that need to be worked out and requires greater burden on the compiler
  • It is better to have something that is more easily implemented/designed than wait on the design of something "more complicated"

I find these arguments to be entirely unsatisfactory. It seems that if Rust were to adopt the first bullet point, then it would be unable to ever have the second because having both options in the language would be a complete mess from an understanding and usability perspective. Also, the first proposal is unlike any notion of "Junction Types" in any other language (that I've used or know of), but still looks a lot like them in a way that just makes them completely difficult to understand in a meaningful way.

It seems like the second idea is much more in line with similar ideas in other languages and that the ability for the user of the language to understand them and understand how to use them is much more apparent. The fact that there are more complications to work out from a language/compiler perspective, should not be a reason to write them off UNLESS it is clearly and unambiguously shown that the cannot ever be made SOUND (not permit UB in safe code).

What, if any, evidence exists that true "Union/Junction Types" cannot ever be SOUND in Rust. If such does not exist, then that should be pursued and the idea of "Anonymous Enums" should be abandoned. If it does exist, then that should be 100% clear in an RFC for "Anonymous Enums" and the, and only then, can it be taken as a serious proposal to the language.

Most of the above is my opinion only, and obviously is not unequivocal.

EDIT #1: I think my objection to having "Anonymous Enums" in the language would be resolved if the syntax for them didn't look like the syntax in most languages for "Union/Junction Types". If instead of, ( A | B ) the syntax were instead enum( A, B ) I believe that the language could then support both "Anonymous Enums" AND true "Union/Junction Types" (which would use the ( A | B ) syntax without undue burden on the users of Rust.

4 Likes

I meant to address this earlier.

It would be potentially unsound to forward arbitrary unsafe traits, as they can have arbitrary requirements that the compiler has no idea about.

It cannot be unsound to forward arbitrary safe traits, as in the absolutely worst case scenario you have an incorrectly implemented safe trait, which is not allowed to have any implications on safety.

I'd have to audit use of safe unnamable supertraits to create faux sealed traits to see if it's possible to write something which could break if the trait is implemented for anonymous sum (or anonymous union, or enum impl Trait) types, but I believe it wouldn't. (If it does, I would argue that either the regular trait or the unnamable supertrait should've been unsafe in the first place, but we're a lot more lax about requiring unsafe within the privacy barrier, thus this being a potential hole in that privacy safety barrier.)

2 Likes

Ah, whoops. I was thinking of proposals to add a Pod trait into the core language where compiler magic would be used to auto-impl it for most types, not traits provided by external crates. That's what I meant by "hypothetical".

I think @CAD97's point about safe vs unsafe traits cleans up what I actually intended to say here: AFAIK enum impl AnySafeTrait should be harmless.

1 Like

For me the reason to reject union types in favor of sum types boils down to the hygeine. It's not a soundness hole in and of itself (I think), but it'd be a huge footgun.

Just consider the code of the rough shape

fn some_context<T>(make_t: impl Fn() -> T) -> (KnownType | T) {
    ...
    let thing: ( KnownType | T ) = if safety_precondition() {
        make_known_type_thats_known_safe()
    } else {
        make_t()
    };
    ...
    match &thing {
        ref thing: KnownType => unsafe { requires_safety_precondition(thing) },
        ref thing: T => fallback_implementation(thing),
    }
    ...
    thing
}

With union type semantics, the caller could provide a KnownType that violates the precondition of requires_safety_precondition. With sum type semantics, the user's would go to the fallback branch. You could "fix" this by switching the order of the branch, but now when the user provides a KnownType it alters the behavior for the case that would've used the unsafe codepath previously.

I know that this bug would slip past me in code review, at least until there's a permanent note on the code review checklist to check for this (the way there is for unwind safety). The bug would be even harder to spot if requires_safety_precondition wasn't marked unsafe but just relied on privacy to prevent unsound use.

Union types would also be the first feature that must be "templated" and would not fit the polymorhization view of generics. Such a step should be taken very cautiously, with all of the caveats and care given to specialization, as it carries much the same power for breaking referential transparency in unexpected and potentially dangerous ways.

3 Likes

And on the user, when they need to understand the code and what it is doing. The extra cases in question arise mostly in complex, potentially error-prone use cases for anonymous sum or union types, when you would want straightforward, easily understood behavior the most.

Personally, I think that an anonymous sum type, which behaves similar to enums, would be easier to teach than an anonymous union type, with its whole bag of new semantics. And Rust isn't in the habit of just copypasting features from other programming languages. Sure, it takes inspiration, but it also adapts them so that the resulting language still remains a solid whole rather than a collection of disparate parts.

1 Like

It's similar to C++ std::variant. Although I don't know if that is considered an "Junction Type" the very similar definitions of all of these confuse me.

I can't actually speak for the implementation of this in other languages. But I would imagine most of them are dynamic, garbage collected, lack lifetimes and use heap allocated objects almost exclusively. I imagine their implementation for union types would essentially be &Any and a match would just be a runtime downcast. Converting from a union with 5 types into a union with 100 is a no-op because it's just a reference to a heap allocated object. They probably 'chose' not to differentiate between order because the obvious implementation for those languages couldn't.

But that's not what we are working with in rust. I agree that deviating from other languages is a barrier, and should negatively weigh on solutions, although I disagree that deviation alone warrants dismal.

This is fair, and it has been brought up before. I am indifferent to the syntax. But have been writing the (A | B) syntax to describe both "Anonymous Enums" and "Union/Junction" because I do think it is useful to compare the behavior of the two when looking at an example. I will try to come up with a better syntax for future snippets. maybe enum(A, B) union(A, B) or enum|union(A, B) when I want to highlight one, or compare the two.

This could also be a solution, Anonymous Enums could be added first with a enum(A, B) syntax, as it seems to have the least barriers for implementation, and junctions types could later be added with (A | B) syntax when problems that concept faces are resolved.

But they both aim to solve a similar use case in Error handling. Both of them would handle that case nearly identically, especially if post 83 could be implemented for anonymous enums. Would the drive to get junctions types exist if the largest use case is nearly identically handled by the anonymous enum implementation.

If post 83 is possible, than junction types would realistically only give us

  1. The matching edge cases and ambiguities when using generics and type aliases
  2. The ability to make this assignment let _ : (A | A | B) = a as A;
  3. When you impl a trait on A | (B | C) its also implemented on C | B | A

Would there even be a drive to develop such a vastly more complicated feature to add those 3 use cases? The first use case I find undesirable, the second is one I doubt will come up much as why would you hand write a top level non-normalized union, and the third case could be a macro.

If we implement the 'enum' approach, as i think we should, it doesn't technically prevent the junction type, but it would probably kill it.

3 Likes

Just a little thought: if these are sum types, then you would think they would follow addition semantics, i.e. associative and commutative. So

(A | (B | A))

should be interchangeable with

(A | A | B)

etc...

You don't really care about order and grouping between types, you only care about how many times each type appears and which appearance you're talking about. So, if, rather than numbering each case, you actually numbered each type:

match x as (A | (B | A)) {
    _: A.0 => 0,
    _: B   => 1,
    _: A.1 => 2,
}

that might make it feel a little less like type ascription syntax is second class.

Those posts are definitively enlightening, and goes far beyond my current level of understanding of the issue. It's very interesting.

I was initially in favor of union types, mostly because I think the syntax would be cleaner. To get out of the corner cases, I would have even suggested to simply error for (T | u8) if the concrete type of T is u8.

But the more I read the various posts, the more I like the regularity of anonymous enums. I think it makes it easier to teach since there are less corner cases. However, the syntax must also be easy to understand, and I think it is one of the advantage of union types.

What would be the possible(s) syntax for anonymous enums for:

  • coercion (ie. transforming (A | A | (B | A)) into A | B). Should it be always explicit, or can it be made implicit in specific/all places (like in match)
  • creating a variant from a concrete type (like creating a (A | B) from a A). The prefix become as been suggested a lot. Should a postfix keyword (like as enum) be preferred? How does it interact with repeated types? Especially when lifetimes are involved.
    • A -> (A | B)
    • A -> (A | A)
    • A -> (&'a1 A | &'a1 A)
    • A -> (&'a1 A | &'a2 A)
    • A -> (A | (A | B))
    • A -> (&'a1 A | (&'a2 A | B)
    • A -> (B | (A | A))
    • A -> (B | (&'a1 A | &'a2 A)
    • A -> (A | (A | A)
    • A -> (&'a1 A | (&'a2 A | &'a3 A)
    • A -> (&'a1 A | (&'a1 A | &'a2 A)

The goal is of course not to bikeshed the perfect syntax for anonymous enums, but just to be sure that sure syntax exists and can be easily taught. I personally really don't like the index syntax, and I hope something better can be found (just like post 83 found a better syntax for matching).

Just to be clear, this is a post-monomorphization error, part of why C++ templates are hard to use, and generally considered a big no-no for Rust language design.

I'd prefer it to be explicit, but with multiple potential inference points.

as (A|B) is obviously a coercion point, as is assignment to a place of type (A|B) (such as a function argument or a previously typed local).

match is more complicated, but I could see a scheme of creating the set of types expected via the specified types of the match arms, and trying to coerce to a sum of those types.

I get some people like become for this purpose, but if it's done via coercion, using as and/or generalized type ascription seems the consistent way, as it would be possible to do that way anyway. (I'm also still in favor of keeping become available for explicit/guaranteed tail call optimization, whatever shape that would take.)

Note that you wouldn't be able to coerce a: A to &A anyway; you need to add a borrow first.

I think even just not supporting creating "ambiguous" anonymous sum types that have a "more canonical" form outside of generic code (which would construct them as the canonical form with the type variable as a root-level variant) is probably fine, even.

While I strongly feel that we should create and keep them in the noncanonical form when working on the generic version (as from the black-box generic pov, it is in canonical form), I also think that most cases once the concrete types are known people will want to quickly collapse it to the canonical form anyway.

The index-based syntax is only really a stopgap such that working with the noncanonical form is possible, and to serve as a simpler base syntax with obvious semantics (even in edge cases) for the type-based syntax to be specified in.

If the concern is on teachability, the index-based syntax is the exact analogue of tuple's index-based member names, just for enum rather than struct. I think that's as obvious of semantics as is possible.

Well, product types are’t commutative either. Taking a more fitting analogy from mathematics: a disjoint union in this sense is also neither commutative nor associative. (This page lists “disjoint union” as a synonym to “sum type” in the context of computer science.)

Well, actually the naming of things doesn’t matter too much anyways.

The problem that I see with commutation like this is that it’s hard to keep instances of the same type in the correct order. You wouldn’t want to inadvertently swap your first A case with the second A case all the time, would you? Perhaps you really don’t care about order often enough.. but then working with an actual type like (A | A) is pretty confusing.

Maybe we can somehow keep the order of multiple repetitions of the same type consistent; but I can’t really think of any good way in light of generics. In one moment you see some (A | B) and think that their order doesn’t matter, in the next moment you learn that A and B are both actually u8.

Also, what about something like this?

fn which_one_is<A, B>(x: (A | B)) -> &'static str {
    match x {
        a: A => "first variant",
        b: B => "second variant",
    }
}
let x: (u8, u16) = 1u8; // I don’t care about order
let y: (u16, u8) = 1u8; // I still don’t care about order
let r = which_one_is(x); // But suddently I observe an order?
let s = which_one_is(x); // Probably the same as r, since x and y have the same type and value

On a positive note, I don’t have any proof that a commutative and associative kind of “union type” without (A | A) == A is worse than the “proper union type”, I have no idea how they could work out nicely for either of them. Then either version could be the first one for which someone figures out a not-too-terrible way how they could actually function in detail.

Couldn't the same thing be said with struct fields order and a macro that print the order of the fields?

The only reason we may care about order is to be able to named them using indexing. If another non-ambiguous way to name them not based on the order of the variant was found, I really think that variant could become order-independent.

To my mind, that would be OK if the "Anonymous Enums" didn't use syntax that looked like "Junction Types" even if Rust will never have "Junction Types" as well as "Anonymous Enums".

EDIT #1: I think I'm getting a better feel for why "Anonymous Enums" might be a better fit for Rust over "Union/Junction Types". It seems like the syntax is the thing that was throwing me (and perhaps others) off. I strongly feel that if the proposal were changed to use syntax like enum( A, B, C ), better yet enum{ A, B, C }, instead of ( A | B | C ) it would be much more understandable and relatable. This would make it clear that it isn't a "Union/Junction Type" and instead has semantics similar to Rust enums. It seems like the only difference is that "Anonymous Enums", unlike regulars "Enums" have the variants referenced by index (like Tuples) instead of variant name/tag, like regular Enums. On top of this, there may be some "syntactic sugar" to allow matching/destructuring based on the types that de-sugars to matching/destructuring based on the type indexes.

2 Likes