Idea: `enum` "tag" type

My motivation: I want to test a lexer I wrote, but I only want to test that it spits out the right token kinds, their byte offsets or slices into the input notwithstanding. So, imagine,

enum Tok<'scx> {
  Int(&'scx str),
  Id(&'scx str),
  ..
}

It would be great if I could write down a static array of the enum variants I expect and then compare that with a .collect() of the lexer. Unfortunately, the Discriminant of an enum erases type information, and can’t be constructed statically from a variant name.

Given an enum E, I propose introducing an enum tag!(E) (placeholder syntax), which encompasses all discriminant values of E, constructible either by explicit cast (e as tag!(_)) or by static access: tag!(E)::Variant. Thus, my test looks something like:

#[test]
fn test_lexer() {
  let lex = Lexer::new("let x = 0;");
  // Intentionally verbose; I imagine you would be able
  // do some syntatically-valid form of `use tag!(Tok<'static>)::*;` 
  // or whatever.
  assert_eq!(lex.collect::<Vec<_>>(), 
    &[tag!(Tok<'static>)::Let, 
      tag!(Tok<'static>)::Id,
      tag!(Tok<'static>)::Eq, ..])
}

FAQ:

  • For K not an enum, tag!(K) has a single unnameable variant.
  • tag!(tag!(E)) should maybe be tag!(E)? Not sure how to feel about idempotence.
  • discriminant(E::Foo(..)) == discriminant(tag!(E)::Foo)

Thoughts?

First reactions:

  • Why not a just custom derive on the enum? It could generate a variation of the enum it’s applied to with all the variant fields dropped, and the repetive Enum -> TagEnum conversion can be generated easily as well.
  • You say the macro syntax is a placeholder but I want to pre-emptively emphasize that macros are supposed to expand to tokens, so if an entirely new language capability is to be introduced, exposing it as a macro only makes sense if the thing it expands to can’t be designed/stabilized yet (this is the case for e.g. await!). Magic macros that can’t be expanded (e.g., asm! at the moment) are just entirely new syntax that confusingly overloads the PATH ! ( TT ) syntax.
1 Like

That's true, but the onus is on the crate owner to run the derive, even though this is a valid transformation for all types, even non-enums; tag!(T) is intended to be generic in T; you might imagine naming it std::mem::Tag<T> instead.

I like to pretend it's the most inoffensive placeholder syntax... clearly not! =P I'm aware that asm and global_asm aren't really macros. My use tag!(Tok<'static>)::*; is clearly nonsensical, too, since you can't expand macros in import position.

Sure, but that is recurring problem with many things, what makes this so special? Or in other words, given that a custom derive often suffices and that there's a less desirable but still workable solution [1] for types not under your control, does this problem really cross the threshold where it's worth adding new language features?

[1] writing out the enum the derive would have generated, or even letting a macro generate it (which limited the repetition to the enum name and the variant names)

1 Like

There are a couple things that makes it "special":

  • There is one and only one correct implementation of this type.
  • From an assembly point of view, the generated functions are no-ops (offsetting and narrowing a pointer). Compare to comparatively non-trivial derives like Debug, ParitialOrd, or the serde traits.
  • Generating a function fn(Tag<E>) -> Discriminant is not possible to do in safe Rust, and for something this obvious generating unsafe is, in my opinion, gross overkill.

While it would be nice to be able to write std::mem::Tag<T> without any type annotations, I think that if you're writing a custom derive you'll implement a simple marker trait to be able to do things like Tag::to_tag(E::Foo(..)).

An idea to solve your specific problem: I solved this by representing tokens using a struct instead of an enum. The only enum I used was the “bare” token kind, and the rest of the fields pointed into the source text and contained location information, like this:

struct Token<'a> {
    kind: TokenKind,
    text: &'a str,
    location: SrcLoc,
}

struct SrcLoc {
    line: usize,
    column: usize,
}

In this case, you can project the kind field of the returned tokens, then just construct a vector of TokenKinds directly.

However, I’m still curious why you would only want to test that the token kinds are correct, but not the contents of the lexemes themselves? That would seem a lot more robust to me (and it would also be possible to carry out without worrying about the discriminant, because you would have a list of the expected tokens themselves as the reference.)

Personal taste, I guess. I write a lot of lexing and parsing code at work and I find that working with tests that involve source-in-raw-strings, and having to check line numbers, is rather unpleasant, since formatting the source-code examples means you need to recompute all of the line numbers. It also usually suffices to test that spans are correct for simple token sequences, and then test only token types for complex sequences.

Another reason I want this is because this gives us an opportunity to implement something like Java's EnumMap, which is one of the very few Java features I miss often.

You don't have to test everything in every test though, especially things that depend on formatting.

That was kind of my original motivation… =P

Yes, we’re just looking at it from a different angle. To me it seems like projecting only certain fields of the struct for comparison purposes should work, like:

assert_eq!(
    tokens.map(|t| (t.kind, t.text)).collect::<Vec<_>>(),
    vec![
        (Kind::Number, "1337"),
        (Kind::Ident, "foo"),
    ]
)

Oh, I see what you mean. Fair enough. I proposed enum-tag-types mostly because I wondered if this was something other people wanted.

1 Like

@mcy, have you looked at the enum-kinds crate? I think it’s pretty close to what you describe.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.