Idea: escaping macro separators


#1

This idea came up a bit in the discussion in https://github.com/rust-lang/rust/issues/51934, but it has wider applications.

Idea: allow escaping tokens in macros so that they can be used as separators. For example, suppose that # is the escape character (please bikeshed), the following would allow using + as a separator, which is currently not accepted.

macro_rules! foo {
    ($(a)#++) => {}
}

This macro would accept:

foo!(a+a+a+a);

Of course, to use # as a separator, one would use ##.

I believe that this would allow using any token as a separator, including some useful tokens that are currently forbidden (e.g. +, *).

Any thoughts?


#2

Any particular reason to choose # instead of something like \ (which is fairly typical as an escape-the-next-character indicator)?


#3

I like this idea. It’s basically what I had in mind. I concur with @gbutler though: using the conventional \ as an escape char would sound most sensible.


#4

I think \ in particular would be ambiguous or at best confusing. For example, what would $(a)\t+ mean? Separate by ‘t’ or by tab?


#5

Since this isn’t a string literal, I would personally expect \t to just produce t and probably a warning about an unnecessary backslash telling me to change it to either t or \\t.

But that aside, doesn’t the ambiguity objection apply to any potential escape metacharacter? Does \ have some pre-existing meaning in Rust macros that makes it worse than #?


#6

\ has no meaing, but # does (in particular, $($x:ident)#* is currently valid and matches a#b), so \ seems like a better choice. (Inasmuch as there’s any good choice here; every option looks ugly to me, but it’s a rare use case anyway.) However, we’d have to modify the definition of a token to be able to use \.


#7

hmm… fair enough. Will using \ anger the lexer?


#8

Terrible thought: what if separators didn’t have to be a single terminal?

// Matches `he said that you said that I talk too much`
($($x:ident)(said that)* talk too much) => { ... }

And look! We get escaping for (cough) “free!”

// Matches `Add + Sub + Mul + Div`
($($x:ident)(+)*) => { ... }

(actually I really quite like how it looks. And it’s probably not an obscene implementation challenge as long as it is limited to fixed token sequences… but a design like this just begs to support matchers appearing in the separator)


#9

Pending someone poking holes in it, It’s a pretty deep insight.


#10

This sounds cool… but I think the lexer will want your blood. Though, maybe it would be enough for $(...)(x y)* to parse for two lexed terminals? Here’s my weak attempt at something pathological:

macro_rules! foo {
    ($($x:ident)(+-)+ $k:ident +) => {}
}
foo!(a +- b +);
//        ^ ^ once we get *here*, we EOF and see no - sign, then realize
//        |   that `$k` captures `b`, but we have no way to know
//        +-- that over *here*. I can imagine a worse scenario than this
//            that induces really nasty lookahead

Also, general question for the thread, what’s the tracking issue for Macros (by example) 2.0? I’m curious what the UX is like right now. As nice as this idea is, it feels like a bandaid for making macro_rules! less painful, and I think this kind of thing should be built into the design of of macro, so that parsing isn’t nearly this exciting.


#11

The way that is written, it matches (unambiguously) something like

foo!(a +- b +- c d +)

To make it ambiguous one would need a pattern like (idunno)

macro_rules! foo {
    ($($x:ident)(+-)+ +- $k:ident +) => {}
}
foo!(a +- b +);

on which I’d expect it to throw this same error that it likes to throw for similar inputs that are achievable today:

   Compiling playground v0.0.1 (file:///playground)
error: local ambiguity: multiple parsing options: built-in NTs ident ('k') or ident ('x').
 --> src/main.rs:4:10
  |
4 | foo!(a , b  );
  |          ^

#12

Doing something like that in the lexer would be a major language change because you would have to teach the lexer about all sorts of new tokens. Doing it in the parser is conceivable but sounds rather hard because you would need to support arbitrary amounts of lookahead (There might be a more efficient way, but I don’t know it)…

In general, I think starting with single tokens is probably enough for now.


#13

What I wrote down is unambiguous, sure, but my point is that I think you either have to make the lexer hate you by re-lexing the contents of a macro call after the definition is parsed, or you need unbounded lookahead, neither of which is an exciting prospect.


#14

I’ve thought that the whole grammar here needs to change with macro macros. We should have a grammar in which these ambiguities just don’t occur. The parens around the separator seems good; it could be mandatory in macro, so that no separator would be written $($x:ident)()* instead of $($x:ident)*, avoiding any potential ambiguity.


#15

Hmm, let’s see how they look nested in close quarters:

// nesting on the left
previously: $($($(a)* b)* c)*

       now: $($($(a)()* b)()* c)()*

// nesting on the right
previously: $(c $(b $(a)*)*)*

       now: $(c $(b $(a)()*)()*)()* // (ouch)
            $(c $[b ${a}{}*][]*)()* // if we could customize the delimiters...
                                    // (still ouch?)

and a “block-style” repetition, for people who like to format it that way:

($($Add:ident for $Type;)()*) => {
    $(
        impl $Add for $Type {
            ...
        }
    )()*
};

// or the "bunched-together egyptian style" sometimes used
($($Add:ident for $Type;)()*) => {$(
    impl $Add for $Type {
        ...
    }
)()*};

A piece of a terrible incremental muncher:

    (
        // Munch the options one at a time into the $opt list.
        ($b:expr) [$kind:tt ($($opt:tt)()*)]
        #opts# [$($opt_tok:tt)()+] $($rest:tt)()* // find a [] tt
    )
    => {arg_impl!{
        ($b) [$kind ($($opt)()* [$($opt_tok)()+] )] // append to end
        #opts# $($rest)()* // check for another
    }};

It doesn’t seem too bad except for the “nesting on the right” example. Though if it occurred, I think I might like to see a “token stream” matcher ($x:ts or maybe $x:tts) that matches like $($x:tt)()*.


#16

The EYE OF SAURON is watching!!!