On macros future-proofing, FOLLOW sets, and related stuff

Hello Rustaceans!

Before I start, let me introduce myself for those who don’t know me. :slight_smile: I’m Leo, I’m currently intern at Mozilla with @pnkfelix, and I’m supposed to fix the issue with macro future-proofing analysis in presence of macros with multiple arms. Unfortunately, it turned out that fixing it the way we wanted was not possible. This post is an attempt at explaining why, and what we can do about it.

Background

The current implementation of macros-by-example in Rust (macro_rules!) is not future-proof: it does not guarantee that code that is currently considered valid Rust code will continue to be so in the future if the syntax of the language is extended.

Rust macros expect some piece of Rust code in input and in turn generate some other piece of Rust code as output. Macro definitions can specify the shape of what they expect as input using some kind of patterns. Those patterns can include ‶matchers″ to say that the macro expects some specific sort of Rust syntax (called “non-terminals”): an, expression, a type, an identifier, etc. Because of that, the macro expansion relies on the Rust parser for those syntactic forms and thus the whole expansion process might stop to work if the Rust syntax changes, because the parser won’t produce the same results anymore.

Let’s take an example:

macro_rules! foo(
    ($e:expr : $t:ty) => ( ... )
)

This defines a macro that expects its input to be any valid Rust expression followed by a colon token followed by any valid Rust type. The $e is used to bind the actual matched expression to a name so that the macro can include it in its output.

Now we could invoke this macro like:

foo!(x: i32)

and it would work. Because x is an identifier and an identifier is a valid Rust expression. Likewise, i32 is also an identifier and an identifier is again a valid Rust type.

But recently the syntax of Rust changed to include type ascription syntax: an expression explicitly annotated with a colon and a type is now itself a valid expression. Speaking in terms of grammar, we added the following production to the Rust syntax:

< expression > ::= ...
                 | < expression > : < type >

With this new syntactic form, when trying to match x: i32 against $e:expr : $t:ty, the parser will parse x: i32 as a whole expression, and then find the end of the macro invocation where it expects a colon. So previously valid Rust code is now rejected.

Current solution (and problem)

The solution to this problem is to guarantee that some part of the grammar won’t change (for example by stating that a given token will always be unambiguous after an expression) and then check that macro definitions don’t rely on syntax that has not been guaranteed. For example, the previous macro would have been rejected before the addition of the type ascription syntax, because it would not have been guaranteed back then that a colon would always be unambiguous after an expression. But the two following macros:

macro_rules! foo(($e: expr ; $t:ty) => ())
macro_rules! foo(($i: ident : $t:ty) => ())

would both be accepted because it’s guaranteed that a semicolon will always be unambiguous after an expression, for the first macro, and that any token will always be unambiguous after an identifier, for the second, because an identifier is always a single token.

So what we did is that we basically defined some sets of tokens that are always ok to appear after a matcher for each construction (expr, ty, etc.) in a macro pattern. Those sets were named FOLLOW sets, quite poorly because it’s not exactly the same thing as what ‶FOLLOW set″ usually refer to in parsing literature.

We then implemented an analysis that checks that a matcher is always followed by something that belongs to its FOLLOW set. But that is not sufficient. In fact that only works for macros with a single pattern. But Rust macros are allowed to define multiple arms with multiple patterns, the first that matches being expanded:

macro_rules! foo(
    ($e:expr) => (...);
    ($i:ident : $t:ty) => (...)
)

In this case, the two arms, taken individually, are okay with regard to the FOLLOW sets rules. But taken together they’re not. Consider the following macro invokation:

foo!(x: i32)

Previously it would have failed to match the first arm because x: i32 was not a valid expression by then, so the macro would have expanded to its second arm. But it is now, so the macro will expand to its first arm. In fact this is way much worse than the previous example: the code still compiles, without errors, but behaves differently without the programmer knowing it.

I took quite a lot of time to think about this specific problem and to find an algorithm that could, based on the FOLLOW sets, reject that macro. The original idea I had was to compare two macro arms and, when encountering a matcher in the first, try to make as much as possible of the second arm to match it. This is different from the parser that works during macro expansion: in this context we don’t want to know if it will always match but if a match exists. For example, such a parser would say that $t:ty matches $e:expr because both can be an identifier. Then, we would look at the token that comes just after and see if it does belong to the FOLLOW set of the matcher we are trying to match from the first arm. This would have worked for the previous example: before type ascription, parsing would have stopped at the colon, which does not belong to the FOLLOW set of expr.

Unfortunately, that doesn’t work for all the cases and I finally realized that it was impossible to solve this problem using only FOLLOW sets as a formalism for what is allowed or not to change. Take the following macro:

macro_rules! foo(
    ($t:ty) => ()
    ([ $i:ident ; $x:ident @ $y:ident ])
)

This of course reminds of the syntax for static array types:

< type > ::= ...
           | [ < type > ; < expr > ]

Right now, a @ b (where a and b are arbitrary identifiers) is not a valid expression, so inputs like foo!([ i32 ; a @ b ]) will match the second arm since it does not describe a valid type. But we might want to be able to extend Rust in the future so that a @ b becomes a valid expression. That’s why @ is not included in the FOLLOW set of expr. But here, expr appears nowhere in the definition of the macros. The only matchers that appear are ty and ident. The FOLLOW set of ident accepts any token. The only reason we could find to reject this macro by doing an analysis based purely on the FOLLOW sets is that @ is not in the FOLLOW set of ty. But that’s absurd! That’s not the reason why we don’t want to allow that. We are modifying expr here, not ty. In fact we could write the same example with another token that is not currently allowed in an expression but is in the FOLLOW set of ty (say, =>), and we would have the same problem, except that we would not have any reason to disallow the macro. And we cannot find one by looking only at the FOLLOW sets. Finding a reason to disallow this macro requires to know that the @ token is here inside of something that the first arm could possibly parse as an expression, and this in turn requires digging into the production rules of the Rust grammar.

Possible solutions

So knowing that, what can we do to solve this problem? There are basically three ways to go:

  • Do nothing. Do not check those macros, and accept that we might break user code when extending the syntax.
  • Continue to try to find an exact solution (or, at least, a reasonably approached solution) by thinking in terms of grammar productions instead of FOLLOW sets.
  • Try instead to find a very approached solution that does not require to dig into the grammar, maybe at the expense of generating many false positives, i.e. report as errors perfectly valid macros.

I will try to explain in details the pros and cons of the last two approaches, assuming that the first one is definitely not what we want. Risking to break user code is bad, risking to make user code behave differently is worse. And we cannot abandon the possibility of expanding the syntax of the language.

Grammar-based solution

I haven’t thought in detail of what this solution would look like. An option would be to reason on an LR automaton. We could consider the grammar that describes the language recognized by a macro, defined by taking all the productions of the Rust language (or, at least, the parts that we need: expr, ty, etc. and their dependencies, but if not all that’s probably a very large part of the whole grammar) and adding as entry points productions that match the arms of the macro. We would need to adjust their precedence more or less like Yacc-like parser generators do with %prec annotations to ensure that multiple macro arms are allowed to match the same input (provided that they already does right now, not modulo some future change in the syntax), and then we could convert it to an LR automaton and run some kind of modified shift-reduce and reduce-reduce conflict checker that would not only check that there are no traditional conflicts but also that there are no ‶future-proof″ conflicts, that is, it would report a conflict if you have, say, a transition and a reduction on different tokens on a given state, if one of the token has not been added to some guaranteed set (it would not be a FOLLOW set here) of the non-terminal we are parsing. It would of course not take precedence into account for this kind of conflict. It’s still unclear if and how this could work, but it might be worthwile. Of course there might be other possible approaches, based purely on the productions instead of an automaton.

Now for the pros and cons:

  • The only pro I can think of compared to the other solutions is that if this one works, it might give an algorithm that give no or very few false positives. It’s a significant one though.
  • But it is still unclear if it could work at all and sorting that out would require a lot of work, that maybe not be worthwile if in the end we realize that it doesn’t work.
  • It would require to have a canonical LR grammar for all (or almost all) the Rust language. This is not yet the case and might not be the case before a long time.
  • Constructing an LR automaton each time we have to check a macro definition can be quite expensive.

Approached solution

A very simple solution would be to just expect the user to manually disambiguate their macros by making them begin by different tokens, or by non-terminals whose FIRST sets (the set of tokens a non-terminal can begin with) are pairwise disjoint. The algorithm could be as simple as iterating over the matchers of two macro arms until finding either a trivial disambiguation (and thus accept the macro) or a non-trivial case that we don’t know how to handle (and thus reject it). For checking two arms consisting of symbols a1 a2 ... an and b1 b2 ... bn something, this would give something like:

  • If ak and bk are both the same token, continue on ak+1 ... an and bk+1 ... bn
  • If ak and bk are different tokens, accept the macro.
  • If ak is a token and bk is a non-terminal, accept the macro is the token is not in the FIRST set of this non-terminal. If it is, it’s a tricky case. One could think that it would not be a problem since the token has precedence for being in the first arm so we could just continue to look at ak+1 ... an and bk+1 ... bn but the problem is that a single non-terminal could match several tokens so I think we cannot tell how much tokens we would need to skip in the first arm.
  • If ak is a non-terminal and bk is a token, accept the macro if the token is not in the FIRST set of this non-terminal, reject it if not. We could emit an error message here suggesting to put the arm containing the terminal first but as stated above, it still unclear wether we could be sure it would solve the problem.
  • If they are both non-terminals, accept the macro if their FIRST sets are disjoints, reject it if not.

Of course the FIRST sets (or some subset of them) will have to be guaranteed just like we did for FOLLOW sets. Fortunately, it’s much less likely that we modify them in the future than the FOLLOW sets (the only cases that would result in this would be likely the addition of prefix operators or new delimited forms, this is unlikely to happen often), so they could probably be larger.

The obvious cons is that this is an extraordinay strong and conservative approach, and it would probably break a lot of code. But there is a proposal to introduce a new macro! form and possibly to deprecate macro_rules!. A possibility would be to introduce this in the new macro! form. This would not break code. Also, this has the advantage of being extremely simple to implement, so we can implement it quickly and use crater to see how much code it breaks, and maybe gather statistics on the breakage (for example: is it more importantly due to token versus non-terminal or to non-terminal versus non-terminal?) to have an idea of wether the broken code would be easy to fix or not.

It will obviously drastically decrease the expressivity of the macro system. It’s unclear wether it will be a real loss for macros-by-example. It will probably forbid macros that use recursion and rely on the dispatch between the arms to handle termination. But as a trade-off, procedural syntax extensions are supposed to become more easily (and, hopefully, stably) usable. Macros-by-example were not meant to do arbitrarily complex computation in the first place anyway.

To sum up this post, the plan I propose is the following:

  • Implement the approached solution and test it against crater to have a more precise idea of the breakage it will cause.
  • If it causes too much breakage that is too hard to fix, or decreases too much the expressivity of the macro system, investigate the grammar-based solution.
  • If breakage and change in expressivity seem reasonable, implement it in the new macro! form that will replace macro_rules!, arguing that the ability to produce future-proof code is a good trade-off for expressivity, and implement the new, easy to use, stable syntax extensions alongside it to balance for the loss of expressivity for use cases where arbitrarily computation at compile-time is indeed required.

This sounds like a problem that should be positively trivial to detect by some lint-like program. Running such a program on all crates (similar to crater) on crates.io would give us hard data on the cardinality and severity of the problem.

Here’s a relatively simple macro (as these devices go) that is kind of in the tt-muncher category. It’s very useful and creates its own kind of syntax (if plain rust would have cut it, I would be more than happy to use plain rust).

slice argument macro s![]

Isn’t the hard part of this problem actually doing the analysis, something that is needed whether or not this is in the compiler or in an external lint/program? Could you expand on what you’re thinking?

What do you mean by ‶plain Rust code″? Do you mean not using macros and using Rust functions instead, or do you mean macros implemented in plain Rust code, like procedural syntax extensions? By the way the @keyword pattern in this macro is more or less what I mean by ‶manually disambiguating″. However I don’t feel like the whole macro can be validated easily. The ‶trivial″ algrotihm could certainly be tweaked to accept the last arm, but the four @parse arms look really complicated to analyse with a simple solution, even though it looks future-proof to me.

Yes, I don’t see anything trivial in this problem. The whole point of this post is to try to explain how complicated it is. Even the solution that I present here as ‶trivial″ has a few tricky corner cases that need to be though of a bit more carefully. The implementation being a lint or in the compiler doesn’t change anything (in fact, it would even use the same API…).

I meant without using macros, just regular rust types & traits & functions.

I actually like the fact that these changes would make macro pattern matches more order-independent and more like match arms.

I think that relying on procedural macros to fill the gap in expressivity is fine. I intend to add quite a lot of pattern matching functionality to libmacro, so procedural macros ought to be able to customise pattern matching fairly easily and do their own thing.

More order-independent and more like match arms? But match arms are very order-dependent. I’m not sure to understand.

@LeoTestard I think you are correct that “FOLLOW” sets aren’t really what we need; I think what I was thinking of using is more like “FIRST” sets – that is, if you can be parsing a $t:expr in one arm and token X in the other, then it would be an error unless X is not considered a legal “start” for an expression.

This confuses me a bit. Maybe I've missed some context in the conversation. For one thing, match arms are order dependent, and it seems like order dependence in macros is pretty crucial to being able to write parsers etc (and pretty darn easy to implement). Maybe I am just confused by this phrase "more order independent" -- do you just mean because it would forbid careful hacks around what is or is not going to parse as a expr or ty etc?

Yes, that’s more or less the idea of the last algorithm I proposed but just considering it to be an error if X is a legal start for an expression is a very conservative approach. If we rely on order-dependence, we can make it much less conservative by accepting cases where the first arm is obviously more specific that the second, but in this case you have to ‶skip″ the non-terminal to continue the analysis further. The problem is that you don’t know how many tokens you have to skip in the first arm in that case, I don’t know if there is a solution to that problem, I haven’t thought much about it yet. But the conservative approach where you just reject the macro on the first errorneous non trivial case you encounter (i.e. the first case involving a non-terminal) would certainly work.

Sorry, being silly about match arms. I guess there is something about the implicit-ness of macro pattern matching which makes me more uneasy with order dependence than match arms.

Sorry, I was being a bit imprecise. I did indeed assume that we would take the order into account, so specifically the error would be if the earlier arm matches (e.g.) a nonterminal nt but the token from the latter arm is in FIRST(nt). Under this model, there is no need to "skip" tokens. The goal is basically too show that, if the latter arm is to match, then the former arm must not have matched (specifically, the nonterminals within).

However, I can easily believe that this algorithm is too strong. (Obviously it will depend quite a bit on the FIRST sets etc, but I think we can expect them to be too strong.) It'd be great to start getting more specific, and collecting a catalog of examples (forgive me if I missed some from your original post).

There are two ways I can see going forward. One would be to add a lint, but avoid a hard error: something like, "this macro is fragile and may stop working or work differently as the definition of $expr changes" perhaps with suggestions like "consider reordering these macro arms" when it makes sense.

Another is to strengthen the analysis. This would make sense if we can find enough ways to do it. For example, one scenario that comes to mind is "postfix" disambiguators like:

 $e:expr ";"
 $t:ty ";"

The algorithm I was vaguely sketching above would fail here because the former arm wants an expr but the latter arm wants a ty and FIRST(expr) overlaps with FIRST(ty). I could some form of reasoning based on skipping tokens and the follow sets here that would accept this example (perhaps this is what you were alluding to above).

Thoughts?

Is this the case you were alluding to when you mentioned strengthening?

Strong +1 to this. It would be very helpful for gaining a good catalog of examples.

Yes, but I don’t think it’s feasible. Consider the following case:

macro_rules! foo(
    ($e:expr) => ...;
    ($x:ident @ $e:expr) => ...
);

This macro needs to be rejected, or at least warned about, because as the definition of expr changes, matches from the second arm will start matching the first. So far, it’s simple. Let’s see what happens when swapping arms:

macro_rules! foo(
    ($x:ident @ $e:expr) => ...;
    ($e:expr) => ...
);

The reasoning is to accept this because the first arm is more specific so no input that matches the second arm should start matching the first if it doesn’t already do. In this specific case it looks easy. But in the general case we just can’t accept that macro, because there could be ambiguities in the rest of the matcher. Consider:

macro_rules! foo(
    ($x:ident @ $e:expr ; $e2:expr) => ...;
    ($e:expr ; $x:ident @ $e2:expr) => ...
);

The problem here is quite obvious. If we stop at the first ‶more specific token in the first arm versus more general NT in the second arm″ case, we can miss possible ambiguities later in the matcher. In this case it looks like the thing to do is to skip until the ‘;’ and then continue the analysis. But is this possible in the general case? There won’t always be a delimiter token to help us, and we can’t know how many tokens we need to skip in the first arm. Furthermore, there could be overlapping NTs, etc. The general case is just too hard for such a simple approach.

In principle, macros where a latter arm can match inputs that are also matched by former arm can be ok. The problem is when input matched by the latter byt not matched by the former could start to match it if the syntax is expanded.

So let’s dig into this example a bit more. Maybe you need to spell things out a bit for me. As I look at this, I’m not 100% sure where the problem is arising:

macro_rules! foo(
    ($x:ident @ $e:expr ; $e2:expr) => ...;
    ($e:expr ; $x:ident @ $e2:expr) => ...
);

In particular, what is the input that will change and cause things to continue differently? Presumably the idea is that there is some input which today triggers the second arm, but when the definition of expr is expanded to include @ will trigger the first arm, right?

(I had a post with some other thoughts, but I realized I was having trouble making an actual example that caused a problem here, though I see the abstract problem you are concerned with.)

Yeah, right. And that’s why the first example (without the semicolons, with the expr arm first) should be disallowed. In fact it was taken from the running example that triggered this whole issue that used type ascription. Now that type ascription is actually landing, it’s no longer future-proofing, so I changed the example to use a @ instead of a :, but the example remains the same.

macro_rules! foo(
    ($e:expr) => ...;
    ($x:ident @ $e:expr) => ...
);

Let’s invoke the macro with: foo!(x @ x). If expr ::= ident @ expr is ever added as a production to the language, we have a problem. Problem that can be solved by swapping arm, because errorneous input now match the first arm, both before and after extension of the language.

Now for the third example, the one you are talking about:

macro_rules! foo(
    ($x:ident @ $e:expr ; $e2:expr) => ...;
    ($e:expr ; $x:ident @ $e2:expr) => ...
);

Mhhhh I guess I got confused at some point because I can’t find anymore an example that triggers a problem, and thinking about it I believe there is none, I shouldn’t have used the same ‶hypothetical future syntax″ before and after the semicolon. But I think I can find an example of a macro such that, whatever is the order in which you put the arms, you have an error, because the ‶first half″ (before the semicolon) of the first arm is more specific than the ‶first half″ of the second arm, but it’s the opposite for the ‶second halfs″. For example:

macro_rules! foo(
    ($x:ident ; $e2:expr) => ...;
    ($e:expr ; $x:ident @ $e2:expr) => ...
);

Here foo!(x ; x @ x) can change arm if we add the same production as above to the grammar. Ok it’s not exactly the same example but from the point of vue of the algorithm it’s still a more specific token in the first arm versus more general NT in the second arm″ case. And to reject it we need to continue the analysis after the semicolons. Which I think is not possible in the general case, for the reasons I gave in my previous post.