Hello Rustaceans!
Before I start, let me introduce myself for those who don’t know me. I’m Leo, I’m currently intern at Mozilla with @pnkfelix, and I’m supposed to fix the issue with macro future-proofing analysis in presence of macros with multiple arms. Unfortunately, it turned out that fixing it the way we wanted was not possible. This post is an attempt at explaining why, and what we can do about it.
Background
The current implementation of macros-by-example in Rust (macro_rules!
) is not future-proof: it does not guarantee that code that is currently considered valid Rust code will continue to be so in the future if the syntax of the language is extended.
Rust macros expect some piece of Rust code in input and in turn generate some other piece of Rust code as output. Macro definitions can specify the shape of what they expect as input using some kind of patterns. Those patterns can include ‶matchers″ to say that the macro expects some specific sort of Rust syntax (called “non-terminals”): an, expression, a type, an identifier, etc. Because of that, the macro expansion relies on the Rust parser for those syntactic forms and thus the whole expansion process might stop to work if the Rust syntax changes, because the parser won’t produce the same results anymore.
Let’s take an example:
macro_rules! foo(
($e:expr : $t:ty) => ( ... )
)
This defines a macro that expects its input to be any valid Rust expression followed by a colon token followed by any valid Rust type. The $e
is used to bind the actual matched expression to a name so that the macro can include it in its output.
Now we could invoke this macro like:
foo!(x: i32)
and it would work. Because x
is an identifier and an identifier is a valid Rust expression. Likewise, i32
is also an identifier and an identifier is again a valid Rust type.
But recently the syntax of Rust changed to include type ascription syntax: an expression explicitly annotated with a colon and a type is now itself a valid expression. Speaking in terms of grammar, we added the following production to the Rust syntax:
< expression > ::= ...
| < expression > : < type >
With this new syntactic form, when trying to match x: i32
against $e:expr : $t:ty
, the parser will parse x: i32
as a whole expression, and then find the end of the macro invocation where it expects a colon. So previously valid Rust code is now rejected.
Current solution (and problem)
The solution to this problem is to guarantee that some part of the grammar won’t change (for example by stating that a given token will always be unambiguous after an expression) and then check that macro definitions don’t rely on syntax that has not been guaranteed. For example, the previous macro would have been rejected before the addition of the type ascription syntax, because it would not have been guaranteed back then that a colon would always be unambiguous after an expression. But the two following macros:
macro_rules! foo(($e: expr ; $t:ty) => ())
macro_rules! foo(($i: ident : $t:ty) => ())
would both be accepted because it’s guaranteed that a semicolon will always be unambiguous after an expression, for the first macro, and that any token will always be unambiguous after an identifier, for the second, because an identifier is always a single token.
So what we did is that we basically defined some sets of tokens that are always ok to appear after a matcher for each construction (expr
, ty
, etc.) in a macro pattern. Those sets were named FOLLOW sets, quite poorly because it’s not exactly the same thing as what ‶FOLLOW set″ usually refer to in parsing literature.
We then implemented an analysis that checks that a matcher is always followed by something that belongs to its FOLLOW set. But that is not sufficient. In fact that only works for macros with a single pattern. But Rust macros are allowed to define multiple arms with multiple patterns, the first that matches being expanded:
macro_rules! foo(
($e:expr) => (...);
($i:ident : $t:ty) => (...)
)
In this case, the two arms, taken individually, are okay with regard to the FOLLOW sets rules. But taken together they’re not. Consider the following macro invokation:
foo!(x: i32)
Previously it would have failed to match the first arm because x: i32
was not a valid expression by then, so the macro would have expanded to its second arm. But it is now, so the macro will expand to its first arm. In fact this is way much worse than the previous example: the code still compiles, without errors, but behaves differently without the programmer knowing it.
I took quite a lot of time to think about this specific problem and to find an algorithm that could, based on the FOLLOW sets, reject that macro. The original idea I had was to compare two macro arms and, when encountering a matcher in the first, try to make as much as possible of the second arm to match it. This is different from the parser that works during macro expansion: in this context we don’t want to know if it will always match but if a match exists. For example, such a parser would say that $t:ty
matches $e:expr
because both can be an identifier. Then, we would look at the token that comes just after and see if it does belong to the FOLLOW set of the matcher we are trying to match from the first arm. This would have worked for the previous example: before type ascription, parsing would have stopped at the colon, which does not belong to the FOLLOW set of expr
.
Unfortunately, that doesn’t work for all the cases and I finally realized that it was impossible to solve this problem using only FOLLOW sets as a formalism for what is allowed or not to change. Take the following macro:
macro_rules! foo(
($t:ty) => ()
([ $i:ident ; $x:ident @ $y:ident ])
)
This of course reminds of the syntax for static array types:
< type > ::= ...
| [ < type > ; < expr > ]
Right now, a @ b
(where a
and b
are arbitrary identifiers) is not a valid expression, so inputs like foo!([ i32 ; a @ b ])
will match the second arm since it does not describe a valid type. But we might want to be able to extend Rust in the future so that a @ b
becomes a valid expression. That’s why @
is not included in the FOLLOW set of expr
. But here, expr
appears nowhere in the definition of the macros. The only matchers that appear are ty
and ident
. The FOLLOW set of ident accepts any token. The only reason we could find to reject this macro by doing an analysis based purely on the FOLLOW sets is that @
is not in the FOLLOW set of ty
. But that’s absurd! That’s not the reason why we don’t want to allow that. We are modifying expr
here, not ty
. In fact we could write the same example with another token that is not currently allowed in an expression but is in the FOLLOW set of ty
(say, =>
), and we would have the same problem, except that we would not have any reason to disallow the macro. And we cannot find one by looking only at the FOLLOW sets. Finding a reason to disallow this macro requires to know that the @
token is here inside of something that the first arm could possibly parse as an expression, and this in turn requires digging into the production rules of the Rust grammar.
Possible solutions
So knowing that, what can we do to solve this problem? There are basically three ways to go:
- Do nothing. Do not check those macros, and accept that we might break user code when extending the syntax.
- Continue to try to find an exact solution (or, at least, a reasonably approached solution) by thinking in terms of grammar productions instead of FOLLOW sets.
- Try instead to find a very approached solution that does not require to dig into the grammar, maybe at the expense of generating many false positives, i.e. report as errors perfectly valid macros.
I will try to explain in details the pros and cons of the last two approaches, assuming that the first one is definitely not what we want. Risking to break user code is bad, risking to make user code behave differently is worse. And we cannot abandon the possibility of expanding the syntax of the language.
Grammar-based solution
I haven’t thought in detail of what this solution would look like. An option would be to reason on an LR automaton. We could consider the grammar that describes the language recognized by a macro, defined by taking all the productions of the Rust language (or, at least, the parts that we need: expr
, ty
, etc. and their dependencies, but if not all that’s probably a very large part of the whole grammar) and adding as entry points productions that match the arms of the macro. We would need to adjust their precedence more or less like Yacc-like parser generators do with %prec
annotations to ensure that multiple macro arms are allowed to match the same input (provided that they already does right now, not modulo some future change in the syntax), and then we could convert it to an LR automaton and run some kind of modified shift-reduce and reduce-reduce conflict checker that would not only check that there are no traditional conflicts but also that there are no ‶future-proof″ conflicts, that is, it would report a conflict if you have, say, a transition and a reduction on different tokens on a given state, if one of the token has not been added to some guaranteed set (it would not be a FOLLOW set here) of the non-terminal we are parsing. It would of course not take precedence into account for this kind of conflict. It’s still unclear if and how this could work, but it might be worthwile. Of course there might be other possible approaches, based purely on the productions instead of an automaton.
Now for the pros and cons:
- The only pro I can think of compared to the other solutions is that if this one works, it might give an algorithm that give no or very few false positives. It’s a significant one though.
- But it is still unclear if it could work at all and sorting that out would require a lot of work, that maybe not be worthwile if in the end we realize that it doesn’t work.
- It would require to have a canonical LR grammar for all (or almost all) the Rust language. This is not yet the case and might not be the case before a long time.
- Constructing an LR automaton each time we have to check a macro definition can be quite expensive.
Approached solution
A very simple solution would be to just expect the user to manually disambiguate their macros by making them begin by different tokens, or by non-terminals whose FIRST sets (the set of tokens a non-terminal can begin with) are pairwise disjoint. The algorithm could be as simple as iterating over the matchers of two macro arms until finding either a trivial disambiguation (and thus accept the macro) or a non-trivial case that we don’t know how to handle (and thus reject it). For checking two arms consisting of symbols a1 a2 ... an
and b1 b2 ... bn
something, this would give something like:
- If
ak
andbk
are both the same token, continue onak+1 ... an
andbk+1 ... bn
- If
ak
andbk
are different tokens, accept the macro. - If
ak
is a token andbk
is a non-terminal, accept the macro is the token is not in the FIRST set of this non-terminal. If it is, it’s a tricky case. One could think that it would not be a problem since the token has precedence for being in the first arm so we could just continue to look atak+1 ... an
andbk+1 ... bn
but the problem is that a single non-terminal could match several tokens so I think we cannot tell how much tokens we would need to skip in the first arm. - If
ak
is a non-terminal andbk
is a token, accept the macro if the token is not in the FIRST set of this non-terminal, reject it if not. We could emit an error message here suggesting to put the arm containing the terminal first but as stated above, it still unclear wether we could be sure it would solve the problem. - If they are both non-terminals, accept the macro if their FIRST sets are disjoints, reject it if not.
Of course the FIRST sets (or some subset of them) will have to be guaranteed just like we did for FOLLOW sets. Fortunately, it’s much less likely that we modify them in the future than the FOLLOW sets (the only cases that would result in this would be likely the addition of prefix operators or new delimited forms, this is unlikely to happen often), so they could probably be larger.
The obvious cons is that this is an extraordinay strong and conservative approach, and it would probably break a lot of code. But there is a proposal to introduce a new macro!
form and possibly to deprecate macro_rules!
. A possibility would be to introduce this in the new macro!
form. This would not break code. Also, this has the advantage of being extremely simple to implement, so we can implement it quickly and use crater to see how much code it breaks, and maybe gather statistics on the breakage (for example: is it more importantly due to token versus non-terminal or to non-terminal versus non-terminal?) to have an idea of wether the broken code would be easy to fix or not.
It will obviously drastically decrease the expressivity of the macro system. It’s unclear wether it will be a real loss for macros-by-example. It will probably forbid macros that use recursion and rely on the dispatch between the arms to handle termination. But as a trade-off, procedural syntax extensions are supposed to become more easily (and, hopefully, stably) usable. Macros-by-example were not meant to do arbitrarily complex computation in the first place anyway.
To sum up this post, the plan I propose is the following:
- Implement the approached solution and test it against crater to have a more precise idea of the breakage it will cause.
- If it causes too much breakage that is too hard to fix, or decreases too much the expressivity of the macro system, investigate the grammar-based solution.
- If breakage and change in expressivity seem reasonable, implement it in the new
macro!
form that will replacemacro_rules!
, arguing that the ability to produce future-proof code is a good trade-off for expressivity, and implement the new, easy to use, stable syntax extensions alongside it to balance for the loss of expressivity for use cases where arbitrarily computation at compile-time is indeed required.