This is a follow up to my previous post about future proofing in which I present the results of my work so far. I’m reaching a point where I think I cannot improve much the existing analysis so I started writing this pre-RFC about it, in order to have the comments of the other people of the core team about the state of the thing, the things that we could address, and how we could land it.
It’s far from being ready I think, a lot of corner cases are still to be precised…
- Feature Name: first_sets_future_proofing
- Start Date: 2016-08-09
- RFC PR:
- Rust Issue:
Extend the restriction on
macro_rules! macros using the concept of FIRST sets
to be able to check more complex macros (that is, multiple arm macros or macros
using sequence repetitions).
macro_rules! are not future-proof, which means there exist valid programs where
macro invocations could change behaviour if some grammar productions were added
to the language.
An existing solution based has been described and implemented. It aims at avoiding such macros by restricting what macros can and can’t rely upon on one side, and what kind of grammar productions we can add on the other side.
It relies on the concept of FOLLOW sets. It restricts macros matchers by specifying what tokens can occur after a non-terminal matcher in a macro pattern.
There are complex problems with this solution that make it insufficient because it fails to check macros that feature two “parallel” parsing possibilites, that is:
- Multiple arm macros
- Single-arm macros with sequence repetitions, where an input could match either the matchers inside the sequence or what comes after. This is currently blocking to be able to “backtrack“ when parsing macro-rules.
This new solution is intended to be added as a complement to the current solution based on FOLLOW sets, not replace it.
It uses FIRST sets, that is, for each non-terminal, the set of tokens that are allowed to begin a sequence that matches this non-terminal, or will possibly be in the future (this is the same, by computing the difference with the set of all Rust tokens, as specifying a set of tokens that will never be allowed to start a sequence that matches the non-terminal).
Using this, we can look for “obvious disambiguations”, that is, places in the macro matchers where the langages accepted by each (or that may me accepted in the future) are mutually exclusive (their intersection is empty).
[Insert a description of the algorithm here. For now, you can look here. The current version has more special cases builtin and a few bugs fixed but is still quite close.]
Since this will generate a lot of false positives, and that even legitimate
errors will probably be hard to understand and to fix, we will first land it as a
warning. We could then periodically use crater to gather breakage statistics to
see how fast people are adapting and decide when to turn it into an error (if we
ever do, the other option being to wait for
macro_rules! to be replaced by
This analysis is potentially slow. It’s quadratic with the number of arms in a multiple-arms macros since it must compare them two-by-two. When encountering a sequence repetition, it becomes exponential in the number of tokens in the other matcher (until the end of the matcher).
It’s very conservative, prefectly valid macros will be rejected. The first results tell that approximately a third of the regressions are false positives.
Not do this and either:
- Keep the status quo, that is, keep macro-rules not future proof.
- Search another solution, based on the knowledge of grammar productions by the algorithm (see thoughts here).
Only add this after investigating the FOLLOW set trick (see below).
Another interesting possibility to reduce the number of regressions is to look to the
$(foo:tt)*pattern. Similar to the
let, this pattern effectively matches all the possible inputs. Therefore, when it’s used in the second arm, it works as a “catch-all” and seems to be used like that in most the cases when we can encouter it. For example, take the following macro (taken from the
macro_rules! foo( (target: $e:expr, $(tt:tt)*) => ... ($(tt:tt)*) => ...
exprever starting matching new input, then macro invocations containing the new form will indeed switch from the second arm to the first. So this macro is legitimately not future-proof and reported as such by the algorithm. But clearly, it’s the intention here that the second arm acts as a “catch-all” rule and that if new forms are added to
expr, then they should indeed start matching the first arm.
We could thus add a builtin knowledge of this pattern to the algorithm, to specially recognize it and accept a macro such as the above. The main problem with that is that it moves the responsibility of ensuring that the macro is future-proof from the compiler to the user who must know check that such behaviour is indeed its intention.
This in turns brings a new unresolved question which is: where to recognize this pattern? Do we only recognize the pattern when a whole arm solely consists of a catch-all matcher, such as the
foomacro above? Or can we also recognize it in the middle of an arm, provided that what comes before is accepted, for example:
macro_rules! foo( (foo bar baz, target: $e:expr, $(tt:tt)*) => ... (foo bar baz, $(tt:tt)*) => ... )
This looks fine but probably has a few corner cases. Same question regarding single-TT catch-all, for example, can this:
macro_rules! foo( (foo bar ( a lot of complex matchers ) baz) => ... (foo bar $(tt:tt) baz ) => ... )
be accepted to, considering that there is a ”local catch-all” of what’s inside the parenthesized delimited-TT matcher in the first arm (and provided that what comes before and after is unambiguous). This has a few corner cases, then again when combined with sequence repetitions, but I think they’re easy to address. This too needs a bit of thinking. It can still be added to the analysis afterwards, though.
Another option is to not specially-recognize
$ttbut introduce a new matcher explicitly designed for this goal, such as
$all. This is exactly the same, but cleaner maybe, and this way we’ll be sure that the behaviour we will give it will be intended in every use case. We can’t be sure of that with
$(tt). The problem, though, is that while it gives a simple easy-fix for existing use cases, it will still cause breakage and require user intervention.
Land it as an opt-in analysis pass that must be invoked explicitly some -option. This however reduces greatly the impact of such an analysis since most users probably won’t invoke it.
Land it as an error. This will cause a lot of breakage.
In order to turn this RFC into a true RFC, we must address the question of what we do with the FIRST sets. Currently, the FIRST sets are defined to be the exact sets of tokens that can start a given non-terminal, as computed from the grammar. So for any token that cannot start right now a non-terminal, it is considered that it will never be allowed to start this NT. It means we cannot restrict the FIRST sets, only relax them by saying “maybe we will reserve the possibility of expanding the language in the future by saying that this token will maybe allowed to start such NT”. This will add more regressions, though few I think. The T vs NT pattern doesn’t represent a huge share of the regressions. Of course we could also leave the things as they are, but this means that we cannot add new prefix operators in the future, for example. It’s not up to me to decide wether this is a restriction worth considering or not. That being said, it should be possible to restrict a bit the FIRST sets: currently, whenever a NT can start by an identifier, then it’s also considered as being allowed to start with any keyword. Maybe we could state that while
expr(for example) can start with any arbitrary ident, it will never be allowed to start by some reserved keyword. However, the current Crater test results show that only half a dozen crates could benefit from this.
Investigate the trick imagined by @nmatsakis here to be able to recover after a “T vs NT” or “NT vs T” pattern, and, presumably, eliminate all remaining false positives. However, I still need to convince myself that there are no tricky cases (especially regarding sequence repetitions) that should be taken care of if we do not want this trick to introduce unsoundness in the algorithm.
Also, this trick relies on specific parser semantics that should be implemented. I do not think such changes in the parser implementation would be breaking but then again, this needs a bit more investigation.
What to do in the future. I think @nrc wants to implement a new system of macros-by-example from scratch to eventually replace (and deprecate)
macro_rules!. It’s still unclear wether this new system will actually suffer the same problems as
macro_rules!regarding future-proofing. It might just use
ttor other simpler systems instead of complexe NT-matchers. But if it was to use a similar matching system, then it would probably be better to directly embed this analysis in it right from the beginning, as an error, to avoid more future breakage. But then again, @nrc is probably the right person to ask.