Hello everyone,
This is a follow up to my previous post about future proofing
in which I present the results of my work so far. I’m reaching a point where I
think I cannot improve much the existing analysis so I started writing this
pre-RFC about it, in order to have the comments of the other people of the core
team about the state of the thing, the things that we could address, and how we
could land it.
It’s far from being ready I think, a lot of corner cases are still to be
precised…
- Feature Name: first_sets_future_proofing
- Start Date: 2016-08-09
- RFC PR:
- Rust Issue:
Summary
Extend the restriction on macro_rules! macros using the concept of FIRST sets
to be able to check more complex macros (that is, multiple arm macros or macros
using sequence repetitions).
Motivation
macro_rules! are not future-proof, which means there exist valid programs where
macro invocations could change behaviour if some grammar productions were added
to the language.
An existing solution based has been described and implemented.
It aims at avoiding such macros by restricting what macros can and can’t rely
upon on one side, and what kind of grammar productions we can add on the other
side.
It relies on the concept of FOLLOW sets. It restricts macros matchers by
specifying what tokens can occur after a non-terminal matcher in a macro pattern.
There are complex problems with this solution that make it
insufficient because it fails to check macros that feature two “parallel” parsing
possibilites, that is:
- Multiple arm macros
- Single-arm macros with sequence repetitions, where an input could match either
the matchers inside the sequence or what comes after. This is currently
blocking to be able to “backtrack“ when parsing macro-rules.
The problem with multiple arm macros have been known for a long time.
An example of an issue with sequence repetitions can be found here.
Detailed design
This new solution is intended to be added as a complement to the current
solution based on FOLLOW sets, not replace it.
It uses FIRST sets, that is, for each non-terminal, the set of tokens that are
allowed to begin a sequence that matches this non-terminal, or will possibly be
in the future (this is the same, by computing the difference with the set of all
Rust tokens, as specifying a set of tokens that will never be allowed to start a
sequence that matches the non-terminal).
Using this, we can look for “obvious disambiguations”, that is, places in the
macro matchers where the langages accepted by each (or that may me accepted in
the future) are mutually exclusive (their intersection is empty).
[Insert a description of the algorithm here.
For now, you can look here.
The current version has more special cases builtin and a few bugs fixed but is
still quite close.]
Since this will generate a lot of false positives, and that even legitimate
errors will probably be hard to understand and to fix, we will first land it as a
warning. We could then periodically use crater to gather breakage statistics to
see how fast people are adapting and decide when to turn it into an error (if we
ever do, the other option being to wait for macro_rules! to be replaced by
something else).
Drawbacks
-
This analysis is potentially slow. It’s quadratic with the number of arms in a
multiple-arms macros since it must compare them two-by-two. When encountering a
sequence repetition, it becomes exponential in the number of tokens in the
other matcher (until the end of the matcher).
-
It’s very conservative, prefectly valid macros will be rejected. The first
results tell that approximately a third of the regressions are false positives.
Alternatives
-
Not do this and either:
- Keep the status quo, that is, keep macro-rules not future proof.
- Search another solution, based on the knowledge of grammar productions by the
algorithm (see thoughts here).
-
Only add this after investigating the FOLLOW set trick (see below).
-
Another interesting possibility to reduce the number of regressions is to look
to the $(foo:tt)* pattern. Similar to the _ pattern in match and let,
this pattern effectively matches all the possible inputs. Therefore, when it’s
used in the second arm, it works as a “catch-all” and seems to be used like
that in most the cases when we can encouter it. For example, take the following
macro (taken from the log crate):
macro_rules! foo(
(target: $e:expr, $(tt:tt)*) => ...
($(tt:tt)*) => ...
If expr ever starting matching new input, then macro invocations containing
the new form will indeed switch from the second arm to the first. So this macro
is legitimately not future-proof and reported as such by the algorithm. But
clearly, it’s the intention here that the second arm acts as a “catch-all” rule
and that if new forms are added to expr, then they should indeed start
matching the first arm.
We could thus add a builtin knowledge of this pattern to the algorithm, to
specially recognize it and accept a macro such as the above. The main problem
with that is that it moves the responsibility of ensuring that the macro is
future-proof from the compiler to the user who must know check that such
behaviour is indeed its intention.
This in turns brings a new unresolved question which is: where to recognize
this pattern? Do we only recognize the pattern when a whole arm solely
consists of a catch-all matcher, such as the foo macro above? Or can we also
recognize it in the middle of an arm, provided that what comes before is
accepted, for example:
macro_rules! foo(
(foo bar baz, target: $e:expr, $(tt:tt)*) => ...
(foo bar baz, $(tt:tt)*) => ...
)
This looks fine but probably has a few corner cases. Same question regarding
single-TT catch-all, for example, can this:
macro_rules! foo(
(foo bar ( a lot of complex matchers ) baz) => ...
(foo bar $(tt:tt) baz ) => ...
)
be accepted to, considering that there is a ”local catch-all” of what’s inside
the parenthesized delimited-TT matcher in the first arm (and provided that
what comes before and after is unambiguous). This has a few corner cases, then
again when combined with sequence repetitions, but I think they’re easy to
address. This too needs a bit of thinking. It can still be added to the
analysis afterwards, though.
Another option is to not specially-recognize $tt but introduce a new matcher
explicitly designed for this goal, such as $all. This is exactly the same,
but cleaner maybe, and this way we’ll be sure that the behaviour we will give
it will be intended in every use case. We can’t be sure of that with $(tt).
The problem, though, is that while it gives a simple easy-fix for existing use
cases, it will still cause breakage and require user intervention.
-
Land it as an opt-in analysis pass that must be invoked explicitly some
-option. This however reduces greatly the impact of such an analysis since most
users probably won’t invoke it.
-
Land it as an error. This will cause a lot of breakage.
Unresolved questions
-
In order to turn this RFC into a true RFC, we must address the question of what
we do with the FIRST sets. Currently, the FIRST sets are defined to be the
exact sets of tokens that can start a given non-terminal, as computed from the
grammar. So for any token that cannot start right now a non-terminal, it is
considered that it will never be allowed to start this NT. It means we cannot
restrict the FIRST sets, only relax them by saying “maybe we will reserve the
possibility of expanding the language in the future by saying that this token
will maybe allowed to start such NT”. This will add more regressions, though
few I think. The T vs NT pattern doesn’t represent a huge share of the
regressions. Of course we could also leave the things as they are, but this
means that we cannot add new prefix operators in the future, for example. It’s
not up to me to decide wether this is a restriction worth considering or not.
That being said, it should be possible to restrict a bit the FIRST sets:
currently, whenever a NT can start by an identifier, then it’s also considered
as being allowed to start with any keyword. Maybe we could state that while
expr (for example) can start with any arbitrary ident, it will never be
allowed to start by some reserved keyword. However, the current Crater test
results show that only half a dozen crates could benefit from this.
-
Investigate the trick imagined by @nmatsakis here to be able to
recover after a “T vs NT” or “NT vs T” pattern, and, presumably, eliminate all
remaining false positives. However, I still need to convince myself that there
are no tricky cases (especially regarding sequence repetitions) that should be
taken care of if we do not want this trick to introduce unsoundness in the
algorithm.
Also, this trick relies on specific parser semantics that should be
implemented. I do not think such changes in the parser implementation would be
breaking but then again, this needs a bit more investigation.
-
What to do in the future. I think @nrc wants to implement a new system of
macros-by-example from scratch to eventually replace (and deprecate)
macro_rules!. It’s still unclear wether this new system will actually suffer
the same problems as macro_rules! regarding future-proofing. It might just
use tt or other simpler systems instead of complexe NT-matchers. But if it
was to use a similar matching system, then it would probably be better to
directly embed this analysis in it right from the beginning, as an error, to
avoid more future breakage. But then again, @nrc is probably the right person
to ask.