On macros future-proofing, FOLLOW sets, and related stuff

Is this the case you were alluding to when you mentioned strengthening?

Strong +1 to this. It would be very helpful for gaining a good catalog of examples.

Yes, but I don’t think it’s feasible. Consider the following case:

macro_rules! foo(
    ($e:expr) => ...;
    ($x:ident @ $e:expr) => ...
);

This macro needs to be rejected, or at least warned about, because as the definition of expr changes, matches from the second arm will start matching the first. So far, it’s simple. Let’s see what happens when swapping arms:

macro_rules! foo(
    ($x:ident @ $e:expr) => ...;
    ($e:expr) => ...
);

The reasoning is to accept this because the first arm is more specific so no input that matches the second arm should start matching the first if it doesn’t already do. In this specific case it looks easy. But in the general case we just can’t accept that macro, because there could be ambiguities in the rest of the matcher. Consider:

macro_rules! foo(
    ($x:ident @ $e:expr ; $e2:expr) => ...;
    ($e:expr ; $x:ident @ $e2:expr) => ...
);

The problem here is quite obvious. If we stop at the first ‶more specific token in the first arm versus more general NT in the second arm″ case, we can miss possible ambiguities later in the matcher. In this case it looks like the thing to do is to skip until the ‘;’ and then continue the analysis. But is this possible in the general case? There won’t always be a delimiter token to help us, and we can’t know how many tokens we need to skip in the first arm. Furthermore, there could be overlapping NTs, etc. The general case is just too hard for such a simple approach.

In principle, macros where a latter arm can match inputs that are also matched by former arm can be ok. The problem is when input matched by the latter byt not matched by the former could start to match it if the syntax is expanded.

So let’s dig into this example a bit more. Maybe you need to spell things out a bit for me. As I look at this, I’m not 100% sure where the problem is arising:

macro_rules! foo(
    ($x:ident @ $e:expr ; $e2:expr) => ...;
    ($e:expr ; $x:ident @ $e2:expr) => ...
);

In particular, what is the input that will change and cause things to continue differently? Presumably the idea is that there is some input which today triggers the second arm, but when the definition of expr is expanded to include @ will trigger the first arm, right?

(I had a post with some other thoughts, but I realized I was having trouble making an actual example that caused a problem here, though I see the abstract problem you are concerned with.)

Yeah, right. And that’s why the first example (without the semicolons, with the expr arm first) should be disallowed. In fact it was taken from the running example that triggered this whole issue that used type ascription. Now that type ascription is actually landing, it’s no longer future-proofing, so I changed the example to use a @ instead of a :, but the example remains the same.

macro_rules! foo(
    ($e:expr) => ...;
    ($x:ident @ $e:expr) => ...
);

Let’s invoke the macro with: foo!(x @ x). If expr ::= ident @ expr is ever added as a production to the language, we have a problem. Problem that can be solved by swapping arm, because errorneous input now match the first arm, both before and after extension of the language.

Now for the third example, the one you are talking about:

macro_rules! foo(
    ($x:ident @ $e:expr ; $e2:expr) => ...;
    ($e:expr ; $x:ident @ $e2:expr) => ...
);

Mhhhh I guess I got confused at some point because I can’t find anymore an example that triggers a problem, and thinking about it I believe there is none, I shouldn’t have used the same ‶hypothetical future syntax″ before and after the semicolon. But I think I can find an example of a macro such that, whatever is the order in which you put the arms, you have an error, because the ‶first half″ (before the semicolon) of the first arm is more specific than the ‶first half″ of the second arm, but it’s the opposite for the ‶second halfs″. For example:

macro_rules! foo(
    ($x:ident ; $e2:expr) => ...;
    ($e:expr ; $x:ident @ $e2:expr) => ...
);

Here foo!(x ; x @ x) can change arm if we add the same production as above to the grammar. Ok it’s not exactly the same example but from the point of vue of the algorithm it’s still a more specific token in the first arm versus more general NT in the second arm″ case. And to reject it we need to continue the analysis after the semicolons. Which I think is not possible in the general case, for the reasons I gave in my previous post.

Great! Yes, I suspected we could find a fix by slightly tweaking the example. So let’s look a this example a bit more:

macro_rules! foo(
    ($x:ident ; $e2:expr) => ...;
    ($e:expr ; $x:ident @ $e2:expr) => ...
);

For background, the original way I thought to do macro future proofing – which we moved away from – was to make it part of the parser semantics. Basically, to parse nonterminal like expr, we would first pull in all token trees up until we find something in the follow-set. So e.g. if we were applying the macro above to x ; x @ x, when we come to the expression, we would pull in the tokens x @ x. We would then parse those tokens as an expression and error out if any of them remain unconsumed (in this case, that’d be an error, because @ x is unconsumed). One problem with this strategy is that the errors occur on the macro use-sites, but in a way the flaw is with the definition of the macro. (I’m also wondering if it is somehow… “incomplete” in that there may still be inputs that would accepted but would change interpretation?)

Anyway, I was hoping to apply that intution I just described in the previous paragraph to the algorithm, but I admit I’m not 100% sure how to do that. I am doubly convinced we ought to measure the real-world impact of the more conservative strategy. =)

Do we have any concrete examples of recursive macros that will not be possible with the proposed macro!, other than s![]?

Also, I don’t like the reduced expressiveness.

I’m not sure what you mean by ‶find a fix″. What would you do to analyse and reject this example?

Also, pulling tokens until we find something in the FOLLOW set doesn’t sound enough: for expr for example, ; is in the FOLLOW set, but can also appear inside an expression. You could add a condition of balancing delimiters (that is, pulling token trees instead) but even this way, I’m not sure it’s enough. For example, for ty, [ is in the FOLLOW set, and some types can begin with [ so you couldn’t parse them at all with this approach. But maybe I’m missing something here.

Sorry, I meant...find a problematic case :slight_smile:

I said pulling token trees, not tokens, for this reason.

(Still, I’m not yet sure how to turn that “greedy parse” behavior into some sort of algorithm, or if it would work well. I’m curious to know how far we could get with the simpler, conservative variant.)

I’ll work on it. But maybe we should start by specifying the FIRST sets? That is, determine for each NT a set of tokens they’re guaranteed to never begin with. Or should we take the opposite approach? Take the existing set of tokens the NTs can right now begin with and extend it with the tokens we might want to reserve.

Yeah, definitely coming up with some FIRST sets is a necessary first step. I think we determined the initial FOLLOW sets at least in part by instrumenting the parser to “observe” the kinds of tokens that could come after an expression (/type/whatever), we might consider doing something similar here.

I suggest creating a grammar that describes a superset of all potential future additions to the Rust grammar. With this grammar, every macro invocation can be validated. If a macro arm matches the input using the potential future Rust grammar, and only a subsequent arm matches with the current Rust grammar, emit an error.

The hard part of the implementation is writing a recognizer for meta-variables that will recognize all future Rust code.

For example, to handle type ascription, this should be included in the new grammar:

future_expr ::= expr | future_expr colon future_ty;

Let’s see whether this macro is future-proof:

macro_rules! foo(
    ($e:expr) => (...);
    ($i:ident : $t:ty) => (...)
)
foo!(x: i32)

Now, $e:future_expr matches against the input, but $e:expr does not, and $i:ident : $t:ty does. This will give an error.

That’s really not possible… There are just too many extension possibilities, we cannot anticipate everything. It may even involve new tokens.

Hey!

After a while (working on other things), I’m now working on implementing the conservative approached analysis.

I started by instrumenting the Rust LALR grammar (in src/grammar) to determine the current FIRST sets. I chosed the LALR grammar over the parser of libsyntax because it was easier to instrument and though it’s not up to date, we agreed with @pnkfelix to say that for the purpose of experimenting with this solution, it was for the time way easier to just fix the results by hand rather than updating it.

Here are the results:

FIRST(ty)

'[' EXTERN IDENT '*' FOR BOX '(' UNDERSCORE FN PROC SELF '&' '<' MOD_SEP
SHL ANDAND TYPEOF UNSAFE OROR '|'

FIRST(expr)

CONTINUE RETURN TRUE '[' '<' SHL LIT_CHAR BREAK MOD_SEP FALSE BOX
LOOP WHILE STATIC_LIFETIME '*' '(' '!' '|' OROR ANDAND LIT_BYTE '-' LIT_FLOAT
LIFETIME LIT_INTEGER MATCH FOR LIT_STR DOTDOT MOVE LIT_BYTE_STR IF
'&' PROC SELF LIT_BYTE_STR_RAW LIT_STR_RAW UNSAFE '{' IDENT

FIRST(stmt)

MOD_SEP BREAK SELF '#' SHL LIT_INTEGER LIT_BYTE_STR_RAW
STATIC_LIFETIME '(' LIT_FLOAT MOVE ';' WHILE LOOP STATIC LIT_BYTE USE
TYPE '*' FN MATCH '-' DOTDOT IDENT CONST TRAIT FALSE STRUCT '!' TRUE
LIT_CHAR LET PROC LIFETIME '[' PUB EXTERN ENUM CONTINUE BOX FOR '{'
LIT_STR_RAW '|' OROR RETURN OUTER_DOC_COMMENT '<' '&' IF UNSAFE
ANDAND LIT_BYTE_STR LIT_STR MOD IMPL

FIRST(block)

'{'

FIRST(item)

MOD_SEP USE CONST IMPL IDENT ENUM SELF UNSAFE TRAIT TYPE MOD
EXTERN STATIC STRUCT FN

FIRST(pat)

SELF '&' '<' LIT_BYTE_STR_RAW LIT_BYTE MOD_SEP LIT_STR LIT_CHAR
ANDAND BOX TRUE MUT IDENT REF LIT_BYTE_STR LIT_FLOAT '[' SHL '-'
LIT_STR_RAW LIT_INTEGER '(' UNDERSCORE FALSE

FIRST(path_expr)

SELF MOD_SEP IDENT

FIRST(meta_item)

IDENT

Unfortunately I cannot do much with only the current FIRST sets. Now we need to make some guarantees : that is, define the sets of tokens that each NT will never be allowed to start with, and expand the computed FIRST sets to include what remains.

cc @nikomatsakis

The algorithm would look like something like this:

check_macro {
    // conservatively check each arm against each following arm.
    // this is too strong but conservative
    for (i, arm) in macro_arms.enumerate() {
        for arm_ in macro_arms[i + 1 ..] {
            test_arms(arm, arm_)
        }
    }
}

test_arms(arm_A, arm_B) {
    let A and B bet the sequence of TTs of the arms arm_A and arm_B, respectively.
    need_disambiguation = false

    // analyse until one of the cases happen:
    // * we find an obvious disambiguation, that is a proof that all inputs that
    //   matches A will never match B or vice-versa
    // * we find a case that is too complex to handle and reject it
    // * we reach the end of the macro
    while A and B are both not empty {
        if A[0] = B[0] {
            // here, by `=`, I mean « match exactly the same input, so that,
            // for example, $foo:expr = $bar:expr, obviously.
            loop with A = A[1..] and B = B[1..]
        }

        // compute the FIRST sets, that is, the sets of tokens the matcher is
        // allowed to start with, including possible future FIRST tokens.
        // that is, all token that the matcher is *not* guaranteed never to
        // start with.
        // it t is a token, the FIRST(t) = { t }, if t is a delimited TT, then
        // is the opening delimiter, if t is a NT matcher, its FIRST set is
        // pre-computed, and if t is a repeated sequence, it will be computed
        // dynamically
        compute FIRST(A) and FIRST(B)
        if FIRST(A) ∩ FIRST(B) = ∅ {
            // at this point, all inputs that match or could amatch A will never
            // match B and vice-versa
            return Ok
        }

        // so this case *might* a problem, but we don't know yet, we must continue
        // the analysis.

        if A or B could match several token trees {
            // i.e. A or B is either a repeated sequence or a NT matcher that is
            // not tt, ident, or block
            // we cannot know where we should continue the analysis
            reject the macro
        }

        // A and B always both match a single TT
        match A, B {
            Sequence, _ | _, Sequence =>
                // cannot happen since sequences are not always a single-TT
                bug!()

            NT(tt), _ | _, NT(tt) =>
                // this is okay for now, either A will always have priority,
                // either B will always be unreachable. but search for errors
                // further
                continue with A = A[1 ..] and B = B[1 ..]

            NT(_), NT(_) =>
                // this case cannot happen. the only NTs that are a single-TT
                // and that are not tt are ident and block, that do not share any
                // FIRST token.
                bug!()

            NT(ident), Token(Ident(_)) | Token(Ident(__), NT(ident) =>
                // it's the same as with tt: either A is included, either B is
                // unreachable
                continue with A = A[1..] and B = B[1..]

            NT(ident), Token(Ident(_)) | Token(Ident(__), NT(ident) =>
                // the only possible NT here is ident because the only token in
                // the FIRST set of block is {, and { is not seen as a token but
                // as the beginning of a Delim
                // the only possible token is thus an hardcoded identifier or
                // keyword. so the only possible case of NT vs. Token is the
                // the case above.
                bug!()

            NT(block), Delim | Delim, NT(block) if Delim.token == Brace
            | Delim, Delim =>
                // we cannot say much here. we cannot look inside. we
                // can just hope we will find an obvious disambiguation later
                need_disambiguation = true,
                continue with A = A[1..] and B = B[1..]

            // cannot happen. either they're the same token or their FIRST sets
            // are disjoint.
            Token, Token | Token, Delim | Delim, Token => bug!() 
        }
    }

    // now we are at the end of one arm:
    if need_disambiguation {
        // we couldn't find any. we cannot say anything about those arms.
        // reject conservatively.
        reject the macro
    } else {
        // either A is strictly included in B and the other inputs that match B
        // will never match A, or B is included in or equal to A, which means
        // it's unreachable. this is not our problem. accept.
    }
}

It could certainly be improved, at the very least by looking inside sequences and delims when we have sequence against sequence or delim against delim. It would require recursion though. We could add that later.

I still think this alternative works. I didn't mean to literally write down all extension possibilities. Today's FOLLOW sets can be generalized to "FOLLOW rules" which come naturally from extended grammar rules.

Example

expr has a small FOLLOW set, FOLLOW(expr) = {FatArrow, Comma, Semicolon}. We add the following rules to the Rust grammar, with ngExpr as the superset of potential future exprs:

ngExpr ::= expr;
ngExpr ::= expr !followExpr AnyToken* => ERROR;
followExpr ::= FatArrow | Comma | Semicolon;

Or, equivalently, we can add these rules without the use of negation, by enumerating all tokens that currently do not belong in expr's FOLLOW set:

ngExpr ::= expr;
ngExpr ::= expr cannotFollowExpr AnyToken* => ERROR;
cannotFollowExpr ::= Colon | Delimiter | Lit | BinOp | At | Dot | DotDot | DotDotDot | ModSep | RArrow | LArrow | Pound | Dollar | Question | Ident | Underscore | /* perhaps more… */;

Every part of this rule can be improved incrementally. For example, let's narrow down ngExpr to reduce the number of false positives it catches. Consider type ascription syntax:

expr ::= expr Colon ty;

After adding this grammar rule, Colon can be removed from cannotFollowExpr. The same applies to every future addition to the Rust grammar.

I think FIRST sets can be generalized with similar rules.

edit: clarification.

Implementation

Of course, the $…:expr nonterminal should be wired to ngExpr, whereas the Rust parser will continue to use the same grammar rules (expr etc.) as it does today.

Future-proofing errors will show up on a per-invocation basis rather than as a result of the static analysis of a macro declaration. (Unfortunately, reporing errors for invocations might be a serious disadvantage.)


Adding new tokens

I overlooked potential future tokens when talking about potential future additions to the grammar. We can ignore new tokens, can't we? Today's macro invocations can't use tokens that don't exist yet. In the future, new tokens and new future-proofing rules would be enabled at the same time.

This is interesting! But don’t forget that:

  • This still requires building an entire LR automaton and run a full ambiguity analysis at compile-time for every macro invocation. This is incredibly costly.
  • The current solution based on the approximation of FIRST sets exists and is implemented. It ‶works″. It yields a lot of false positives but I’m quite confident that it’s sound. Maybe we could improve it in the future to make it more precis, using grammar-level analysis stuff, but for now, I think the priority is to land this conservative analysis…

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.