On macros future-proofing, FOLLOW sets, and related stuff

Great! Yes, I suspected we could find a fix by slightly tweaking the example. So let’s look a this example a bit more:

macro_rules! foo(
    ($x:ident ; $e2:expr) => ...;
    ($e:expr ; $x:ident @ $e2:expr) => ...
);

For background, the original way I thought to do macro future proofing – which we moved away from – was to make it part of the parser semantics. Basically, to parse nonterminal like expr, we would first pull in all token trees up until we find something in the follow-set. So e.g. if we were applying the macro above to x ; x @ x, when we come to the expression, we would pull in the tokens x @ x. We would then parse those tokens as an expression and error out if any of them remain unconsumed (in this case, that’d be an error, because @ x is unconsumed). One problem with this strategy is that the errors occur on the macro use-sites, but in a way the flaw is with the definition of the macro. (I’m also wondering if it is somehow… “incomplete” in that there may still be inputs that would accepted but would change interpretation?)

Anyway, I was hoping to apply that intution I just described in the previous paragraph to the algorithm, but I admit I’m not 100% sure how to do that. I am doubly convinced we ought to measure the real-world impact of the more conservative strategy. =)

Do we have any concrete examples of recursive macros that will not be possible with the proposed macro!, other than s![]?

Also, I don’t like the reduced expressiveness.

I’m not sure what you mean by ‶find a fix″. What would you do to analyse and reject this example?

Also, pulling tokens until we find something in the FOLLOW set doesn’t sound enough: for expr for example, ; is in the FOLLOW set, but can also appear inside an expression. You could add a condition of balancing delimiters (that is, pulling token trees instead) but even this way, I’m not sure it’s enough. For example, for ty, [ is in the FOLLOW set, and some types can begin with [ so you couldn’t parse them at all with this approach. But maybe I’m missing something here.

Sorry, I meant...find a problematic case :slight_smile:

I said pulling token trees, not tokens, for this reason.

(Still, I’m not yet sure how to turn that “greedy parse” behavior into some sort of algorithm, or if it would work well. I’m curious to know how far we could get with the simpler, conservative variant.)

I’ll work on it. But maybe we should start by specifying the FIRST sets? That is, determine for each NT a set of tokens they’re guaranteed to never begin with. Or should we take the opposite approach? Take the existing set of tokens the NTs can right now begin with and extend it with the tokens we might want to reserve.

Yeah, definitely coming up with some FIRST sets is a necessary first step. I think we determined the initial FOLLOW sets at least in part by instrumenting the parser to “observe” the kinds of tokens that could come after an expression (/type/whatever), we might consider doing something similar here.

I suggest creating a grammar that describes a superset of all potential future additions to the Rust grammar. With this grammar, every macro invocation can be validated. If a macro arm matches the input using the potential future Rust grammar, and only a subsequent arm matches with the current Rust grammar, emit an error.

The hard part of the implementation is writing a recognizer for meta-variables that will recognize all future Rust code.

For example, to handle type ascription, this should be included in the new grammar:

future_expr ::= expr | future_expr colon future_ty;

Let’s see whether this macro is future-proof:

macro_rules! foo(
    ($e:expr) => (...);
    ($i:ident : $t:ty) => (...)
)
foo!(x: i32)

Now, $e:future_expr matches against the input, but $e:expr does not, and $i:ident : $t:ty does. This will give an error.

That’s really not possible… There are just too many extension possibilities, we cannot anticipate everything. It may even involve new tokens.

Hey!

After a while (working on other things), I’m now working on implementing the conservative approached analysis.

I started by instrumenting the Rust LALR grammar (in src/grammar) to determine the current FIRST sets. I chosed the LALR grammar over the parser of libsyntax because it was easier to instrument and though it’s not up to date, we agreed with @pnkfelix to say that for the purpose of experimenting with this solution, it was for the time way easier to just fix the results by hand rather than updating it.

Here are the results:

FIRST(ty)

'[' EXTERN IDENT '*' FOR BOX '(' UNDERSCORE FN PROC SELF '&' '<' MOD_SEP
SHL ANDAND TYPEOF UNSAFE OROR '|'

FIRST(expr)

CONTINUE RETURN TRUE '[' '<' SHL LIT_CHAR BREAK MOD_SEP FALSE BOX
LOOP WHILE STATIC_LIFETIME '*' '(' '!' '|' OROR ANDAND LIT_BYTE '-' LIT_FLOAT
LIFETIME LIT_INTEGER MATCH FOR LIT_STR DOTDOT MOVE LIT_BYTE_STR IF
'&' PROC SELF LIT_BYTE_STR_RAW LIT_STR_RAW UNSAFE '{' IDENT

FIRST(stmt)

MOD_SEP BREAK SELF '#' SHL LIT_INTEGER LIT_BYTE_STR_RAW
STATIC_LIFETIME '(' LIT_FLOAT MOVE ';' WHILE LOOP STATIC LIT_BYTE USE
TYPE '*' FN MATCH '-' DOTDOT IDENT CONST TRAIT FALSE STRUCT '!' TRUE
LIT_CHAR LET PROC LIFETIME '[' PUB EXTERN ENUM CONTINUE BOX FOR '{'
LIT_STR_RAW '|' OROR RETURN OUTER_DOC_COMMENT '<' '&' IF UNSAFE
ANDAND LIT_BYTE_STR LIT_STR MOD IMPL

FIRST(block)

'{'

FIRST(item)

MOD_SEP USE CONST IMPL IDENT ENUM SELF UNSAFE TRAIT TYPE MOD
EXTERN STATIC STRUCT FN

FIRST(pat)

SELF '&' '<' LIT_BYTE_STR_RAW LIT_BYTE MOD_SEP LIT_STR LIT_CHAR
ANDAND BOX TRUE MUT IDENT REF LIT_BYTE_STR LIT_FLOAT '[' SHL '-'
LIT_STR_RAW LIT_INTEGER '(' UNDERSCORE FALSE

FIRST(path_expr)

SELF MOD_SEP IDENT

FIRST(meta_item)

IDENT

Unfortunately I cannot do much with only the current FIRST sets. Now we need to make some guarantees : that is, define the sets of tokens that each NT will never be allowed to start with, and expand the computed FIRST sets to include what remains.

cc @nikomatsakis

The algorithm would look like something like this:

check_macro {
    // conservatively check each arm against each following arm.
    // this is too strong but conservative
    for (i, arm) in macro_arms.enumerate() {
        for arm_ in macro_arms[i + 1 ..] {
            test_arms(arm, arm_)
        }
    }
}

test_arms(arm_A, arm_B) {
    let A and B bet the sequence of TTs of the arms arm_A and arm_B, respectively.
    need_disambiguation = false

    // analyse until one of the cases happen:
    // * we find an obvious disambiguation, that is a proof that all inputs that
    //   matches A will never match B or vice-versa
    // * we find a case that is too complex to handle and reject it
    // * we reach the end of the macro
    while A and B are both not empty {
        if A[0] = B[0] {
            // here, by `=`, I mean « match exactly the same input, so that,
            // for example, $foo:expr = $bar:expr, obviously.
            loop with A = A[1..] and B = B[1..]
        }

        // compute the FIRST sets, that is, the sets of tokens the matcher is
        // allowed to start with, including possible future FIRST tokens.
        // that is, all token that the matcher is *not* guaranteed never to
        // start with.
        // it t is a token, the FIRST(t) = { t }, if t is a delimited TT, then
        // is the opening delimiter, if t is a NT matcher, its FIRST set is
        // pre-computed, and if t is a repeated sequence, it will be computed
        // dynamically
        compute FIRST(A) and FIRST(B)
        if FIRST(A) ∩ FIRST(B) = ∅ {
            // at this point, all inputs that match or could amatch A will never
            // match B and vice-versa
            return Ok
        }

        // so this case *might* a problem, but we don't know yet, we must continue
        // the analysis.

        if A or B could match several token trees {
            // i.e. A or B is either a repeated sequence or a NT matcher that is
            // not tt, ident, or block
            // we cannot know where we should continue the analysis
            reject the macro
        }

        // A and B always both match a single TT
        match A, B {
            Sequence, _ | _, Sequence =>
                // cannot happen since sequences are not always a single-TT
                bug!()

            NT(tt), _ | _, NT(tt) =>
                // this is okay for now, either A will always have priority,
                // either B will always be unreachable. but search for errors
                // further
                continue with A = A[1 ..] and B = B[1 ..]

            NT(_), NT(_) =>
                // this case cannot happen. the only NTs that are a single-TT
                // and that are not tt are ident and block, that do not share any
                // FIRST token.
                bug!()

            NT(ident), Token(Ident(_)) | Token(Ident(__), NT(ident) =>
                // it's the same as with tt: either A is included, either B is
                // unreachable
                continue with A = A[1..] and B = B[1..]

            NT(ident), Token(Ident(_)) | Token(Ident(__), NT(ident) =>
                // the only possible NT here is ident because the only token in
                // the FIRST set of block is {, and { is not seen as a token but
                // as the beginning of a Delim
                // the only possible token is thus an hardcoded identifier or
                // keyword. so the only possible case of NT vs. Token is the
                // the case above.
                bug!()

            NT(block), Delim | Delim, NT(block) if Delim.token == Brace
            | Delim, Delim =>
                // we cannot say much here. we cannot look inside. we
                // can just hope we will find an obvious disambiguation later
                need_disambiguation = true,
                continue with A = A[1..] and B = B[1..]

            // cannot happen. either they're the same token or their FIRST sets
            // are disjoint.
            Token, Token | Token, Delim | Delim, Token => bug!() 
        }
    }

    // now we are at the end of one arm:
    if need_disambiguation {
        // we couldn't find any. we cannot say anything about those arms.
        // reject conservatively.
        reject the macro
    } else {
        // either A is strictly included in B and the other inputs that match B
        // will never match A, or B is included in or equal to A, which means
        // it's unreachable. this is not our problem. accept.
    }
}

It could certainly be improved, at the very least by looking inside sequences and delims when we have sequence against sequence or delim against delim. It would require recursion though. We could add that later.

I still think this alternative works. I didn't mean to literally write down all extension possibilities. Today's FOLLOW sets can be generalized to "FOLLOW rules" which come naturally from extended grammar rules.

Example

expr has a small FOLLOW set, FOLLOW(expr) = {FatArrow, Comma, Semicolon}. We add the following rules to the Rust grammar, with ngExpr as the superset of potential future exprs:

ngExpr ::= expr;
ngExpr ::= expr !followExpr AnyToken* => ERROR;
followExpr ::= FatArrow | Comma | Semicolon;

Or, equivalently, we can add these rules without the use of negation, by enumerating all tokens that currently do not belong in expr's FOLLOW set:

ngExpr ::= expr;
ngExpr ::= expr cannotFollowExpr AnyToken* => ERROR;
cannotFollowExpr ::= Colon | Delimiter | Lit | BinOp | At | Dot | DotDot | DotDotDot | ModSep | RArrow | LArrow | Pound | Dollar | Question | Ident | Underscore | /* perhaps more… */;

Every part of this rule can be improved incrementally. For example, let's narrow down ngExpr to reduce the number of false positives it catches. Consider type ascription syntax:

expr ::= expr Colon ty;

After adding this grammar rule, Colon can be removed from cannotFollowExpr. The same applies to every future addition to the Rust grammar.

I think FIRST sets can be generalized with similar rules.

edit: clarification.

Implementation

Of course, the $…:expr nonterminal should be wired to ngExpr, whereas the Rust parser will continue to use the same grammar rules (expr etc.) as it does today.

Future-proofing errors will show up on a per-invocation basis rather than as a result of the static analysis of a macro declaration. (Unfortunately, reporing errors for invocations might be a serious disadvantage.)


Adding new tokens

I overlooked potential future tokens when talking about potential future additions to the grammar. We can ignore new tokens, can't we? Today's macro invocations can't use tokens that don't exist yet. In the future, new tokens and new future-proofing rules would be enabled at the same time.

This is interesting! But don’t forget that:

  • This still requires building an entire LR automaton and run a full ambiguity analysis at compile-time for every macro invocation. This is incredibly costly.
  • The current solution based on the approximation of FIRST sets exists and is implemented. It ‶works″. It yields a lot of false positives but I’m quite confident that it’s sound. Maybe we could improve it in the future to make it more precis, using grammar-level analysis stuff, but for now, I think the priority is to land this conservative analysis…

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.