Pre-RFC: Macro input future-proofing

Available here: https://github.com/cmr/rfcs/blob/macro-future-proofing/text/0000-macro-future-proofing.md Snapshot as of posting:

  • Start Date: 2014-12-21
  • RFC PR: (leave this empty)
  • Rust Issue: (leave this empty)

NOTE: Draft, not finalized.

Key Terminology

  • macro: anything invokable as foo!(...) in source code.
  • MBE: macro-by-example, a macro defined by macro_rules.
  • matcher: the left-hand-side of a rule in a macro_rules invocation.
  • macro parser: the bit of code in the Rust parser that will parse the input using a grammar derived from all of the matchers.
  • NT: non-terminal, the various ā€œmeta-variablesā€ that can appear in a matcher.
  • fragment: The piece of Rust syntax that an NT can accept.
  • fragment specifier: The identifier in an NT that specifies which fragment the NT accepts.
  • language: a context-free language.

Example:

macro_rules! i_am_an_mbe {
    (start $foo:expr end) => ($foo)
}

(start $foo:expr end) is a matcher, $foo is an NT with expr as its fragment specifier.

Summary

Future-proof the allowed forms that input to an MBE can take by requiring certain delimiters following NTs in a matcher. In the future, it will be possible to lift these restrictions backwards compatibly if desired.

Motivation

In current Rust, the macro_rules parser is very liberal in what it accepts in a matcher. This can cause problems, because it is possible to write an MBE which corresponds to an ambiguous grammar. When an MBE is invoked, if the macro parser encounters an ambiguity while parsing, it will bail out with a "local ambiguity" error. As an example for this, take the following MBE:

macro_rules! foo {
    ($($foo:expr)* $bar:block) => (/*...*/)
};

Attempts to invoke this MBE will never succeed, because the macro parser will always emit an ambiguity error rather than make a choice when presented an ambiguity. In particular, it needs to decide when to stop accepting expressions for foo and look for a block for bar (noting that blocks are valid expressions). Situations like this are inherent to the macro system. On the other hand, itā€™s possible to write an unambiguous matcher that becomes ambiguous due to changes in the syntax for the various fragments. As a concrete example:

macro_rules! bar {
    ($in:ty ( $($arg:ident)*, ) -> $out:ty;) => (/*...*/)
};

When the type syntax was extended to include the unboxed closure traits, an input such as FnMut(i8, u8) -> i8; became ambiguous. The goal of this proposal is to prevent such scenarios in the future by requiring certain "delimiter tokens" after an NT. When extending Rustā€™s syntax in the future, ambiguity need only be considered when combined with these sets of delimiters, rather than any possible arbitrary matcher.

Detailed design

The algorithm for recognizing valid matchers M follows. Note that a matcher is merely a token tree. A ā€œsimple NTā€ is an NT without repetitions. That is, $foo:ty is a simple NT but $($foo:ty)+ is not. FOLLOW(NT) is the set of allowed tokens for the given NTā€™s fragment specifier, and is defined below. F is used for representing the separator in complex NTs. In $($foo:ty),+, F would be ,, and for $($foo:ty)+, F would be EOF.

input: a token tree M representing a matcher and a token F

output: whether M is valid

  1. If there are no tokens in M, accept.
  2. For each token T in M:
    1. If T is not an NT, continue.
    2. If T is a simple NT, look ahead to the next token T' in M. If T' is EOF, replace T' with F. If T' is in the set FOLLOW(NT), T' is EOF, T' is any NT, or T' is any identifier, continue. Else, reject.
    3. Else, T is a complex NT.
      1. If T has the form $(...)+ or $(...)*, run the algorithm on the contents with F set to EOF. If it accepts, continue, else, reject.
      2. If T has the form $(...)U+ or $(...)U* for some token U, run the algorithm on the contents with F set to U. If it accepts, continue, else, reject.

This algorithm should be run on every matcher in every macro_rules invocation, with F as EOF. If it rejects a matcher, an error should be emitted and compilation should not complete.

The current legal fragment specifiers are: item, block, stmt, pat, expr, ty, ident, path, meta, and tt.

  • FOLLOW(item) = {}
  • FOLLOW(block) = FOLLOW(expr)
  • FOLLOW(stmt) = FOLLOW(expr)
  • FOLLOW(pat) = {FatArrow, Comma, Pipe}
  • FOLLOW(expr) = {Comma, FatArrow, CloseBrace, CloseParen, Lit} (where Lit is any literal, string or numeric)
  • FOLLOW(ty) = {Comma, Eq, Gt, Lt, RArrow, FatArrow, OpenBrace, OpenParen, CloseBrace, CloseParen}
  • FOLLOW(ident) = any token
  • FOLLOW(path) = any token
  • FOLLOW(meta) = any token
  • FOLLOW(tt) = any token

Note: the FOLLOW sets as given are based on every MBE in the Rust distribution, but should probably be tuned before the RFC is accepted.

Drawbacks

It does restrict the input to a MBE, but the choice of delimiters provides reasonable freedom.

Alternatives

  1. Fix the syntax that a fragment can parse. This would create a situation where a future MBE might not be able to accept certain inputs because the input uses newer features than the fragment that was fixed at 1.0. For example, in the bar MBE above, if the ty fragment was fixed before the unboxed closure sugar was introduced, the MBE would not be able to accept such a type. While this approach is feasible, it would cause unnecessary confusion for future users of MBEs when they canā€™t put certain perfectly valid Rust code in the input to an MBE. Versioned fragments could avoid this problem but only for new code.
  2. Keep macro_rules unstable. Given the great syntactical abstraction that macro_rules provides, it would be a shame for it to be unusable in a release version of Rust. If ever macro_rules were to be stabilized, this same issue would come up.
  3. Do nothing. This is very dangerous, and has the potential to essentially freeze Rustā€™s syntax for fear of accidentally breaking a macro.

Unresolved questions

Are the given FOLLOW sets adequate?

1 Like

The concrete example compiles with small corrections:

macro_rules! bar {
    ($in:ty ( $($arg:ident),* ) -> $out:ty ; ) => ( {} )
};

On the other hand, it's possible to write an unambiguous matcher that becomes ambiguous due to changes in the syntax for the various fragments.

Technically, this matcher is not ambiguous. Said change to the Rust syntax simply has caused the deterministic parser to consume more, and thus reject macro input that used to be accepted.

But that wouldn't be an issue with much better parsers, I imagine. All parsing would have to be nondeterministic. The example would keep working not unlike matching T T with \A(T T?) (T T?)\Z in regexes.

If T' is EOF, replace T' with F.

Is T' set to EOF at the end of a complex NT? Surely T' is not replaced in the matcher?

One alternative approach involves stability for fragment specifiers. Some specifiers, like $_:ident, could be freely used, others, especially $_:ty could stay behind a feature gate.

NB. $example:path matches Fn() -> T and $example:ty matches ::Fn() -> T.

You're right, the CFG the matcher corresponds to, theoretically, is not ambigous.

You're right, technically the CFG that the matcher corresponds to is unambiguous. The parser we use is very dumb in many ways, but in return we get determinism and good parse times. I don't want to trade away the determinism: using macros is already a somewhat difficult process. Making it more difficult for unsuspecting users to understand how their macro works seems like a non-starter.

No, I don't think it is. It is at the end of a macro invocation, which I confused with this case. Instead, it should look for the close delimiter of the token tree.

Indeed, I'll add this alternative to the RFC. It might be a good idea to do that now anyway, to push the full macro NTs off beta or 1.x. I want more time to think about macro matching and the implications of allowing adjacent NTs anyway.

Ah, good catch!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.