The algorithm would look like something like this:
check_macro {
// conservatively check each arm against each following arm.
// this is too strong but conservative
for (i, arm) in macro_arms.enumerate() {
for arm_ in macro_arms[i + 1 ..] {
test_arms(arm, arm_)
}
}
}
test_arms(arm_A, arm_B) {
let A and B bet the sequence of TTs of the arms arm_A and arm_B, respectively.
need_disambiguation = false
// analyse until one of the cases happen:
// * we find an obvious disambiguation, that is a proof that all inputs that
// matches A will never match B or vice-versa
// * we find a case that is too complex to handle and reject it
// * we reach the end of the macro
while A and B are both not empty {
if A[0] = B[0] {
// here, by `=`, I mean « match exactly the same input, so that,
// for example, $foo:expr = $bar:expr, obviously.
loop with A = A[1..] and B = B[1..]
}
// compute the FIRST sets, that is, the sets of tokens the matcher is
// allowed to start with, including possible future FIRST tokens.
// that is, all token that the matcher is *not* guaranteed never to
// start with.
// it t is a token, the FIRST(t) = { t }, if t is a delimited TT, then
// is the opening delimiter, if t is a NT matcher, its FIRST set is
// pre-computed, and if t is a repeated sequence, it will be computed
// dynamically
compute FIRST(A) and FIRST(B)
if FIRST(A) ∩ FIRST(B) = ∅ {
// at this point, all inputs that match or could amatch A will never
// match B and vice-versa
return Ok
}
// so this case *might* a problem, but we don't know yet, we must continue
// the analysis.
if A or B could match several token trees {
// i.e. A or B is either a repeated sequence or a NT matcher that is
// not tt, ident, or block
// we cannot know where we should continue the analysis
reject the macro
}
// A and B always both match a single TT
match A, B {
Sequence, _ | _, Sequence =>
// cannot happen since sequences are not always a single-TT
bug!()
NT(tt), _ | _, NT(tt) =>
// this is okay for now, either A will always have priority,
// either B will always be unreachable. but search for errors
// further
continue with A = A[1 ..] and B = B[1 ..]
NT(_), NT(_) =>
// this case cannot happen. the only NTs that are a single-TT
// and that are not tt are ident and block, that do not share any
// FIRST token.
bug!()
NT(ident), Token(Ident(_)) | Token(Ident(__), NT(ident) =>
// it's the same as with tt: either A is included, either B is
// unreachable
continue with A = A[1..] and B = B[1..]
NT(ident), Token(Ident(_)) | Token(Ident(__), NT(ident) =>
// the only possible NT here is ident because the only token in
// the FIRST set of block is {, and { is not seen as a token but
// as the beginning of a Delim
// the only possible token is thus an hardcoded identifier or
// keyword. so the only possible case of NT vs. Token is the
// the case above.
bug!()
NT(block), Delim | Delim, NT(block) if Delim.token == Brace
| Delim, Delim =>
// we cannot say much here. we cannot look inside. we
// can just hope we will find an obvious disambiguation later
need_disambiguation = true,
continue with A = A[1..] and B = B[1..]
// cannot happen. either they're the same token or their FIRST sets
// are disjoint.
Token, Token | Token, Delim | Delim, Token => bug!()
}
}
// now we are at the end of one arm:
if need_disambiguation {
// we couldn't find any. we cannot say anything about those arms.
// reject conservatively.
reject the macro
} else {
// either A is strictly included in B and the other inputs that match B
// will never match A, or B is included in or equal to A, which means
// it's unreachable. this is not our problem. accept.
}
}
It could certainly be improved, at the very least by looking inside sequences and delims when we have sequence against sequence or delim against delim. It would require recursion though. We could add that later.