Maybe “ad-hoc” isn’t the right phrase, but because only a subset of AST productions are counted when enumerating substitution targets, that subset has to be known to any tool processing mangled names. That seems like unnecessary complexity to me.
The reference implementation I’m working on uses an AST and defines compression in terms of the AST. So I think we should be covered here?
Sure, as long as it’s clearly specified.
The current implementation does not need more than 1 character look-ahead needed.
Great! I hope you have a way to ensure that remains true as the specification evolves.
Can you describe this in more detail?
No, you’re probably right that referring to AST nodes is a better option so you don’t have to encode a length.
OTOH a downside of AST nodes is you need to have a clear processing model that precisely describes what kinds of nodes are allowed in what contexts … and those constraints need to be faithfully implemented. E.g. if you have an AST node <ident>, and a substitution references that node where you’re expecting an <expression>, depending on how your AST works you may or may not promote that <ident> to the right variant of your Expression AST node. Token or string backreferences would have the appealing property that that’s all handled by existing code in the parser.