@nikomatsakis was telling me how we’ve had a RFC on having a canonical grammar distinct from Rustc’s parser, and how there were some efforts on that front, documented in this extensive issue.
So I’ve been trying to come up with a plan on how could we make progress, again based on @nikomatsakis’ suggestions:
Making sure we have a stable way of serializing current Rust AST (@harpocrates was using rustc -Z ast-json-noexpand as a base for comparison). We like it or not, the current grammar is what the compiler accepts today.
Try to steal get inspiration from all the previous attempts at this, in no particular order:
Build a test harness that verifies the generated AST against Rust’s AST.
Continue building up the grammar based on the confidence that we’re keeping compatibility with what exists today in rustc.
We call the grammar’s implementation complete when we can successfully do something similar to a crater run side-by-side the current compiler and obtain comparable output.
As a stretch goal, make sure we can auto-generate railroad diagrams from this grammar to feed into the reference documentation we provide to our users.
Putting this potential plan in action!
Let’s start a WG-Grammar!
This new group will hold regular meetings, following the same successful format of WG-NLL:
Focus on identifying and prioritizing tasks.
Ensure we’re keeping momentum - this can easily become a never-ending project, which can be frustrating
Offer mentorship for those who want to have fun with grammars and parsers
Everyone is welcome to join!
The expected initial outcome from this group is having a clear vision on how can we deliver a formal grammar for Rust.
This is a pretty long and exciting project, with a lot of opportunities for cross-pollination between several sub-teams.
What team does this WG fall under? If it is more specification oriented I think T-lang would be appropriate here.
I would suggest using some form of extended BNF formalism since BNF is universally understood among folks who have taken a PLT course or written parsers. Languages which do have specifications usually specify their grammar with BNF formalisms.
To make the grammar readable for humans, I think there are some important features to have in the formalism:
optionals, e.g. foo ::= bar baz? ;
lists (both separated and terminated variants)
parameterization of grammars such as in the Habit report
I'd definitely be interested in helping out with this though (so long as I can resist the urge to write yet another parser generator for the project). It'd be interesting if the grammar generator chosen can support the libsyntax2 Syntax Tree format (ping @matklad) but I don't think it's necessary for a first formalism.
Note that error reporting doesn't have to be a first-pass focus; in fact, I'd argue that it shouldn't be.
I've heard it said that EBNF can be extended to support parameterization, which is particularly important if you want to significantly cut down on duplication in the grammar of a large programming language like Rust; but I'm not sure how you do that. (See https://github.com/Centril/rfcs/blob/rfc/or-patterns/text/0000-or-patterns.md#grammar for a recent example where I had to use parameterization to not make the grammar blow up in size).
I suppose the main drawback of ENBF is that it lacks a way to name the various alternatives in a production such as is possible with TNBF, LBNF, GLL, etc.
I suppose we could also use / tweak thelykenware/gll project's formalism here which is pretty understandable if you are familiar with BNF; however I'd very much like to see a textual format (as opposed to being based on macro_rules) which also supports parameterization, optionals, lists, and other niceties.
“Pure” EBNF (as refreshed by a Wikipedia check) uses [optional], {repetition}, and (grouping). However, something = foo ( bar ); is not valid EBNF, so foo(bar) is available for extension (this seems to somewhat commonly be used for parameterization). Similarly, it allows raw string = ? arbitrary explanation since its not context free ? ;.
(To advertize the generator I’ve been using recently, pest is a PEG generator that uses an external .pest file, has builtin ways to ignore whitespace/comments, and we have plans to add parameterized rules soon™. PEG probably isn’t appropriate for this project but I thought it potentially worth mentioning.)
If textual format is a desire, it should be fairly trivial to write a proc-macro that just reads in that external file and expands the macro-by-example invocation. Maybe that’s cheating, but it’s very possible.
And I guess I should say that whatever tools the effort uses, it’s great if it supports inline tests, some sort of debugging, and an interactive mode for designing. (See also @matklad’s Modern Parser Generator blog post. I have very rough plans to eventually take my stab at writing one meeting most (or all) of these desires. That’s a future kind of thing though.)
I was thinking we could do something more programmatic, and "concretize" the parse tree you'd get from GLL into the libsyntax AST (since that would also be useful to actually run rustc on the test suite with the GLL-based parser, and eventually replace libsyntax's current one).
In the GLL framework, we auto-generate a typed parse tree API on top of a more "dynamically typed" parse tree, so adding something in between wouldn't be a problem (although I personally don't see its usefulness).
We support X? for optionals, and X*, X+, alongside their separated versions, X* % Y and X+ % Y (e.g. Y could be ",").
(the syntax using % is from Perl 6, IIRC, and was suggested by @eternaleye)
Having facilities for parsing this out of a file wouldn't impact the format itself, we just didn't bother to bootstrap yet.
But if there are issues with the format, they could be raised - almost everything other than the GLL "virtual machine" is up for debate!
(this will become easier once we get to document anything)
Paramtetrization is harder, and we'd prefer to avoid it initially (if it's not actually needed), since it makes mapping to a typed API, and to EBNF, much more annoying.
+1 on this point. There's an interesting observation that "you should report minimal errors in the parser and implement syntax tree validators" in Roslyn and that in general matches my experience as well.
I think it might also be desirable to use separate tools for grammar specification and actual in-compiler implementation, because they have somewhat different requirements.
For example, in the spec, you want to specify various escape sequences of string literal in full detail, to forbid invalid escapes. In the implementation, you want to parse anything between " as a string, and handle escapes on the semantic level.
Similarly, you should forbid &self argument in top-level functions in the spec, but allow it in the impl to have better error recovery & error messages.
To deal with correctness issues, fuzzing speck against impl should be enough I think?
@qmx
Do you plan to formalize lexer as well, or this grammar work will start from the token level?
It would be good to use tokens in the proc macro sense as building blocks (i.e. no multi-character operators) to avoid issues like https://github.com/rust-lang/rust/issues/47856 from the start.
I would love to have a scannerless Rust grammar to start off with, but sadly, that’s hard and not even because of raw strings (which can be dealt with in custom disambiguation logic).
The problem is that almost all hand-written lexers do not fit in a formalism - the default “longest match” semantic of most practical regex engines is pretty much the standard when it comes to lexing, and it’s context-sensitive.
For example, ab is lexed as one identifier token, not two, but a formal lexer must produce an ambiguous result, treating both [Ident(ab)] and [Ident(a), Ident(b)] as valid parses.
Now, a scannerless grammar has something going for it: depending on how tokens can be arranged in the grammar, many of these ambiguities can be solved (e.g. if no grammar rule accepts two identifiers-like things one after another, or always mandates separating them with spaces).
Sadly, even ignoring macro invocations, Rust is already pretty complex here, supporting e.g. for(x)in(y) and for x in y.
See also https://github.com/lykenware/gll/issues/14 (currently we have a context-sensitive feature, to be able to do anything scannerless at all, because one of the papers suggested it as an “practical” extension for scannerless parsing, but we need to move away from that, and instead desugar simple forms of lexical context-sensitivity to (E)BNF).
I am super excited about this. I would caution us against trying too hard up front to pick the “perfect” format for everything. I think we can evolve as we go. I also think it’s worth separating out the “Vision” from the “MVP”.
In my view, the end vision is this:
A Rust grammar specification in some machine-readable format which we can map to a standard context-free grammar. This will encode both stable Rust and also nightly-only features (with annotations to indicate that).
It should operate over a scanner with tokens that map very closely to the procedural macro tokens.
This official Rust grammar should be convertible into the input for other tools, such as parser generators but also railroad diagram generators etc.
It is however a non-goal for it to be a LL(k) or LR(k) grammar, so these tools must support a generalized algorithm that accepts any CFG, such as GLL, GLR, or LL(*).
We should have a testing harness that tests the grammar (and the trees produced by it) in various ways
Given a corpus of Rust code, indicate where parse errors occurred
Compare against the trees produced by other parsers (e.g., rustc, syn)
Compare against “reference trees”
We should use this standard grammar to spec additions to the language and to test for ambiguities
Since the grammar is not LR(1), we’ll need to use some other technique to detect ambiguities, but we have various ideas.
However, the MVP probably looks quite different. I think the main goal should be getting some executable and testable written in Rust. We can iterate from there, moving to other tools or other notations. Maybe something like this:
Some kind of scanner hopefully using procedural macro tokenizer
A testing harness that compares against rustc and just checks yes/no – did this file parse or not?
A LALRPOP-based parser to start, based on harpocrates parser (link) but adjusted to remove reliance on precedence
This can be ported either manually or automatically
Tests should pass against some reasonable corpus
(To be honest, I don’t actually care if we use LALRPOP or not, but I think it’s one of the more mature tools out there. It is limited to LR(1) currently, but the harpocrates parser is using an LALR parser generator, I think, so that should be ok. I expect us to migrate away from the LALRPOP format and add support for interfacing with other tools, as described in the “end vision”.)
Once the MVP is working, the immediate next steps would be:
Write different backends from the same starting grammar (e.g. lyken/gll)
Extend harness to compare parse trees (Harpocrates parser does this)
I’d like to experiment with notations for precedence as well as expressing ways to suppress ambiguity, for example around if <expr> {, where the family of expressions that get parsed is a subset of the full family of expressions.
As I wrote in my post, I don't think we should limit ourselves to EBNF in particular. I think we should be able to generate standard EBNF, but we should adjust our formal definition as needed to make it readable, concise, and precise.
Some examples of things that "plain ol' EBNF" can't cover:
"Optional" features, like nightly-only things.
Precedence chains
Subsets of nonterminals, like expressions in an if context
I'd like us to be able to express those declaratively.
One thing worth noting is that the "Conjunctive grammars" exist, as a parsimonious extension of CFG with an intersection operator &. They retain all formal properties of general CFGs (while adding closure under intersection), including the O(|G|n³) worst-case bound.
There are also the "Boolean grammars", which add negation (and similarly preserve everything, including performance). Alexander Okhotin has written a great deal on the subject of both.
Using such a formalism, the if <expr> { case would be quite simple - intersect the <expr> with the permitted forms (in the conjunctive case), or the negation of the forbidden forms (in the boolean case).
While this slightly erodes the ability to generate a classic EBNF, it doesn't do so by all that much - in particular, an EBNF that simply strips these things would produce a parse forest that is a strict superset of that produced by a full conjunctive or boolean parser, aside from the additional parse trees from the right hand side of the conjunction (which in this case, and I suspect in general for Rust, will be mere duplicates of the left hand side).
EDIT: Also relevant re: precedence are the Floyd languages, sometimes called the operator-precedence languages. This proper subset of CFG is classically parsed by either the Shunting Yard algorithm (LR-like) or Pratt parsing (LL-like). It makes use of an operator precedence matrix, which denotes a partially ordered set over the operators. See also, this dissertation.
Such a matrix can easily be defined by giving pairwise 'tighter', 'looser', and 'eqv' annotations (and a unary associativity annotation) on rule clauses of the form (NT opT NT). (Strictly speaking, this only defines the upper triangular matrix, and leaves the lower triangular matrix to be defined by inversion - however, making + tighter than * on the left, and looser on the right, is a bit of a... specialized need.)
On the Crater side, the easiest thing is to implement a -Z flag in the compiler that compares the current parser output against the new one (like what -Z nll=migrate does for NLL). If you do that you can request Crater runs whenever you want (support is already baked in).