Proposal: Grammar Working Group


It would be cool to converge on some kind of textual format for AST early on, so that testing harness can be basically a glorified diff. That is, it would be useful to be able to compare some parser against current implementation without needing to build rustc, and using some test corpus instead.

JSON could be used for that, but I think that a home-grown text format could be much more readable. I kinda like the one use in IntelliJ/libsyntax2: just an indentation based tree where nodes are marked with textual offsets and kinds. Here’s an example:


One thing that may be useful is borrowing the taxonomy from the Grammarware project:

  • A Parse Forest is the set of all valid parse trees from an ambiguous parse of the input
  • A Parse Tree is is an unambiguous parse tree where the leaves can be concatenated to recover the input exactly
  • A Concrete Syntax Tree is a parse tree in which layout information (whitespace, etc) has been erased
  • An Abstract Syntax Tree is a tree which contains only abstract syntax information - instead of an Expr node containing an LHS, a ‘+’, and an RHS, it’s a BinaryPlus node with two children.

The GLL algorithm, like any general CFG parsing algorithm, produces a parse forest. Through disambiguation, one can select a canonical parse tree from the forest. Through erasing irrelevant formatting, one can arrive at a concrete syntax tree. Finally, through interpreting syntactic constructs according to their meaning, one can arrive at an abstract syntax tree.

These can be interleaved to a degree - various papers have found that disambiguating during the parsing process (thus reducing the size of the parse forest as it’s constructed) can have significant performance advantages.

Notably, an indented-tree serialization is only viable for a CST or AST in this taxonomy; it would not be sufficient for cases where the parse tree is needed, such as making sure that diagnostics report ranges correctly.


Hm, why is this so?i think you can serialize parse trees just fine? The example from libsyntax2 is a parse tree


Ah, I see what you’re getting at. I was thinking of a more direct encoding than you are using, given your emphasis on “more legible”. However, an encoding like that is, IMO, really not any more legible than JSON. Meanwhile, JSON has plenty of existing tooling, including high-performance parsers and serializers, and graphical viewers.

EDIT: To put it another way, I think it’d be a somewhat cruel irony if interacting with our new, more-formalism-based parser required hand-rolling a parser for a custom format.


The Rust Reference has been making progress on documenting the grammar. Just today the grammar for patterns was published (available here). It currently does not have any sort of automated verification, and the syntax is somewhat informal. I’d like to make sure that whatever automated grammar is written is kept in sync with the reference, and that the reference remains in some sort of syntax that is easily understandable (it doesn’t necessarily need to stay the way it is now). Would it be reasonable to keep the grammar definition within the reference?


We want the grammar for three pruposes:

  1. The compiler to actually compile.
  2. Human readable, so that we humans can learn it.
  3. A formal spec that we can do proofs with.

The WG is proposing working on the third one, while the reference mainly concerns itself with the second. Ideally, it’d be nice to have the second and third grammars aligned, but I suspect that it’ll be hard (due to things like the predicate of an if expression cannot be a struct expression).

That said, I’ll definitely delegate responsibility of the Reference grammar to this WG.


We are having a lot of activity at #wg-grammar on discord, join us!

So, just for the sake of clarity, there were some patterns that emerged from the initial conversation - and it became painstakingly obvious that we need to narrow down the scope of what we want to achieve.

Our main target is having a canonical grammar, as in spec/reference.

While being able to use this grammar to generate rustc's parser is highly desired from many people (myself included), that’s a non-goal, at least initially.

We’re planning to have our first meeting this Wednesday, October 3rd at 21:00 CEST (place and format TBD), with this tentative agenda:

  • refining the WG’s scope (aka defining what is the MVP for this group)
  • test harness execution plan
  • should we aim to define an AST/Syntax a la Roslyn?
  • quick review on @eddyb’s GLL


Any update on place and format?


Gaaah, my bad!

We’re going to meet via text, on the #wg-grammar discord channel.

See y’all there (and indeed, there will be better communication from now on)


We’ve had our first meeting last Wednesday on discord! Here are the main takeaways:

  • We now have a repository where the new grammar will live
  • Our main target is having a CFG textual representation of the grammar, that potentially could be consumed by both LALRPOP and GLL.
    • so initially targeting both parser frameworks, even if it means more work - the expected outcome is to get a deeper understanding of Rust’s grammar
  • Strive to make this grammar as modular as possible, allowing us to test parts of the grammar in isolation.
  • Really focus on getting something in place (like a hello-world parser) so we can iterate on it, and build a test harness.

Another point discussed is what is the target audience for this work, and we narrowed it down to Language designers, Reference (having an authoritative place to define “Rust”) and Empowering RFCs when it comes to changes to the Language/Grammar.

Lastly, our meetings are going to happen at #wg-grammar on discord, the next one is scheduled for October 17th, 21:00 Central European Time.