It would be cool to converge on some kind of textual format for AST early on, so that testing harness can be basically a glorified diff. That is, it would be useful to be able to compare some parser against current implementation without needing to build rustc, and using some test corpus instead.
JSON could be used for that, but I think that a home-grown text format could be much more readable. I kinda like the one use in IntelliJ/libsyntax2: just an indentation based tree where nodes are marked with textual offsets and kinds. Here’s an example:
One thing that may be useful is borrowing the taxonomy from the Grammarware project:
A Parse Forest is the set of all valid parse trees from an ambiguous parse of the input
A Parse Tree is is an unambiguous parse tree where the leaves can be concatenated to recover the input exactly
A Concrete Syntax Tree is a parse tree in which layout information (whitespace, etc) has been erased
An Abstract Syntax Tree is a tree which contains only abstract syntax information - instead of an Expr node containing an LHS, a ‘+’, and an RHS, it’s a BinaryPlus node with two children.
The GLL algorithm, like any general CFG parsing algorithm, produces a parse forest. Through disambiguation, one can select a canonical parse tree from the forest. Through erasing irrelevant formatting, one can arrive at a concrete syntax tree. Finally, through interpreting syntactic constructs according to their meaning, one can arrive at an abstract syntax tree.
These can be interleaved to a degree - various papers have found that disambiguating during the parsing process (thus reducing the size of the parse forest as it’s constructed) can have significant performance advantages.
Notably, an indented-tree serialization is only viable for a CST or AST in this taxonomy; it would not be sufficient for cases where the parse tree is needed, such as making sure that diagnostics report ranges correctly.
Ah, I see what you’re getting at. I was thinking of a more direct encoding than you are using, given your emphasis on “more legible”. However, an encoding like that is, IMO, really not any more legible than JSON. Meanwhile, JSON has plenty of existing tooling, including high-performance parsers and serializers, and graphical viewers.
EDIT: To put it another way, I think it’d be a somewhat cruel irony if interacting with our new, more-formalism-based parser required hand-rolling a parser for a custom format.
The Rust Reference has been making progress on documenting the grammar. Just today the grammar for patterns was published (available here). It currently does not have any sort of automated verification, and the syntax is somewhat informal. I’d like to make sure that whatever automated grammar is written is kept in sync with the reference, and that the reference remains in some sort of syntax that is easily understandable (it doesn’t necessarily need to stay the way it is now). Would it be reasonable to keep the grammar definition within the reference?
The WG is proposing working on the third one, while the reference mainly concerns itself with the second. Ideally, it’d be nice to have the second and third grammars aligned, but I suspect that it’ll be hard (due to things like the predicate of an if expression cannot be a struct expression).
That said, I’ll definitely delegate responsibility of the Reference grammar to this WG.
We are having a lot of activity at #wg-grammar on discord, join us!
So, just for the sake of clarity, there were some patterns that emerged from the initial conversation - and it became painstakingly obvious that we need to narrow down the scope of what we want to achieve.
Our main target is having a canonical grammar, as in spec/reference.
While being able to use this grammar to generate rustc's parser is highly desired from many people (myself included), that’s a non-goal, at least initially.
We’re planning to have our first meeting this Wednesday, October 3rd at 21:00 CEST (place and format TBD), with this tentative agenda:
refining the WG’s scope (aka defining what is the MVP for this group)
test harness execution plan
should we aim to define an AST/Syntax a la Roslyn?
Our main target is having a CFG textual representation of the grammar, that potentially could be consumed by both LALRPOP and GLL.
so initially targeting both parser frameworks, even if it means more work - the expected outcome is to get a deeper understanding of Rust’s grammar
Strive to make this grammar as modular as possible, allowing us to test parts of the grammar in isolation.
Really focus on getting something in place (like a hello-world parser) so we can iterate on it, and build a test harness.
Another point discussed is what is the target audience for this work, and we narrowed it down to Language designers, Reference (having an authoritative place to define “Rust”) and Empowering RFCs when it comes to changes to the Language/Grammar.
Lastly, our meetings are going to happen at #wg-grammar on discord, the next one is scheduled for October 17th, 21:00 Central European Time.
You’ll have to ask @eddyb why they picked GLL over GLR for their parser generator.
As for the WG, we picked it because it a) offers a correct CFG implementation, b) it allows us to check for ambiguities in the grammar, c) the notation used in the gll crate seemed nice.
To expand on what Centril said, the lykenware/gll crate has the ability to not only detect an ambiguous parse, but produce a parse forest instead of a parse tree, giving us more tooling flexibility at less initial grammar annotation cost.
Personally, I trust GLL more because it’s a lot like “backtracking recursive descent” or some parser combinator approach you might use in a pure functional language, but with sharing optimizations (like memoization) and more control over the nondetermistic execution (i.e. A | B can execute A before B, or A interleaved with B).
There’s a real possibility of writing a computer-checked formal proof that a GLL implementation’s optimizations can’t result in the wrong forest (I can’t remember if I’ve seen this done for other implementations, but it’s possible).
Meanwhile, (G)LR is like stack programming languages, and has associated complexity in the transformation of a grammar to that computational model.
I’ve also, sadly, not heard yet of a GLR implementation that produces a parse forest (tree-sitter doesn’t AFAIK).
But none of this should really matter, since we plan to produce a CFG, first and foremost, which means any CFG parsing algorithm (not just GLL and GLR) should work.
SGLR the algorithm (to distinguish from any implementations) can produce parse forests as well. That’s what the G in GLR means. (in case you’re curious, the S stands for scannerless i.e. there’s no separate lexing phase)
That raises a question: how does GLL deal with left recursion, which LL(k) languages (and PEGs) are incapable of parsing due to infinite recursion?
This is cool, and reminds me of an idea I had to implement the concurrent version as a separate infix operator (tentative symbol: || ie. “parallel or”). That gives the language implementer the option to pay the extra cost where necessary, while avoiding unnecessary perf hits.
Now we’re talking! This is very nice, and I’d like a link to the paper if you have one.
You’re probably referring to the time spent building the parse table? Doesn’t GLL use anything like that? I mean, naively I’d say it’d have to in order to be able to memoize anything at all, but hey…
For a GLR impl that produces parse forests, have a look at JSGLR (as the name implies it’s written in Java, and as a research heavy project it’s absolutely not production ready, but it exists).
Yes and no. In principle you’re right of course, but (and I expect everyone here knows this) in practice runtime efficiency in time and space matters.