Proposal: Grammar Working Group

eddyb · November 29, 2018, 3:42pm

Not that relevant for this WG but, yeah, that's fair. We are using a high-level notation for e.g. lists (including comma-separated etc.) making it possible for the implementation to choose whether to do left- or right-recursion, for example.

What GLL generates from the grammar is effectively a CPS functional program and memoization is done on functions, with continuations shared and executed only once.

There is no conversion from the grammar to stack actions, only call/ret.

I think GLL might be not as efficient for left recursion, but it doesn't get stuck on infinite recursion, since both the call stack (GSS) and the resulting parse forest (SPPF) can handle cycles gracefully.

I expect no less! I just don't see it mentioned outside of academic research, despite GLR implementations being used "in the industry".

Thanks for the JSGLR example - we should definitely check it out, and if you want to, you can leave issues on GitHub - rust-lang/gll: GLL parsing framework..

Now I wish I had a link to that paper @eternaleye showed me of parsing RCGs by building a CFG (or was it LCFRS?) approximation, running GLL on it, then disambiguating the resulting parse forest using the information from RCG that was missing from the approximation.

jjpe · November 29, 2018, 4:08pm

I'm guessing it's because would-be grammar writers are scared away by the O(n^3) worst case for JSGLR (can't speak to GLR in general, though I don't really expect the scannerlessness to make that much of a difference in terms of asymptotic performance) while they're used to O(n) for LL(*) and LR(*), and that's assuming they even know that GLR and GLL exist. Depending on the type of analysis (keeping in mind evaluation can be seen as an analysis) that worst case can be significant. For rust that upper bound doesn't seem to matter as much as the type/borrowck analysis and especially code generation are much more expensive than parsing (I'm not sure if that holds asymptotically, though empirically the analysis/code gen steps seem to scale superlinearly).

If you ever find that link please post it, even if it's a couple of weeks (months?) from now. I'd love to read it.

eternaleye · November 29, 2018, 10:37pm

First, regarding how GLL deals with left recursion - the answer is "quite nicely, thank you", but the actual mechanism ties back into the memoization. Essentially, before attempting to match a rule, a pre-memoziation is recorded for the rule. If the rule then attempts to invoke itself at the same position, the pre-memoization is promoted to a full memoization of a cyclic result, which then appears in the SPPF. The same technique can be seen (...modulo some excessively dense notation...) in Pierre Boullier's paper (PS.GZ) that first introduced the Range Concatenation Grammars.

The paper @eddyb referred to (PDF) is an instance of a general technique called "Guided Parsing", where one uses a simplified approximation of a grammar to guide parsing with the full grammar.

Some background - I presume you're familiar with the most-discussed levels of the Chomsky hierarchy, namely the regular languages, the context-free languages, and the context-sensitive languages.

In between context-free and context-sensitive, there are a number of intermediate classes of grammars - some fascinating ones include the conjunctive grammars, the boolean grammars, linear context-free rewriting systems, positive range concatenation grammars, and "full" range concatenation grammars.

To explain these, I'll first introduce an alternate syntax for CFG, and then explain the rest in terms of that.

In classic BNF, you have rule := subrule other_subrule. This can be seen as a form of logic program, but the arguments are implicit. These can be made explicit, resulting in rule(segment1 segment2) := subrule(segment1), other_subrule(segment2) - this takes a single argument (which is pattern-matched into two variables covering disjoint ranges), and delegates to subrules. Disjunctions are represented, as in BNF, by the existence of multiple rules with the same name. The syntactic restrictions necessary to permit only CFGs are:

Rules take only a single argument
Every variable in the head of the rule must occur exactly once in the body of the rule

The conjunctive grammars (usually defined as "CFG with conjunction") may be defined as the relaxation of rule 2, to allow variables to be used multiple times. LCFRS may be defined as the relaxation of rule 1, to allow rules of higher arity, though that arity must be the same for all rules that share a name.

Positive range concatenation grammars (PRCGs), then, apply both of these relaxations, plus an additional extension: the ability to concatenate variables so long as the head pattern-matched them from adjacent segments of input.

The addition of negation, then, affords one the "full" range concatenation grammars.

An alternate name for LCFRS, used mostly when they are expressed in this syntax (which is, in fact, RCG syntax) is "Simple RCG" or "SRCG". In addition, a k-RCG is an RCG whose predicates have at most arity k.

One can then define the other grammars as various trivial cases of RCG:

CFG is 1-SPRCG
conjunctive grammars are 1-PRCG
boolean grammars are 1-RCG
LCFRS are SPRCG
etc.

(As an interesting aside, CFG can be characterized as "regular languages plus least fixpoints" in a formal sense (PDF))

This use of a common syntax makes the primary contribution of the paper much more tractable - a sound and complete algorithm for generating a 1-PRCG approximation (which accepts a superset of the original language) from an arbitrary PRCG. From there, a 1-SPRCG approximation is relatively trivial (dropping all but one side of each conjunction), and existing techniques can even take you all the way down to the regular languages.

Another paper (PDF) points out that the 1-PRCGs have worst-case performance O(|G|n^3), the same as the CFGs, affording a nice increase in power without a loss of performance. Work elsewhere by Alexander Okhotin has shown that this result extends to the boolean grammars, which as noted above are equivalent to the 1-RCGs.

In addition, (P)RCG is a nice formalism in its own right - it characterizes the PTIME-recognizable languages.

EDIT: Did up a table of the relationships of the grammar formalisms:

	Degree	Positive	Simple
CFG	1	✅	✅
Conjunctive	1	✅	❌
???	1	❌	✅
Boolean	1	❌	❌
LCFRS	k	✅	✅
PRCG	k	✅	❌
???	k	❌	✅
RCG	k	❌	❌

If you want more info on the topic, I highly recommend Laura Kallmeyer's "Parsing Beyond Context-Free Grammars", ISBN 978-3-642-14845-3, DOI 10.1007/978-3-642-14846-0

Last-minute stinger: A 2-RCG for Rust raw strings:

RawBytes('b' remainder) := RawString(remainder)
RawString('r' lhs '"' body '"' rhs) := Hashes(lhs, rhs), !ContainsDelimiter('"' rhs, body)
Hashes('#' a, '#' b) := Hashes(a, b)
Hashes("", "") := true
ContainsDelimiter(needle, prefix middle suffix) := Hashes(needle, middle)

This can be transformed into a PRCG, though that is left as an exercise for the reader

jjpe · December 2, 2018, 6:11am

Thank you for posting all this info! Looks like I have some reading to do

system · March 25, 2019, 8:30am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Grammar of rust compiler	6	5688	March 25, 2019
Hosted compiler documentation is now available!	4	811	March 25, 2019
Subteam reports 2015-10-16 announcements	1	2519	March 25, 2019
Announcement: simple & production ready Rust un-grammar compiler	37	4091	January 23, 2021
Anybody interested in a Languages-hosted-in-Rust WG?	31	3537	March 25, 2019

Proposal: Grammar Working Group

Related topics