Pre Pre RFC: canonical lexer specification


#1

Hi! Based on the recent discussion on the Canonical Grammar RFC, I came to the conclusion that it maybe worthwhile to split off a Canonical Lexer RFC. Let’s discuss!

Problem description:

Creating an executable grammar specification is hard. By dividing it into the lexical and syntactical parts, it should be possible to simplify both the spec checking tools and the spec itself. This is especially important for Rust, because it’s lexical structure is pretty complicated. For example, raw string literals can not be described by a CFG (proof).

Proposed solution:

Create an executable lexer specification, make sure it is run during the build of rustc and fuzz existing lexer against it.

How this speck should look like?

Seems like there is no convenient formalism for describing lexers (each token by itself can be described by a regular expression, but the lexing algorithm often includes add hock rules like precedence), and nested comments and raw string literals are not regular. But I think that we can just carefully spell out the lexing algorithm, because it can be made simple enough:

  • Write a regular expression for each token (declarative part)
  • Loop through regexes, match them against the start of the input, select longest match, continue with the rest of the input (spelling the details part)
  • If input starts with r#*" or /* run a custom algorithm for matching raw literals and comments and hand wave their meaning (the messy part)

I’ve actually written a prototype of such spec as a literate program in Rust. It’s capable of lexing itself, but I have neither compared it with the actual lexer, nor I am sure that this simple algorithm and special rules will be able to handle the whole rust :slight_smile:

TLDR: a prototype lexer spec: https://github.com/matklad/rustlexspec/blob/master/src/lib.rs


#2

This is not quite true. In particular, the entire concept of a lexer fundamentally relies on that pretty much all interesting classes of grammar are closed with respect to composition with finite-state transducers.

That is to say that for a grammar category that is closed in such a manner, composing it with a finite state transducer yields a grammar in the same category. This is true of LL, LR, LALR, CFG, etc.

However, finite-state transducers can only express transformations up to the complexity of the regular languages. Once you go beyond the regular languages, it’s no longer an FST, and the closure properties no longer hold.

As a result, taking a CFG and composing it with a “lexer” that expresses things beyond regular languages may yield something that isn’t a CFG at all - it might be context-sensitive, or it may even be undecidable.

In other words: There is, in fact, a formalism for describing lexers - one which, in fact, inspired their use in parsers. That formalism is a “Finite-state transducer”, and if Rust’s “lexer” cannot be expressed as one, that’s potentially a real problem.

Nested comments are a very good example there: Finite-state transducers, like finite-state automata, cannot count. If Rust actually does handle nested comments by nesting them, rather than the first */ ending the comment, it may not be possible to express comments as part of the lexer without badly violating the definition of the term.

(It may be possible to end-run that specific issue by showing that the Rust grammar itself falls into a class that is closed under composition with “balanced parenthesis languages”, but that’s a whole different kettle of fish, and possibly a rotten one)


#3

Describing the lexer by presenting an FST is as convenient as describing the grammar by presenting a push down automaton :slight_smile: I was trying to say that while there is a language to precisely and human friendly describe CFGs, I am not aware of the one for lexers.

Hey, I have a constructive proof that “fundamentally relies” part is false: Rust as a language over the Unicode “symbols” is not context free (because raw string literals), but Rust as a language over the alphabet of tokens is. Hence, a rustc lexer does not compose with most (all?) interesting classes of grammars.


#4

Sure there is. Just as a pushdown automaton can be described in (E)BNF (subrules are, in fact, exactly stack operations), an FST can be described by a list of (token, regular expression) pairs.

First, there can be no such proof. The simple fact is that, given a lexer and a grammar, unless the closure property holds, parsing may be nonterminating - and in that case, Rice’s theorem holds, and you get the full set of problems that come with Turing-completeness.

Mm, and that’d be another case like what I said regarding comments. My point isn’t that “what rustc calls a lexer doesn’t exist”. My point is that it abuses the term “lexer” beyond its limit, exactly due to this:

In both cases, if they actually do count/balance, then at very least those particular portions belong to the class of semi-Dyck languages. As the semi-Dyck languages under composition with FSTs generate the context-free languages, there is in fact a negative proof: Composition of a context-free grammar with another context-free grammar is potentially not context-free, and possibly not even halting.

As a result, one cannot judge such a lexer independently of the parser, as depending on details of both, the lexer can break the parser.


#5

In both cases, if they actually do count/balance, then at very least those particular portions belong to the class of semi-Dyck languages.

Nested comments do, raw literals don’t, because they are not a CFG language (and semi-Dyck langauges are CFG).

Rust language’s syntax (if we take Unicode “symbols” as an alphabet) is not context free. There is no push down automaton that can recognize Rust syntax. (and this is a good tradeoff imo, because the raw literal syntax is sweet, and lexeing Rust in practice is simple)

This is one of the main reasons why I want to specify the lexer separately :slight_smile:

an FST can be described by a list of (token, regular expression) pairs.

Yes, but you would need to add some rules for conflict resolution (several regular expressions match, or the same regular expression is matched in different positions), and in practice lexers have to deal with nested comments and other non-regular things. That’s why I advocate for describing the meaning of the specification in the specification itself, as opposed to “hey, the spec is just an input file for Lex”.

My point is that it abuses the term “lexer” beyond its limit!

So do I! Is there a better term for a function that turns a unicode string into a token sequence?


#6

Well, for one, Rust’s doesn’t turn a unicode string into a token sequence. Note how macros-by-example work on token trees, rather than token lists.

Fundamentally, it’s a grammar. Composed with another grammar. If anything, what you’re calling a lexer is a full grammar in its own right, and then Rust’s grammar is a tree-adjoining grammar over token trees.


#7

This looks very implemented oriented. Which may be fine, but when speaking of a specification, I usually think of a general document that isn’t necessarily dependent on any implementation technology/details. Of course executable specifications are great, but I’d rather see a general CFG of Rust (+ disambiguation rules) first, and only an example lexer+parser using some dominant technology/algorithm second. If you define the implementation oriented thing first, you may get stuck in complacency, where the current spec feels “good enough” even though other have difficulty using it if they don’t want to or can’t use the tech/algorithm that you’re basing your spec on.


#8

I absolutely do not want to tie the spec to a particular lexer/parser technology. That’s why I propose to just describe a matching algorithm. I would prefer a less operation description, but I am not sure that it is possible.

However I feel that the specification should be checkable mechanically from the start: if you can’t derive an algorithm from the spec, it is not precise enough.

Of course there is a question if we need an absolutely precise speck, or would an approximate one do. By an approximate speck I mean the one that will be unambiguous for real world programs, but may have some leeway for obscure corner cases. And I’ve just found an obscure corner case, which was invalid in 1.4.0 and is valid in 1.11.0!

fn foo() {}

fn main() { self :: foo() }

I would prefer to nail down all the details exactly.


#9

Fair enough. I don’t know much about independent lexer specifications, and that also depends a lot on how complicated Rust is in this regard. But if you don’t want to be tied to a particular technology, I would suggest a CFG or maybe a DDG (Data-Dependent Grammar). Those have clear interpretations, and you can pick a syntax in which to write it that’s supported by a tool. If you go for a CFG make sure you grab a tool that covers full CFGs as your supporting tech, most tools are restricted to LL/LR classes which require you to warp your grammar too much to keep it a general spec.