Hi! Based on the recent discussion on the Canonical Grammar RFC, I came to the conclusion that it maybe worthwhile to split off a Canonical Lexer RFC. Let’s discuss!
Problem description:
Creating an executable grammar specification is hard. By dividing it into the lexical and syntactical parts, it should be possible to simplify both the spec checking tools and the spec itself. This is especially important for Rust, because it’s lexical structure is pretty complicated. For example, raw string literals can not be described by a CFG (proof).
Proposed solution:
Create an executable lexer specification, make sure it is run during the build of rustc
and fuzz existing lexer against it.
How this speck should look like?
Seems like there is no convenient formalism for describing lexers (each token by itself can be described by a regular expression, but the lexing algorithm often includes add hock rules like precedence), and nested comments and raw string literals are not regular. But I think that we can just carefully spell out the lexing algorithm, because it can be made simple enough:
- Write a regular expression for each token (declarative part)
- Loop through regexes, match them against the start of the input, select longest match, continue with the rest of the input (spelling the details part)
- If input starts with
r#*"
or/*
run a custom algorithm for matching raw literals and comments and hand wave their meaning (the messy part)
I’ve actually written a prototype of such spec as a literate program in Rust. It’s capable of lexing itself, but I have neither compared it with the actual lexer, nor I am sure that this simple algorithm and special rules will be able to handle the whole rust
TLDR: a prototype lexer spec: https://github.com/matklad/rustlexspec/blob/master/src/lib.rs