Announcement: simple & production ready Rust un-grammar

Last week I've made another overhaul of rust-analyzer syntax tree structure. Now the syntax tree itself (but not the parser, it is still hand written) is generated from ungrammar. Ungrammar (made up term) specifies the structure of concrete syntax tree, without specifying the parser/language. It is like ASDL with C.

Because un-grammar doesn't deal with language ambiguities the spec is simple and readable -- 618 lines, including blanks and comments. I think it's the first rust grammar I've seen which fits into my head. Definitely much better than RustParser.bnf :slight_smile:

It is production ready in several senses:

  • it is a source of truth for rust-analyzer's syntax tree, so it is quite literary used in production (caveat -- rust-analyzer doesn't stresses terminals yet, so there definitely should be a missing comma here and there)
  • The names of various productions were carefully stolen from syn & the reference chosen
  • I have a feeling that this spec will last for a while -- it's fourth of fifth iteration, and this time it feels like there's nothing to remove. For this reason, it will live in a repository, separate from rust-analyzer, and each change would reuquire a new version of a cargo package

Besides the spec itself, the link repository also contains a code to parse it into a data structure (this is how rust-analyzer's sytnax tree is generated).

41 Likes

That's awesome! It's quite impressive how simple the Rust grammar is.

This is fantastic, it's what I've been looking for since I started programming in rust!

@matklad Would it be possible to rewrite it as a PEG instead? Would that solve the ambiguities? I don't know enough about parsing, so I'm honestly curious if that approach would work.

You can make this grammar to specify the parse/language using various approaches. However, any such change would increase the complexity, as, fundamentally, the ungrammar in its form just lacks the required information.

And this lack of information is an explicit design goal. This is the grammar for describing syntax tree structure, not for describing the mapping from sequences of tokens to trees.

So, the best way to nail down the language fully would be to write a separate, uglier grammar in some formalism.

3 Likes

Out of curiosity: what value does the former have without the capabilities of the latter?

We generate a whole lot bunch of code from that 600-line file. A nice thing is that the shape of tree we generate corresponds exactly to the shape of tree described in the grammar.

If you try to both specify a parser and a tree using a single grammar, you get into situation where you need to massage grammar in one direction to avoid conflicts, and in the opposite direction to have a resonable syntax tree structure.

12 Likes

In computer science, this ungramamr would be called the "abstract syntax", and you can use it to formally talk about the language and define things like type checking and compilation. The mapping from "concrete syntax" to "abstract syntax" is a mostly orthogonal concern.

Nice work @matklad :slight_smile:

It’s not abstract syntax, it includes all punctuation, parenthesis, the things are ordered, etc.

If I were to specify ungrammar formally, I’d say that it describes the set of concrete syntax trees. It maps each concrete non terminal to a regular language over terminals & nonterminals, such that:

  • for each valid syntax tree, each node’s terminal & non-terminal children match the corresponding regex
  • regex are minimal in a sense that for each dfa state there exists a tree that hits the state

Notably, ungrammar does not guarantee that, if a concrete syntax tree matches all the rules, than there exists a string of terminals, which parses into that concrete syntax tree.

Actually, I am not sure what “this grammar” refers to, so I might be arguing with a statement that hasn’t been made :slight_smile: I’ll write a post which introduces ungrammars properly one day: it’s a useful tool for practical language development, which I haven’t seen used before.

3 Likes

When you do, can you please post a link here? I'm extremely interested in how parsing works, but have no experience, so am constantly looking out for more information.

5 Likes

The abstract syntaxes (syntaxen? syntaxs?) that I saw usually are ordered. But fair point for punctuation and parenthesis. This sits somewhere between your typical abstract and concrete syntax.

"syntaxes" (by a native speaker with over 50 years experience as a part-time paid proofreader and editor of technical material)

7 Likes

Why don't you have that guarantee? Is it just possibly missing parentheses in expressions, or is there more to it?

In general, this looks like a "high-level" or "readable" context-free grammar for Rust. By high-level I mean (and I think this is common parlance in text about grammars) without disambiguation info shoehorned in for a particular LL or LR subset which would obscure the actual language structure. Maybe those keywords help.

There are grammar formalisms that do something like this and have separate disambiguation rules. If you want to play around with executing your ungrammar as a parser you could try turning it into SDF3 and running it in Spoofax. Some students are our uni attempted to write a Rust grammar with Spoofax, but I think that's a private repo until they've gotten a grade, so I can't link it here.

1 Like

Why don't you have that guarantee? Is it just possibly missing parentheses in expressions, or is there more to it?

This gives a more useful CST in practice. It should be possible to write ungrammar that forces precedence rules, forbids struct literals in conditions, etc.

But, in terms of writing an IDE, a more useful models is that where there‘s just a single „BinExpr“ node type with „operator“ token, rather than a whole tower of mostly isomorphic types.

I understand the point of your formalism to write nicely shaped CSTs and not derive a parser from it. I was more interested in why a printing of the tree back to text might not give a text that will parse into the same CST. Looks like it's not just precedence/priority but you've also written something a bit looser than the actual grammar. I suppose that makes the grammar simpler to understand, but isn't there a danger that you would start writing support for things in rust-analyzer based on the shape of your CST without those things being allowed in Rust? You might run into edge cases that are difficult to analyze but aren't allowed in Rust anyway...

If anyone's interested, I did a quick&dirty translation to SDF3 here. It's not pretty, just systematically translated and it should give more or less the same shape of trees. Obviously SDF3 has a slightly different purpose (ASTs vs. CSTs, deriving a parser, the shape of AST should fit the rest of the Spoofax ecosystem, etc.). I might refactor it into a grammar that's more idiomatic SDF3. Once I've done that I can add disambiguation rules, the ungrammar is not amenable to the high-level disambiguation rules that SDF3 has.

3 Likes

A concrete syntax tree does not have the role of controlling what is and is not valid syntax. That's a parser, which a concrete syntax tree is not. A syntax tree would only be designed to be able to represent all valid syntax in a useful way, without much consideration toward invalid syntax.

That means you can manually instantiate a syntax tree with invalid syntax. When you print to text and attempt to parse that using a real parser, your parser is not supposed to accept invalid syntax.

Here is an example of one invalid "program" that the concrete syntax tree derived from this un-grammar would be able to represent. (That's fine and expected! Un-grammar is a DSL for defining syntax trees.)

#[a :: :: 0]
#![<<<super>>>]
fn f(!);
Fn {
  attrs: [
    Attr {
      "#",
      "[",
      path: Path {
        qualifier: Path {
          segment: PathSegment {
            name_ref: "a",
          },
        },
        "::",
        segment: PathSegment {
          "::",
          name_ref: "0",
        },
      },
      "]",
    },
    Attr {
      "#",
      "!",
      "[",
      path: Path {
        segment: PathSegment {
          "<",
          path_type: Path {
            segment: PathSegment {
              "<",
              path_type: Path {
                segment: PathSegment {
                  "<",
                  path_type: Path {
                    segment: PathSegment {
                    name_ref: "super",
                  },
                  ">",
                },
              },
              ">",
            },
          },
          ">",
        },
      },
      "]",
    },
  ],
  "fn",
  name: "f",
  param_list: ParamList {
    "(",
    param: Param {
      type: NeverType {
        "!",
      },
    },
    ")",
  },
  ";",
}

A real parser would reject this for lots of reasons.

  • The colons in #[a :: :: 0] are screwy.

    error: expected identifier, found `::`
     --> src/main.rs:1:8
      |
    1 | #[a :: :: 0]
      |        ^^ expected identifier
    
  • Even if the colons weren't screwy, an integer is not allowed as a path segment.

    error: expected identifier, found `0`
     --> src/main.rs:1:11
      |
    1 | #[a ::    0]
      |           ^ expected identifier
    
  • Inner attributes after outer attributes is invalid syntax.

    error: an inner attribute is not permitted following an outer attribute
     --> src/main.rs:2:1
      |
    1 | #[a :: :: 0]
      | ------------ previous outer attribute
    2 | #![<<<super>>>]
      | ^^^^^^^^^^^^^^^ not permitted following an outer attribute
    
  • < is not valid at the start of an attribute path.

    error: expected identifier, found `<<`
     --> src/main.rs:2:4
      |
    2 | #![<<<super>>>]
      |    ^^ expected identifier
    
  • <super> is not a valid path without a rest part in <type>::rest syntax.

    error: expected `::`, found `>`
     --> src/main.rs:2:13
      |
    2 | #![<<<super>>>]
      |             ^ expected `::`
    
  • ! is not a valid name for a function parameter.

    error: expected parameter name, found `!`
     --> src/main.rs:3:6
      |
    3 | fn f(!);
      |      ^ expected parameter name
    
  • Pattern parameters without a type are only legal syntax in closures, not functions.

    error: expected one of `:`, `@`, or `|`, found `)`
     --> src/main.rs:3:7
      |
    3 | fn f(!);
      |       ^ expected one of `:`, `@`, or `|`
      |
      = note: anonymous parameters are removed in the 2018 edition (see RFC 1685)
    
9 Likes

Ok. I can see how that works. Interesting... This is quite a different viewpoint than my usual one on syntax trees and intermediate representations. Do you have experience with working with CSTs like these? Do you find it easier to work with such a permissive structure? Is it common to accidentally construct CSTs that are not valid programs? Or do I misunderstand and are these CSTs purely for consumption, produced by a parser?

Context: My experience with writing language tools is limited to the Spoofax toolchain and a few handwritten tools in Haskell and Rust. I typically want to do transformations on abstract syntax trees. My pipeline is basically: permissive grammar (permissiveness depends on how good the parser tooling / handwritten code is at error messages and error recovery), then some cleanup (remove parser hacks from tree, desugar, normalize), static semantics checking (names, types, etc.), then transform/interpret/whatever. With every step in the pipeline I ideally get a tree with a richer type, a stricter subset. That way the type system can help me: I can leverage static guarantees of valid programs in later steps of the pipeline. The idea of an intermediate representation that is more permissive than even the parser is a foreign concept to me :slight_smile:

I don't have experience with the rust-analyzer syntax tree but I have a lot with Syn's syntax tree (https://docs.rs/syn/1.0.38/syn/struct.File.html) which is also a permissive design, though quite a bit less permissive than the un-grammar I think.

For Syn, the concrete syntax tree is almost always only used for consumption, produced by a parser. For code generation, proc macros do not tend to produce syntax trees. They produce a less structured representation (https://doc.rust-lang.org/stable/proc_macro/struct.TokenStream.html). But even with less structure than a permissive concrete syntax tree, it's easier to programmatically produce correct TokenStream than a correct concrete syntax tree, by using the https://github.com/dtolnay/quote crate.

1 Like

Additionally, it's worth noting how r-a handles its syntax tree(s).

At the root level, it has a maximally permissive, homogeneously typed tree, Rowan. The tree is an immutable, persistent tree.

There are (potentially many) views of the syntax tree with more type information, but they all are views of the same untyped tree.

Additionally, r-a is not a traditional "batch" compiler that transforms syntax -> AST -> IR -> IR -> Intelligence, but rather a query-driven, incremental system. As such, there isn't really a place for tree refinement the way you might see in a traditional batch compiler.

2 Likes