Announcement: simple & production ready Rust un-grammar

Apanatshka · August 8, 2020, 3:20pm

This is getting off-topic, but I fiddled with the SDF3 grammar today. Starting from a very permissive grammar makes it a bit hard to get an unambiguous grammar, but I can at least refer to this commit to see the typical associativity / priority rules in action.

Apanatshka · August 8, 2020, 3:35pm

using the quote crate.

Quasiquoting is awesome, yes But I would argue that that's more structured rather than less. Sure it turns into a TokenStream, but that's not how the user works with it.

Additionally, r-a is not a traditional "batch" compiler that transforms syntax -> AST -> IR -> IR -> Intelligence, but rather a query-driven, incremental system. As such, there isn't really a place for tree refinement the way you might see in a traditional batch compiler.

That's a good point. I have yet to write a background service language tool or query engine or something like that. But in incremental systems in general I believe there can be a place for tree refinement in some incremental systems. In particular, I've worked on a refactoring of a compiler into an incremental compiler. That did still involve transformation pipelines, just with some caching in between by fitting it into an incremental pipeline framework. It's sometimes possible, and can avoid re-implementing a huge project, but there's a trade-off between incrementality, memory usage and the amount of refactoring work.

dtolnay · August 8, 2020, 5:07pm

I'll warn that you will probably run into trouble defining precedence as a linear cascade of "context-free priorities" like in the link. Rust's expression parsing is not determined by associativity / priority rules.

To illustrate, there are cases where braces sometimes turn a path expression into a struct initialization and sometimes do not, in a way that is distinct from precedence. In the following code, the expression S {} is one expression. Presumably there is an empty struct struct S {} defined somewhere which it is instantiating.

let _ = *S {};

// parsed by rustc as: `*(S {})`

But in the following, S {} is not a struct init expression.

if *S {} {}

// parsed by rustc as:
//
//    if (*S) {
//        /* empty block */
//    }
//    {
//        /* another empty block */
//    }

The choices around which way this is resolved at various syntactic positions is fairly arbitrary. Really either parse behavior could work in most positions, and language designers just decide each case based on which is more likely to be what the programmer had in mind most of the time, or based on accident of the implementation.

if return S {} {}

// parsed by rustc as:
//
//    if (return (S {})) {
//    }
//
// but could equally well have been this other arbitrary choice:
//
//    if (return S) {
//    }
//    {}

As far as I have been able to tell, the whole behavior is distinct from precedence and is not captured by assigning a priority level to the braced struct init expr in relation to other operators. This can be illustrated by return 0..S {} vs match 0..S {}. The former parses as return (0..(S {})) implying tighter precedence for struct init than .., while the latter parses as match (0..S) {} implying a contradictory tighter precedence for .. than struct init.

dtolnay · August 8, 2020, 5:27pm

A token stream is less structured than a syntax tree because it takes a huge amount of parsing work to go from token stream to syntax tree (reproducing the structure) and zero parsing work to go from syntax tree to token stream (discarding the structure).

You had been asking about the permissiveness–structure axis in the context of: "Do you find it easier to work with such a permissive structure? Is it common to accidentally construct trees that are not valid programs?". There are infinitely more ways to create a token stream that is not a valid program (even with a quoter) vs a syntax tree that is not a valid program, reflecting the almost total lack of structure in a token stream.

Apanatshka · August 9, 2020, 8:14am

This is super interesting, thanks for sharing. Those ad-hoc decisions will indeed make it much harder to reproduce what the official parser does. The nice thing about SDF3 is that it supports ambiguous grammars, so worst-case scenario is you don't find a way to encode the grammar unambiguously and just leave some things like this ambiguous, to be disambiguated after parsing by some post-processing on the AST. As long as the ambiguity does not result in exponential blowup, and I don't think it does for the cases you mention, the performance penalty will be minimal.

Apanatshka · August 9, 2020, 8:25am

There are infinitely more ways to create a token stream that is not a valid program (even with a quoter) vs a syntax tree that is not a valid program, reflecting the almost total lack of structure in a token stream.

I can see that the underlying representation of the quoter will have an influence on creating valid programs, e.g. when trying to splice in parts of the program in the wrong place. But earlier you wrote:

But even with less structure than a permissive concrete syntax tree, it's easier to programmatically produce correct TokenStream than a correct concrete syntax tree, by using the quote crate.

How do those properties combine? If a token stream as underlying representation has more ways to generate the wrong program, how does it also make it easier to produce the correct one in particular?

dtolnay · August 9, 2020, 9:07am

If I am trying to produce the data structure struct Foo<A> { x: A } as output, which of the following am I going to have more confidence is correctly producing the data structure that I intended to produce?

In my experience, obviously the first one. Sure I could mistakenly write struct<A> Foo in the first way or something else invalid, which is not representable in the second way. But the first way is still easier.

let output1 = quote! {
    struct Foo<A> { x: A }
};

let output2 = Struct {
    attrs: Vec::new(),
    vis: Visibility::Inherited,
    struct_token: Token![struct](Span::call_site()),
    ident: Ident::new("Foo", Span::call_site()),
    generics: Generics {
        lt_token: Some(Token![<](Span::call_site())),
        params: vec![
            GenericParam::Type(TypeParam {
                attrs: Vec::new(),
                ident: Ident::new("A", Span::call_site()),
                colon_token: None,
                bounds: Vec::new(),
                eq_token: None,
                default: None,
            }),
        ],
        gt_token: Some(Token![>](Span::call_site())),
        where_clause: None,
    },
    fields: Fields::Named(FieldsNamed {
        brace_token: Brace,
        named: [
            Field {
                attrs: Vec::new(),
                vis: Visibility::Inherited,
                ident: Ident::new("x", Span::call_site()),
                colon_token: Token![:](Span::call_site()),
                ty: Type::Path(TypePath {
                    qself: None,
                    path: Path {
                        leading_colon: None,
                        segments: vec![
                            PathSegment {
                                ident: Ident::new("A", Span::call_site()),
                                arguments: None,
                            },
                        ],
                    },
                }),
            },
        ],
    }),
    semi_token: None,
};

Apanatshka · August 9, 2020, 9:30am

Sure, the first one is nicer, but that's a quoter vs manual construction comparison, which is a bit of an appels vs oranges situation.

Try comparing an AST struct instantiation with a direct TokenStream definition. Or compare quote with a hypothetical typed quasiquoter that builds ASTs instead of TokenStreams. That's the kind of comparison I was thinking about. Or is the hypothetical typed quoter not a realistic option?

dtolnay · August 9, 2020, 9:52am

I am not convinced that "fix it after parsing" is viable for Rust. It isn't a matter of reassociating some operators; a huge part of the expression syntax hinges on which way the braces are decided. For something like if A {} | B | B | B {}, if your parser guesses:

if ((((A {}) | B) | B) | B) {
}

then fixing it later basically requires reparsing it all with a "real" parser, so you might as well have parsed with the "real" parser to begin with.

if (A) {
}
(|B| (B | (B {})))  // surprise closure

Apanatshka · August 9, 2020, 10:43am

That's a good example. But I think you misunderstand the parsing tech I'm using here. There is no guessing in the parser, and no reparsing required. If you have an ambiguous grammar you just get a "parse forest" instead of a "parse tree". You get both options, and in post-processing you just throw away one of the two.

The only problem you need to deal with for ambiguous SDF3 grammar is combinatorial explosion. If the ambiguity you leave in is particularly nasty you can get worst-case n^3 possible parses.

carbotaniuman · August 13, 2020, 3:58am

A hypothetical typed quoter would need to do some level of parsing itself right? While a token-tree quote literally turns characters into tokens.

Apanatshka · August 13, 2020, 6:16am

Yes, which means special compiler support. Or pre-processing, which is apparently MetaOCaml what does.

phaylon · August 28, 2020, 1:16pm

Due to a recent discussion on Reddit I was wondering: Would this un-grammar be usable for a code formatter that only concerns itself with things like indentation and non-newline spacings? I would assume so, but I thought I'd check before I go on a dive

matklad · August 28, 2020, 1:32pm

Yes!

More specifically, rust-analyzer's parser & syntax tree would be a good fit for this task I think.

Topic		Replies	Views
Proposal: Grammar Working Group compiler	44	4711	March 25, 2019
Grammar specification for Rust language design	3	3592	February 8, 2021
Grammar of rust compiler	6	5673	March 25, 2019
Macros vs incremental parsing compiler	13	2390	May 7, 2019
Discussion: Adding grammar information to Procedural Macros for proper custom syntax support in the toolchain tools and infrastructure	35	1579	June 13, 2025

Announcement: simple & production ready Rust un-grammar

Related topics