Announcement: simple & production ready Rust un-grammar

This is getting off-topic, but I fiddled with the SDF3 grammar today. Starting from a very permissive grammar makes it a bit hard to get an unambiguous grammar, but I can at least refer to this commit to see the typical associativity / priority rules in action.

using the quote crate.

Quasiquoting is awesome, yes :slight_smile: But I would argue that that's more structured rather than less. Sure it turns into a TokenStream, but that's not how the user works with it.

Additionally, r-a is not a traditional "batch" compiler that transforms syntax -> AST -> IR -> IR -> Intelligence, but rather a query-driven, incremental system. As such, there isn't really a place for tree refinement the way you might see in a traditional batch compiler.

That's a good point. I have yet to write a background service language tool or query engine or something like that. But in incremental systems in general I believe there can be a place for tree refinement in some incremental systems. In particular, I've worked on a refactoring of a compiler into an incremental compiler. That did still involve transformation pipelines, just with some caching in between by fitting it into an incremental pipeline framework. It's sometimes possible, and can avoid re-implementing a huge project, but there's a trade-off between incrementality, memory usage and the amount of refactoring work.

I'll warn that you will probably run into trouble defining precedence as a linear cascade of "context-free priorities" like in the link. Rust's expression parsing is not determined by associativity / priority rules.

To illustrate, there are cases where braces sometimes turn a path expression into a struct initialization and sometimes do not, in a way that is distinct from precedence. In the following code, the expression S {} is one expression. Presumably there is an empty struct struct S {} defined somewhere which it is instantiating.

let _ = *S {};

// parsed by rustc as: `*(S {})`

But in the following, S {} is not a struct init expression.

if *S {} {}

// parsed by rustc as:
//
//    if (*S) {
//        /* empty block */
//    }
//    {
//        /* another empty block */
//    }

The choices around which way this is resolved at various syntactic positions is fairly arbitrary. Really either parse behavior could work in most positions, and language designers just decide each case based on which is more likely to be what the programmer had in mind most of the time, or based on accident of the implementation.

if return S {} {}

// parsed by rustc as:
//
//    if (return (S {})) {
//    }
//
// but could equally well have been this other arbitrary choice:
//
//    if (return S) {
//    }
//    {}

As far as I have been able to tell, the whole behavior is distinct from precedence and is not captured by assigning a priority level to the braced struct init expr in relation to other operators. This can be illustrated by return 0..S {} vs match 0..S {}. The former parses as return (0..(S {})) implying tighter precedence for struct init than .., while the latter parses as match (0..S) {} implying a contradictory tighter precedence for .. than struct init.

7 Likes

A token stream is less structured than a syntax tree because it takes a huge amount of parsing work to go from token stream to syntax tree (reproducing the structure) and zero parsing work to go from syntax tree to token stream (discarding the structure).

You had been asking about the permissiveness–structure axis in the context of: "Do you find it easier to work with such a permissive structure? Is it common to accidentally construct trees that are not valid programs?". There are infinitely more ways to create a token stream that is not a valid program (even with a quoter) vs a syntax tree that is not a valid program, reflecting the almost total lack of structure in a token stream.

This is super interesting, thanks for sharing. Those ad-hoc decisions will indeed make it much harder to reproduce what the official parser does. The nice thing about SDF3 is that it supports ambiguous grammars, so worst-case scenario is you don't find a way to encode the grammar unambiguously and just leave some things like this ambiguous, to be disambiguated after parsing by some post-processing on the AST. As long as the ambiguity does not result in exponential blowup, and I don't think it does for the cases you mention, the performance penalty will be minimal.

There are infinitely more ways to create a token stream that is not a valid program (even with a quoter) vs a syntax tree that is not a valid program, reflecting the almost total lack of structure in a token stream.

I can see that the underlying representation of the quoter will have an influence on creating valid programs, e.g. when trying to splice in parts of the program in the wrong place. But earlier you wrote:

But even with less structure than a permissive concrete syntax tree, it's easier to programmatically produce correct TokenStream than a correct concrete syntax tree, by using the quote crate.

How do those properties combine? If a token stream as underlying representation has more ways to generate the wrong program, how does it also make it easier to produce the correct one in particular?

If I am trying to produce the data structure struct Foo<A> { x: A } as output, which of the following am I going to have more confidence is correctly producing the data structure that I intended to produce?

In my experience, obviously the first one. Sure I could mistakenly write struct<A> Foo in the first way or something else invalid, which is not representable in the second way. But the first way is still easier.


let output1 = quote! {
    struct Foo<A> { x: A }
};

let output2 = Struct {
    attrs: Vec::new(),
    vis: Visibility::Inherited,
    struct_token: Token![struct](Span::call_site()),
    ident: Ident::new("Foo", Span::call_site()),
    generics: Generics {
        lt_token: Some(Token![<](Span::call_site())),
        params: vec![
            GenericParam::Type(TypeParam {
                attrs: Vec::new(),
                ident: Ident::new("A", Span::call_site()),
                colon_token: None,
                bounds: Vec::new(),
                eq_token: None,
                default: None,
            }),
        ],
        gt_token: Some(Token![>](Span::call_site())),
        where_clause: None,
    },
    fields: Fields::Named(FieldsNamed {
        brace_token: Brace,
        named: [
            Field {
                attrs: Vec::new(),
                vis: Visibility::Inherited,
                ident: Ident::new("x", Span::call_site()),
                colon_token: Token![:](Span::call_site()),
                ty: Type::Path(TypePath {
                    qself: None,
                    path: Path {
                        leading_colon: None,
                        segments: vec![
                            PathSegment {
                                ident: Ident::new("A", Span::call_site()),
                                arguments: None,
                            },
                        ],
                    },
                }),
            },
        ],
    }),
    semi_token: None,
};

Sure, the first one is nicer, but that's a quoter vs manual construction comparison, which is a bit of an appels vs oranges situation.

Try comparing an AST struct instantiation with a direct TokenStream definition. Or compare quote with a hypothetical typed quasiquoter that builds ASTs instead of TokenStreams. That's the kind of comparison I was thinking about. Or is the hypothetical typed quoter not a realistic option?

I am not convinced that "fix it after parsing" is viable for Rust. It isn't a matter of reassociating some operators; a huge part of the expression syntax hinges on which way the braces are decided. For something like if A {} | B | B | B {}, if your parser guesses:

if ((((A {}) | B) | B) | B) {
}

then fixing it later basically requires reparsing it all with a "real" parser, so you might as well have parsed with the "real" parser to begin with.

if (A) {
}
(|B| (B | (B {})))  // surprise closure
2 Likes

That's a good example. But I think you misunderstand the parsing tech I'm using here. There is no guessing in the parser, and no reparsing required. If you have an ambiguous grammar you just get a "parse forest" instead of a "parse tree". You get both options, and in post-processing you just throw away one of the two.

The only problem you need to deal with for ambiguous SDF3 grammar is combinatorial explosion. If the ambiguity you leave in is particularly nasty you can get worst-case n^3 possible parses.

A hypothetical typed quoter would need to do some level of parsing itself right? While a token-tree quote literally turns characters into tokens.

Yes, which means special compiler support. Or pre-processing, which is apparently MetaOCaml what does.

Due to a recent discussion on Reddit I was wondering: Would this un-grammar be usable for a code formatter that only concerns itself with things like indentation and non-newline spacings? I would assume so, but I thought I'd check before I go on a dive :slight_smile:

Yes!

More specifically, rust-analyzer's parser & syntax tree would be a good fit for this task I think.

See also

Ah, yes. I'm subscribed to that actually, but somehow skipped over that message in my inbox :slight_smile: I guess I'll have to have a look at the ra-fmt progress over the weekend to see where/if I can help out somehow.

For some reason I thought this grammar was separate from rust-analyzer, but it of course make sense that it's in the parser/syntax parts of the project.

Thanks!

The promised post:

4 Likes

@matklad so you're aware, code blocks on that blog are unusable with a narrow (e.g. mobile) window, where they word-wrap:

The solution I expect is that the line numbers space out to indicate real lines not caused by word-wrap, though I'm used to seeing sites just overflow code blocks wider than the text width so they don't have to word-wrap.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.