Macros vs incremental parsing

This is a reply to this GitHub comment as well as a report of my current experiments with rust-analyzer. I fully agree that going from spans to opaque ids seems to be a good idea! This is what rust-analyzer is using for macros (well, will be using once #724 is merged).

Specifically, the setup is as follows:

TokenTree are strictly for interfacing with macros. “AST” (which is Concrete Syntax Tree actually) itself is build directly from text (to account for comments & whitespace).

To do macro expansion, there’s an explicit step that converts the CST for the macro argument to the enum TokenTree { Leaf(Token), Subtree(Vec<TokenTree>) } representation. Because macros care about identity of tokens, tokens (currently only idents) are assigned incrementing ids in the process of transforming CST into TokenTrees. That is, ids are local to each macro call.

A macro than transforms TokenTree to another TokenTree, taking care of maintaining token identities where necessary.

TokenTree is then parsed back to CST. This step is pretty messy at the moment, because it basically goes via to_string and does not preserve “structured” hygiene information (though, we do remember which source text range maps to which target text range, for IDE purposes)

However, we don’t try to be incremental “within” the macro. Adding a token to the start of macro invocations just shifts all the ids. In fact, although the parser itself is incremental, we don’t try to match pre-edit and post-edit trees at all: trees are assumed to change completely after every edit. This works pretty well with salsa nonetheless: the trick is to place a mediating query between the syntax tree and further analysis.

For example, to do name resolution, we first extract a (position independent) set of imports and items declared in a file, and then run resolution on top of that. The first phase, lowering syntax trees to a set of import and items, is repeated after every change, but is fast, and happens for a single file only. The second phase is executed only when results of the first phase change (ie, if you actually add an import). This setup does not require precise tracking of changes: even if you, say, rewrite file completely by switching branches, if the set of declarad items stays the same, name resolution is reused.

Although the bread and butter of most IDEs is being lazy and skipping large chunks of the source code completely, this wont’ work for Rust. The reason is that name resolution is a fixed point iteration algorithm, intertwined with macro expansion, which you more or less have to run for a whole crate at a time. Luckely, salsa + mediating queries should handle this pretty well: if you don’t change top-level strcture, nothing is recomputed, if you add a new struct, nameresolution is recomputed, but macro expansions are not affected, etc. Hopefully, even editing code inside a top-level macro, like lazy_static, will require only this macro to be re-executed, with a caveat.

Caveat:

A major requirenment for incremental parser is error tolerance. If the user added let x =, that shouldn’t break the parse tree before or after. That’s easy to achieve in a hand-written parser, it’s really not a rocket science. However, I am not sure how this would work with procedural macro. Say, you’ve implemented lazy-static as a proc macro. Now, user types let x = inside lazy-static invocation. If you use something like syn, than it will (I think) just refuse parsing this invalid expression, and so, globally, you’ll transition from the state where there is a top-level item declared by macro, to the state without one. So, you need to re-run name resolution, etc… Error recovery inside macros would be a fun problem to solve once we sort out non-macro stuff!

BTW, if anyone finds this stuff fascinating, check this rust-analyzer issue and PR which start to tie together the unified “parsing/macro expansion/name resolution” frontend:

7 Likes

Regarding the case you mentioned where adding let x = to the lazy_static invocation causes a top-level item to be removed, I see two possible solutions:

  1. When the lazy_static macro invocation fails, report the error, but don’t remove the existing output that the macro was producing. i.e. the top-level item that was previously being produced by the macro remains as-is until the code is edited further to the point where it no longer produces an error.
  2. Make adding and removing top-level items more incremental so that it isn’t a performance problem. If instead of propagating top level items have changed, it instead propagated “foo” has been removed from the top-level then any queries that looked up “foo” would be invalidated - but that probably wouldn’t be so wide-reaching. There’s probably lots of complexity here that I don’t appreciate which perhaps makes this impractical.

I gather that Salsa, when an input to X changes, recomputes X from scratch. This definitely simplifies things, but I wonder what stages might be able to do better with something more incremental. Linking in particular could do much better if it were given a big collection of symbols and their definitions to link, then when a function gets edited, a new definition of that symbol is provided to the linker, which can then just fix up relocations in and to that function. Linking is on my mind since I’ve been trying to write an incremental ELF linker, but I suspect other stages might benefit from knowing what of their inputs have changed as well. e.g. we probably want to display a list of all errors / warnings. Going to each declared item in the system and asking it for its errors every time this changes seems expensive. Hmm, I think I’m getting off-topic again.

How do you report errors (e.g. duplicate name) if you don't have position information while doing name resolution?

Edit: actually, to answer my own question, I'm guessing errors would be produced by a separate function. But presumably that function would need access to spans, at least for nodes that produce errors.

Edit2: so I'm wondering, if spans or opaque IDs all shift when a minor edit is made, can you avoid redoing all error checking for that file?

I think that if error checking spat out “untranslated” errors, then you could have a translation step that resolved the abstract ids to concrete spans to render the error, and only that step would have to be redone so long as the edit was trivial.

So the “untranslated” errors would contain only the opaque IDs, not the actual spans - but if the opaque IDs all change every time there’s a small edit, it’ll still invalidate even the untranslated errors right?

Yeah, "could we be more incremental" is the question that comes often. My current view (which changed after implementing rust-analyzer) is that, at least for IDE stuff, salsa is exactly what we need.

Initially, I thought that some kind of push-based invalidation or "monoid cached tree"s is necessary. I've created an issue for it (Push-based query invalidation · Issue #41 · salsa-rs/salsa · GitHub).

<aside>

What is the proper name for "monoid cached trees"? This is one description of the data structure: Segment Tree | Sum of given range - GeeksforGeeks. In Russian we unambiguously call it "segment tree" in competitive programming. However Wikipedia describes "segment tree" as a completely different data structure: Segment tree - Wikipedia, so I don't feel comfortable using this term in english. Monoid Tree seems like a perfect name, but the sole reference to the term is this tweet by pcwalton: https://twitter.com/pcwalton/status/798420969975058432. Raph Levien calls this "Map Reduce": Rope science, part 1 - MapReduce for text.

</aside>

Now I don't this this is necessary, at least for the IDE use-case. The main reason is that modern top-class IDEs are not really incremental, they are mostly just lazy. This matches my experience with IntelliJ, this comment and what I know about TypeScript.

In Rust analyzer, for example, we have a query which resolves all imports of all modules in a crate. We recompute it completely from scratch every time you add a new import or top-level item. I haven't measured how long does it take to compute the query, simply because completion just feels instantaneous anyway! I think there's a certain cognitive bias here: although from Big-O point of view this seems pretty terrible, the actual amount of data one needs to process is tiny, for the two reasons:

  • it's source code typed by humans, its not "big data"
  • there are powerful ways to cut down on the amount of stuff you need to look at. Typically this is laziness, but for Rust it is less effective than for other languages. However Rust has crates, and they can be use to limit the scope of analysis very efficiently.

When the lazy_static macro invocation fails, report the error, but don’t remove the existing output that the macro was producing

This is interesting (and somewhat "obvious") idea, but I feel it won't work out too well. In the current model, there's also a "current" state of the world and you answer queries about it. This is a nice "pure function" model: you give a bunch of source files in, you get a specific answer out. There maybe clever laziness and incrementallity hidden inside, but the API of the model is simple. Explicitly accounting for known outdated results seems to give a whole new axis of complexity.

Make adding and removing top-level items more incremental

This is what I would call "monoid tree" approach: if you compute a set of items declared in a file, it should be easy to incrementally update various unions of sets of items. Unfortunately, the semantics of Rust makes this pretty complicated in practice. The reason is that rust name resolution algorithm is fundamentally a fixed-point iteration algorithm: imports can import from other imports, and glob imports create "links" between modules. Making these kinds of algorithms incremental requires something like differential data flow, which is quite a bit of essential complexity. And in Rust name resolution is intertwined with macro expansions, which uses a rather weird set of rules (iirc, macro_rules were an explicitly "stop-gap" solution), so there's a huge bit of accidental complexity as well.

Note that, for example, IntelliJ for Java does use something like "monoid tree" in this case. In Java, each class has a fully qualified name (which you can learn from a single file, because each file contains a package declaration), so IntelliJ builds an index from fully-qulified names to classes, and updates this index incrementally when you edit files.

1 Like

Right, yeah. Specifically, errors refer to opaque IDs which are stable across re-parses. We have (hand waving) a separate function "compute errors for file" which binds this opaque IDs to offsets. Currently we don't memoize this function all, because we only need to show offsets of opend files, and there are few of them. If we memoize this function though, it will have to be recomputed only for changed files.

So you’re generating stable opaque IDs by interning the named path to the item + token number within the item? i.e. the opaque IDs should only change within the item that was edited, which seems reasonable. How does this handle unmatched braces etc? e.g. if I have:

fn a() {...}
fn b() {...}
fn c() {...}

and I edit b as:

fn b() {
  if (...) {
    // <- User still typing here
}

Is it going to pull c (and everything after it) inside of b? Because presumably that would cause c to change to b::c which would then change all the opaque IDs within? I guess once you add the closing brace, the IDs should then change back, so perhaps it’s not too bad - although that might mean that completions for items later in the file would be incorrect until you added the closing brace. Or maybe you already have something to handle this?

With an incremental parser, I would imagine that since the change was entirely contained within b (and existing content after b was without parse error) that it would get to the end of b and report “Missing }”, leaving everything after b alone.

So, I generate IDs only for things inside the macro invocation. If inside macro invocation you have unmatched braces, the macro invocation is malformed and you can't expand it at all. So, unmatched braces are bad, but not for the "shifting ID" reason, for "macros can expand only valid code" reason. I hope we can fix this somehow as well, by being clever when transforming partial syntax node into a token tree, but that is far-future work.

For non-macro code, parser indeed has powerful error recovery, but unmatched {} are indeed the worst case. The reason is, in your example there's no easy rule for the parser to know that } is missing in b. After all, it could have been

fn a() { ... }
fn b() {
fn c() { ... }
}

Very naive question: can it guess based on indentation? (and only use this guess for caching/warning purposes of course)

We certainly can add indenation-based heuristics, yes, though I’ve personally never tried that. I am also not sure that this is a significant problem worth solving. IntelliJ, which for me is a gold standard here, doesn’t do such heuristic-based recovery, and works fine without it.

If I'm editing code, starting from a valid program and I make it invalid, by typing half of something, I'd want my IDE to be as conservative as possible WRT to what gets invalidated. Where I'm going to eventually put the closing brace is indeed unknown, but if we assume it's going to go at the end of the file, that will cause a lot of churn and possibly surprising results. Say in this example:

struct Foo {}
mod m1 {
  fn f1() {
    Foo::n // <- User typing here

impl Foo {
  fn new() -> Foo {...}
}

The last time the program was valid, Foo::new() was defined. But if we assume that f1 and m1 don't get closed until the end of the file, then the impl Foo is invalid (no Foo in current scope) and Foo::new() no longer exists, so the autocomplete that the user was hoping for doesn't show up.

Here, the principal of "when in error, invalidate as little as possible" helps both to reduce computation time and to give results that are more likely to be what the user expected.

Making parser recovery smarter is one solution to this problem. I haven't actually tried implementing indentation-based heuristics. They might work in practice, but I am a bit skeptical because it's not a grammar based recovery: you really can put } almost anywhere to fix the code, you can't guess the correct position from abstractish syntax alone.

Another solution to this problem (what IntelliJ is using) is to make unbalanced braces situation as transient as possible. The simplest approach is to add a closing brace every time an opening brace is typed (this is what IntelliJ Rust is currently using). A slightly more magical solution (employed by Kotlin & Java) is suppress auto-pairing } in certain contexts, but insert } automatically on Enter. (suppression and on-enter for Kotlin). This allows for smother typing feeling while still maintaining balanced {, }.

The fact that temporary unbalanced {} do not change resolution unless there's an inline mod also helps :slight_smile:

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.