Compact "token tree" representation


#1

Disclaimer: this assumes no hygiene information needs to be attached to each identifier. @nrc could clarify whether that is the case for scope sets.

In light of the proposed macro changes, I’m suggesting a radical change to our syntax processing model:

  • No AST in memory
  • Macro expansion and all pre-resolution passes work on tokens
  • Lazy parsing is done on top of the stable token API, such that it can be a library on crates.io
  • The visitor pattern is largely unaffected
  • Folding becomes more complicated, as you have to replace a sequence of tokens with another one
  • The conversion to HIR is the only actual parsing to a tree, but the HIR is flatter than the AST

(This idea was previously vaguely described on IRC as the “token soup” model.)

To actually have this model without ruining performance, I don’t believe we can use the current token trees.

I’ve done some data gathering on the --pretty=expanded rustc crate (at 85b031 and stage0).
If there is any larger/more representative crate out there, let me know.

The code and results are at https://gist.github.com/eddyb/313cc92003c27514b513.

The inherent problem with the current representation of token trees is that just over 24% of all tokens are delimiters.
With 690009 tokens split into 83594 allocations, you have an average of 8 tokens per allocation.

That is a huge waste.

Byte tags and variable-length encodings

To be able to process as many tokens as possible at once - although SIMD-accelerated pattern-matching isn’t here yet, we would still want to be able to diff large lists of tokens really fast - and also for keeping memory consumption low, the tags of each token should be as small as possible.

We currently have 46 simple tokens, 54 keywords and 9 token types that require extra metadata (3 open/close delimiters, identifiers, lifetimes and literals). That’s a total of 109 token types, which fit nicely in a byte.

The metadata could be stored:

  • inline, wasting 4 bytes for all the tokens that do not have metadata
  • in a separate array, which has similar memory usage to inline, but would keep the array of token tags compact
  • in a variable-sized fashion: inline but only for the tokens with actual metadata, using the least memory possible and compromising only on random access

The metadata for open/close delimiters would be a distance to the paired delimiter, so that you can skip over everything inside, just like you can with token trees currently.

Token ropes

To avoid actually allocating all of the tokens produced by macro expansion, we can build “ropes” that point back into the original token stream.

Example:

macro_rules! add {
    ($x:expr, $y:expr) => ($x + $y)
}
fn four() -> i32 { add!(1, 3) }

Original tokens:

"macro_rules" ! "add" {
  ( $ "x" : "expr" , $ "y" : "expr" )
  => ( $ "x" + $ "y" )
}
"fn" "four" ( ) -> "i32" { "add" ! ( "1" , "3" ) }

Expanded tokens:

"fn" "four" ( ) -> "i32" { "1" + "3" }

Expanded token ropes:

24...30: "fn" "four" ( ) -> "i32" {
34...34: "1"
19...19: "+"
36...36: "3"
38...38: "}"

As you can see, the data that would need to created by expansion grows with the number of macro invocations and $var interpolations in macro_rules RHSes, not actual tokens being interpolated.

One problem with AST-based expansion is that you have to recurse into the parsed expansion of each macro to find more macro invocations.

With the token ropes, you can always have a top-level queue that you keep pushing ranges to and have no recursion at all.

This ends up supporting the “macro tail call” pattern, where each macro expands to another macro invocation, as in that case the queue size stays constant, instead of recursing and potentially overflowing the compiler’s stack (as demonstrated by @Manishearth and @DanielKeep with the brainfuck/hodor macros).

Back to the AST

I mentioned how token trees split allocations unnecessarily, but the AST is much worse: just the 4 most common kinds of nodes (Item, Expr, Ty and Pat) add up to ~244600 nodes using over 33MB by themselves.
There are also vectors and other kinds of nodes, so the whole AST is probably using over 50MB (but it’s harder to quantify atm).

All of that means you take 25 milliseconds just to traverse the entire AST of librustc, whereas the (artificial) flat token array takes 5 milliseconds with small token tags and only goes up to 7 for 88 byte Tokens (which is how big they are currently in libsyntax, for various reasons).

So flat arrays are faster to go through, but what about the actual work you need to perform for parsing the tokens into the AST each time?

(My hope is that parsing each time takes less than visiting the cache-unfriendly AST, but that might not be the case, and if it’s not, there are other options available)

I don’t know the answer yet, I am still working on modifying libsyntax so it parses without allocating the result into memory at all.

I will keep updating this post as I get more concrete data.


#2

I think we will need to keep hygiene info in the tokens. This is not due to the new hygiene algorithm but due to the tokens in, tokens out model for syntax extensions - we need to keep the hygiene info somewhere, and since we don’t have the AST any more, tokens are the only choice.

I would also like to keep span info with tokens rather than in the AST - it is useful for tools to have all the spans, and keeping them in tokens rather than the AST seems a good way to do so (we’d probably need to compress the spans to some extent to make this feasible).

I’m unclear what the motivation for this is - aiui lexing and parsing don’t take a meaningful amount of time and the memory use is not too much of a concern. What is the goal with the refactoring?