RFC - compiler refactoring - spans

(I mean RFC literally, I would like comments, rather than this adheres to our RFC format or that it will end up in the RFC process).

I would like to refactor the way that the compiler stores spans, in particular I would like to move spans out of the AST and in to a side table. I’m not sure if that is the best plan, but I do think we need to do something.

Motivation

Which nodes in the AST have spans is ad hoc - we basically have them where we need them for errors in the compiler. In the near future I hope more tools than just the compiler will use libsyntax and its AST and span info. Different tools will require different nodes to have spans and storing spans for everything is wasteful of memory in the common case of plain old compilation.

Spans pollute the AST. Syntax extensions and other tools which manipulate the AST have to deal with spans on AST nodes where they really shouldn’t have to.

Proposal

Remove all spans from the AST. Keep a hashtable mapping raw pointers to spans. Use the address of an AST node, to look up its span in the hashtable. (This will involve a bunch of casting to raw pointers, because there is no Node type for nodes in the AST, using ‘node’ in the loosest sense here, not any particulat Rust type. I.e., we want spans for idents, paths, expressions, items, and so forth).

I think using AST pointers as indices in to the hashtable is OK, because due to the ownership structure of our AST, an address will never be reused for another AST node. So we can use addresses for lookups even after the node itself has been deleted.

Question: how can we make this work for idents and things which are stored by value in the AST?

Question where is the best place to store a pointer to this hashtable in the compiler? In each of the various contexts? TLS?

Disadvantages

Finding a span for a node becomes a bit more code. I think this is a worthwhile trade-off since we don’t do it that often.

Macro expansion will have to be more careful to make sure all spans for expanded nodes are computed properly. Having the span in the node means the compiler enforces that expansion at least does something with the span. With the spans in a side table, they could just be forgotten.

Is there a better way to do this?

3 Likes

I don’t have a concrete opinion and haven’t looked at libsyntax well enough to have one, but I am curious if you have experience with how other compilers with reusable syntax libraries do this same thing? Like perhaps, how does clang deal with this concern?

If you haven’t already, Clang AST documentation is worth reading.

@ahmedcharles, current libsyntax design is largely similar to Clang’s. Primary difference is Clang uses locations instead of spans. Clang stores locations directly in AST, and deals with memory issue by careful encoding. (Even ignoring span vs location design below, our spans are larger than Clang’s.) Spans (called ranges in Clang) are constructed on demand from locations.

For example, a.x would store 6 offsets in a span based system, compared to 4 offsets in a location based system.

Field(
    Path(a, Span(0, 1)),
    Ident(x, Span(2, 3)),
    Span(0, 3))
Field(
    Path(a, Loc(0), Loc(1)),
    Ident(x, Loc(2), Loc(3)))

In a location based system, spans for fields are constructed on demand like the following:

impl Field {
    fn span(&self) -> Span {
        Span(self.path.start, self.ident.end)
    }
}

Thanks for the info, though to clarify, my question was more along the lines of: What design guidelines do the clang devs use to determine when to add location support for something? If it’s expensive to keep this information around when it’s not used, then clang is unlikely to keep all of it around.

Ah I see. Clang always keeps locations, period. It is not free, but they try very, very hard to make it not expensive. Since parsing is so fast (parsing entire rustc crate takes less than 0.3 seconds) the cost is primarily memory. Whether we go for optional locations or not, we probably should copy some of tricks to keep locations not expensive from Clang.

I’ll need to take a look at the span system in the near future too because we need to store span information to crate metadata in order to allow for proper debuginfo for cross-crate generics.

Are you sure that memory addresses are never re-used for AST nodes? I was under the impression that folding over the AST multiple times could cause addresses being re-used–but maybe something has changed in libsyntax since I last took a closer look at it.

I’d be OK with putting spans in a sidetable but I definitely would want the key into this table to be something stable that can be serialized to metadata. The current ast::NodeId seems like a good candidate. But that would mean adding a NodeId to everything that can have a span.

While I agree that getting spans out of the AST seems like a good idea, and a sidetable seems reasonable, using raw pointers from the AST doesn’t seem like a very good idea – for one thing, the information will get lost during folding! The same is true for indices, unfortunately. This certainly requires some thought.

Do we have any data on how much memory spans actually take up? It seems to me that putting things into side tables always makes it easy for them to go out of sync. Having spans on everything, directly in the AST, might make handling them correctly much easier – and it might not be that wasteful when looking at the big picture.

Do we have any data on how much memory spans actually take up?

@huon did the exercise in July 2014 (#15594). Spans took up 12% of maximum resident memory in entire compilation for a synthetic benchmark. This probably is worth doing again in a little more rigorous manner.

Thanks for the link @sanxiyn. I’ve tried to collect some more data (which I posted there).

Not to be broken record, but with a polymorphic IR/AST, plugins that ignore spans are guaranteed via parametricity not to mess with them by mistake. This seems much cleaner than a side table to me.

It seems from the comments that a side table is a bad idea. I see more and more reasons to want more spans though. So, it seems to me that the way forward is to have spans for each node in the AST and optimise the way we store spans. This could also be a good opportunity to improve the AST.

Proposal for the AST

Every node in the AST would be a

struct AstNode<T> {
    node: T,
    span: Span,
}

Perhaps we should add an id: NodeId field too, since it might be nice for every node to have an id. This is very close to the existing Spanned<T> struct. The only real difference with this proposal is that it is more uniform - every node is an AstNode (even lead nodes in idents, etc.) and there are no nodes which have a span anywhere except in an AstNode.

The formal type parameter would be Item or Expr, etc., the same nodes as we have today.

This has the advantage that we can convert the AST gradually, rather than in one enormous refactoring.

Optimising spans

In order to make this not use a tonne more memory, we need to optimise our spans. If we have spans for every node then I believe there are some interesting optimisations we can do.

Currently a span in 12 bytes, which with padding is most likely to be 16 bytes in most uses. (Three u32s - a start and end location (index into the codemap) and an expansion id, which indexes into an expansion table for storing information about macro expansion).

I propose making a span a trait (or its implementation a trait) with implementations LeafSpan, InnerSpan, and ZeroSpan.

Leaf spans would store a start location (u32) and an offset (u16). Since these are leaf nodes, it is very unlikely the length of the span would overflow a u16 (we should of course check this at runtime). However, this is only useful if we can get away with a u16 expansion id (see discussion of this below), otherwise, this is still likely to get padded up to 16 bytes in most cases. If we can get away with u16 expansion ids, then we get 8 bytes with padding.

InnerSpans are used for internal nodes of the AST. These are represented as two u16 offsets from the child nodes, plus the expansion id, so always 8 bytes. E.g., an inner node with three child leaf nodes (they could of course also be inner nodes in real life)

inner ::= 'foo' child1 ':' child2 child3 ';'

foo a: bar b;
    -  --- -  child spans
------------- inner spans

The child spans are represented as (4, 1), (7, 3), (11, 1). The inner span is stored as a negative offset from the start of the first child span and a positive offset from the end of the last child span, in this case -4 and 1.

ZeroSpan is an optimisation of InnerSpan where both offsets are 0, this is common enough to be useful, for example in paths. These spans are just the size of the expansion id.

Finally, we might choose to support the current optimisation of eliding some spans by having a zero sized NoSpan implementation. This would allow us to ensure that we didn’t ever use unnecessary memory on spans, but make it much easier to add spans where they are needs (changing a generic parameter is much easier than changing the AST structure. (I would be against though, just mentioning that it is possible).

This would require changing the AstNode struct to be generic in the kind of span. In order to avoid a bunch of type parameters all over the place, I propose having a system of macros and typedefs to create a type with no type parameters for each kind of AST node, e.g., type ItemNode = AstNode<Item, InnerSpan>;.

Expansion ids

These are currently a u32 in each span. I don’t know enough about how these are used to know how exactly they can be optimised. I assume that simply reducing the id size to u16 is not practical since it might overflow (note that far from every span in a program has a useful expansion id, any span which is a direct reference to a span in the source has expansion id 0). We could use one bit of the id to denote whether it is a direct id, this only gets set once we get close to overflow, at which point we fall back to hashing the span (including the expansion id, which other than the ‘overflow’ bit continues to be incremented, so that we can distinguish spans with the same start and end points) and looking up the expansion id from a hashtable. Hopefully we hit > 2^15 expansion ids rarely enough that this does not cause significant slowdown.

It might also be good to avoid expansion ids completely in spans that can never be the result of macro expansion. I’m not sure if that is a worthwhile optimisation though.

At some point the past, we had more things be Spanned<T>, but we (partially) moved away from it, for some reason that I cannot remember.

I'm somewhat nervous about this... what's the strategy if the span does overflow? (e.g. tables::normalization::{canonical_table, compatibility_table} in libunicode are both more than 65kB of source, although I suppose that ast::Expr is probably considered internal, rather than a leaf?)

I don't think there are any such spans, since a macro can expand to mod foo { ... } (which can contain any AST nodes, AFAICT) and hence set the expansion id of anything it contains.

Yeah, expr is certainly an internal node. Leaf nodes would be idents and literals, basically (perhaps a few other such things). The only thing I can think of that might break this is string literals - it seems possible to have a string literal longer than 2^16 bytes long. I’d be fairly happy just making that an error though. If that is not possible we could just make it a bad span of some kind - a span of that size is not really useful either for the compiler or most tools.

Good to know about the expansion ids. I wonder if there is a completely different scheme we could use which would be more memory efficient?

Quick thought: it seems to me that there will be far more “small” spans than large spans. Also, most source files will not be larger than a few megabytes, realistically. The largest single file in the Rust source code is ~717kB (libunicode/tables.rs). 5031 are less than 64kB, with 41 larger.

So let’s say that by default, we use u16 for the offset and length, with u31 for the expansion ID. The LSB of the structure can be unset to indicate that the Span is actually a Box<BigSpan> (assuming at least even alignment), which contains full-size fields. That makes Span 64-bit in the AST nodes.

Of course, the exact division of bits should be derived from more comprehensive statistics. Ideally, there’d be an instrumented compiler dumping stats on the number of bits needed for each of those fields for all spans for, say, Servo and rustc, then divvy the storage up to minimise the number of spans that would need to be pushed out to the heap.

Another possible optimization: If the spans of two neighboring nodes join at one location — if we keep whitespace nodes, this could be the default — the end location would basically be duplicated. In this case, we could do away with the length altogether.

However, this requires that each node knows its next neighbor.

Yeah, I considered optimisations based on knowing the parent or siblings, but that means adding a pointer to each node, so it seems like that would use more memory than it saves. We also don’t keep whitespace nodes (or comments, etc.).

Bare in mind that the offsets are per crate, not per file (and items from other crates can be added into that codemap, and I think (not sure about this) can be used in spans), which could definitely put us over the 2^16 mark. Definitely worth doing some research to figure out the exact numbers to see if there is an opportunity here though.

Some follow up stats: summing the byte sizes of the source files for the crates in rustc yields:

680         driver
5,756       libflate
12,180      grammar
15,200      librustc_bitflags
19,245      libarena
20,963      libfmt_macros
28,899      liblog
29,039      rustbook
30,810      libgraphviz
49,305      librbml
51,781      libgetopts
66,328      librustc_privacy
73,581      libtest
73,734      libterm
83,186      liballoc
103,189     compiletest
103,865     librustc_llvm
121,622     librustc_driver
122,824     libcoretest
129,016     librand
132,838     librustc_back
173,439     libserialize
229,755     librustc_borrowck
242,461     liblibc
270,359     librustc_resolve
369,664     librustdoc
670,832     libcore
771,639     libunicode
819,990     libcollections
834,098     librustc_typeck
1,295,922   librustc_trans
1,375,285   libsyntax
2,072,684   librustc
2,191,542   libstd

Comparing that to bitsize of offset/length in terms of how many crates would be “covered” without having to go to heap spans:

u16: 11 / 34 (32%)
u17: 20 / 34 (59%)
u18: 24 / 34 (71%)
u19: 26 / 34 (76%)
u20: 30 / 34 (88%)
u21: 33 / 34 (97%)
u22: 34 / 34 (100%)

So { offset: u21, length: u21, exp_id: u21, not_heap: u1 } would (at least for the first two) cover all but one crate in rustc. I don’t have servo, so I can’t check that.

Awesome, thanks for collecting that data! I think we need to build in a fair amount of leeway here, which suggests to me that offset does need to be u32 (or possibly u31/u30). I suspect we would get a far higher number of spans in the inline format by making the offset larger than the length. I like that your proposed scheme is simpler in that there is only one kind of span, although the amount of bit-twiddling that will be needed makes it a little more error prone. Whether it is more efficient probably depends on the number of ZeroSpan cases we hit under my scheme.

It would be interesting to know how hight expansion ids get for each crate too.