It seems from the comments that a side table is a bad idea. I see more and more reasons to want more spans though. So, it seems to me that the way forward is to have spans for each node in the AST and optimise the way we store spans. This could also be a good opportunity to improve the AST.
Proposal for the AST
Every node in the AST would be a
struct AstNode<T> {
node: T,
span: Span,
}
Perhaps we should add an id: NodeId
field too, since it might be nice for every node to have an id. This is very close to the existing Spanned<T>
struct. The only real difference with this proposal is that it is more uniform - every node is an AstNode
(even lead nodes in idents, etc.) and there are no nodes which have a span anywhere except in an AstNode
.
The formal type parameter would be Item
or Expr
, etc., the same nodes as we have today.
This has the advantage that we can convert the AST gradually, rather than in one enormous refactoring.
Optimising spans
In order to make this not use a tonne more memory, we need to optimise our spans. If we have spans for every node then I believe there are some interesting optimisations we can do.
Currently a span in 12 bytes, which with padding is most likely to be 16 bytes in most uses. (Three u32s - a start and end location (index into the codemap) and an expansion id, which indexes into an expansion table for storing information about macro expansion).
I propose making a span a trait (or its implementation a trait) with implementations LeafSpan
, InnerSpan
, and ZeroSpan
.
Leaf spans would store a start location (u32) and an offset (u16). Since these are leaf nodes, it is very unlikely the length of the span would overflow a u16 (we should of course check this at runtime). However, this is only useful if we can get away with a u16 expansion id (see discussion of this below), otherwise, this is still likely to get padded up to 16 bytes in most cases. If we can get away with u16 expansion ids, then we get 8 bytes with padding.
InnerSpan
s are used for internal nodes of the AST. These are represented as two u16 offsets from the child nodes, plus the expansion id, so always 8 bytes. E.g., an inner node with three child leaf nodes (they could of course also be inner nodes in real life)
inner ::= 'foo' child1 ':' child2 child3 ';'
foo a: bar b;
- --- - child spans
------------- inner spans
The child spans are represented as (4, 1), (7, 3), (11, 1). The inner span is stored as a negative offset from the start of the first child span and a positive offset from the end of the last child span, in this case -4 and 1.
ZeroSpan
is an optimisation of InnerSpan
where both offsets are 0, this is common enough to be useful, for example in paths. These spans are just the size of the expansion id.
Finally, we might choose to support the current optimisation of eliding some spans by having a zero sized NoSpan
implementation. This would allow us to ensure that we didn’t ever use unnecessary memory on spans, but make it much easier to add spans where they are needs (changing a generic parameter is much easier than changing the AST structure. (I would be against though, just mentioning that it is possible).
This would require changing the AstNode
struct to be generic in the kind of span. In order to avoid a bunch of type parameters all over the place, I propose having a system of macros and typedefs to create a type with no type parameters for each kind of AST node, e.g., type ItemNode = AstNode<Item, InnerSpan>;
.
Expansion ids
These are currently a u32 in each span. I don’t know enough about how these are used to know how exactly they can be optimised. I assume that simply reducing the id size to u16 is not practical since it might overflow (note that far from every span in a program has a useful expansion id, any span which is a direct reference to a span in the source has expansion id 0). We could use one bit of the id to denote whether it is a direct id, this only gets set once we get close to overflow, at which point we fall back to hashing the span (including the expansion id, which other than the ‘overflow’ bit continues to be incremented, so that we can distinguish spans with the same start and end points) and looking up the expansion id from a hashtable. Hopefully we hit > 2^15 expansion ids rarely enough that this does not cause significant slowdown.
It might also be good to avoid expansion ids completely in spans that can never be the result of macro expansion. I’m not sure if that is a worthwhile optimisation though.