RFC - compiler refactoring - spans

Some follow up stats: summing the byte sizes of the source files for the crates in rustc yields:

680         driver
5,756       libflate
12,180      grammar
15,200      librustc_bitflags
19,245      libarena
20,963      libfmt_macros
28,899      liblog
29,039      rustbook
30,810      libgraphviz
49,305      librbml
51,781      libgetopts
66,328      librustc_privacy
73,581      libtest
73,734      libterm
83,186      liballoc
103,189     compiletest
103,865     librustc_llvm
121,622     librustc_driver
122,824     libcoretest
129,016     librand
132,838     librustc_back
173,439     libserialize
229,755     librustc_borrowck
242,461     liblibc
270,359     librustc_resolve
369,664     librustdoc
670,832     libcore
771,639     libunicode
819,990     libcollections
834,098     librustc_typeck
1,295,922   librustc_trans
1,375,285   libsyntax
2,072,684   librustc
2,191,542   libstd

Comparing that to bitsize of offset/length in terms of how many crates would be “covered” without having to go to heap spans:

u16: 11 / 34 (32%)
u17: 20 / 34 (59%)
u18: 24 / 34 (71%)
u19: 26 / 34 (76%)
u20: 30 / 34 (88%)
u21: 33 / 34 (97%)
u22: 34 / 34 (100%)

So { offset: u21, length: u21, exp_id: u21, not_heap: u1 } would (at least for the first two) cover all but one crate in rustc. I don’t have servo, so I can’t check that.

Awesome, thanks for collecting that data! I think we need to build in a fair amount of leeway here, which suggests to me that offset does need to be u32 (or possibly u31/u30). I suspect we would get a far higher number of spans in the inline format by making the offset larger than the length. I like that your proposed scheme is simpler in that there is only one kind of span, although the amount of bit-twiddling that will be needed makes it a little more error prone. Whether it is more efficient probably depends on the number of ZeroSpan cases we hit under my scheme.

It would be interesting to know how hight expansion ids get for each crate too.

No problem; it was a few minutes with bash, awk (which I'd never used before) and Python.

Yeah; this is really something where you want to instrument the compiler to just dump all the spans in the AST, then find an optimal distribution of bits. Why guess when you can measure? :stuck_out_tongue:

Also, my original thought was that because spans expand as you go up the tree, you want offset and length to be the same... but you're right, it probably makes more sense to bias toward offset in order to capture more total spans on the stack.

Ideally, you'd hide the bit-twiddling behind a safe API. Then you just need to get it right once and you're golden.

Dammit, I have a half-written blog post about span encodings and analyzed data from an instrumented compiler lying around for a few weeks now…

I have a suggestion for getting rid of expansion ids entirely, and a flexible encoding that would be able to store up to 70% of spans (in the rustc codebase) in 32 bits and the rest in a codemap-table like the one we have for expansion infos now. If you allow 64 bits per span, it should be easy to store 100% of spans within those, without even using a side table in the codemap.

EDIT: I’ll try to finish the blog post in the next few days.

OK, I won’t have time to write the blog post soon, so I’ll try to write up the short version of my findings here.

Getting rid of expansion ids

The idea is actually pretty simple: Instead of recording expansion information in the ExpansionInfo list in the CodeMap, we can just treat each expansion as a new (virtual) file in the codebase and allocate spans from there. So for every call to assert!() or any other macro we create a new FileMap in the CodeMap, which also stores the information currently found in ExpansionInfo. All spans from this particular expansion are then allocated within the range of this virtual file map. This way we don’t have to set aside a fixed number of bits for the expansion id.

Encoding Spans

I had the idea that we could encode spans like strings with small-string optimization: Let’s say we want a span to be X bits large. If we can represent the whole span in X-1 bits, we do so. Else, we set a tag bit that tells us that the other X-1 are to be interpreted as an index into a side-table in the codemap. This side table stores spans in a format that can represent all possibilities (like our current span format).

So, let’s say we want spans as they occur in the AST to take up only 32 bits. Then we could encode them as:

........ ........ ........ ........
TPPPPPPP PPPPPPPP PPPPPPPP LLLLLLLL 

That is, we have the tag bit, 23 bits for the position of the span within the codemap and 7 bits for the length. Everything that cannot be represented this way because its position is greater than 2^23 or its length is greater than 2^7 is stored in the side table.

Note that the side table is needed when actually interpreting the span (e.g. when reporting an error) but not when copying the span around. A potential Span struct could thus conveniently derive(Copy). Also note that this kind of side table does not have the negative effects of the one mentioned in the OP, as it doesn’t rely on node IDs or memory addresses.

More Flexible Span Encodings

We don’t have to restrict ourselves to just one tag bit. If we use two bits, we can use the value 0 as the “side table marker” and the values 1, 2, and 3 as identifiers for different encodings. That is, we could have an encoding that looks like this:

........ ........ ........ ........
00IIIIII IIIIIIII IIIIIIII IIIIIIII or
01PPPPPP PPPPPPPP PPPPPPPP LLLLLLLL or
10PPPPPP PPPPPPPP PPPPPPPP PPPPLLLL or
11PPPPPP PPPPPPPP PPPLLLLL LLLLLLLL 

An encoding like the above would catch a few more spans, because it can represent all short spans (length < 15) with positions up to 2^26.

Finding an Optimal Encoding

In order to find the best encoding (at least regarding to my sample codebases) I have …

(1) … modified rustc so it would dump all spans found in the ast-map into a simple binary file. Spans are dumped after macro expansion and importing other crates, so this should give a pretty accurate picture of what spans can occur, and …

(2) … written a program that analyzes the dumped span data with regards to different encodings.

The program can take an encoding as described above and test it against a code base (i.e. the span data dumped by the modified rustc). It will then tell you what percentage of the spans can be represented without a side table entry.

Given a number of bits to use in total and a number of bits to use for the encoding tag, it can also generate all possible encodings for those numbers. If the tag is one bit, we can exhaustively search the problem space to find the best encoding. For more than one bit, the problem space gets too large to search exhaustively, but the program uses a simple randomized tabu search that finds good results pretty quickly.

I once let my Core i7 run at full CPU utilization on all 8 hardware threads for 10 hours—which was just enough to try out all ~20 billion possible encodings for a 2 bit tag. The tabu search yielded the optimal result in 10 seconds :).

Results

My test used the complete rustc+std code base (everything that gets compiled by make rustc) as reference. The data also includes imported filemaps (see PR #22235) and encodings are tested with and without the above mentioned expansion-id optimization.

The following table shows the best results with different settings. Encodings are represented as LxPyEz which means x length-bits, y position-bits, and z expansion-id-bits. The number next to the encoding is the percentage of spans in the codebase that can be encoded without side table entry by the given encoding.

|               | 32 bit (1 bit tag) | 32 bit (2 bit tag)                |
| ------------- | ------------------ | --------------------------------- |
| w/ ExpnId     | (L7P24E0) (65%)    | (L4P17E9, L6P24E0, L9P21E0) (69%) |
| w/o ExpnId    | (L7P24) (94%)      | (L6P24, L7P23, L9P21) (96%)       |

As can be seen, using the expansion-id optimizations allows for more than 90% of all spans to be represented with only 32 bits, which sounds pretty good to me. The flexible encoding also brings a few percent, but not much. However, I suspect that the flexible encoding can better keep up with even larger code bases.

Conclusion

If I haven’t made a mistake in my analysis somewhere (which is entirely possible) we should be able to store spans in 32 bits. Also, with the virtual-filemap/expansion-id optimization it should be possible to store all spans within 64bit without needing a side-table.

EDIT: Meh, discourse can’t do html tables… EDIT2: And it swallows stuff …

7 Likes

Awesome work!

I hadn’t considered multiple tag bits. Also, shifting bits from offset to length makes a lot of sense since as spans gets longer, they have to start earlier in the file on average. That said, only going from 94% → 96% doesn’t make the extra complexity seem worthwhile.

Also, if you do indeed require the side table even for small spans, then there’s no reason to go to 64-bit at all; I went with that simply because I wanted to ensure you didn’t need to pass around extra out-of-band information for large spans to work.

This gets even more important since all source files of a crate and its dependencies share the same "address space" in the code map, that is, even short file can start at a high address. If you do it like now, storing start- and end-address for each span, we basically need the same amount of bits for the start- and the end-field. Span lengths, on the other hand, never get bigger than a single file.

Keep in mind that these numbers are based on just one codebase. The flexible encoding might show its strength more for larger codebases. I'm not saying that we should do it but the added complexity is negligible (it just affects the Span::new() and Span::decode() functions).

By the same token, we can easily move to flexible encodings later as well, no?

Sure. The time-consuming part is refactoring Span so it can only be decoded in conjunction with the code map. After that, switching to a different encoding is simple, especially if it’s pure runtime data with no backwards-compatibility restrictions.

Getting rid of expansion IDs the way I proposed is a different story though. That would need quite some work in the macro expansion code.

So, I redid the @michaelwoerister's experiment and got very similar results.

The setup was slightly different - I didn't collect span statistics from AST before or after expansion, I recorded every span created with Span::new instead (this requires https://github.com/rust-lang/rust/pull/43968).
I also reimplemented the program tabu-searching for optimal encodings independently, so bug compatibility with @michaelwoerister's version is unlikely.

First of all, what percentage of spans can we represent with a certain number of bits? Here's some chart.

This chart is too optimistic, because it assumes that the number of possible encodings can be used is unlimited.
We see however, that the center of the step is around 24 bits, so it's reasonable to look for encodings for 32 and 40 bit compressed spans (I looked for 24 bit as well, for comparison).

The search space looks like this: (base, offset, ctxt, is_negative_offset).
base is span.lo (position in the codemap), offset is abs(span.hi - span.lo), ctxt is index into interner for various hygiene- and macro-related data, is_negative_offset is is_negative(span.hi - span.lo) (yes, there are spans with span.lo > span.hi! I'm going to eliminate them soon though).

I looked for optimal encodings with 1-, 2- and 3-bit tag.
Simple tabu search likes to stuck in local maximums, so the results are slightly different when we start searching from (large, 0, 0, 0) and (equal, equal, equal, 1), I'm listing all of them below.

    # 24 bits
    [(20, 3, 0, 0)], representable: 28.0945636582%
    [(21, 1, 0, 0), (20, 2, 0, 0), (18, 4, 0, 0)], representable: 29.5463886348%
    [(20, 2, 0, 0), (18, 4, 0, 0), (15, 7, 0, 0)], representable: 28.2045555975%
    [(21, 0, 0, 0), (21, 0, 0, 0), (21, 0, 0, 0), (21, 0, 0, 0), (20, 1, 0, 0), (18, 3, 0, 0), (14, 1, 6, 0)], representable: 20.3942652242%
    [(20, 1, 0, 0), (19, 2, 0, 0), (17, 4, 0, 0), (14, 7, 0, 0), (6, 7, 7, 1), (6, 7, 7, 1), (0, 6, 14, 1)], representable: 22.8205040856%

    # 32 bits
    [(24, 7, 0, 0)], @michaelwoerister's result
    [(24, 7, 0, 0)], representable: 80.010087496%
    [(24, 6, 0, 0), (21, 9, 0, 0), (17, 4, 9, 0)], @michaelwoerister's result
    [(24, 6, 0, 0), (22, 8, 0, 0), (18, 1, 11, 0)], representable: 82.6795242211%
    [(23, 7, 0, 0), (18, 4, 8, 0), (18, 1, 11, 0)], representable: 79.8442320036%
    [(29, 0, 0, 0), (29, 0, 0, 0), (29, 0, 0, 0), (29, 0, 0, 0), (24, 5, 0, 0), (22, 7, 0, 0), (17, 1, 11, 0)], representable: 79.9041099684%
    [(24, 5, 0, 0), (22, 7, 0, 0), (20, 9, 0, 0), (18, 3, 8, 0), (17, 1, 11, 0), (16, 5, 8, 0), (5, 24, 0, 0)], representable: 81.6788671615%

    # 40 bits
    [(24, 15, 0, 0)], representable: 84.5305650395%
    [(38, 0, 0, 0), (24, 14, 0, 0), (21, 4, 13, 0)], representable: 92.5187443371%
    [(24, 14, 0, 0), (21, 4, 13, 0), (20, 6, 12, 0)], representable: 93.4870459248%
    [(37, 0, 0, 0), (37, 0, 0, 0), (37, 0, 0, 0), (37, 0, 0, 0), (37, 0, 0, 0), (24, 13, 0, 0), (20, 5, 12, 0)], representable: 91.8228933713%
    [(24, 13, 0, 0), (21, 3, 13, 0), (20, 5, 12, 0), (19, 7, 11, 0), (18, 11, 8, 0), (16, 10, 11, 0), (15, 8, 14, 0)], representable: 93.5323612314%

As expected, 24 bits is too small to keep anything.
32 bits and 40 bits give much better percents than 24 bits, but 40 bits is not significantly better than 32 bits.
3 bits of tag seems to be too much unless we use really large number of bits for the whole span.

Going from percents to absolute numbers, even if we assume that interned spans spend 3 times more memory due to interning overhead, 32 bit still seem to win over 40 bit:

    cost32 = 32, cost40 = 40, percent32 = 0.8268, percent40 = 0.9349
    cost32 * percent32 + 3 * cost32 * (1 - percent32) < cost40 * percent40 + 3 * cost40 * (1 - percent40)
    43.0848 < 45.2080

If we add extra padding that will be more likely generated by 40 bit spans, the difference will become even larger.


As a result, I'm going to implement 32-bit compressed spans with 2-bit tag and encodings (24, 6, 0, 0), (22, 8, 0, 0), (18, 1, 11, 0).
EDIT: And I'm also going to use [u8; 4] instead of u32 to avoid bumping alignment requirements for structs containing spans.

2 Likes

What I didn’t do yet:

  • I’m not sure it’s still reasonable to get rid of ctxt (previously expn_id) in spans and move its data to codemap, because now it contains hygiene information in addition to macro expansion data.
  • span.lo is very non-optimal, it’s a huge offset in a codemap including all files in the crate. If the file itself is tiny it still can get huge offset and avoid inlining. If something like file_local_offset + file_id was used instead, it would radically reduced number of bits required to represent base, which is currently the most space-demanding component of inlined spans. (I’d expect 24-bit spans to become usable or percentage of spans representable with 32 bits to jump into 90%s).

What's the logic here? Naively thinking, all code offsets are "equally probable", so splitting a code offset into a file-name and file-offset shouldn't help much. A crate can easily have a million tokens, and the base can be each one of these.

The thing I'm afraid of with 32-bit spans is that we'll have a "size cliff" when a crate contains more than 16M characters and all spans at the end of the crate need to be "big" (similarly with 8-bit file-ids, when a crate has more than 256 files).

Mini example: 16 files, each 16 bytes (256 total), each byte starts a span.
Average number of bits to encode a span start:

  • current scheme:
    Sum = bits(0) + bits(1) + ... + bits(16) + bits(17) + ... + bits(255) = 1 * 1 + 2 * 2 + ... + 8 * 128 = ∑n ∋ [1, 8] n * 2 ^ (n - 1) = (8 - 1) * 2 ^ 8 + 1 = 1793
    Average = 1793 / 256 ≈ 7.00
  • file_local offset and file id:
    Sum = (bits(0) + ... + bits(15) /*file ids*/) + 16 * (bits(0) + ... + bits(15) /*offsets*/) = 17 * ∑n ∋ [1, 4] n * 2 ^ (n - 1) = 17 * ((4 - 1) * 2 ^ 4 + 1) = 833
    Average = 833 / 256 ≈ 3.25

Even if all spans are indices into a separate table, shouldn’t that still be a win, since most code never needs to actually decode them (just copy them around)?

Sure. That would be a locality win, but a peak memory usage loss (because we'll still need an interner). Maybe we can drop the interner index after macro expansion to lower peak memory usage. But I was afraid that a crate with 16MB of code would use 4 bytes/span but a crate with 32MB would use 4 + (8 + index-space)/2 space, where IIRC "index-space" can be quite large (using an HashSet of indices, it would take (4+8)/α bytes - 4 for the index, 8 for the hash - although we could use a 4-byte hash with some modifications).

What encoding are you using? Encoding "my way" would take 8 bits (uniform distribution on 256 options) - you can encode 50% of spans with 7 bits + an escape flag, but why would you.

Unless you do some "compression" that knows that AST children are in the same file as their parents, but that's something we'll have to decode on every visit, and can't handle expansions.

It's "use the best possible encoding for each span start", not a specific encoding, so it's just an estimation for the lower bound of required number of bits, like the chart in RFC - compiler refactoring - spans - #28 by petrochenkov.

For span starts I don't know what would be the best encoding exactly, maybe 16 bits file number + 24 bits local offset, maybe some flexible encoding with additional field specifying the position of "demarcation line" between file number and local offset.
The only requirement is that separate file number and local offset should be quickly extractable, because 1) file numbers are often small and can often be used in inlined spans and 2) local offsets are often small and can often be used in inlined spans.
Current global offsets are rarely small (only in the first file of a codemap) and can rarely be used in inlined spans, and do not allow quickly extracting file number and local offset.

I'm not sure how is this related.

You can't cheat entropy. The reason we use a global codemap offset, is because we want to uniquely identify a character in the source-code, and a global codemap offset does that in the least number of bits. If a program has more than a million bytes of code, you can't expect a 20-bit index (+ 11-bit length, say) to do any good, even if you split it to a file-number and local offset.

1 Like

I think I’ll just make an experiment some time and report results.

I’ve been thinking about this. In terms of cache hit ratio, it seems like the best case behavior of file id/offset vs global offset are the same, if you assume a uniform distribution of spans across files and within files. Global offset’s best case and worst case behavior is the same, because the only way you get a cache miss is by having the total bytes of source being over the max offset.

Note, if you need to report spans as file/line/column, the cost of mapping global offset -> file/line/column may be similar to file/offset -> file/line/column.

I initially thought storing the file id/offset would be beneficial, but I couldn’t actually think of a scenario where I could justify a non-uniform distribution of spans. And any scenario where global offset results in a cache miss, it means that file id/offset will have the same or more misses in that scenario.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.