I’ve done several experiments to reduce the number of monomorphizations the compiler emits, with the goal of reducing compile times. The results have been not great, but here they are for posterity.
Function melding
First, this is a list of the most commonly instantiated generics in Servo. You can see that there are lots and lots of tiny abstractions emitted repeatedly. In theory, many of these are redundant.
The approach I’ve most recently tried is called ‘function melding’ (patch), which attempts to combine monomorphizations of a single function in cases where the type substitutions would produce identical machine code. For example in a simple case like mem::uninitialized
, which does very little, most instances of the function can be melded - that is, shared with other type substitutions. The basic approach of the patch is to find candidates with the correct structure, then hash various ABI-affecting properties of the type substitutions. Those with the same hash meld together.
The existing analysis in the patch is not completely sound, but produces results that I believe are indicative of what is possible.
The current patch does meld a significant proportion of monomorphizations - consistently around 10%. Unfortunately, I haven’t been able to translate this to compile time improvements.
In Servo, for example, there are presently 115658 total instantiations in the entire build. The patch completely elides 15981, 13%.
The compile times though hardly budge. In these comparisons ‘before’ means compile with this patch using the RUSTC_NO_MELD env var (which short circuits everything), and ‘after’ means with melding on.
- Before (O0): Build completed in 433.89s
- After (O0): Build completed in 433.77s
- Before (O3): Build completed in 692.03s
- After (O3): Build completed in 701.44s
So, total compile time doesn’t budge in debug mode, and gets worse in release. Sigh. Hoping that it was just the analysis that was slow, I still expected to see LLVM time decrease, but no:
before release
time: 2.124; rss: 344MB translation
time: 0.327; rss: 267MB llvm function passes
time: 11.500; rss: 278MB llvm module passes
time: 1.521; rss: 283MB codegen passes
time: 0.001; rss: 278MB codegen passes
time: 13.416; rss: 278MB LLVM passes
time: 3.881; rss: 280MB running linker
time: 3.882; rss: 278MB linking
after release
time: 2.230; rss: 343MB translation
time: 0.322; rss: 266MB llvm function passes
time: 11.619; rss: 278MB llvm module passes
time: 1.517; rss: 281MB codegen passes
time: 0.001; rss: 281MB codegen passes
time: 13.526; rss: 281MB LLVM passes
time: 3.900; rss: 286MB running linker
time: 3.901; rss: 281MB linking
before debug
time: 2.105; rss: 336MB translation
time: 0.058; rss: 255MB llvm function passes
time: 0.040; rss: 258MB llvm module passes
time: 1.349; rss: 310MB codegen passes
time: 0.000; rss: 286MB codegen passes
time: 1.482; rss: 286MB LLVM passes
time: 4.034; rss: 291MB running linker
time: 4.036; rss: 286MB linking
after debug
time: 2.244; rss: 333MB translation
time: 0.056; rss: 253MB llvm function passes
time: 0.035; rss: 256MB llvm module passes
time: 1.310; rss: 306MB codegen passes
time: 0.000; rss: 284MB codegen passes
time: 1.434; rss: 284MB LLVM passes
time: 4.021; rss: 286MB running linker
time: 4.022; rss: 284MB linking
No wins in LLVM. This is almost hard to believe, but perhaps eliding 13% of the smallest functions really has negligible effect on LLVM.
So I’m feeling defeated by function melding. There are more - larger - functions that can be melded still, but I’m not hopeful it will change anything.
Preinstantiating with WeakODR linkage
Earlier this year I did a different experiment. In this one I tried to identify monomorphizations that could be reused downstream and then expose them as public symbols so downstream wouldn’t need to re-emit them. The hope was that this would reduce compile times in debug builds.
The details aren’t that important and I’ve forgotten some by now.
While I did get some notable wins in some scenarios, I also saw some big losses in others, and wasn’t confident it was worth while. Some notes about what I observed:
- Non-internal linkage is punished by optimization passes. In particular, it defeats a crucial inlinlining optimization where LLVM strongly prefers to inline private functions that have a single caller - so it can then delete the function. This makes the approach worthless for optimized builds.
- Similarly to the above, non-internal functions take longer to compile.
- My patch was hampered by having imperfect downstream knowledge. It was forced to ‘guess’ which functions would be useful downstream, and often decided to do the extra work of exporting them (with an expensive crate-independent hash) in vain.
- Because of the above issues it wasn’t
There are some changes to this approach that may still bear fruit:
- The crate-independent hash is only necessary if you care about coalescing functions at link time (using WeakODR linkage). In release builds coalescing monomorphizations has minimal effect on binary size (can’t recall why offhand), but in debug builds the reduction in binary size is significant (but not very important!).
- Instead of using an inaccurate heuristic to decide which monomorphizations to export, profile-based preinstantiation would ensure that only functions that downstream will actually use get exported.
Sadly, once the compiler is refactored to separate type checking and code generation, cargo will be able to orchestrate the build such that every required monomorphization is only emitted once. Once that happens the only place preinstantation will have any affect is for types in the standard library.
Sorry if any of this was nonsensical. Brain dump.