Some notes on reducing monomorphizations

I’ve done several experiments to reduce the number of monomorphizations the compiler emits, with the goal of reducing compile times. The results have been not great, but here they are for posterity.

Function melding

First, this is a list of the most commonly instantiated generics in Servo. You can see that there are lots and lots of tiny abstractions emitted repeatedly. In theory, many of these are redundant.

The approach I’ve most recently tried is called ‘function melding’ (patch), which attempts to combine monomorphizations of a single function in cases where the type substitutions would produce identical machine code. For example in a simple case like mem::uninitialized, which does very little, most instances of the function can be melded - that is, shared with other type substitutions. The basic approach of the patch is to find candidates with the correct structure, then hash various ABI-affecting properties of the type substitutions. Those with the same hash meld together.

The existing analysis in the patch is not completely sound, but produces results that I believe are indicative of what is possible.

The current patch does meld a significant proportion of monomorphizations - consistently around 10%. Unfortunately, I haven’t been able to translate this to compile time improvements.

In Servo, for example, there are presently 115658 total instantiations in the entire build. The patch completely elides 15981, 13%.

The compile times though hardly budge. In these comparisons ‘before’ means compile with this patch using the RUSTC_NO_MELD env var (which short circuits everything), and ‘after’ means with melding on.

  • Before (O0): Build completed in 433.89s
  • After (O0): Build completed in 433.77s
  • Before (O3): Build completed in 692.03s
  • After (O3): Build completed in 701.44s

So, total compile time doesn’t budge in debug mode, and gets worse in release. Sigh. Hoping that it was just the analysis that was slow, I still expected to see LLVM time decrease, but no:

before release

time: 2.124; rss: 344MB translation
time: 0.327; rss: 267MB       llvm function passes
time: 11.500; rss: 278MB      llvm module passes
time: 1.521; rss: 283MB       codegen passes
time: 0.001; rss: 278MB       codegen passes
time: 13.416; rss: 278MB        LLVM passes
time: 3.881; rss: 280MB       running linker
time: 3.882; rss: 278MB linking

after release

time: 2.230; rss: 343MB translation
time: 0.322; rss: 266MB       llvm function passes
time: 11.619; rss: 278MB      llvm module passes
time: 1.517; rss: 281MB       codegen passes
time: 0.001; rss: 281MB       codegen passes
time: 13.526; rss: 281MB        LLVM passes
time: 3.900; rss: 286MB       running linker
time: 3.901; rss: 281MB linking

before debug

time: 2.105; rss: 336MB translation
time: 0.058; rss: 255MB       llvm function passes
time: 0.040; rss: 258MB       llvm module passes
time: 1.349; rss: 310MB       codegen passes
time: 0.000; rss: 286MB       codegen passes
time: 1.482; rss: 286MB LLVM passes
time: 4.034; rss: 291MB       running linker
time: 4.036; rss: 286MB linking

after debug

time: 2.244; rss: 333MB translation
time: 0.056; rss: 253MB       llvm function passes
time: 0.035; rss: 256MB       llvm module passes
time: 1.310; rss: 306MB       codegen passes
time: 0.000; rss: 284MB       codegen passes
time: 1.434; rss: 284MB LLVM passes
time: 4.021; rss: 286MB       running linker
time: 4.022; rss: 284MB linking

No wins in LLVM. This is almost hard to believe, but perhaps eliding 13% of the smallest functions really has negligible effect on LLVM.

So I’m feeling defeated by function melding. There are more - larger - functions that can be melded still, but I’m not hopeful it will change anything.

Preinstantiating with WeakODR linkage

Earlier this year I did a different experiment. In this one I tried to identify monomorphizations that could be reused downstream and then expose them as public symbols so downstream wouldn’t need to re-emit them. The hope was that this would reduce compile times in debug builds.

The details aren’t that important and I’ve forgotten some by now.

While I did get some notable wins in some scenarios, I also saw some big losses in others, and wasn’t confident it was worth while. Some notes about what I observed:

  • Non-internal linkage is punished by optimization passes. In particular, it defeats a crucial inlinlining optimization where LLVM strongly prefers to inline private functions that have a single caller - so it can then delete the function. This makes the approach worthless for optimized builds.
  • Similarly to the above, non-internal functions take longer to compile.
  • My patch was hampered by having imperfect downstream knowledge. It was forced to ‘guess’ which functions would be useful downstream, and often decided to do the extra work of exporting them (with an expensive crate-independent hash) in vain.
  • Because of the above issues it wasn’t

There are some changes to this approach that may still bear fruit:

  1. The crate-independent hash is only necessary if you care about coalescing functions at link time (using WeakODR linkage). In release builds coalescing monomorphizations has minimal effect on binary size (can’t recall why offhand), but in debug builds the reduction in binary size is significant (but not very important!).
  2. Instead of using an inaccurate heuristic to decide which monomorphizations to export, profile-based preinstantiation would ensure that only functions that downstream will actually use get exported.

Sadly, once the compiler is refactored to separate type checking and code generation, cargo will be able to orchestrate the build such that every required monomorphization is only emitted once. Once that happens the only place preinstantation will have any affect is for types in the standard library.

Sorry if any of this was nonsensical. Brain dump.


Oh, one note about the LLVM timings - that was on building just the servo bin, which does instantiate a number of generics, but not nearly all of them.

Thanks for investigating this and for the writeup!

Did you happen to capture differences binary size, too? After compile time, the biggest negative associated with monomorphization is binary bloat (~30%, from estimates from MLton). Even if this work did not decrease compile times, if it takes even 5% off of our code size, that could have some significant impact on Servo’s performance, particularly on embedded hardware with relatively small i-cache sizes.

1 Like

@larsberg I don’t have numbers offhand, but the only major reductions in binary sizes were in the unoptimized builds. With the preinstantiation patch the size wins were big in unoptimized builds (at the likely expense of even slower runtime). The preinstantiation patch in optimized builds also reduced Servo’s binary size a small amount, (< 1% I believe).

The function melding patch results in no size difference in rustc for optimized builds. I did not look into unoptimized builds or at Servo bin sizes with that patch.

I just compared optimized Servo binary sizes with the meld patch. First is before, second after:

-rwxrwxr-x  1 brian brian 148500624 Aug  4 17:33 servo
-rwxrwxr-x  1 brian brian 146639976 Aug  4 17:17 servo

That is better than 1% at least.

Hrm, I’m not sure 1% is going to make a huge difference at runtime, though you just never know if the ~2MB less in binary size will end up being in a key place without running some perf tests. Would it help the status of this patch if we did look into it for some representative pages?

Thanks for the additional data!

Just some random thoughts on this. While I like the general idea, it also blocks off some things like using TBAA in such melded functions. And omitting TBAA information (once we generate it in the first place) would probably have to happen for any meldable function, regardless of whether melding actually happens.

If you do this, I think you might want to bitcast the arguments instead of the function. AFAICT, before inlining LLVM has to eliminate the bitcast on the call, so it would have to transform things to have the casts on the arguments instead. That might account for some the compile time increase you saw.

Regarding the resulting file sizes, it would be interesting to see if you get comparable results when the MergeFunc pass is enabled. It’s currently disabled due to issues with debug info, but it has the advantage that it can reduce the binary size while still allowing for things like TBAA to be used, because it runs late in the optimization process and AA info is only discarded iff a merge happens.

@dotdash interesting point about casting the arguments, thanks. @sunfish also suggested that might be nicer to LLVM but he wasn’t sure in either case what the impact would be.

How does this differ from the LLVM MergeFunctions pass?

Could the existence of this optimization pass explain why pre-melding does not seem to affect binary size?

@matthieum I’m not that familiar with the details of LLVM mergefunc, except that it has a reputation as not working, many people have tried and failed.

A Rust implementation benefits from having much better semantic information. Concretely, this implementation only tries to merge redundant monomorphizations of the same generic function - iow the search space is very small, and hopefully more likely to find a merge, compared to LLVM’s task of merging arbitrary functions.

I don’t know to what extent mergefunc affects our compilation, though I do recall being strongly recommended we not run it because it is so broke (I think we do run it…).

I just checked our source and LLVM mergefunc is commented out because ‘it causes crashes’.

@brson Those crashes are most likely At least I’m not aware of any other problems. Things being really broken is probably an old thing, I guess? We fixed a number of issues with that pass in LLVM before we enabled it in rustc.

If there is a consistent 1% size improvement, then it is worthwhile, especially for embedded use cases.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.