Caller-side inline directives

For a simple example of problematic std function inlining with multiple CGUs, please clone GitHub - HadrienG2/grayscott: Rust version of the "Performance with stencil" course's examples, checkout commit bd6b2fda82b85000a449e16d38705cf5f178299d, remove the codegen-units = 1 directive in Cargo.toml, then build and run the criterion benchmark for the autovectorized compute backend as follows:

$ cargo install cargo-criterion
$ cargo criterion -p compute_autovec -- --sample-size 10 'full.*2048x.*32'

Then switch back to codegen-units = 1 and observe how throughput improves. It is a factor of 16x on the Zen 3 CPU that I currently have my hands on, although admittedly this particular microarchitecture is crazy amazing and the improvement is less impressive on other CPUs I've tried (Zen 2, Comet Lake). Maybe more like 3x faster. Still huge.

# Before
compute_autovec/full,2048x1024elems,32steps                                                                          
                        time:   [1.3362 s 1.3368 s 1.3375 s]
                        thrpt:  [50.176 Melem/s 50.200 Melem/s 50.222 Melem/s]

# After
compute_autovec/full,2048x1024elems,32steps                                                                           
                        time:   [75.417 ms 80.169 ms 84.470 ms]
                        thrpt:  [794.47 Melem/s 837.10 Melem/s 889.84 Melem/s]
                 change:
                        time:   [-94.396% -94.239% -94.043%] (p = 0.00 < 0.05)
                        thrpt:  [+1578.7% +1635.8% +1684.5%]
                        Performance has improved.

A perf profile reveals that one reason why switching to codegen-units = 1 is so beneficial is that std function inlining goes quite wrong in the presence of multiple CGUs. In particular, the FnMut::call_mut trait method is not inlined for a simple closure.

This is the code that calls the function. Don't mind the complicated iterator, this is just iterating over every element of identically shaped 2D arrays. I need to do the whole row/column dance because rustc can't figure out the codegen right when I use ndarray's flat iterators together with a fixed sized array's flattened iterator, which is annoying but fair enough :

image

And this is rustc mistakenly outlining the FnMut::call_mut-mediated call to the folding closure, as spotted by perf:

Further, as previously alluded too, the whole inner Iterator::fold() call is not inlined either...

...which is sad, because inlining this little reduction based on a loop with 9 iterations is something rustc should do. Especially given that it knows the loop has 9 iterations, since all arrays and ndarray windows involved have dimensions known at compile time and the function that contains this info is inlined... Oh well, something for another day.

Anyway, just for completeness, I tried to see if the whole-program optimizations of LTO can fix this :

  • Thin LTO does nothing, as unfortunately happens too often when I try to use it.
  • Fat LTO, as usual, is more interesting: at the expense of a great build slowdown, it does resolve the two inlining issues discussed above, but at the cost of introducing a new equally bad inlining issue, namely outlining of the FMA instruction of the SIMD library that I use. Then again, I contributed that mul_add function myself and forgot to mark it inline :person_facepalming:, will submit a patch once I'm done writing this post.