10% performance hit by refactoring crate into modules? (solved)

kaspar030 · November 1, 2021, 10:29am

Hi everyone,

recently I wanted to clean up a messy main.rs, which contained way too much code, by splitting parts into seperate modules. Basically, moving parts of main.rs into model/foo.rs, and model/mod.rs pub uses the structs, which main.rs then uses from model. The whole application has ~3300 lines of code, and I moved ~700 lines from main.rs into model/*.rs.

Other than moving the type definitions & impls, which required some struct fields to be made public, I didn't change any code.

Anyhow, this lead to a reproducible 10% runtime performance regression. I didn't expect that.

I'm on Rust 1.56.

Here's the benchmark output:

❯ hyperfine -w 10 --prepare "touch /home/kaspar/src/riot.laze/laze-project.yml" "laze -C /home/kaspar/src/riot.laze b -g -G"
Benchmark 1: laze -C /home/kaspar/src/riot.laze b -g -G 
  Time (mean ± σ):      1.098 s ±  0.033 s    [User: 6.280 s, System: 0.285 s]                                                                     
  Range (min … max):    1.053 s …  1.144 s    10 runs                                                                                              
                                                                         
laze on  refactored took 22s                                                                                                                      
❯ hyperfine -w 10 --prepare "touch /home/kaspar/src/riot.laze/laze-project.yml" "laze -C /home/kaspar/src/riot.laze b -g -G"
Benchmark 1: laze -C /home/kaspar/src/riot.laze b -g -G                                                                                            
  Time (mean ± σ):      1.003 s ±  0.011 s    [User: 5.697 s, System: 0.281 s]
  Range (min … max):    0.982 s …  1.022 s    10 runs

Switching branches between those runs. Top run is from master branch, bottom one is from "refactored". These numbers are stable within 1%.

This is with "--release", lto="fat", compilation-units=1, incremental=false.

here is the application repository. This is the branch with the refactoring, and here is the used test data.

I obviously don't expect anyone to look at my code, I just don't know if I'm missing something obvious.

Are modules supposed to cause runtime overhead if used within a crate?

Thanks in advance, Kaspar

KillTheMule · November 1, 2021, 10:52am

As far as I can see though, this only applies to the refactored branch, and master has "only" lto=true.

kaspar030 · November 1, 2021, 11:19am

Sorry, I didn't push that commit to master (it is now). The numbers were made with the same settings ( lto="fat", compilation-units=1, incremental=false) on master and refactored. My first hunch was that maybe compilation-units=1, incremental=false was causing the difference, I did not set those before the refactor.

HerrJemine · November 1, 2021, 1:31pm

I am just guessing here, but maybe making the fields public inhibits some optimisations? You could try making those fields pub(crate).

mbartlett · November 1, 2021, 2:19pm

Are you sure it is a regression? As far as I can see, the time required decreases on the bottom refactored branch.

kaspar030 · November 1, 2021, 8:20pm

Ups. The slower top run is from the refactored branch. Sorry.

kaspar030 · November 1, 2021, 8:51pm

I tried that, it doesn't change the result.

mbrubeck · November 1, 2021, 9:24pm

I don't have any ideas about the cause, but I was able to reproduce this perf regression locally. I also compared the branches line-by-line and verified that there are no meaningful code changes accidentally introduced on the refactored branch.

mathstuf · November 1, 2021, 10:02pm

I would recommend using perf or callgrind to analyze the differences. There's obviously something about the machine code that is different, so being able to determine "this function is performing differently" would help narrow down where to investigate further.

Wild guesses:

instruction cache locality?
inlining aggressiveness (a diff of symbols in the binaries may help here)

michaelwoerister · November 2, 2021, 2:40pm

Machine code layout can have a surprisingly big impact on performance (see this talk for a fun example).

Did you try compiling your code with PGO? That might make the difference between the two smaller.

kaspar030 · November 2, 2021, 3:16pm

It does bring the two compiles closer together:

master with PGO:

❯ hyperfine -w 10 --prepare "touch /home/kaspar/src/riot.laze/laze-project.yml" "target/x86_64-unknown-linux-gnu/release/laze -C /home/kaspar/src/riot.laze b -g -G"
Benchmark 1: target/x86_64-unknown-linux-gnu/release/laze -C /home/kaspar/src/riot.laze b -g -G
  Time (mean ± σ):      1.002 s ±  0.022 s    [User: 5.609 s, System: 0.291 s]
  Range (min … max):    0.977 s …  1.036 s    10 runs

refactored with PGO:

❯ hyperfine -w 10 --prepare "touch /home/kaspar/src/riot.laze/laze-project.yml" "target/x86_64-unknown-linux-gnu/release/laze -C /home/kaspar/src/riot.laze b -g -G"
Benchmark 1: target/x86_64-unknown-linux-gnu/release/laze -C /home/kaspar/src/riot.laze b -g -G
  Time (mean ± σ):      1.083 s ±  0.043 s    [User: 6.187 s, System: 0.286 s]
  Range (min … max):    1.038 s …  1.179 s    10 runs

Interestingly, master doesn't profit from using PGO, refactored just a bit.

I'll try to get perf / callgrind data.

bluss · November 3, 2021, 6:02pm

Since variance is a topic in that talk, and it seems "variance measurement" is built into most benchmarking tools, it could be relevant to underline how little things are varying between those the runs those make. See the video about all the stuff that can vary and influence performance. And we vary almost none of that when the benchmark runner re-runs one's test 10 or 1000 times - we run the same exact executable in the same environment on the same computer.

Actual variance over 10 operating systems x 10 compiler versions x 10 random perturbations of the code or layout, now that would be interesting in the statistical sense at least

michaelwoerister · November 4, 2021, 10:17am

Yes, I expect that this would be a case of machine code layout and/or inlining decision variance, both of which are determined at compile time so that benchmarking tools can't do much to mitigate that.

Finding out where the regression comes from exactly would be a good next best step, I think. The toplev might give some interesting insights.

michaelwoerister · November 4, 2021, 10:24am

If you are benchmarking with perf or something based on perf, be sure to compile your code with -Cforce-frame-pointers=yes and run perf with --call-graph=fp for most accurate call-graph data.

PSeitz · November 6, 2021, 12:20pm

I can confirm from experience that inlining between crates is different and usually much worse (even with lto: true). My recommendation is to apply #[inline] to critical functions that have been moved to a module.

I don't get the comments about lto=true vs lto=fat, according to the documentation they are the same

https://doc.rust-lang.org/cargo/reference/profiles.html#lto

kaspar030 · November 7, 2021, 8:57pm

I'm getting much more detailed call graphs with --call-graph=dwarf?

bjorn3 · November 7, 2021, 9:01pm

--call-graph=dwarf will also give you inlined functions and works without frame pointers, but comes at the cost of slower perf report (need to unwind the stack at analysis time), much bigger perf files (need to capture a snapshot of the stack) and doesn't work as well for deep stacks (max 50kb is captured of the stack, default is much lower).

kaspar030 · November 7, 2021, 9:13pm

Uff, I got it, there actually was a significant change:

in model/module.rs, I used im::HashSet, whereas master used std::collections::HashSet in its main.rs. That was a refactoring mistake.

With this fixed, the refactored branch is as fast as master:

refactored:

❯ hyperfine -w 10 --prepare "touch /home/kaspar/src/riot.laze/laze-project.yml" "target/release/laze -C /home/kaspar/src/riot.laze b -g -G"
Benchmark 1: target/release/laze -C /home/kaspar/src/riot.laze b -g -G
  Time (mean ± σ):     993.8 ms ±  23.0 ms    [User: 5601.6 ms, System: 288.8 ms]
  Range (min … max):   964.6 ms … 1046.0 ms    10 runs

Toplev just told me the workload is frontend blocked. perf showed one function (bitmaps::Iter::next()) that suddenly popped up in the top five, but I didn't get the hint. But also im::HashSet::... was more visible, which made me think, where do I use that? ...

Ten points to Rust abstracting two completely different implementations that well.

Thousand points for the suggestions! Thanks a lot everyone!

And boy am I relieved that putting everyting in one file is not making things any faster.

system · February 5, 2022, 9:14pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Help us benchmark incremental compilation!	48	12525	March 25, 2019
Let's talk about parallel codegen compiler	49	10210	March 25, 2019
Help Needed: corpus for measuring runtime performance of generated code compiler	34	6002	March 25, 2019
Incremental Compilation Beta compiler	37	30603	March 25, 2019
Ideas on compiler performance improvement? compiler	22	2520	March 25, 2019

10% performance hit by refactoring crate into modules? (solved)

Related topics