Significant Increase in Fat LTO Time Post Update to rustc 1.75.0-nightly

Upon updating my Rust compiler from version rustc 1.74.0-nightly (b4e54c6e3 2023-09-11) to rustc 1.75.0-nightly (df871fbf0 2023-10-24), I've experienced a significant increase in the time taken for the final Fat LTO process in my project. While the Fat LTO time was ~3400 seconds on version 1.74.0-nightly, it rose to ~7100 seconds on the newer 1.75.0-nightly version.

To further investigate this, I utilized the -Z self-profile option along with summarize to generate and compare profiling data on both compiler versions. I am pasting the data below for reference and would greatly appreciate the community's insights on whether this is a recognized issue with the recent nightly build or if there are any recommended workarounds or optimizations to alleviate this substantial increase in compilation time.

1.74:

+-------------------------------------------------+-----------+-----------------+----------+------------+
| Item                                            | Self time | % of total time | Time     | Item count |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| finish_ongoing_codegen                          | 3402.49s  | 33.418          | 3402.84s | 1          |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_passes                                     | 3371.47s  | 33.114          | 3402.53s | 1          |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_lto_optimize                               | 2278.57s  | 22.380          | 2278.57s | 1          |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_module_codegen_emit_obj                    | 1070.40s  | 10.513          | 1070.40s | 1          |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_fat_lto_link_module                        | 30.52s    | 0.300           | 30.52s   | 228        |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_module_optimize                            | 13.62s    | 0.134           | 13.62s   | 2          |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_module_codegen                             | 8.84s     | 0.087           | 1079.24s | 1          |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| run_linker                                      | 3.22s     | 0.032           | 3.22s    | 1          |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| join_worker_thread                              | 344.24ms  | 0.003           | 344.24ms | 1          |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| codegen_generate_lto_work                       | 343.74ms  | 0.003           | 31.06s   | 1          |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| codegen_module                                  | 323.71ms  | 0.003           | 408.84ms | 1          |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| codegen_crate                                   | 244.86ms  | 0.002           | 1.29s    | 1          |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_fat_lto_build_monolithic_module            | 198.89ms  | 0.002           | 30.72s   | 1          |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| metadata_decode_entry_fn_sig                    | 69.40ms   | 0.001           | 69.40ms  | 37772      |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| metadata_decode_entry_generics_of               | 68.21ms   | 0.001           | 68.21ms  | 81412      |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| metadata_decode_entry_optimized_mir             | 50.25ms   | 0.000           | 50.25ms  | 1626       |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| metadata_decode_entry_impl_trait_ref            | 50.18ms   | 0.000           | 50.18ms  | 82658      |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| specialization_graph_of                         | 40.79ms   | 0.000           | 109.35ms | 114        |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| monomorphization_collector_graph_walk           | 37.88ms   | 0.000           | 347.04ms | 1          |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| codegen_select_candidate                        | 37.52ms   | 0.000           | 62.29ms  | 2467       |
+-------------------------------------------------+-----------+-----------------+----------+------------+
......
Total cpu time: 10181.483221645s
+----------------------------+-----------------+
| Item                       | Artifact Size   |
+----------------------------+-----------------+
| cgu_instructions           | 31953984 bytes  |
+----------------------------+-----------------+
| codegen_unit_size_estimate | 328793 bytes    |
+----------------------------+-----------------+
| linked_artifact            | 165002200 bytes |
+----------------------------+-----------------+
| object_file                | 271495536 bytes |
+----------------------------+-----------------+

1.75:

+--------------------------------------------------------+-----------+-----------------+----------+------------+
| Item                                                   | Self time | % of total time | Time     | Item count |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| finish_ongoing_codegen                                 | 7099.43s  | 33.376          | 7099.43s | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_passes                                            | 7067.94s  | 33.228          | 7099.44s | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_lto_optimize                                      | 5829.20s  | 27.404          | 5829.20s | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_module_codegen_emit_obj                           | 1215.85s  | 5.716           | 1215.85s | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_fat_lto_link_module                               | 30.96s    | 0.146           | 30.96s   | 228        |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_module_optimize                                   | 14.08s    | 0.066           | 14.08s   | 2          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_module_codegen                                    | 8.78s     | 0.041           | 1224.63s | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| run_linker                                             | 3.04s     | 0.014           | 3.04s    | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| codegen_generate_lto_work                              | 344.55ms  | 0.002           | 31.50s   | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| codegen_module                                         | 282.49ms  | 0.001           | 345.48ms | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| codegen_crate                                          | 210.41ms  | 0.001           | 1.11s    | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_fat_lto_build_monolithic_module                   | 196.08ms  | 0.001           | 31.16s   | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| metadata_decode_entry_generics_of                      | 62.07ms   | 0.000           | 62.07ms  | 68991      |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| metadata_decode_entry_fn_sig                           | 56.75ms   | 0.000           | 56.75ms  | 29549      |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| metadata_decode_entry_impl_trait_ref                   | 46.99ms   | 0.000           | 46.99ms  | 81794      |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| metadata_decode_entry_optimized_mir                    | 45.51ms   | 0.000           | 45.51ms  | 1656       |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| specialization_graph_of                                | 38.15ms   | 0.000           | 102.93ms | 116        |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| link_binary_remove_temps                               | 33.28ms   | 0.000           | 33.28ms  | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| codegen_module_perform_lto                             | 31.07ms   | 0.000           | 7053.86s | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| monomorphization_collector_graph_walk                  | 30.76ms   | 0.000           | 301.59ms | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| free_global_ctxt                                       | 28.01ms   | 0.000           | 28.01ms  | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| codegen_select_candidate                               | 23.48ms   | 0.000           | 48.03ms  | 2456       |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
......
Total cpu time: 21271.099680148s
+----------------------------+-----------------+
| Item                       | Artifact Size   |
+----------------------------+-----------------+
| cgu_instructions           | 32231505 bytes  |
+----------------------------+-----------------+
| codegen_unit_size_estimate | 334659 bytes    |
+----------------------------+-----------------+
| linked_artifact            | 164362944 bytes |
+----------------------------+-----------------+
| object_file                | 267755416 bytes |
+----------------------------+-----------------+

Thank you in advance for any assistance or insights you may provide!

2 Likes

Could have been one of the LLVM updates. As workaround would thinLTO suffice for you? That should be significantly faster to compile.

Can you test with 1.74.0-beta.4? That currently has the same LLVM version as nightly, so we could isolate a lot of other Rust changes that way. You can set RUSTC_BOOTSTRAP=1 to let that beta use nightly features if needed.

I have conducted further testing on both compiler versions (rustc 1.74.0-nightly (b4e54c6e3 2023-09-11) and rustc 1.75.0-nightly (df871fbf0 2023-10-24)) by employing the -Zllvm-time-trace=y and -Ztime-llvm-passes=y flags to gather more detailed data concerning the LLVM passes and their time consumption during the compilation process.

The additional data collected should provide a more granular view of the time distribution among different LLVM passes, and might help pinpoint the source of the increased Fat LTO time in the rustc 1.75.0-nightly version.

1.74:

===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 2315.5184 seconds (2315.4190 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  1119.6733 ( 49.0%)   1.4742 (  4.6%)  1121.1475 ( 48.4%)  1121.1496 ( 48.4%)  SLPVectorizerPass
  619.6725 ( 27.1%)   4.5877 ( 14.3%)  624.2603 ( 27.0%)  624.2682 ( 27.0%)  GlobalOptPass
  142.1246 (  6.2%)   3.7638 ( 11.8%)  145.8883 (  6.3%)  145.8513 (  6.3%)  InstCombinePass
  102.1435 (  4.5%)   2.6057 (  8.1%)  104.7491 (  4.5%)  104.7401 (  4.5%)  GVNPass
  44.0555 (  1.9%)   2.1526 (  6.7%)  46.2081 (  2.0%)  46.2002 (  2.0%)  InlinerPass
  34.3042 (  1.5%)   4.5471 ( 14.2%)  38.8513 (  1.7%)  38.8515 (  1.7%)  IPSCCPPass
  32.7502 (  1.4%)   0.3878 (  1.2%)  33.1380 (  1.4%)  33.1338 (  1.4%)  DSEPass
  22.3498 (  1.0%)   2.8540 (  8.9%)  25.2038 (  1.1%)  25.2042 (  1.1%)  CalledValuePropagationPass

1.75:

===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 5898.4705 seconds (5898.4054 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  4093.1816 ( 69.8%)  11.4068 ( 32.8%)  4104.5884 ( 69.6%)  4104.6231 ( 69.6%)  GlobalOptPass
  1205.8698 ( 20.6%)   1.0976 (  3.2%)  1206.9674 ( 20.5%)  1206.9685 ( 20.5%)  SLPVectorizerPass
  151.3682 (  2.6%)   3.4558 (  9.9%)  154.8240 (  2.6%)  154.7960 (  2.6%)  InstCombinePass
  105.3950 (  1.8%)   1.6248 (  4.7%)  107.0197 (  1.8%)  107.0099 (  1.8%)  GVNPass
  48.1172 (  0.8%)   2.3671 (  6.8%)  50.4843 (  0.9%)  50.4769 (  0.9%)  InlinerPass
  34.9884 (  0.6%)   3.0910 (  8.9%)  38.0794 (  0.6%)  38.0797 (  0.6%)  IPSCCPPass
  31.7734 (  0.5%)   0.3575 (  1.0%)  32.1308 (  0.5%)  32.1269 (  0.5%)  DSEPass
  22.6235 (  0.4%)   1.2870 (  3.7%)  23.9105 (  0.4%)  23.9106 (  0.4%)  CalledValuePropagationPass

I am hopeful that this data will shed light on the underlying issue and am looking forward to any further insights or recommendations from the community. Thank you!

Thank you for the suggestion. However, I need Fat LTO for maximal performance optimization in my project.

Thank you for the suggestion. I will test with 1.74.0-beta.4 and post the data later to provide further insights on this issue.

1.74.0-beta.4:

+--------------------------------------------------------+-----------+-----------------+----------+------------+
| Item                                                   | Self time | % of total time | Time     | Item count |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| finish_ongoing_codegen                                 | 3353.34s  | 33.335          | 3353.66s | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_passes                                            | 3322.78s  | 33.031          | 3353.38s | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_lto_optimize                                      | 2096.29s  | 20.839          | 2096.29s | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_module_codegen_emit_obj                           | 1202.91s  | 11.958          | 1202.91s | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_fat_lto_link_module                               | 30.07s    | 0.299           | 30.07s   | 228        |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| llvm_dump_timing_file                                  | 24.97s    | 0.248           | 24.97s   | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_module_optimize                                   | 14.88s    | 0.148           | 14.88s   | 2          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| LLVM_module_codegen                                    | 8.67s     | 0.086           | 1211.58s | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| run_linker                                             | 3.37s     | 0.033           | 3.37s    | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
| codegen_generate_lto_work                              | 338.09ms  | 0.003           | 30.60s   | 1          |
+--------------------------------------------------------+-----------+-----------------+----------+------------+
......
Total cpu time: 10059.508958809s
+----------------------------+-----------------+
| Item                       | Artifact Size   |
+----------------------------+-----------------+
| cgu_instructions           | 31949329 bytes  |
+----------------------------+-----------------+
| codegen_unit_size_estimate | 329117 bytes    |
+----------------------------+-----------------+
| linked_artifact            | 164968104 bytes |
+----------------------------+-----------------+
| object_file                | 271414088 bytes |
+----------------------------+-----------------+
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 2069.3431 seconds (2069.2430 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  930.3309 ( 45.6%)   1.2122 (  4.3%)  931.5431 ( 45.0%)  931.5450 ( 45.0%)  SLPVectorizerPass
  575.9518 ( 28.2%)   4.2449 ( 15.0%)  580.1966 ( 28.0%)  580.2013 ( 28.0%)  GlobalOptPass
  139.8040 (  6.8%)   4.5362 ( 16.0%)  144.3402 (  7.0%)  144.3063 (  7.0%)  InstCombinePass
  101.3340 (  5.0%)   1.5802 (  5.6%)  102.9142 (  5.0%)  102.9058 (  5.0%)  GVNPass
  43.5819 (  2.1%)   2.7182 (  9.6%)  46.3001 (  2.2%)  46.2919 (  2.2%)  InlinerPass
  32.8438 (  1.6%)   3.0050 ( 10.6%)  35.8488 (  1.7%)  35.8489 (  1.7%)  IPSCCPPass
  32.4568 (  1.6%)   0.4038 (  1.4%)  32.8606 (  1.6%)  32.8564 (  1.6%)  DSEPass
  20.7165 (  1.0%)   1.1919 (  4.2%)  21.9084 (  1.1%)  21.9087 (  1.1%)  CalledValuePropagationPass

Following up on this, I have conducted a series of tests across different Rust nightly versions ranging from nightly-2023-10-02 to nightly-2023-10-15 to identify the first version where the issue arises. The results indicate that nightly-2023-10-02 (rustc 1.75.0-nightly e0d7ed1f4 2023-10-01) functions as expected with no abnormal Fat LTO times, while nightly-2023-10-03 (rustc 1.75.0-nightly 2e5a9dd6c 2023-10-02) exhibits the problem with significantly increased Fat LTO time.

I hope this information helps in pinpointing the root cause of the issue. I'm looking forward to any further insights or suggestions from the community. Thank you!

1 Like

I couldn't find any LLVM updates in that range. update jemalloc-sys to most recent release by lqd · Pull Request #116331 · rust-lang/rust · GitHub is another candidate which is exactly in your regression range. You may want to use cargo-bisect-rustc to find the exact responsible PR.

1 Like

The most likely culprit in that range is Rebased: Mark drop calls in landing pads cold instead of noinline by InnovativeInventor · Pull Request #102099 · rust-lang/rust · GitHub, which allows more inlining, which can expose new pathological optimization issues.

Based on your pass timings, GlobalOpt was already incredibly slow before (at >600s) and is even slower now (at >4000s). This is quite atypical, the pass is usually very fast.

It's not possible to say what the actual issue is without further information. Is the code for your project open-source, so the issue can be reproduced? If not, I would suggest capturing a profile using perf, which might at least indicate which part of GlobalOpt the time is spent in.

3 Likes

Thank you for the insight. I am currently utilizing cargo-bisect-rustc to pinpoint the exact PR responsible for this issue. I will update this thread with the results as soon as I have them.

I am in the process of utilizing cargo-bisect-rustc to pinpoint the exact PR that could be causing this issue. Upon reviewing the comments on PR #102099, I noticed your mention of a potential compile-time regression for certain types of excessively large machine-generated code. Indeed, our project encompasses a significant amount of machine-generated code, which aligns with the scenario you described, making this PR a likely culprit.

Unfortunately, our project is not open-sourced, but I will follow your recommendation to capture a profile using perf in hopes of identifying the root cause. I'll share any significant findings here to provide further insight into this matter. Thank you for your guidance!

1 Like

cargo-bisect-rustc result:

********************************************************************************
Regression in 2e5a9dd6c9eaa42f0684b4b760bd68fc27cbe51b
********************************************************************************

Attempting to search unrolled perf builds
ERROR: couldn't find perf build comment
==================================================================================
= Please file this regression report on the rust-lang/rust GitHub repository     =
=        New issue: https://github.com/rust-lang/rust/issues/new                 =
=     Known issues: https://github.com/rust-lang/rust/issues                     =
= Copy and paste the text below into the issue report thread.  Thanks!           =
==================================================================================

searched nightlies: from nightly-2023-10-02 to nightly-2023-10-03
regressed nightly: nightly-2023-10-03
searched commit range: https://github.com/rust-lang/rust/compare/e0d7ed1f453fb54578cc96dfea859b0e7be15016...2e5a9dd6c9eaa42f0684b4b760bd68fc27cbe51b
regressed commit: https://github.com/rust-lang/rust/commit/2e5a9dd6c9eaa42f0684b4b760bd68fc27cbe51b

<details>
<summary>bisected with <a href='https://github.com/rust-lang/cargo-bisect-rustc'>cargo-bisect-rustc</a> v0.6.7</summary>


Host triple: x86_64-unknown-linux-gnu
Reproduce with:

cargo bisect-rustc --start=2023-10-02 --end=2023-10-03 -c rustfmt

</details>

perf result:

Okay, so most time is spent inside hasAddressTaken(). This function iterates overall uses of the function, so it can be expensive if there are many uses. To actually get this kind of pathological case it needs to be used in some quadratic fashion though.

There are multiple uses of hasAddressTaken(), but I think the most likely candidate is https://github.com/llvm/llvm-project/blob/0d0ca51ffe1002cec3b1b7a332e290176b650390/llvm/lib/Transforms/IPO/GlobalOpt.cpp#L1783C23-L1783C38. This effectively iterates over all calls in all functions, so we get an O(n^2) complexity in the total number of calls.

Are you on x86_64-unknown-linux-gnu? I could try to implement a speculative fix and produce a test build for it.

4 Likes

Yes I'm on x86_64-unknown-linux-gnu. I'm looking forward to testing the fix to see if it resolves the issue. Please keep me updated, and thanks for your help.

Can you please test the a5ce5f4157fcfd7ce1b007b4cd90ecea39ac0537 toolchain?

You should be able to install it using cargo install rustup-toolchain-install-master && rustup-toolchain-install-master a5ce5f4157fcfd7ce1b007b4cd90ecea39ac0537 and then run cargo +a5ce5f4157fcfd7ce1b007b4cd90ecea39ac0537.

5 Likes

Thank you for the instructions. I attempted to install the specified toolchain along with rustfmt as my project requires it for compilation. However, I encountered an error indicating that the rustfmt component is missing on the a5ce5f4157fcfd7ce1b007b4cd90ecea39ac0537 toolchain on the nightly channel for the x86_64-unknown-linux-gnu target. Here's the command I ran and the error message I received:

rustup-toolchain-install-master a5ce5f4157fcfd7ce1b007b4cd90ecea39ac0537 -c rustfmt
error: missing component `rustfmt` on toolchain `a5ce5f4157fcfd7ce1b007b4cd90ecea39ac0537` on channel `nightly` for target `x86_64-unknown-linux-gnu`

Is there an alternative way to get rustfmt installed for this specific toolchain?

Can you please try c23cf1f1ae4b4f214b1f56682785fe158af01c1c? It should include rustfmt.

I've tested the c23cf1f1ae4b4f214b1f56682785fe158af01c1c toolchain, which included rustfmt as you mentioned. The Fat LTO time is now around ~3300 seconds. I have also gathered detailed results for the LLVM passes:

===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 1829.8834 seconds (1829.7805 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  1197.1784 ( 66.6%)   1.3375 (  4.2%)  1198.5159 ( 65.5%)  1198.5194 ( 65.5%)  SLPVectorizerPass
  153.2118 (  8.5%)   4.0990 ( 12.7%)  157.3108 (  8.6%)  157.2806 (  8.6%)  InstCombinePass
  105.6791 (  5.9%)   1.5892 (  4.9%)  107.2684 (  5.9%)  107.2597 (  5.9%)  GVNPass
  48.7374 (  2.7%)   2.8769 (  8.9%)  51.6143 (  2.8%)  51.6062 (  2.8%)  InlinerPass
  36.6116 (  2.0%)   6.3579 ( 19.8%)  42.9694 (  2.3%)  42.9697 (  2.3%)  IPSCCPPass
  30.5134 (  1.7%)   2.6811 (  8.3%)  33.1945 (  1.8%)  33.1950 (  1.8%)  GlobalOptPass
  31.6985 (  1.8%)   0.3643 (  1.1%)  32.0629 (  1.8%)  32.0594 (  1.8%)  DSEPass
  23.5987 (  1.3%)   2.3870 (  7.4%)  25.9856 (  1.4%)  25.9862 (  1.4%)  CalledValuePropagationPass
1 Like