Add Propeller support to the Rustc compiler

Hi!

I suggest adding to the Rustc compiler support for the Propeller Post-Link time Optimization (PLO) as a step forward to the currently existing Profile-Guided Optimization (PGO) infrastructure.

You can think about Propeller as an alternative to the LLVM BOLT which is already can be used for Rust programs with cargo-pgo. Propeller is an interesting alternative since it does not require an additional external tool for the optimization process (LLVM BOLT has such a "limitation"). Also, it helps to avoid the disassemble-assemble process which can be tricky for really large binaries (I already met some issues with optimizing ClickHouse binary with BOLT).

I have a bunch of hopefully helpful links about the topic:

After adding Propeller support to the Rustc compiler, I suggest adding an additional mode to cargo-pgo as it's already done for BOLT.

Mentioning here @Kobzol as a Rust PGO hero :slight_smile:

Hi :slight_smile: Couple of notes.

Are there some benchmarks that show how much/if is Propeller better than BOLT? I skimmed through the Propeller paper and it seemed to me that it's motivation was focused on distributed builds and warehouse-scale applications, which is not necessarily something that might be useful for the Rust compiler itself. Really the paper looked to me like Google is basically trying to reproduce BOLT, but with lower memory usage when optimizing binaries. I haven't really seen significant performance improvements of Propeller over BOLT. BOLT seems to generate much larger binaries, that's true, but that's an implementation issue that will be hopefully eventually resolved, with the layout rewrite.

Regarding adding support for any Rust programs, Rustc doesn't support BOLT directly, and probably shouldn't even know about it. You just pass some additional opaque linker flags to Rustc if you want the resulting binary to be BOLT-compatible. It would be ideal to have the same workflow for Propeller - if the binary needs to be compiled in some special way in order to support Propeller, and ideal way would be to just pass some -Cllvm-args=<propeller flags> to LLVM, so that the Rust compiler doesn't need to know about it at all (same as with BOLT).

1 Like

Are there some benchmarks that show how much/if is Propeller better than BOLT?

I do not think so since both projects are trying to do the same things but from different perspectives (BOLT with disassemble/assemble approach, Propeller via basic block reordering stuff).

it seemed to me that it's motivation was focused on distributed builds and warehouse-scale applications, which is not necessarily something that might be useful for the Rust compiler itself.

For the Rustc compiler itself - probably not. For the large amount of Rust applications across the industry - it could be the case.

Really the paper looked to me like Google is basically trying to reproduce BOLT, but with lower memory usage when optimizing binaries.

Yes, it's one of the reasons why I want to try using Propeller over BOLT. I already met huge issues with terrific amount of required RAM during the BOLT optimization phase. This problem is already reported to the upstream but the only viable way for now is using "lite" mode in BOLT.

It would be ideal to have the same workflow for Propeller - if the binary needs to be compiled in some special way in order to support Propeller, and ideal way would be to just pass some -Cllvm-args=<propeller flags> to LLVM, so that the Rust compiler doesn't need to know about it at all (same as with BOLT).

Agree. If I understand correctly the Clang commit about Propeller (⚙ D68049 Propeller: Clang options for basic block sections), it's only the thin fancy wrapper over the LLVM flags. So probably right now we can try to use -Cllvm-args=... approach (but I haven't tried it yet). Adding the corresponding flags to Rustc itself seems to be just a quality-of-life feature.

Hi, and thank you both for promoting PGO and making it more accessible! I'm all for the tools that improve performance and efficiency, but I need to make a couple of clarifications to the claims here:

  1. External tools:

Propeller is an interesting alternative since it does not require an additional external tool for the optimization process

Even though most of the Propeller components are integrated into the build system or the toolchain, there's still one standalone tool for Whole Program Analysis.

  1. Benchmarks:

Are there some benchmarks that show how much/if is Propeller better than BOLT?

Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications – Google Research made several claims in that respect:

Screenshot 2023-10-25 at 3.35.40 PM

We didn't evaluate these claims. At first glance it seems unintuitive that Propeller can deliver better perf given that function splitting and reordering algorithms are common for BOLT and Propeller, and that Propeller is more limited wrt branch simplifications.

  1. Memory usage:

Really the paper looked to me like Google is basically trying to reproduce BOLT, but with lower memory usage when optimizing binaries.

Be aware that the memory usage claim in Propeller paper is about lower peak memory consumption on a single machine, and not about using less total memory across all machines participating in the relinking process. The paper does not include memory usage in single host relinking scenario, as it's explicitly a no-goal of Propeller.

And to note, Lightning BOLT made improvements to running time and peak memory usage by introducing lite mode (default) and leveraging parallelism.

  1. Output binary size:

BOLT seems to generate much larger binaries, that's true, but that's an implementation issue that will be hopefully eventually resolved, with the layout rewrite.

While this is true, allocatable section sizes are the same. It's understandable that for mass distribution artifact size is a big concern. BOLT has -use-old-text option to reuse text section and partially mitigate this issue.

  1. Disassembling:

Also, it helps to avoid the disassemble-assemble process which can be tricky for really large binaries (I already met some issues with optimizing ClickHouse binary with BOLT).

While generally true, this part is robust enough for internal uses which include really large binaries, in large part thanks to mature LLVM components. Please report any issues through LLVM GitHub issue tracker.

4 Likes

Thanks a lot for chiming in! So, if I understand it correctly, if I'm not bottlenecked by memory usage, and I don't require distributed builds, then Propeller is probably not needed, and it's not expected in general that it would provide much bigger perf. gains than BOLT?

Similar to ThinLTO, the Google propeller will have more features. I would vote for Bolt.

From my understanding of both tools - yes, you are right.

Yes, that's my understanding too. But of course it's better to run the experiment for your particular use case.

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.