Profile-guided optimization: How well does it work for you?

Since Rust 1.37 the compiler supports profile-guided optimization (PGO), a feature that many power users have been asking for. However, so far the performance improvements enabled by PGO have been a bit underwhelming, even after fixing a bad interaction with Cargo. On the other hand, my sample of test applications is rather small.

So my question is: Did anybody try out PGO with their projects? Or would anyone here like to try it? You'd have to use at least Rust 1.39 (currently beta) in order to get a working Cargo version and then just follow the instructions in the official docs: https://doc.rust-lang.org/rustc/profile-guided-optimization.html#a-complete-cargo-workflow

In theory, LLVM's PGO can noticeably improve runtime performance (e.g. up to 10% for Firefox).

3 Likes

I don't like using RUSTFLAGS (seems too fragile), so I'm waiting for PGO to get some first-class support in Cargo before trying it.

4 Likes

One thing I am looking for would be PGO for libraries.

If for example we did PGO on the collections in the standard library, and other common crates and check in the generated information and point to it in the toml file, then downstream packages could depend on them and get more optimized code without having to run profiling themselves. (At least if they are compiling to one of the optimized architectures)

I gather that isn't simply going to work out of the box. Are the obstacles surmountable? I would be willing to help if someone can provide direction.

I think there's a bit of a conceptual conflict here because PGO works by optimizing for specific usage profile while libraries usually are general purpose with no knowledge of the usage profile yet.

6 Likes

I tried it in my application at work and the result was slowdown. But I guess it was probably because of my lack of knowledge here.

Thanks for giving it a try, @vehls!

I gave it a shot, but I saw minimal benefit (beneath noise floor).

That said, my application was incredibly tiny already, and I made substantially more savings(at least memory-wise) by construcing a smaller BufReader, because the files I was reading were under 200 bytes long already, averaging 10 bytes per line, and I was only using BufReader to get nice linewise semantics, and I didn't need 8K of heap to read that :wink:

Tools like valgrinds dhat can make this under-utilization of allocated memory more obvious, not sure if PGO can cut corners here.

2 Likes

For a private app of mine LTO gave a few percent improvement. PGO didn't give any improvement over LTO.

While I didn't really try it much, my understanding of PGO in these days is that one generally shouldn't expect much improvement from it.

The compiler needs to make decisions about which if-branch is more likely to decide which one to optimize more, possible at the cost of the other. There are many heuristics by which the compiler can make an educated guess.

PGO only provides the real data so it doesn't have to guess.

But for this to have any effect on the end application, the original guess would have to be wrong. Which would mean that either the application does something very unusual to confuse the heuristics or that the compiler is bad at guessing. And the compilers have several decades of research about how to guess better and better in them by now.

4 Likes

That's not really accurate - I see regularly see substantial wins from PGO in C++ codebases.

1 Like

Note that the C++ build process differs significantly enough from that of Rust (in particular wrt compilation unit granularity, which has a strong effect on inlining even with LTO on) that the increased effectiveness of PGO in C++ could be caused by this difference. But that's just a possibility.

I agree, in Firefox the improvement for C/C++ code is 5-10%.

Huh, that is a really interesting theory! And one that can even be tested :) I'll give it a try for the regex benchmark suite when I find the time.

2 Likes

Cool ! Feel free to ping me about the results if you do.

The background behind this theory is that I recently tried LTO on my C++ builds and was disappointed at how little cross-unit inlining the compiler (in this case GCC) would actually perform. At the time, I speculated that PGO metadata about hot call paths might hint the compiler in the right direction, but never got around actually testing this hypothesis (it's on my to-do pile somewhere).

1 Like

So I finally got around to testing this and indeed PGO makes much more of difference when compiling with a higher number of compilation units. In my tests PGO improve performance by 0.3% with 1 CGU and by 1.2% with one CGU per Rust module.

Maybe even more interesting: Tuning ThinLTO via the -import-instr-limit parameter made the effect even more pronounced. There I was able to get a 4% improvement over the best non-PGO configuration which is well within expectations for a PGO build. It seems that the default "brute-force" ThinLTO settings actually negatively interfere with PGO.

So, I basically consider PGO as "working as expected" now.

4 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.