Since Rust 1.37 the compiler supports profile-guided optimization (PGO), a feature that many power users have been asking for. However, so far the performance improvements enabled by PGO have been a bit underwhelming, even after fixing a bad interaction with Cargo. On the other hand, my sample of test applications is rather small.
One thing I am looking for would be PGO for libraries.
If for example we did PGO on the collections in the standard library, and other common crates and check in the generated information and point to it in the toml file, then downstream packages could depend on them and get more optimized code without having to run profiling themselves. (At least if they are compiling to one of the optimized architectures)
I gather that isn't simply going to work out of the box. Are the obstacles surmountable? I would be willing to help if someone can provide direction.
I think there's a bit of a conceptual conflict here because PGO works by optimizing for specific usage profile while libraries usually are general purpose with no knowledge of the usage profile yet.
I gave it a shot, but I saw minimal benefit (beneath noise floor).
That said, my application was incredibly tiny already, and I made substantially more savings(at least memory-wise) by construcing a smaller BufReader, because the files I was reading were under 200 bytes long already, averaging 10 bytes per line, and I was only using BufReader to get nice linewise semantics, and I didn't need 8K of heap to read that
Tools like valgrinds dhat can make this under-utilization of allocated memory more obvious, not sure if PGO can cut corners here.
While I didn't really try it much, my understanding of PGO in these days is that one generally shouldn't expect much improvement from it.
The compiler needs to make decisions about which if-branch is more likely to decide which one to optimize more, possible at the cost of the other. There are many heuristics by which the compiler can make an educated guess.
PGO only provides the real data so it doesn't have to guess.
But for this to have any effect on the end application, the original guess would have to be wrong. Which would mean that either the application does something very unusual to confuse the heuristics or that the compiler is bad at guessing. And the compilers have several decades of research about how to guess better and better in them by now.
Note that the C++ build process differs significantly enough from that of Rust (in particular wrt compilation unit granularity, which has a strong effect on inlining even with LTO on) that the increased effectiveness of PGO in C++ could be caused by this difference. But that's just a possibility.
Cool ! Feel free to ping me about the results if you do.
The background behind this theory is that I recently tried LTO on my C++ builds and was disappointed at how little cross-unit inlining the compiler (in this case GCC) would actually perform. At the time, I speculated that PGO metadata about hot call paths might hint the compiler in the right direction, but never got around actually testing this hypothesis (it's on my to-do pile somewhere).
So I finally got around to testing this and indeed PGO makes much more of difference when compiling with a higher number of compilation units. In my tests PGO improve performance by 0.3% with 1 CGU and by 1.2% with one CGU per Rust module.
Maybe even more interesting: Tuning ThinLTO via the -import-instr-limit parameter made the effect even more pronounced. There I was able to get a 4% improvement over the best non-PGO configuration which is well within expectations for a PGO build. It seems that the default "brute-force" ThinLTO settings actually negatively interfere with PGO.
So, I basically consider PGO as "working as expected" now.