The wg-self-profiler working group maintains the unstable -Z self-profile rustc flag and the associated profiler tooling in the measureme crate. We're looking for feedback on what kinds of questions people are looking to answer with the feature and what their ideal workflows are.
Specifically, we'd like to hear from you if you:
Have used (or tried to use) -Z self-profile in the past
What information were you trying to find or what problem were you trying to solve?
Did you find the information or solve the problem?
What was your workflow for using the tool?
What would make it easier to use the tooling?
Have profiled (or tried to profile) rustc in the past
What tools did you use?
What information were you looking for or problem were you trying to solve?
Would the existing self-profiler features have helped you?
I don't know how much it related to the self-profiling feature but I'll take this opportunity to share this recent talk on profiling by Emery Berger as it worth considering in the context of improving the performance of rustc: https://www.youtube.com/watch?v=r-TLSBdHe1A
Overall the question I would like a tool to answer is "what lines in my source file is the compiler spending its time on?" or as it is spending most of the time in llvm "what lines in my source file is generating the most llvm-ir?"
I just ran this for the first time on the cargo library crate and generated some cool tables, it was a smooth experience! I had to poke around a bit in the measureme repository to find the summarize executable and associated instructions. I then poked around a bit more, found crox, and loaded that up into the chrome tools and wow that was cool as well!
Most of the time when I've profiled rustc historically and/or looked for optimization opportunities I mostly look for architectural changes that can improve compile times. For example parallel codegen, defaulting to multiple codegen units, pipelined compilation, etc. I think the self-profile option is great for exploring that as well, for example a low-hanging fruit is that almost all compilations do not need compressed bitcode in rlibs because they're not doing LTO, so when building Cargo in debug mode that's an easy second saved for just the Cargo crate!
In terms of information that I'd love to have, I'd echo @Eh2406's sentiment where being able to "blame" lines of code would be amazing. I'd love to see, for example, that 10% of our compile time is all from format! or something like that. From the perspective of a crate author they typically don't have much control over what the compiler is doing, so knowing what they can do to make the compiler's job easier would be a godsend. I realize though that this is super difficult, and likely still a long way out (especially because LLVM is where a lot of the interest lies and I'm not sure LLVM has great support for this)
I'm also curious, does -Z self-profile lend itself well to multi-crate compilations too? For example could the data be aggregated into one profile to see everything? A lot of times I find myself profiling cross-crate scenarios as well or just profiling a couple of sequenced compilations.
Also as one final aside, I found that in the chrome dev tools Cargo had over a hundred threads (this was debug incremental mode) which made a huge waterfall of threads to follow. It'd be pretty cool though if the thread ids could get reused to show the actual parallelism on the CPU and also flatten out the graph a bit.
Thanks for that feedback! This is definitely something we've heard before from other people. As you've said, it's not an easy problem I believe @michaelwoerister has mentioned that getting this kind of information out of LLVM is possible, but I don't know the details. The larger question we have is "what to actually do with this data?". The incremental compilation support in rustc gives us very fine-grained information about where time is being spent but we don't know how to aggregate that data. One particularly tricky question is how to deal with the effects of the query cache. For example:
During compilation, we'll need to calculate information about the type of x in foo() so we'll issue a query like layout_of(Vec<u8>) -> TyLayout which will take some amount of time to complete. The results of this query are then cached in memory. Later when we process bar(), we'll issue another query to get information about y, layout_of(Vec<u8>) -> TyLayout but this time, we already know the result and return it from the cache making it basically "free".
So, if we take the simplest approach to blaming time, we'll report foo() as being costly to the user while bar() is "zero" time. What's even more confusing is that if the user removes foo(), now suddenly bar() takes additional time to compile.
We're very interested in building good tooling which can answer the "what parts of my code are expensive to compile?" question, but we're not really sure how to get there. For this particular thread, we'd like to focus on the "profiling rustc" use case. If you have thoughts or ideas(!!) about profiling user code's compilation time, we'd really like the feedback on this GitHub issue.
I think this would be relatively easy to support. Currently, we assume all of the events come from a single process but I don't think it would be difficult to change that.
Right now, we just call std::thread::current().id() which returns an identifier used internally by the std library, not the OS thread id. If at all possible, we should try to use the OS thread id which should fix that.
@alexcrichton it certainly seems possible to collapse disjoint threads like you want. I'm slightly concerned about hiding what's actually happening from the user in a profiling tool but as an opt-in option, that seems fine.
Out of curiosity, is it normal that Cargo would create that many threads?
Oops, sorry! More than happy to discuss that elsewhere.
I believe this is normal yeah, I was compiling Cargo in debug mode incrementally which probably has a massive number of CGUs because Cargo is so big. And yeah @bjorn3 is right in that the rustc codegen backend actually literally spawns a new thread for every codegen unit, we've never actually gotten around to having a pool of threads to reuse. That being said it may be a good trick to implement at some point!
If it's not easy to have that sort of reuse of threads it's not going too far out of your way I think. The main angle I'm getting at is that looking at these graphs sometimes you can see opportunities of missed parallelism, and having everything organized by "core" rather than by thread can help show that a bit more easily.
Like @obust mentioned this sounds like exactly the use case that is layed out in the liked talk. Here is the project and @alexcrichton has started some experiments over here i guess. The TLDR; is you run "experiments" like (is removing/speeding up foo() beneficial for the overall runtime? In your case, no because now bar() is eating it up the benefit – for obvious reasons). But now if i think about that, in the case of a cache that potentially have a miss or not, i don't think its really solving the problem because coz cannot know that removing foo() entirely forces bar() to get significant because of the cache miss. But i do think some of the techniques could help to find critical paths.
Yep, that feature seems to be in high demand. I don't know if we'll ever be able to blame "lines" of code, but blaming compile time on individual functions seems quite achievable. Still a ways off though.
Yes, the plan is for it be able to handle multiple processes. The raw format already makes sure that events can be assigned to their processes. The main complication is in the various post-processing tools. But I think some of that even is already implemented.
clang/llvm added support for generating Chrome chrome://tracing in https://reviews.llvm.org/rL357340, the same format that crox is generating so would be nice to be able to combine them but can be hard due to how the timestamps is generated in diffrent ways.
the events is tagged with module or function depending on pass type.
Here is one possible vision for a workflow: There is a tool called rsp (for Rust self-profiling) that everything goes through. This tool does one of four things:
Invoke cargo with RUSTFLAGS modified, so that all rustc invocations do self-profiling, then generate a report.
Invoke rustc directly, so that self-profiling is done, then generate a report.
Generate a report for an existing set of profiles, then show it.
Conveniently list and show already existing reports.
What makes this tool ergonomic is that:
it supports "presets", like "summarize", "incremental compilation details", "monomorphization bloat", "tell me everything". These presets define what event filter is used during self-profiling and what kind of report is generated at the end.
it uses reasonable defaults that we would not want to hard-code into rustc or cargo, like for choosing the storage location for profile data.
it has a curses-based UI that lets you select the preset or action you want to take, and displays a help message for the currently selected things. I think this is incredibly useful because it makes all the functionality discoverable. The UI mode would be the default but there would also be a batch mode and when running something through the UI, it would automatically print the corresponding commandline string so that subsequent invocations can easily done without the UI.
The tool would also support different report renderers:
terminal-compatible text
machine-readable JSON
HTML
something else?
and reports would be stored on disk by default, containing a timestamp and all the options rustc/cargo were invoked with, so that one does not easily lose or misinterpret measurement results.
Example invocations of rsp could look like:
# Opens the UI for selecting a preset, then runs cargo and generates an HTML report that is opened in the browser
rsp -- cargo build --release
# Runs cargo and generates a summarize HTML report that is opened in the browser
rsp --batch --preset=summarize -- cargo build --release
# Runs cargo with custom RUSTFLAGS
RUSTFLAGS="-Cfoo=bar" rsp --batch --preset=incremental-details -- cargo build
# Runs rustc and generates a summarize HTML report that is NOT opened in the browser
rsp --batch --preset=summarize --dont-show-report -- rustc main.rs -O
# Generate a report for existing profiling data
rsp --batch --generate-report=/path/to/directory/containing/profiling-data
# Generate a report for existing profiling data
rsp --batch --preset=perf.rlo --report-format=json -- cargo --release rustc --args
# Opens the UI, showing the main menu with defaults selected
# and "generate report", "run cargo build", "run cargo build --release", and
# "show last report" actions available
rsp
# Shows the last report again, without re-compiling or re-analyzing anything.
# This might open a browser or just print to the terminal
rsp -l
This tool (together with all "sub-tools") would probably be distributed as rustup component, so that it stays in sync with cargo and rustc.
That all sounds great, but I'm a little worried about the perfect becoming the enemy of the good here. In my experience, whatever other fancy features a profiling system has my workflow boils down to
Do the thing that gets me the flame chart, (or better yet a flame graph).
Look at the chart and think about how to optimize based on what the chart says.
Optimize according to results of step 2. Sometimes this requires an architectural rework.
Repeat steps 1-3 until it's fast enough or I am exhausted.
In my opinion something that can do that is a good MVP to aim towards at the start. That said, making profiling accessible to a wider audience, like Rust has with other things, is a worthwhile goal.