Let's talk about parallel codegen

tl;dr; I think we should turn up codegen-units to the number of cores by default to get 2-3x faster builds with virtually no runtime impact


It’s been quite awhile now since -C codegen-units started working again, and we’ve had very few reports related to it, so I think it’s time to discuss turning it on by default. As a recap the compiler currently builds up one giant LLVM module for each crate, optimizes it, and then generates an object file from that LLVM module. The codegen-units option indicates that this giant module should be split up into N units, allowing each unit to be optimized and codegen’d in parallel.

Builds can typically be 2-3x faster with -C codegen-units at an appropriate value. The speedup is not linear because the rest of the compiler is still a good chunk of time and is sequential. Additionally, one of the longest stages of the compiler, translation, is still serial. Finally, creating lots of LLVM modules is pretty memory-heavy, and as I’ll show below does degrade compile-time if there’s too many.

To help prepare for this transition, I’ve collected some data to help show why I think that parallel codegen should be enabled by default.

Compile-time improvements

First, it should be mentioned that all numbers here are for optimized crates only. The win for parallel codegen isn’t very large for debug builds (as there the codegen is just a small portion of the build).

This first table is collected from compiling the Cargo crate itself. The compiler used -C opt-level=3 and numbers were generated from the current stable compiler. All timings just used the Unix time utility and are in seconds for my machine. The column is the number of codegen units used, and the row is the metric measured within that number of codegen units. Some terminology is:

  • build (all) - time it took to take the whole build to finish
  • build (llvm) - time taken for only the LLVM passes
  • build (trans) - time taken for just the translation pass of the compiler
  • foo avg - average time it took to do “foo” across each codegen unit
  • foo std - standard deviation of the time it took each codegen unit to do “foo”
  • module - the “llvm module passes” optimization pass
  • function - the “llvm function passes” optimization pass
  • codegen - the “llvm codegen passes” to generate an object file
|               |   1     |    2    |   4     |   8     |   16    |   32    |   64    |  128    |   256   |   512   |  1024   |
|---------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| build (all)   |  86.120 |  72.180 |  40.814 |  40.310 |  44.646 | 53.524  | 73.050  | 108.710 | 188.920 | 359.600 | 751.00  |
| build (llvm)  |  71.681 |  56.437 |  23.563 |  20.529 |  19.820 | 20.809  | 24.246  |  28.933 |  44.858 |  83.161 | 204.102 |
| build (trans) |   7.519 |   8.717 |  10.316 |  12.810 |  17.825 | 25.659  | 41.613  |  72.358 | 136.275 | 267.906 | 536.293 |
| module avg    |  55.01  |  36.53  |  13.69  |  10.13  |    9.07 |  7.55   |  6.00   |   6.64  |  11.03  |  18.67  |  36.98  |
| module std    |   0.00  |   9.38  |   2.97  |   2.61  |    2.96 |  3.50   |  3.56   |   3.05  |   3.20  |   6.74  |  16.87  |
| codegen avg   |  14.03  |   7.80  |   3.45  |   2.14  |    1.32 |  1.20   |  0.79   |   0.66  |   0.45  |   0.37  |   0.33  |
| codegen std   |   0.00  |   0.47  |   0.59  |   0.43  |    0.42 |  0.40   |  0.52   |   0.92  |   0.45  |   0.54  |   0.52  |
| function avg  |   1.69  |   1.71  |   0.84  |   0.73  |    1.44 |  2.06   |  3.25   |   5.62  |  10.75  |  22.95  |  50.69  |
| function std  |   0.00  |   0.08  |   0.20  |   0.12  |    0.28 |  0.60   |  1.13   |   1.69  |   3.75  |   9.30  |  22.45  |
| max rss (MB)  |  490    |   593   |  640    |  753    |    951  |  1379   |  2133   |   3526  | 6459    | 12323   | 24070   |
| size (MB)     |  6.5M   |   6.7M  |  6.8M   |  6.9M   |    7.0M |  7.1M   |  7.1M   |   7.1M  | 7.1M    |  7.1M   |  7.1M   |

Given all this data, there’s some interesting conclusions here!

  • We do not benefit from just increasing codegen units. My machine had 8 cores, and there’s a sharp increase in build time when going far beyond that number.
  • We get a quite nice speedup getting up to the number of cores on the machine, but the parallelization wins aren’t quite by a factor of N where N is the number of codegen units.
  • More codegen units is pretty memory intensive
  • binary size is affected by codegen units, but not by a large amount
  • Increasing the number of codegen units takes a pretty heavy toll on translation time

One reason lots of codegen units may be pretty inefficient is that the compiler currently spawns one thread per codegen unit. I tweaked the compiler to have a -C codegen-threads option and recompiled cargo with a constant 8 codegen threads and N codegen units to get the following data:

|               |   1     |    2    |   4     |   8     |   16    |   32    |   64    |  128    |   256   |
|---------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| build (all2)  |  74.450 |  58.418 |  33.258 |  30.998 |  29.962 | 37.528  | 47.703  |  70.310 | 117.590 |

Note that these timings are a bit lower in general than the previous set of numbers, but I’d just chalk that up to this compiler being the current nightly which is presumably faster than the current stable. Overal, though, it looks like we do get some nice wins by limiting the number of threads, for example 16 codegen units no longer spikes up, but there is still a general upward trend of compile time with too many codegen units.

Now another aspect of codegen-units is precisely how the compiler splits up a crate into separate LLVM modules. Currently the compiler will simply switch codegen units whenever it starts translating a new literal Rust module. This unfortunatey means that the maximum number of codegen units a crate can benefit from is the number of Rust modules. In Cargo, for example, only 54 of the generated 1024 object files (for 1024 codegen units) had code in them, the remaining 900 or so were all blank!

I modified the compiler instead of only calling .rotate() when translating a Rust module, instead it rotates on each call to trans_item. This loosely means that each function goes into a new codegen unit (in a round robin fashion). This should give us a nice distribution across codegen units and make sure we fill up all object files. Let’s take a look at these compile numbers with the same hard limit on 8 codegen threads as above:

|               |   1     |    2    |   4     |   8     |   16    |   32    |   64    |  128    |   256   |
|---------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| build (all3)  |  74.580 |  43.077 |  33.611 |  28.048 |  33.350 | 41.665  | 57.750  | 104.850 | 237.160 |

Here it looks like the benefit isn’t that great for Cargo. The upward trend is steeper, and it’s not clearly beneficial in the 8/16 case. Note that this may benefit smaller crates more (such as regex below which has only 8 Rust modules), and it may also help keep cores warm over time as codegen units are more evenly sized. I suspect some more investigation can happen here in terms of the best way to split up a crate for LLVM.

That’s all just the Cargo crate, however, so let’s also take a look at compiling the regex crate

|               |   1     |    2    |   4     |   8     |   16    |   32    |   64    |  128    |   256   |   512   |  1024   |
|---------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| build regex   |  6.437  |  4.786  |  4.129  |  4.126  |  5.186  |  6.449  |  8.800  |  14.258 |  24.061 |  45.858 |  98.680 |

This data basically confirms what we found in the Cargo crate, so nothing too too exciting here!

Runtime impact

Alright, so after talking about compile time, let’s talk about runtime performance. The main drawback of codegen-units is that by splitting up the LLVM module you’re in theory missing inlining decisions which can in turn decrease performance. As inlining is the “mother of all optimization” (especially in Rust), let’s take a look at the numbers, but first let’s talk about what each row means:

  • cargo foo - Using the cargo library generated above, this cargo binary was used to perform “cargo foo” in the Servo source tree doing a noop build. Quite a lot has to happen to determine that nothing needs to be built, so this is a relatively good benchmark of Cargo’s performance.
  • regex hard1K - the regex crate’s benchmarks were compiled and the foo1K benchmarks were run with the ns/iter output here.
  • *3 - these timings were measured with the “more round robin” approach where each item being translated caused the compiler to rotate codegen units.
|               |   1     |    2    |   4     |   8     |   16    |   32    |   64    |  128    |   256   |   512   |  1024   |
|---------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| cargo fetch   |  0.311  |  0.328  |  0.318  |  0.330  |  0.325  |  0.318  |  0.327  |  0.320  | 0.319   |  0.312  |  0.317  |
| cargo build   |  0.544  |  0.538  |  0.548  |  0.533  |  0.547  |  0.549  |  0.559  |  0.559  | 0.563   |  0.543  |  0.560  |
| cargo build3  |  0.512  |  0.518  |  0.552  |  0.546  |  0.546  |  0.530  |  0.534  |  0.537  | 0.551   |    --   |    --   |
| regex med1K   |  2107   |  2108   |  2117   |  2140   |  2146   |  2116   |  2117   |  2139   |  2145   |  2119   |  2119   |
| regex hard1K  |  32388  |  39505  |  53087  |  53168  |  52140  |  53326  |  53139  |  53150  |  53391  |  52151  |  53160  |
| regex3 med1K  |  2105   |  2165   |  2164   |  2175   |  2164   |  2169   |  2117   |   --    |   --    |    --   |   --    |
| regex3 hard1K |  32426  |  38240  |  51240  |  52150  |  52355  |  52139  |  52051  |   --    |   --    |    --   |   --    |

Some interesting results from this!

  • The performance of Cargo was entirely unaffected by codegen units, in all cases it performed the same no matter what.
  • Regexes also were pretty flat, with the one exception of the "hard1K" benchmark took a dive in performance between 1-4 codegen units and then flattened off after that.

This sort confirms our initial suspicion about runtime impact. Larger applications like Cargo probably don’t rely nearly as much on inlining as microbenchmarks like the regex crate has. One reason there isn’t a total cliff in performance I believe is that even with codegen units #[inline] functions are still inlined in to each unit (I may be wrong on this though).

Changing the defaults

So given all this data, my conclusion would be that if we used the number of codegen units you cores when optimizing it would not regress performance in many cases and would drastically increase compile times (2x in most cases). You’d of course always be able to tell the compiler to go back to 1 codegen unit to get the current behavior.

Concretely, I would propose changing the default value of codegen-units to the number of CPUs on the machine (in all builds, debug and optimized). At this time I wouldn’t propose any other changes (such as the ones I’ve made here). Some future possibilities would be:

  • Adding a -C codegen-threads flag. Above it is clear that you don’t want an absurd number of codegen threads (you benefit from maxing a the number of cores) but it wasn’t clear that having more codegen units beyond threads was a clear win.
  • Tweaking how the LLVM modules are sharded. Above I experimented with a sort of "maximal even sharding" but there weren’t cleary many compile-time benefits to this. In theory inlining may be primarily important within a module (e.g. like a C++ file), so this continues to seem like a reasonable decision.
  • Figuring out why 1K codegen units is so very costly. It should in theory be a pretty low impact operation to have so many codegen units, but it seems disproportionately expensive above. There may be obvious inefficiencies in managing codegen units which would fix this.
  • Adding a “job server” like functionality to the compiler and Cargo. Currently Cargo spawns many compiler processes in parallel and if each of these compilers spawns N codegen threads (one per core) then that’s a bit too much contention. Makefiles have a jobserver to manage this kind of parallelism, and the compiler/Cargo could do something similar to limit parallelism. For now though most builds are waiting on one crate to compile, so this probably won’t be too beneficial to start out.

Conclusions

Bumping up the number of codegen units by default seems like a win-win situation to me, but what does everyone else think? Perhaps some more data should be gathered? Do you think these measurements aren’t representative? Have you seen other problems with codegen-units? Curious to hear your thoughts!

4 Likes

I’m a bit nervous about tanking key benchmarks. “you forgot to pass --release” now becomes “you forgot to pass the 17 flags that make code really fast”. Now, this isn’t exactly a foreign concept to veteran C(++) programmers, with their -ffastmaths and whatnot, but is this the path we want to take on Rust?

Ostensibly actual release builds aren’t too concerned with compile times. It seems more reasonable to have a flag akin to “I’m still testing but I want to test optimized stuff”. --profile? This could opt into more codegen units with little concern, and possibly also do some of the things we’ve punted on in the past. e.g. include debug info (maybe even debug asserts?).

4 Likes

Well, if we wanted to be really scientific about it, we could create a graph of modules, where edges are weighted by the number of cross-module references, and then apply some clever graph partitioning algorithm
/me runs ducking

1 Like

I think a good rule of thumb is that the default debug build should optimize every choice for compilation speed, and the default release build should optimize every choice for code performance. With that rule of thumb, I’d say this change should be applied to the default debug build, but not release.

3 Likes

I think the compilation time advantage only occurs in release builds, as debug builds are not optimized. The only part that would be parallelized in debug builds would be the actual translation from LLVM-IR to assembler. I’m not sure if there’s much to be gained, but it’s worth testing.

Like @Gankra, I'm not a huge fan of doing this for release builds, mainly because the programmer doesn't have much control. E.g. depending on the algorithm used to divide things up, adding some function in some random module (or adding a call to an external generic that adds a new monomorphisation) may cause two function elsewhere to go from being placed in the same unit to being in different ones (or visa versa), resulting in "heisenbenchmarks". And, hints of lower performance is not good at all (we've had enough problems with people forgetting --release).

That said, this would totally make sense to me for a opt-debug build (i.e. the opt-level = 1 thing that has been tossed around a bit).

Due to this, I feel like having multiple codegen-units (CGUs) by default would encourage (even more...) overuse of #[inline] as a "quick hack" to get better performance even though adding the #[inline] might end up being a deoptimisation with a single CGU i.e. in order of increasing perf: no inline + several CGUs, inline + several CGUs, inline + one CGU, no inline + one CGU.

#[inline] can also be particularly bad for compile-time performance (of downstream crates especially, something a developer may not see when creating the lib), so implicitly encouraging it with something that is designed to reduce compile times is somewhat self-defeating. :stuck_out_tongue:

Yes to both. I don't think measuring just two crates is enough, especially since one got 60% slower in some modes. Also cargo and regex are almost certainly only a tiny slice of the space of possible Rust applications, e.g. neither will do much intense numeric processing. More data should look at a much broader cross-section of the Rust/systems world, e.g. gamedev stuff, low-level OS stuff, async IO, scientific/numeric programming etc.

2 Likes

I’m not sure if this is a good idea. Last week I was experimenting with a not-even-that-smallish crate (it still compiled in 2 seconds, but that may very well have been a dream), and any other setting for codegen-units except the default increased compiletime.

If there’s a reliable metric (in terms of module size, whole crate size, or whatever) that will turn on multiple codegen threads when it sees fit (but only in debug mode), I could possibly be persuaded.

I also think that there are other (less regression-prone) enhancements to compiletime that could be made (and I’m not even talking about incremental compilation): For example, FastISel is still not really used, and landing pads aren’t really necessary in executables, but are still generated. These two are even directly related, since not using invoke instructions will allow FastISel to work better. Of course, landing pads can already be turned off today, but it doesn’t seem to be officially supported.

There are also long standing bugs regarding typechecking performance: For example, of the 2 seconds my crate took to compile, 0.5 seconds were spent in typeck (I’m now on a different system where typeck alone takes 2 seconds :dizzy_face: ), presumably due to #25916 (the code is really simple - almost no generics and traits, just functions that compute stuff). Or #20304, which is apparently another big chunk of typeck time.

I also managed to reduce link time by about 150ms by using ld.gold. Could it be autodetected (I just symlinked it to ld)?

(okay, this turned into an off-topic rant, sorry :confused:)

I think this is why translation times increase, as well, but of course that's just a guess.

I have to admit, I was somewhat surprised by your results here, modest though they may have been. I suspect the important part is not how many threads there are as opposed to the fact that we don't have any sort of "jobserver" to try and limit the number of active threads. As you mention later, it'd probably be good to do this globally, not just within the context of a single rustc process.

I am definitely encouraged by this data, but it still seems to me that some kind of "opt-debug" compilation mode offers a nice compromise, as others suggested. This seems like a straightforward change, but are you dubious of its merit?


Regarding comments about type-checker performance etc, I'd say it's off-topic to some extent. Codegen-units is here today, and we know that in many cases LLVM optimization can be a serious bottleneck. That is to say, we should do both.

Note that the “hard” regex benchmarks test the throughput of the matching engine without any optimizations (because the regex and input are constructed specifically to defeat all of them). In the core engine, there are some cases where inlining really matters I think. Here’s one: https://github.com/rust-lang-nursery/regex/blob/master/src/input.rs#L86-L92 And another: https://github.com/rust-lang-nursery/regex/blob/master/src/program.rs#L94-L96

Optimize for where time is spent. When debugging/developing, time is spent compiling over and over, and executing the resulting binary once or twice. When releasing, you compile once, and then execute the binary an unlimited number of times.

Thanks for the responses everyone! Certainly lots to cover here, so I'll try to break this all down.

I don't think it's black and white here, there's a whole spectrum to deal with. On one end we've got debug "I want to build this super quickly", but that's not really being covered by this thread at all (codegen units has very little effect in debug builds). What @jschievink mentioned would help quite a bit here but I'd prefer to hold off discussion of that to another time.

On the other end of the spectrum we've got "I want to run this as fast as possible and I am willing to wait forever". These are not the only two modes work with in Rust. Servo, for example, often compiles release builds in development because it is otherwise too slow, and Dropbox also does the same. I have also additionally heard that for games debug builds are too slow in many circumstances, so the code needs to be optimized better.

What I'm getting at is that a binary "--release or not" is probably not going to cut it. It may be for each project there is "some debug mode" and "some release mode" that's all anyone ever uses, but the definition of "debug mode" is likely to change between projects. This means that we really do need to cater to the use case of faster optimized builds.


With that, let's talk about runtime performance!

Don't worry, we both share the same desire to not pass many flags to the compiler to get "fast Rust". I would like cargo build --release to continue to be the go-to "let's get the fastest build possible". That being said, I want to be super concrete here when talking about performance. You mention "tanking key benchmarks" and every piece of data I showed above showed that we did this in precisely one situation. I have not done any analysis of this yet, and it's highly likely that one #[inline] tag will fix it. Along those lines I think it's unfair to say that more codegen units will "tank key benchmarks" as a blanket statement.

We can also contrast the way we build code with C++. Normal C/C++ projects ostensibly already have many "codegen units" as it's basically just one per C/C++ file and then all #[inline] functions are what's in headers. I don't think anyone's complaining about the performance of C++ any time soon, so I very against a blanket declaration that "Rust must always have one codegen unit to be the fastest ever".


I don't 100% agree with this, but I think the sentiment is in the right place. I think we may want some more knobs for projects to play around with, but my claim is that very few projects will see any noticeable difference by lowering codegen units to 1. Another example of this is that we do not enable LTO by default, which makes the world of rust truly one codegen unit, and this is because it's incredibly slow and would be a pretty awful experience.

Along those lines I don't think there's a blanket "this is the fastest and best for everyone" setting that exists, but rather "this is the fastest configuration for this project" and that will probably vary among projects (with Cargo allowing various knobs here and there).


@huon

I totally agree that there are probably various heuristics that can be done to improve the way we split up codegen units, but I don't think this should hold us back from turning it on by default. My response to @Gankra above also touches on a number of the topics you talked about, but overall all the data I've seen (plus the concept behind codegen units) makes me think that the pitfalls of "heisenbenchmarks" (which is a great name btw) and such won't actually come up all that often. You'll of course always have the ability to reduce codegen units to 1 via configuration in any situation.

I also don't necessarily see #[inline] as a quick hack per se but rather something akin to "I'd put this in a C++ header file". It feels to me like a pretty good heuristic for what should be inlined everywhere vs not. Like I mentioned above, I don't think anyone's complaining about C++'s runtime performance so if we stick to that kind of standard it seems like it's hard to go that wrong.

All excellent points! Do you have some example projects in mind you'd like me to take a look at? I'm not sure of many easily available benchmarks in this area (in terms of runtime).


Very interesting! Could I take a look at this crate? I'd be curious to see what the slowdown was.

Otherwise I agree with @nikomatsakis that the other aspects of optimization you mentioned, while they should certainly all be investigated, are somewhat orthogonal to codegen-units. I'm currently just trying to push hard on this as it's been sitting relatively dormant in the compiler for quite some time.


I think I'd like to first try real hard to stick to "you always run cargo build with --release or not" as much as possible. That message is very simple to understand for everyone because if you're developing, you always run cargo build in a project and if you are releasing code you always cargo build --release. I think it'd be a failure mode if you had to remember different flags and configurations and such depending on which project you're working on.

Now Cargo solves this a good deal with profiles, so it may not be too too hard. For example specific projects could set the codegen-units back to 1 for release builds. While I do think that some sort of "opt-debug' compilation mode seems nice, from what I've seen so far it basically means that "debug" is next-to-useless (e.g. far too slow) so it'd probably just be the default debug build.

Regardless, though, I'm open to the idea but would prefer to not do so!


Sorry for the long-winded response, but thanks again for all the thoughts!

1 Like

Yeah that's basically the problem. I think far too many of us have identified the default debug build as useless for our work, but suffer from the burden of --release (which I agree should be balls-to-the-wall optimized). A third mode for "make some dang tradeoffs" without having to understand Cargo's profile system would be a big deal. Now that you mention it, I think it'd be great if it was the default, and today's default was relegated to some sort of --debug or --no-opt flag or something.

I'm literally about to start making some code multi-threaded just so the debug perf isn't trash (release perf is perfect without it).

That's understandable. There are definitely times you'd want faster compiles for optimized builds. Those feel more exceptional than "I want this to build the fastest possible code it can". Such that I would be bummed and confused if the default for --release did not necessarily mean the best optimized code.


I feel being able to define custom profiles would make the most sense, so that we don't have to guess beforehand what someone might need. Let someone add to their Cargo.toml:

[profile.devel]
opt-level = 1
codegen-units = 4
# etc

And then cargo build --devel would look for the profile defined in the toml.

I hear you, but I also think that asking people to tweak profiles is a tall ask. That's a BIG drop in simplicity. I certainly find them intimidating. I guess what I mean is that I'd rather remember "debug", "opt-debug", or "release", or -O0, -O1, and -O2, or whatever we call it, then have to remember how to create profiles, and what should go in them. Having common modes that are universal across projects with well-understood meaning feels easier to me.

Still, clearly two modes are better than three, so I guess it really depends, ultimately, on whether you think that codegen-units is costing us nothing. (*) I guess the jury is still out on this, but you make a persuasive case. It's certainly true that even with today's release, a well-chosen #[inline] (or removing an ill-chosen one!) can make a big difference to performance, along with other small changes (and even more so across crates). So perhaps codegen-units just mildly enhances this effect, as you suggest. Have to think on it.

(*) Really this question is broader than codegen-units, of course. It seems pretty likely that we will continue to encounter optimizations we can do which have a disproportionate effect on compilation time. That is, where the extra time doesn't seem to payoff in results for most users. I guess that these will also be things one can tweak in profiles.

You suggest that people will only ever use two of the three modes for any given project, but I'm not sure this is the case. It certainly hasn't been my experience on C++ projects. The ones I've worked on, at least, have always had debug, opt-debug, and opt as the three modes, and I've always found uses for all of them. Typically, I mostly used debug, but sometimes there would be bugs where debug is just too slow to track it down, and then I would use opt-debug, and only fallback to opt for benchmarking or particularly nasty bugs. In some projects, it might go the other way -- typically use opt-debug, but fallback to debug if I need the extra debuginfo. It's not clear to me why this should be so different for Rust.

Perhaps part of the difference between C++ and Rust is that I used debuggers a lot more in C++, and opt-debug compilations definitely are lacking in the debuginfo department (typically). In Rust, I almost never use gdb, partially because I'm just so accustomed to it not working at all. (And while it's made great strides, it's still a far cry from what you get with C++ anyhow).

This does seem like it would be useful, so that people can decide if two modes are enough for them, or they want a wider menu of choices.

Man. So, I'm thinking more and more on what you said and coming to agree that the default cargo build --release ought to contain the optimizations that we have selected as being the best "bang for your buck" in terms of compilation vs release time. And while we tend to view codegen-units as weakening optimization, you can look at it the other way: not using codegen-units is a kind of extra optimization, one that you pay for substantially. When you look at it that way, it's not clear that it's worth the extra effort, at least not for most projects and not by default. (But getting more data on this would definitely be good.)

That said, there is still something that bothers me about the idea that we should only have two modes. Here are some different quotes from your message:

These don't seem to all fit together for me? On the one hand, you make the case quite well that two modes doesn't fit, but on the other, you don't want to have more modes. I guess perhaps the point is that while there will be more modes, they are going to be unique to each project somehow? And, if each project tweaked its profiles for "debug" builds suitable, then I guess that this would make your final quote: "just run cargo build if developing" accurate (unlike today, where the right build for debugging will depend on the project). OK, I'm coming around to this point of view. (If I've accurately summarized it.)

Sorry for the stream-of-consciousness message here. :smile: Time for me to get some sleep, I think.

The problem is that just compiling an entire project like Servo in one LLVM unit (a true and proper total optimization mode) would melt the developer’s computer. Otherwise, I would recommend that for --release mode.

It seems like there are three different modes that people want to compile in, most of the time:

  • I just want to test/debug and I don’t care if the code is slow, please compile as quickly as possible.
  • I want to run test/debug, but the un-optimized build is not fast enough for me, please compile as quickly as possible while still doing LLVM optimizations.
  • I want to use this binary in production, or release it, or whatever. If compilation takes 3 hours, it doesn’t really matter because my workflow is done.

Currently the last two cohabit in cargo build --release, but are in tension. I get the concern over flag proliferation, but I don’t think three flags is a great many more than two. If these two purposes are split into their own flags, each can be designed so that it is suited to its clear purpose. Maybe LTO could even be turned on for the third flag.

This is pretty minor, but in the past I’ve hit the memory limit of travis (2GB if I remember correctly) when using cargo build -j 2.

Chances are that --codegen-units could hit that limit as well.

If it’s actually true that multiple codegen units rarely if ever have measurable performance impact, I could be persuaded to change --release and leave -C codegen-units=1 as an optional knob to tune if necessary. But, although the benchmarks in the OP are encouraging, they are a far cry from convincing. Coming up with a good set of benchmarks is very very hard, of course, but a good start would be a “call for benchmarks” on users.rust-lang.org and other venues: Performance-minded community members all have their own benchmarks. Build an installer with the change applied and let people run their benchmarks on it and report back. In the long run, we want performance tracking infrastructure anyway, but since it’s not ready yet…

Regarding @alexcrichton’s point that multiple codegen units is the default in C++ land: True, but my totally unscientific opinion is that idiomatic Rust relies on inlining to a greater degree than typical (not necessarily modern) C++ code. Additionally, while #[inline] is analogous to putting something in a header in C++, at least in C++ you can reliably predict what code lands in the same codegen unit (and thus, whether it has to be put in a header). So I see a greater likelihood for spamming #[inline] everywhere, sometimes justified and sometimes out of paranoia.

One alternative which has a smaller burden of proof would be a third mode between --debug and --release. This takes a bit of re-training people to not use --release while developing, but that’s minor compared to fiddling with profiles. I currently strongly prefer this, simply because it is less controversial and gives virtually all the benefits.

The crate (and commit) in question was this. In debug mode I consistently get worse compile times when I use more than 1 codegen unit:

jonas@archbox ~/d/sneeze> time cargo rustc
   Compiling sneeze v0.1.0 (file:///home/jonas/dev/sneeze)
1.86user 0.16system 0:02.04elapsed 99%CPU (0avgtext+0avgdata 181280maxresident)k
0inputs+9784outputs (2major+69625minor)pagefaults 0swaps
jonas@archbox ~/d/sneeze> time cargo rustc -Ccodegen-units=2
   Compiling sneeze v0.1.0 (file:///home/jonas/dev/sneeze)
1.98user 0.53system 0:02.37elapsed 106%CPU (0avgtext+0avgdata 178160maxresident)k
0inputs+9800outputs (2major+73066minor)pagefaults 0swaps
jonas@archbox ~/d/sneeze> time cargo rustc -Ccodegen-units=3
   Compiling sneeze v0.1.0 (file:///home/jonas/dev/sneeze)
2.01user 0.72system 0:02.53elapsed 108%CPU (0avgtext+0avgdata 178744maxresident)k
0inputs+9992outputs (2major+74861minor)pagefaults 0swaps
jonas@archbox ~/d/sneeze> time cargo rustc -Ccodegen-units=4
   Compiling sneeze v0.1.0 (file:///home/jonas/dev/sneeze)
2.00user 0.70system 0:02.47elapsed 109%CPU (0avgtext+0avgdata 179368maxresident)k
0inputs+10144outputs (2major+75488minor)pagefaults 0swaps

In release mode, it does work much better:

jonas@archbox ~/d/sneeze> time cargo rustc --release
   Compiling sneeze v0.1.0 (file:///home/jonas/dev/sneeze)
3.75user 0.40system 0:04.17elapsed 99%CPU (0avgtext+0avgdata 177032maxresident)k
0inputs+3272outputs (2major+69760minor)pagefaults 0swaps
jonas@archbox ~/d/sneeze> time cargo rustc --release -Ccodegen-units=2
   Compiling sneeze v0.1.0 (file:///home/jonas/dev/sneeze)
3.46user 0.35system 0:02.85elapsed 133%CPU (0avgtext+0avgdata 177200maxresident)k
0inputs+3296outputs (2major+76220minor)pagefaults 0swaps
jonas@archbox ~/d/sneeze> time cargo rustc --release -Ccodegen-units=3
   Compiling sneeze v0.1.0 (file:///home/jonas/dev/sneeze)
3.46user 0.39system 0:02.97elapsed 129%CPU (0avgtext+0avgdata 177664maxresident)k
0inputs+3320outputs (2major+76809minor)pagefaults 0swaps
jonas@archbox ~/d/sneeze> time cargo rustc --release -Ccodegen-units=4
   Compiling sneeze v0.1.0 (file:///home/jonas/dev/sneeze)
3.34user 0.38system 0:02.89elapsed 128%CPU (0avgtext+0avgdata 177088maxresident)k
0inputs+3344outputs (2major+78151minor)pagefaults 0swaps
jonas@archbox ~/d/sneeze> time cargo rustc --release -Ccodegen-units=5
   Compiling sneeze v0.1.0 (file:///home/jonas/dev/sneeze)
3.54user 0.38system 0:02.92elapsed 134%CPU (0avgtext+0avgdata 177324maxresident)k
0inputs+3568outputs (2major+78669minor)pagefaults 0swaps

The slowest passes in debug mode seem to be - in order - type checking (741ms), LLVM passes (376ms) and translation (267ms), so it’s not very surprising that codegen-units doesn’t make it faster. And this explains my frustration with typechecking performance :wink: