One thought I had last night is that I would feel better about codegen-units if they were more deterministic. On the other hand, inlining is always a matter of heuristics, so perhaps this is silly.
In general, I do think it's time to start building a preliminary Rust benchmarking suite. I would like something we can use to evaluate both compilation time and the performance of generated code. (These may not be the same test suites.)
Let's say I have two machines, one with two cores and one with eight cores. How could I make sure that cargo build --release creates the same binary on both machines? Would I need to set the same codegen-units for every dependency? (Sorry if I missed something, this is a bit of a drive-by comment.)
Would https://github.com/cybergeek94/img_hash be a good benchmarking test case? Itâs very compute-heavy and relies on inlining and vectorization for good performance.
As a side note: @nikomatsakis and talked a bit about incremental compilation last week and one of the things needed there is some slightly more sophisticated support for splitting things into code-gen units. While this is still a ways of off, itâs likely that some general improvements in this area will fall out of this.
Your measurements regarding runtime performance, @alexcrichton, are very encouraging by the way. In an incremental compilation environment we will likely want to have much smaller code-gen units than when recompiling everything. Itâs great that you took the time for analyzing this!
Regarding the actual topic of this thread, Iâd be in favor of starting to create a Rust benchmark suite so itâs easier to gather data for questions like this in the future.
My gut feeling though is that it would make most sense to have a third ââdebug-optâ build profile that uses as many code-gen units as there are processors and maybe disables some optimizations to keep debuginfo as usable as possible. That would also open up the possibility of enabling LTO for ââreleaseâ builds.
In C++, however, a codegen-unit is always (and deterministically) a single file, often corresponding to the implementation of a single class, or a bunch of related classes. So it's very likely that there will be some helper methods there that will be called mostly or only from other methods of the same class. A similar effect is achieved by using Rust modules for compilation units.
But with the new way of distributing the functions onto compilation units, this locality is lost. I would expect that to have a significant impact on how much inlining goes on. Certainly, you cannot compare the "one function per compilation unit" style with C++.
I think @nikomatsakis put it quite well in saying that one codegen unit should in some sense be considered an âextra optimizationâ rather than a standard thing we do, that allows us a lot more freedom with how we deal with codegen units both in terms of parallelization as well as perhaps incremental compilation in the future. Taken to the extreme we would enable LTO on every build by default (as thatâs truly one codegen unit), but thereâs good reasons to not do this!
I do, however, think that it is an excellent point that a strong point of C++ is that you have predictable codegen units of you need it, and I would see it as a shame to lose that. Along those lines I donât think my âround robinâ approach is a great one and if weâre not scaling to hundreds of codegen units I would expect somewhere between N and 2N codegen units (where N is the number of cores) where each module is guaranteed to be in the same codegen unit to be sufficient. That should give us a level of predictability while still allowing us to break up crates nicely.
Hm an interesting observation! Do you perhaps only have one core on that machine? I saw only a tiny speedup for multiple codegen units in debug mode but a significant (~40%) improvement in release mode on my local machine (8 cores)
I totally agree that itâd be nice to have a comprehensive benchmark suite, but as you mention it is indeed quite hard, so I wouldnât necessarily want that to prevent us from making improvements in areas like this.
Yeah, it just needs to be enabled with a feature flag so it can build and test on stable: cargo bench --feature bench. Is there a #![cfg()] flag for benchmarks only that I could use instead?
Edit: thereâs some errant use of unsafe that Iâve meant to clean up. Itâs most premature optimizations from before I learned that LLVM can elide the bounds checks in loops.
No, my machine has 4 cores and 8 threads, and codegen units does work fine in release mode. But Iâm not too surprised that enabling codegen units can increase compile time in debug mode, given that LLVM just doesnât do as much as in a release build.
I agree with others that this doesnât seem to be a good idea. The regex hard1k benchmarks take a 60% dive in perf. IMO thatâs just not acceptable. As a C++ dev, Iâd rather have my compile times take twice as long than take a 60% perf hit. I canât imagine not caring about that much of a perf difference and still choosing C++ or Rust over something like Java.
I canât see how the TL;DR of this post can be â2-3x faster builds with virtually no runtime impactâ when the benchmarks donât show that. The benchmarks include only two codebases (cargo and regex) and one of them jumps of a perf cliff with parallel codegen.
@Valloric, I feel that those arenât necessarily the same conclusions that I would draw from this. Yes, one benchmark takes a dive, but this is one microbenchmark in one repository, and these should always be taken with a grain of salt.
There currently is and always will be a method to turn off codegen units (and perhaps this could even be the default for cargo bench), but your experience doesnât match what many others are feeling as well (welcoming longer compile times for what amounts to almost no perf in larger code bases).
I agree that more benchmarks would be nice to have, but I donât think the conclusion here should be that because one microbenchmark regressed that the entire idea should be scrapped. Weâre already doing a pretty surprising thing by putting the entire crate in one codegen unit by default (e.g. C/C++ donât do this at all) for all builds we ever do. I think thereâs a lot to gain here from simply considering one codegen unit an extra optimization. If you really need it you can buy into it, but like LTO itâs unlikely to make a difference in real-world situations. One can always construct a microbenchmark to show differently, but 2-3x faster builds is no small feat!
My biggest concern with this proposal is that it seems to promote spooky action at a distance. If someone makes a change to a program and performance is significantly affected, is it because of the change they made or is it because by adding a function in the middle, theyâve changed the function / compilation unit relationship and thus the inlining opportunities available to the rest of the program?
Is this a legitimate concern? Would it improve matters to generate many more units, e.g. one per module? That would probably be easier for new users to reason about at least.
@sorear yeah I agree that level of non-determinism would probably not be so great, but the current codegen-unit boundary today is a module boundary, which corresponds pretty closely to what one would probably expect (e.g. a codegen unit per file like in C++), so as long as we maintain that level of determinism I donât think itâll be too much of a problem.
Yeah. I think my concrete proposal was for a -C codegen-units=inf (which would generate the same code regardless of hardware concurrency), but close enough.
Are you proposing that we will make precisely one codegen unit per module? I don't think that's what we do today, right? That would be deterministic, though. (And it'd be an interesting experiment to run, as well, I don't think you measured that scenario.)
This whole conversation is about defaults, but it seems that turning codegen-units up is a decision which would benefit on the developers of large, complex projects (servo, Dropbox) at the cost of devs who work on smaller crates. This seems precisely the wrong way round to me: servo devs are precisely the ones who would find it easy to tweak their profiles to hit the perfect balance for an opt-debug build.
In the other corner, the smaller devs are precisely the ones who would be impacted by the increased complexity of multiple flags and by the rare but distinct possibility of perf hits. It is these smaller devs who will post a microbenchmark on reddit, with results like the one in the regex crate. And weâll have to tell them: âAye, you passed in the first release flag, but what about the second release flag (-C codegen-units=1)â.
The niche use case is the âmy debug build is too slow at runtime and my release build is too slow at compile-timeâ, not the âmy release build is too slow at runtimeâ.
Yeah I think thatâs basically what Iâm proposing, and I also think itâs what happens today as well. Monomorphizations are a bit of a wrench as I assume they just get thrown in whatever the current codegen unit is (as opposed to in a more principled manner). The only point at which we rotate codegen units, however, is right here when we walk into a new module.
I feel that a module-level codegen unit is a nice balance between predictability and size. Most modules correspond to one file, and most files are of a reasonable size. Itâs at least fair I think to say âif you want your compiles to be faster, write smaller modulesâ. All that, plus most projects benefitting the most from better compile times probably have > 8 modules to get benefits on everyoneâs computers.
I do think itâs a good point that smaller projects tend to configure less than larger ones, but I donât think that âbig projectsâ are the only ones to benefit here. I suspect that thereâs quite a few projects in the category of âdebug runtime is too slow for iterationâ, like games, which will want faster optimized compiles and this is a great means to get there.
As a data point, I wouldnât consider Cargo/rustfmt to be large projects, but they have significant speedups in compile time with multiple codegen units. Cargo drops from 3 minutes to 2, and rusftmt drops from 3:15 to 1:33.
I agree that we donât want to have multiple flags, but I also have proposed 0 new flags here. There have been some thoughts about a âdebug optâ mode but my claim is that we can avoid that by just saying that you have N codegen units by default. Having only one codegen unit should be considered just another extra optimization for those who want to try it, but the evidence shows to me that the compile time wins outweigh the minor loss in perf here and there. (e.g. this is the the same reason we disable LTO by default, itâs just far too slow to get any real benefit)
There are a lot of comments here and I am not sure which direction things are taking but but things I have read that I agree most with are:
lets keep --release the target that generates the fastest programs no matter the compilation cost (release is by definition is not what you want to use during development, and for people who do, it is by lack of something else).
letâs keep the number of flags small but 3 isnât a whole lot more than 2
I donât like the addition of --optimized because it means the same thing as --release to me. We are talking about faster builds, how about a --fast-build (people frustrated with build times are going to look for something like that in the man page rather than the number of codegen units) which tries to pick the most sensible trade-offs for development: optimizations that make sense for build time and retain the debug info that we can without impacting build times too much (debug info is not an all-or-nothing story but I donât know what are the costs associated with different parts).
Or have the default target do that and add --debug for full debugging information. since the tradeoffs for fast-ish build and fast-ish runtime is a bit hard to describe in a word, and flags state a clear intention. But then expect waves of questions on reddit about why rust doesnât work well with gdb anymore
I really like gcc -g -Og, âoffering a reasonable level of optimization while maintaining fast compilation and a good debugging experience.â For example, it doesnât mix/reorder statements as much, and is better about keeping variables around for inspection. I would definitely use an option like this for Rust.