Let's talk about parallel codegen

One thought I had last night is that I would feel better about codegen-units if they were more deterministic. On the other hand, inlining is always a matter of heuristics, so perhaps this is silly.

In general, I do think it's time to start building a preliminary Rust benchmarking suite. I would like something we can use to evaluate both compilation time and the performance of generated code. (These may not be the same test suites.)

Let's say I have two machines, one with two cores and one with eight cores. How could I make sure that cargo build --release creates the same binary on both machines? Would I need to set the same codegen-units for every dependency? (Sorry if I missed something, this is a bit of a drive-by comment.)

2 Likes

Would https://github.com/cybergeek94/img_hash be a good benchmarking test case? It’s very compute-heavy and relies on inlining and vectorization for good performance.

As a side note: @nikomatsakis and talked a bit about incremental compilation last week and one of the things needed there is some slightly more sophisticated support for splitting things into code-gen units. While this is still a ways of off, it’s likely that some general improvements in this area will fall out of this.

Your measurements regarding runtime performance, @alexcrichton, are very encouraging by the way. In an incremental compilation environment we will likely want to have much smaller code-gen units than when recompiling everything. It’s great that you took the time for analyzing this! :+1: :+1:

Regarding the actual topic of this thread, I’d be in favor of starting to create a Rust benchmark suite so it’s easier to gather data for questions like this in the future.

My gut feeling though is that it would make most sense to have a third “–debug-opt” build profile that uses as many code-gen units as there are processors and maybe disables some optimizations to keep debuginfo as usable as possible. That would also open up the possibility of enabling LTO for “–release” builds.

3 Likes

In C++, however, a codegen-unit is always (and deterministically) a single file, often corresponding to the implementation of a single class, or a bunch of related classes. So it's very likely that there will be some helper methods there that will be called mostly or only from other methods of the same class. A similar effect is achieved by using Rust modules for compilation units. But with the new way of distributing the functions onto compilation units, this locality is lost. I would expect that to have a significant impact on how much inlining goes on. Certainly, you cannot compare the "one function per compilation unit" style with C++.

I think @nikomatsakis put it quite well in saying that one codegen unit should in some sense be considered an “extra optimization” rather than a standard thing we do, that allows us a lot more freedom with how we deal with codegen units both in terms of parallelization as well as perhaps incremental compilation in the future. Taken to the extreme we would enable LTO on every build by default (as that’s truly one codegen unit), but there’s good reasons to not do this!

I do, however, think that it is an excellent point that a strong point of C++ is that you have predictable codegen units of you need it, and I would see it as a shame to lose that. Along those lines I don’t think my “round robin” approach is a great one and if we’re not scaling to hundreds of codegen units I would expect somewhere between N and 2N codegen units (where N is the number of cores) where each module is guaranteed to be in the same codegen unit to be sufficient. That should give us a level of predictability while still allowing us to break up crates nicely.


@jschievink

Hm an interesting observation! Do you perhaps only have one core on that machine? I saw only a tiny speedup for multiple codegen units in debug mode but a significant (~40%) improvement in release mode on my local machine (8 cores)


@killercup

Yes presumably if you wanted the same binary you’d have to compile with the same number of codegen units.


@abonander

Do you have a benchmark that I’d be able to run?


@michaelwoerister, @hanna-kruppe

I totally agree that it’d be nice to have a comprehensive benchmark suite, but as you mention it is indeed quite hard, so I wouldn’t necessarily want that to prevent us from making improvements in areas like this.

Yeah, it just needs to be enabled with a feature flag so it can build and test on stable: cargo bench --feature bench. Is there a #![cfg()] flag for benchmarks only that I could use instead?

Edit: there’s some errant use of unsafe that I’ve meant to clean up. It’s most premature optimizations from before I learned that LLVM can elide the bounds checks in loops.

No, my machine has 4 cores and 8 threads, and codegen units does work fine in release mode. But I’m not too surprised that enabling codegen units can increase compile time in debug mode, given that LLVM just doesn’t do as much as in a release build.

@abonander ok thanks! After running your benchmarks on 1-8 codegen units, I saw no appreciable difference in performance among any of them.

@jschievink huh, weird!

I agree with others that this doesn’t seem to be a good idea. The regex hard1k benchmarks take a 60% dive in perf. IMO that’s just not acceptable. As a C++ dev, I’d rather have my compile times take twice as long than take a 60% perf hit. I can’t imagine not caring about that much of a perf difference and still choosing C++ or Rust over something like Java.

I can’t see how the TL;DR of this post can be “2-3x faster builds with virtually no runtime impact” when the benchmarks don’t show that. The benchmarks include only two codebases (cargo and regex) and one of them jumps of a perf cliff with parallel codegen.

@Valloric, I feel that those aren’t necessarily the same conclusions that I would draw from this. Yes, one benchmark takes a dive, but this is one microbenchmark in one repository, and these should always be taken with a grain of salt.

There currently is and always will be a method to turn off codegen units (and perhaps this could even be the default for cargo bench), but your experience doesn’t match what many others are feeling as well (welcoming longer compile times for what amounts to almost no perf in larger code bases).

I agree that more benchmarks would be nice to have, but I don’t think the conclusion here should be that because one microbenchmark regressed that the entire idea should be scrapped. We’re already doing a pretty surprising thing by putting the entire crate in one codegen unit by default (e.g. C/C++ don’t do this at all) for all builds we ever do. I think there’s a lot to gain here from simply considering one codegen unit an extra optimization. If you really need it you can buy into it, but like LTO it’s unlikely to make a difference in real-world situations. One can always construct a microbenchmark to show differently, but 2-3x faster builds is no small feat!

1 Like

My biggest concern with this proposal is that it seems to promote spooky action at a distance. If someone makes a change to a program and performance is significantly affected, is it because of the change they made or is it because by adding a function in the middle, they’ve changed the function / compilation unit relationship and thus the inlining opportunities available to the rest of the program?

Is this a legitimate concern? Would it improve matters to generate many more units, e.g. one per module? That would probably be easier for new users to reason about at least.

@sorear yeah I agree that level of non-determinism would probably not be so great, but the current codegen-unit boundary today is a module boundary, which corresponds pretty closely to what one would probably expect (e.g. a codegen unit per file like in C++), so as long as we maintain that level of determinism I don’t think it’ll be too much of a problem.

Maybe piggybacking on the good point that LTO is also not enabled by default, should there be maybe 3 levels of optimization instead of 2? E.g.

--debug
--release 
--optimized

Yeah. I think my concrete proposal was for a -C codegen-units=inf (which would generate the same code regardless of hardware concurrency), but close enough.

Are you proposing that we will make precisely one codegen unit per module? I don't think that's what we do today, right? That would be deterministic, though. (And it'd be an interesting experiment to run, as well, I don't think you measured that scenario.)

This whole conversation is about defaults, but it seems that turning codegen-units up is a decision which would benefit on the developers of large, complex projects (servo, Dropbox) at the cost of devs who work on smaller crates. This seems precisely the wrong way round to me: servo devs are precisely the ones who would find it easy to tweak their profiles to hit the perfect balance for an opt-debug build.

In the other corner, the smaller devs are precisely the ones who would be impacted by the increased complexity of multiple flags and by the rare but distinct possibility of perf hits. It is these smaller devs who will post a microbenchmark on reddit, with results like the one in the regex crate. And we’ll have to tell them: “Aye, you passed in the first release flag, but what about the second release flag (-C codegen-units=1)”.

The niche use case is the “my debug build is too slow at runtime and my release build is too slow at compile-time”, not the “my release build is too slow at runtime”.

1 Like

@nikomatsakis

Yeah I think that’s basically what I’m proposing, and I also think it’s what happens today as well. Monomorphizations are a bit of a wrench as I assume they just get thrown in whatever the current codegen unit is (as opposed to in a more principled manner). The only point at which we rotate codegen units, however, is right here when we walk into a new module.

I feel that a module-level codegen unit is a nice balance between predictability and size. Most modules correspond to one file, and most files are of a reasonable size. It’s at least fair I think to say “if you want your compiles to be faster, write smaller modules”. All that, plus most projects benefitting the most from better compile times probably have > 8 modules to get benefits on everyone’s computers.


@cristicbz

I do think it’s a good point that smaller projects tend to configure less than larger ones, but I don’t think that “big projects” are the only ones to benefit here. I suspect that there’s quite a few projects in the category of “debug runtime is too slow for iteration”, like games, which will want faster optimized compiles and this is a great means to get there.

As a data point, I wouldn’t consider Cargo/rustfmt to be large projects, but they have significant speedups in compile time with multiple codegen units. Cargo drops from 3 minutes to 2, and rusftmt drops from 3:15 to 1:33.

I agree that we don’t want to have multiple flags, but I also have proposed 0 new flags here. There have been some thoughts about a “debug opt” mode but my claim is that we can avoid that by just saying that you have N codegen units by default. Having only one codegen unit should be considered just another extra optimization for those who want to try it, but the evidence shows to me that the compile time wins outweigh the minor loss in perf here and there. (e.g. this is the the same reason we disable LTO by default, it’s just far too slow to get any real benefit)

1 Like

There are a lot of comments here and I am not sure which direction things are taking but but things I have read that I agree most with are:

  • lets keep --release the target that generates the fastest programs no matter the compilation cost (release is by definition is not what you want to use during development, and for people who do, it is by lack of something else).
  • let’s keep the number of flags small but 3 isn’t a whole lot more than 2

I don’t like the addition of --optimized because it means the same thing as --release to me. We are talking about faster builds, how about a --fast-build (people frustrated with build times are going to look for something like that in the man page rather than the number of codegen units) which tries to pick the most sensible trade-offs for development: optimizations that make sense for build time and retain the debug info that we can without impacting build times too much (debug info is not an all-or-nothing story but I don’t know what are the costs associated with different parts).

Or have the default target do that and add --debug for full debugging information. since the tradeoffs for fast-ish build and fast-ish runtime is a bit hard to describe in a word, and flags state a clear intention. But then expect waves of questions on reddit about why rust doesn’t work well with gdb anymore :smile:

I really like gcc -g -Og, “offering a reasonable level of optimization while maintaining fast compilation and a good debugging experience.” For example, it doesn’t mix/reorder statements as much, and is better about keeping variables around for inspection. I would definitely use an option like this for Rust.