C/C++ header files are awful, but the one big advantage they have over Rust is that they allow code to be compiled before or in parallel with its dependencies. In practice, we’ve noticed that its easy to get long chains of dependent crates and “bottleneck” crates (i.e. crates which take a while to compile which lots of other crates depend on). The net effect is the overall build becomes significantly serialized.
I’ve been wondering how practical it might be to introduce a new crate type that has enough of the public interface of the crate to allow dependent crates to compile against it. The assumption here is that generating such an interface crate is much faster than full compilation (akin to a check build).
Ideally the work done by the interface build could be reused to build the real crate, so that the overall serial compilation time doesn’t increase much. But for me that’s a secondary concern - I’m more interested in unlocking more parallelism.
Is this worth pursuing? Is there any work underway which addresses this problem, perhaps in a different way? Incremental builds and rustc’s internal parallelism help a bit, but they’re still serial in Amdahl’s law terms - I’d like to move more onto the parallel side.
The ideal answer would be something like “oh yes, .rmeta files already have everything that’s needed”, but I suspect that’s not true because inlining and macros make internal implementation details blur across crate boundaries.
Other properties I’d like:
Produces identical executable/rlib output to when not using this mechanism (i.e., like incremental builds it’s a build-time optimization which doesn’t affect the generated artifacts).
Works in all compilation modes (ie, not just restricted to dev builds). Release build times matter too. (Though a dev-only intermediate state would be fine given the previous constraint.)
Ideally (definitely on the “nice to have” end of the scale") the interface crates would be invariant if implementation details change which don’t affect the public interface (including inlined things), so that there’s less rebuilding. Though really this is just a special case of incremental builds, and its hard to see how this would work without revamping the SVH mechanism. I guess we could have a notion of “SVH of the public interface” and use that to express crate dependencies.
A problem is what should happen when a generic function defined in the interface crate gets used in a consumer crate. There will be no code available to monomorphize when compiling the consumer crate.
For dependencies there were suggestions to have a global Cargo cache and where possible download precompiled dependencies. This could eliminate most of the pain caused by long chains of deps.
I’ve also seen suggestions to have MIR-only libraries. Compilation to MIR is faster than the full LLVM compilation.
In this case I’m talking about code built with Buck, which already has this kind of global artifact cache. It is populated by CI and it helps a lot for the portions of the dependency graph that you’re not modifying. But the cases I’m talking about is when you do have local modifications which affect the bottlenecks so they need to be rebuilt. In other words, I’m looking for the next improvement.
I was thinking that could either be deferred until final link time (when the full crate will be available), or the interface crate could have the MIR (or something) of the generic function to allow monomorphization - assuming that generating that representation is cheap compared to full compilation.
Of course its quite possible the entire crate is generic stuff, in which case it’s equivalent to a C++ header-only template library. But on the upside, that should mean building the crate itself is quick - it just makes all its dependents slow.
I’ve noticed these „bottleneck“ crates as you name them. I wonder about a different solution, one that maybe could be easier to implement.
I suspect the bottleneck happens on long paths from a leaf to the root of the dependency graph. So it would be beneficial to start working on the long paths sooner and leave the short ones to fill in the parallelism when there’s the bottleneck on the long one. I guess that cargo now picks first n crates which already have no yet uncompiled dependencies, but picking in a more clever way from these might help. Some kind of score (longest path to the root, number of reverse dependencies, …) might help with that. But maybe such heuristics already exists and I just don’t know about it.
That could help. But in this case we’re building on machines with quite a lot of cores, and the problem is not enough parallelism to keep them all busy rather than filling all the cores with the wrong jobs first.
We’re also using Buck rather than Cargo, which generally uses more aggressive parallelism when its available. (Though I don’t know whether it prioritizes targets on critical path chains.)
FWIW, I have been noodling on supporting -Zalways-encode-mir in Cargo to increase parallelism by using rmeta files as dependencies. I think rmeta files should be sufficient to do what you want, but I am a little skeptical at this point.
Unfortunately I haven’t gotten very far. Also, I doubt it would help with Buck since the current strategy is to stagger the outputs (so once the rmeta is ready, other jobs can be spawned, and compilation of the rlib continues).
Oh, that sounds really interesting. Does that mean the rmeta would have the same SVH as the rlib so that all the final linkage constraints still work?
Do you mean one invocation of rustc would generate both the .rmeta and also the .rlib? (I don't think you do, but I'm interested in thinking about moving beyond the batch model of compilation in general.)
In any case, Buck has tons of custom logic for each supported language, so it can do whatever Cargo can do to invoke rustc, and it already has support for similar compilation models for C++ and Java.
Compilation of crates and their dependencies could be overlapped in complicated ways. E.g. if crate B depends on crate A, then once you’ve parsed A enough to know which names are macros and the contents of those macros, you can start parsing B. Once you know the types of A, you can start typechecking B.
I think ultimately we need a Rust compiler that can compile an entire dependency tree of crates all at once.
I think that would hit scaling limits. I'm thinking more of moving the compiler away from a batch-oriented "invoke once to get an output" to a streaming model where results are available as they're generated (and early termination if later results are no longer needed). That is, not too different from what you're talking about, but still keeping the compilation of separate crates somewhat decoupled.
I think the RLS 2.0 work is heading in this direction.
I don't know the answer to that. I know very little about how the compiler side works.
Correct. Cargo would use --emit=dep-info,metadata,link, and rustc would emit a message midway once the rmeta has been written. The final link step would then block until all the rlibs are finished.