Parallel-friendliness in the Rust compiler

Continuing a conversation from another thread:

Out of curiosity, could you list some of those things? Do you think there are specific design choices in the Rust language that block parallel compilation?

(I was thinking about this recently in the context of const generics; it feels like having type-checking depend on the contents of function bodies in some cases would make it incredibly hard to separate compilation into passes. I don't know how much that impact parallelization.)

One that comes to mind is this case that the R-A folks brought up: [Edition vNext] Consider deprecating weird nesting of items · Issue #65516 · rust-lang/rust · GitHub

Basically, I think there's two categories of things that cause problems:

  1. Anything that requires knowing the full set of stuff to run a check. For example, overlap coherence checking needs all the impls. (But one benefit of the orphan coherence rules is that you don't need to worry about downstream impls, since they're rejected.)

  2. Anything that requires knowing the inside of something. For example, one could conceivably note that there is a trait implementation and add that to the coherence checking set as a separable thing from looking at the actual implementation of it (and checking those methods, writing down which methods have non-default implementations, etc). But things like 3245-refined-impls - The Rust RFC Book break that slightly, since now there are details from inside the impl … for … block that matter outside of it. And this category is also the one with the auto-trait leakage problem, as well as the "oh, there was an impl in a weird place" R-A issue I linked above.

1 Like

There are two styles of parallel compilation one can imagine.

One style is a pipeline of phases, some of which are embarrassingly parallel. A different style is a more general graph of queries, which can be explored in parallel. Rust is pretty hostile to the first one, but should work moderately ok with the second one.

A hypothetical ideally parallel Rust for the first case would look like this:

  • crates come with API/header files (either hand-written or generated), so that, when compiling a graph of crates, everything can be compiled in parallel, not necessary in the dependency order.
  • within a single crate, parsing&lowering of every file proceeds in parallel. That is, module structure is inferred from file-system (so that you don’t need to parse parent module to learn where the child is), and meta-programming works on per-file basis (so, all used macros are known before the start of compilation, and macros are name-resolved syntactically (basically, there’s a flat namespace of macros per crate)
  • after lowering, there’s a short, per-crate global phase of name binding, where imports and references in top-level declarations and function signatures are resolved are resolved, but we don’t look into function bodies (in Rust, parsing and macro-expansion have to happen here)
  • now that we know signatures, we again can compile every function completely independently (in Rust, impl trait auto-trait leakage is problematic for this)

I think this is roughly the compilation model Carbon is aiming at.

8 Likes

Bouncing on this post to think about what a more-efficient Rust could do:

How much inter-crate parallelism could you get from header files? Since crate types depend on parent crate types (or even function bodies for const generics), it feels like most of a crate's compilation would still be blocked on compiling its dependencies.

On the other hand, that seems like something you wouldn't need if you had pre-compiled crate headers. The header would include the module hierarchy with file locations, modulo cfg annotations. (Basically the header would include a mod-only filter of the entire file hierarchy.)

Right. The big advantage that comes to mind if you have a model where a file can only access macros in upstream crates + a fixed set of files per crate + the current file is that you can do a bunch of work (interning + tokenizing + parsing) and know that the result depends only on the file content, so during incremental compilation you can load it based on mtime alone without even reading the file.

I though there'd be a bunch of time-consuming steps between "name-binding" and "function signatures are resolved", so it wouldn't quite be a short per-crate global phase?

In particular, even if most of the trait resolution occurs in function bodies, you still need to do some to figure out signatures, right?

(In general, trait resolution seems to be the part that's most problematic if you want a "check file timestamp, load cached analysis if file untouched" makefile-like model.)

It feels like you could use some sort of map-reduce algorithm here, where most of the work is done in parallel?

Like, first do trait resolution that doesn't know whether the impl trait implements Send+Sync, and complete the resolution later? (But that might be difficult, because type inference might be affected by whether or not a type implements auto traits, so I dunno)

(I've been thinking about this a lot since the github post on stack graphs came out. It feels like there's a lot of untapped compile-speed potential in doing a lot of memoize-friendly work per-file before doing a per-crate analysis.)

Take a look at all the methods on Any: https://doc.rust-lang.org/std/any/trait.Any.html#implementations. It's technically a different method if called on dyn Any + 'static or dyn Any + Send + 'static or …

So it would be nice to treat autotraits like lifetimes and just check conformance with them later, but I suspect it's not actually possible. (Maybe over an edition something like that could be made to work, though? Probably would need a way for things to say they're generic over auto-trait-ness, though.)

I don't think that applies in the general case, though. Eg: this doesn't compile:

struct Foobar<T>(T);

impl<T> Foobar<T> {
    fn bar(&self) {
        println!("hello");
    }
}

impl<T: Send + Sync> Foobar<T> {
    fn bar(&self) {
        println!("goodbye");
    }
}

fn main() {
    let foo = Foobar(42);
    foo.bar();
}

Probably would need a way for things to say they're generic over auto-trait-ness, though.

I'm not sure you need to go that far. As long as you can do type inference and symbol resolution before resolving auto traits, you should still have some parallelism and incrementality, right? As long as eg associated types can't differ based on whether a type implements auto traits, you should only get obligations that can be deferred.

(Negative impls would create more blockers for parallelism, though)

But they can, even on stable: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=5b44e898ac988bcad90fe80bdf920abe

trait Something {}

trait Hello {
    type World;
}

impl Hello for dyn Something {
    type World = ();
}

impl Hello for dyn Something + Send {
    type World = i32;
}

(Specialization, of course, could make this even more possible.)

Swift and C++ are moving to a two step model. Compiling the surface of a module and compiling the module can be two separate steps. Maybe Rust is different.

Might be interesting to steal JavaScript patterns here, like a pre-parse step that just gathers the top level items visible so cross file references can be resolved as quickly as possible (assuming it doesn't do this already!)

Honestly it seems like there's plenty of room to parallelise a build, but it's just a huge amount of work for not much gain in the common case of lots of parent crates that can be built mostly in parallel.

Being entirely single threaded at the moment does imply there's some relatively low-hanging fruit though: perhaps try to overlap the top level phases by starting with leaves (functions, basic blocks, whatever) as you find them in the previous phases and just putting them to the back of the queue when you find you're missing data? This assumes that you can know you are missing data and you can easily "taskify" the work of course. Perhaps async helps?

1 Like

If you have an infinite number of cores (:slight_smile:):

  1. compile all surfaces in parallel
  2. compile all modules in parallel
  3. link the final artefact.

This might be possible with ISO C++ modules.

Cargo currently pipelines execution as soon as metadata is available, but I recall some conversation on zulip that there's caveats (build scripts? lto? proc macros? dylibs?) which can lead to the critical path being blocked by the full compilation of the dependency.

Maybe there's opportunity to make this more fine-grained that cargo invokes rustc which proceeds as far as it can on the metadata alone and then blocks + yields jobserver tokens once it needs the fully compiled dependency. But that'd need a notification mechanism.

There's some cases like build-scripts that compile sub-C libraries that could be more parallelised if Cargo had more knowledge of what the build-script is doing (see metabuild). If the rust code does not actually depend on the compiled C library and has hardcoded extern {} block, or even if the build-script before compiling the library runs bindegn to generate the bindings to the library, then there's no need to wait for the build-script to finish; it should be possible to go onto compiling the Rust library and just have to block on the build-script completing the sub-compile before the link-stage.

The gcc ISO C++ module system has a module mapper:

It is basically an RPC mechanism between the compiler and the build system.

cargo could be an RPC server and rustc sends RPCs.

I think dyn traits are a special case? As far as I'm aware, there's no way for the compiler to infer if something is supposed to be dyn Something or dyn Something + Send unless you explicitly specify it. So parallel compilation would still work.

Min specialization wouldn't, since it wouldn't include associated types, right?

Full-on "we're not sure what the semantics would be" specialization would.

For fresh builds, yeah. But if you want snappy incremental builds (or, at the extreme, hot reloading) on low-end computers, then intra-crate parallelization becomes attractive again.

I was thinking about this because these days I'm coding on my laptop and VS-Code takes 5+ seconds to give me error messages on save, for a moderately large crate. Not bad by any means (I remember hour-long build times and those were not fun), but juuuust long enough that I lose a bit of focus and end up thinking "man, if we could improve incremental compilation just a bit, my work would go way faster".

5 Likes

C++ modules have an inherent ordering to them, so it is no longer embarrassingly parallel. Even better, the ordering is content-dependent, so if you change file contents (namely import and export sets), specific file compilation locations in the graph may have to change.

I was talking about a full-build with an infinite number of cores. The surfaces are passed to the modules to resolve dependencies. It is embarrassing parallel.

For small number of cores, you will have some parallelism and have to follow the dependencies.

Incremental builds are a different challenge.

What makes you think that you don't have to follow dependencies when you have an infinite amount of cores? You still have to compile at least the interface before you can compile dependent modules. You can probably start compiling dependent modules before codegen of the current module is finished. The same is true for rust since the introduction of build pipelining in rustc/cargo.

ISO C++ modules calls them BMI files. Or Binary module interface files. They describe the surface of a module, e.g. exported things. If I have all BMI files, then I can compile any module and pass the necessary BMI files.

Module A depends on Module B.

Module A depends on the BMI file of Module B.

The same thing is true in rustc. Except rustc calls it .rmeta files rather than BMI files.

3 Likes