Parallelizing rustc using Rayon
My plan for parallelizing rustc
basically is to allow use of Rayon anywhere in the compiler. If you’re unfamiliar with Rayon or need a refresher, I suggest you read http://smallcultfollowing.com/babysteps/blog/2015/12/18/rayon-data-parallelism-in-rust/.
There are a number of loops in rustc
which just iterates over all items in a crate and these should use Rayon iterators instead. Parallelizing the loops for type checking and borrow checking are particularly impactful. This proposal does not alter the structure of the compiler which means it’s reasonably easy to implement (unlike MIR or incremental compilation). You can find an implementation of most of this proposal at https://github.com/Zoxc/rust/tree/rayon-queries and https://github.com/Zoxc/rayon/tree/fiber for the Rayon changes.
Now just using Rayon sounds quite simple, but there are a number of issues which prevents that from working:
- Data structures used in these loops must be thread-safe
-
rustc
uses thread-local storage to store data for a compilation session (for example the string interner). This data cannot be accessed from Rayon thread pool -
rustc
uses thread-local storage to store the currentTyCtxt
- The query engine and incremental compilation stores active tasks as a global stack. This is fundamentally incompatible with parallel execution
- Rayon cannot schedule tasks with a directed acyclic graphs structure without the possibility of deadlock. It can only handle tree structured tasks.
rustc
queries may even be cyclic so this is a problem
Making data structures thread-safe
Fixing this isn’t terribly hard. The compiler helps a lot here and the changes are mostly mechanical where Cell
, Rc
and RefCell
are replaced by thread-safe variants. There is still a risk for race conditions and deadlocks here though, as the compiler only helps with preventing data races. In particular TyCtxt
and GlobalCtxt
with associated interners and allocators need to be thread-safe. Any type returned by queries must also be thread-safe as any thread may now use queries.
Using thread-safe types also adds fair amount of overhead. cargo check
on syntex_syntax
gives me 5% overhead on my AMD R7 1700, although that may vary between systems and crates. Due to that overhead, we’d want to maintain a separate compilation mode without parallelism, at least until using a parallel rustc
is a clear win.
Taming thread-local storage
The use of thread-local storage for compilation session data mostly occurs in libsyntax and is used for string interners, span data and hygiene data. I combined these thread-locals into a single struct Globals
and used a thread-local pointer which points to this struct. When we create the Rayon thread pool, we set this thread-local pointer to the Globals
struct for the current session on all the threads. Now ideally I’d prefer there to be explicit refereces to this Globals
struct instead of using implicit thread-local state, but libsyntax (and even libproc_macro) is designed with implicit state in mind.
One complication here is that InternedString
currently stores a &'static str
and also implements Deref<Target=str>
. This string is actually a pointer into the string interner which does not have a 'static
lifetime. This is fixable by storing a Symbol
instead and remove the Deref<Target=str>
which does have some impact on uses of InternedString
, see https://github.com/rust-lang/rust/pull/46972. This fix also make it safe for InternedString
to implement Send
.
Another way to solve this would be to create a new thread which has the same lifetime as the string interner. If InternedString
also does not implement Send
, this ensures that any InternedString
created on this new thread cannot escape, so there wouldn’t be a way to access the &'static str
from other threads. This would allow you to use an InternedString
which implements Deref<Target=str>
in the compiler, assuming you don’t need to send it across threads. However, we’d still need to add another InternedString
-like type which do implement Send
for the cases where sending things across thread is necessary.
rustc
stores a reference to the current TyCtxt
in thread-local storage, and this is used in various locations. When we use rayon::join
or some other Rayon primitive, we need this context to be preserved in the transition to the Rayon thread pool. To enable this to work we change Rayon to expose a pointer sized thread-local variable. The compiler will set this variable to point to a new struct ImplicitCtxt
which contains a TyCtxt
. Rayon will read the current value of this variable before transfering jobs across threads, then when a job is executed or resumed it will set thread-local variable on the new thread. This effectively makes rayon::join
preserve this thread-local variable.
Getting rid of the global stacks
The query engine and incremental compilation currently store a stack each in GlobalCtxt
. The last entry of those stacks represents the current task. This cannot represent the multiple concurrent tasks which are in flight with multiple threads.
Incremental compilation makes no use of any other entry than the last, so we can just store the current task in the ImplicitCtxt
struct we introduced. When starting a new task, we read the previous one into a local variable and restore it when the new task is done.
For the query engine, we actually need the information in the stack to find cycles in the dependencies between queries. We can instead make a query struct (QueryJob
) and also store a pointer to that it in ImplicitCtxt
. The query struct will include a pointer to the (parent) query which spawns it. Following these pointers recovers the stack for a query. If one query spawns multiple queries concurrently there will be multiple query structs pointing to the spawner’s query struct. The active queries now form a tree instead of a stack.
Executing directed acyclic graphs on Rayon
Rayon cannot currently schedule tasks with a directed acyclic graphs structure. Say we have a query A
which depends on C
, and also a query B
which depends on A
. Lets assume we have a Rayon thread with A
and B
on its queue of jobs. A
ran first and it needs to wait for C
. C
happened to be processed on another thread so Rayon helpfully tries to execute other available work before sleeping so it will start executing B
. B
in turn needs to wait for A
. This is a problem since all of this happens on the same stack. We need to wait for A
in order to resume processing B
. We cannot jump up the stack and start processing A
again, that would overwrite the stack which B
was using. Neither A
or B
can process and we have a deadlock.
To solve this problem we can give A
and B
their own stacks and execution states. So when B
needs to wait for A
in the example above, we can just switch to the stack of A
and continue executing. So we must make Rayon jobs use fibers (also known as stackfull coroutines). Luckily we only need to create fibers when jobs actually start executing. The way rayon::join
works is that it will run one of the closures passed to it on the current thread and it will put the other closures on the work list to be stolen. In the common case where the job does not get stolen it will be removed from the work list, in which case we do not need to create any fibers. Effectively we’ll only need to create a fiber when work moves across threads. My Rayon branch has this implemented. It is using the context crate or Windows fiber API so it is quite portable. These changes do not affect Rayon’s API either, so you can use it as you normally would.
Query states and waiting on queries
The query engine will store multiple states per query instead of just a possible result. A query may be completed, started, poisoned or it may have caused a cycle-error.
enum QueryResult<'tcx, T> {
Started(Lrc<QueryJob<'tcx>>),
Complete(T),
Cycle(CycleError<'tcx>),
Poisoned,
}
Queries are marked as poisoned if the query panicked. Any query waiting on a poisoned query it will itself panic silently (using std::panic::resume_unwind
). There may be multiple queries waiting on a single query and these all need to be unwound.
Queries which are already started can be waited upon. This works by asking Rayon to wait for the QueryJob
in question. QueryJob
implements a Waitable
trait introduced in Rayon and it will get called with the fiber of the waitee and also the thread-local variable we introduced in Rayon which points to the ImplicitCtxt
of the waitee. These are added to a list in QueryJob
. When a query is complete, we’ll ask Rayon to resume the fibers stored in the associated QueryJob
. We can recover the QueryJob
of the waitee from its ImplicitCtxt
, so we can recover a list of all QueryJob
s which are directly waiting for one QueryJob
.
Cycle detection
If we have a cycle in our queries, for example A
depends on B
which depends on A
, it will result in a deadlock in my proposed scheme. To address that we add deadlock detection to Rayon. A Rayon thread pool will keep track of how many external waiters there are on it. When the last thread in the thread pool goes to sleep, we check that there are no external waiters for that thread pool. If there are any external waiters, we call back into rustc
to inform it of the situation.
Once the query engine is informed about a deadlock, it will then walk over all active queries and look for cycles. The query engine contains a map of all active queries and each queries contains the list of queries which are waiting on it. We can then use a depth-first search using this information to find cycles.
Once a cycle is found we generate a cycle error for all queries in the cycle and store those errors as the result of the queries. We then resume all the waiters for all queries in the cycle. I’m not sure about the correctness of this recovery approach. None of the queries are aborted and they may later overwrite the cycle error in the query result with some other value.
This approach for cycle detection results in a higher latency for cycle errors since they only happen once the thread pool is idle. Cycle errors do not seem to be terribly common though and on the upside, this scheme basically detects them for free.
Error messages
If we run two queries in parallel we must now report errors from both of them, even if the errors are fatal (which means they panic). If we do not do this we might get a fatal error for query A
one run and a fatal error for query B
on another one. This means that using a parallel rustc could result in more error messages being visible.
The ordering of error messages would also become non-deterministic. This cannot be fixed by simply sorting them as the order becomes illogical. If a query A
uses the result of another query B
and then emits an error message, we expect to see the error message from A
after those from B
. I’m not sure if this is a big deal, currently cargo
prints error messages from concurrent rustc invocations in a non-deterministic order too.
Parallelizing passes
Let’s look at how we would parallelize borrow checking on MIR. First let’s look at the current code:
time(time_passes,
"MIR borrow checking",
|| for def_id in tcx.body_owners() { tcx.mir_borrowck(def_id); });
We see that it’s using body_owners
to loop over all the bodies in the crate. Let’s look at its definition:
pub fn body_owners(self) -> impl Iterator<Item = DefId> + 'a {
self.hir.krate()
.body_ids
.iter()
.map(move |&body_id| self.hir.body_owner_def_id(body_id))
}
We add a new parallel version of body_owners
called par_body_owners
. Here par_iter
is an utility function which gives us a parallel iterator for a collection.
pub fn par_body_owners<F: Fn(DefId) + Sync + Send>(self, f: F) {
par_iter(&self.hir.krate().body_ids).for_each(|&body_id| {
f(self.hir.body_owner_def_id(body_id))
});
}
Now we can utilize our new function:
time(time_passes,
"MIR borrow checking",
|| { tcx.par_body_owners(|def_id| { tcx.mir_borrowck(def_id); })});
Voilà! We have parallel borrow checking.
Evaluation
We can trivially parallelize a number of passes, which my branch does. My branch also parallelizes LLVM IR generation. It turns out this makes rustc’s performance less sensitive to the number of codegen units used, so you can use more codegen units which may be helpful for incremental compilation, as less code would need to be rebuilt.
Running cargo check
for syntex_syntax
(a relatively large crate) on my branch using an AMD R7 1700 takes 5.31 seconds using 8 threads, 6.25 seconds using 4 threads versus 12.24 seconds before. This is pretty much a best case scenario though. Smaller crates will see less of a benefit and actually generating code will diminish these gains since relatively more time will be spent in LLVM. Building syntex_syntax
in release mode on my branch takes 53.94 seconds with 8 cores versus 59.93 seconds before.
Compared to a more distributed approach, this plan avoids recomputing things by taking advantage of the low cost of coordination on the same machine. I expect this plan will outperform the distributed approach when used with a single machine or when used on smaller crates. It will probably also have reduced memory overhead, since the memory overhead in this plan is mostly due to the additional stacks used by the fibers. Rayon supports work stealing, which means that if one of the threads fall idle, it can steal work from other threads evenly distributing the load. This is not feasible with the distributed approach, which will likely result in uneven loads and reduced parallelism. The downside of this plan is that we have to make large part of the compiler thread-safe, and we have to explicitly parallelize code.
Future prospects
I expect that we may increase the parallelization of rustc by making it more query driven and less pass structured. Passes are bad because it limits the amount of work we can do in parallel to the work available in the current pass. If there’s only one thing left to do in the pass, the rest of our CPU cores will remain idle. We want to ensure there’s always work to do so that does not happen. More short-term we could possibly parallelize more of the compiler. Parsing, macro expansion, metadata creation and loading/storing the incremental compilation graph seem to be good targets.
It could also be possible to compile multiple crates in a single rustc process using a single Rayon thread pool. This would ensure that our systems do not get overloaded with multiple processes spawning multiple threads.
This approach could also be combined with a more distributed approach which would allow rustc to scale up using multiple machines. We could split a Rust crate into multiple parts and compile each part using a single process per machine where that process uses all the cores available to the machine.