Parallelizing rustc using Rayon

Zoxc · January 19, 2018, 4:45am

Parallelizing rustc using Rayon

My plan for parallelizing rustc basically is to allow use of Rayon anywhere in the compiler. If you’re unfamiliar with Rayon or need a refresher, I suggest you read http://smallcultfollowing.com/babysteps/blog/2015/12/18/rayon-data-parallelism-in-rust/.

There are a number of loops in rustc which just iterates over all items in a crate and these should use Rayon iterators instead. Parallelizing the loops for type checking and borrow checking are particularly impactful. This proposal does not alter the structure of the compiler which means it’s reasonably easy to implement (unlike MIR or incremental compilation). You can find an implementation of most of this proposal at https://github.com/Zoxc/rust/tree/rayon-queries and https://github.com/Zoxc/rayon/tree/fiber for the Rayon changes.

Now just using Rayon sounds quite simple, but there are a number of issues which prevents that from working:

Data structures used in these loops must be thread-safe
rustc uses thread-local storage to store data for a compilation session (for example the string interner). This data cannot be accessed from Rayon thread pool
rustc uses thread-local storage to store the current TyCtxt
The query engine and incremental compilation stores active tasks as a global stack. This is fundamentally incompatible with parallel execution
Rayon cannot schedule tasks with a directed acyclic graphs structure without the possibility of deadlock. It can only handle tree structured tasks. rustc queries may even be cyclic so this is a problem

Making data structures thread-safe

Fixing this isn’t terribly hard. The compiler helps a lot here and the changes are mostly mechanical where Cell, Rc and RefCell are replaced by thread-safe variants. There is still a risk for race conditions and deadlocks here though, as the compiler only helps with preventing data races. In particular TyCtxt and GlobalCtxt with associated interners and allocators need to be thread-safe. Any type returned by queries must also be thread-safe as any thread may now use queries.

Using thread-safe types also adds fair amount of overhead. cargo check on syntex_syntax gives me 5% overhead on my AMD R7 1700, although that may vary between systems and crates. Due to that overhead, we’d want to maintain a separate compilation mode without parallelism, at least until using a parallel rustc is a clear win.

Taming thread-local storage

The use of thread-local storage for compilation session data mostly occurs in libsyntax and is used for string interners, span data and hygiene data. I combined these thread-locals into a single struct Globals and used a thread-local pointer which points to this struct. When we create the Rayon thread pool, we set this thread-local pointer to the Globals struct for the current session on all the threads. Now ideally I’d prefer there to be explicit refereces to this Globals struct instead of using implicit thread-local state, but libsyntax (and even libproc_macro) is designed with implicit state in mind.

One complication here is that InternedString currently stores a &'static str and also implements Deref<Target=str>. This string is actually a pointer into the string interner which does not have a 'static lifetime. This is fixable by storing a Symbol instead and remove the Deref<Target=str> which does have some impact on uses of InternedString, see https://github.com/rust-lang/rust/pull/46972. This fix also make it safe for InternedString to implement Send.

Another way to solve this would be to create a new thread which has the same lifetime as the string interner. If InternedString also does not implement Send, this ensures that any InternedString created on this new thread cannot escape, so there wouldn’t be a way to access the &'static str from other threads. This would allow you to use an InternedString which implements Deref<Target=str> in the compiler, assuming you don’t need to send it across threads. However, we’d still need to add another InternedString-like type which do implement Send for the cases where sending things across thread is necessary.

rustc stores a reference to the current TyCtxt in thread-local storage, and this is used in various locations. When we use rayon::join or some other Rayon primitive, we need this context to be preserved in the transition to the Rayon thread pool. To enable this to work we change Rayon to expose a pointer sized thread-local variable. The compiler will set this variable to point to a new struct ImplicitCtxt which contains a TyCtxt. Rayon will read the current value of this variable before transfering jobs across threads, then when a job is executed or resumed it will set thread-local variable on the new thread. This effectively makes rayon::join preserve this thread-local variable.

Getting rid of the global stacks

The query engine and incremental compilation currently store a stack each in GlobalCtxt. The last entry of those stacks represents the current task. This cannot represent the multiple concurrent tasks which are in flight with multiple threads.

Incremental compilation makes no use of any other entry than the last, so we can just store the current task in the ImplicitCtxt struct we introduced. When starting a new task, we read the previous one into a local variable and restore it when the new task is done.

For the query engine, we actually need the information in the stack to find cycles in the dependencies between queries. We can instead make a query struct (QueryJob) and also store a pointer to that it in ImplicitCtxt. The query struct will include a pointer to the (parent) query which spawns it. Following these pointers recovers the stack for a query. If one query spawns multiple queries concurrently there will be multiple query structs pointing to the spawner’s query struct. The active queries now form a tree instead of a stack.

Executing directed acyclic graphs on Rayon

Rayon cannot currently schedule tasks with a directed acyclic graphs structure. Say we have a query A which depends on C, and also a query B which depends on A. Lets assume we have a Rayon thread with A and B on its queue of jobs. A ran first and it needs to wait for C. C happened to be processed on another thread so Rayon helpfully tries to execute other available work before sleeping so it will start executing B. B in turn needs to wait for A. This is a problem since all of this happens on the same stack. We need to wait for A in order to resume processing B. We cannot jump up the stack and start processing A again, that would overwrite the stack which B was using. Neither A or B can process and we have a deadlock.

To solve this problem we can give A and B their own stacks and execution states. So when B needs to wait for A in the example above, we can just switch to the stack of A and continue executing. So we must make Rayon jobs use fibers (also known as stackfull coroutines). Luckily we only need to create fibers when jobs actually start executing. The way rayon::join works is that it will run one of the closures passed to it on the current thread and it will put the other closures on the work list to be stolen. In the common case where the job does not get stolen it will be removed from the work list, in which case we do not need to create any fibers. Effectively we’ll only need to create a fiber when work moves across threads. My Rayon branch has this implemented. It is using the context crate or Windows fiber API so it is quite portable. These changes do not affect Rayon’s API either, so you can use it as you normally would.

Query states and waiting on queries

The query engine will store multiple states per query instead of just a possible result. A query may be completed, started, poisoned or it may have caused a cycle-error.

enum QueryResult<'tcx, T> {
    Started(Lrc<QueryJob<'tcx>>),
    Complete(T),
    Cycle(CycleError<'tcx>),
    Poisoned,
}

Queries are marked as poisoned if the query panicked. Any query waiting on a poisoned query it will itself panic silently (using std::panic::resume_unwind). There may be multiple queries waiting on a single query and these all need to be unwound.

Queries which are already started can be waited upon. This works by asking Rayon to wait for the QueryJob in question. QueryJob implements a Waitable trait introduced in Rayon and it will get called with the fiber of the waitee and also the thread-local variable we introduced in Rayon which points to the ImplicitCtxt of the waitee. These are added to a list in QueryJob. When a query is complete, we’ll ask Rayon to resume the fibers stored in the associated QueryJob. We can recover the QueryJob of the waitee from its ImplicitCtxt, so we can recover a list of all QueryJobs which are directly waiting for one QueryJob.

Cycle detection

If we have a cycle in our queries, for example A depends on B which depends on A, it will result in a deadlock in my proposed scheme. To address that we add deadlock detection to Rayon. A Rayon thread pool will keep track of how many external waiters there are on it. When the last thread in the thread pool goes to sleep, we check that there are no external waiters for that thread pool. If there are any external waiters, we call back into rustc to inform it of the situation.

Once the query engine is informed about a deadlock, it will then walk over all active queries and look for cycles. The query engine contains a map of all active queries and each queries contains the list of queries which are waiting on it. We can then use a depth-first search using this information to find cycles.

Once a cycle is found we generate a cycle error for all queries in the cycle and store those errors as the result of the queries. We then resume all the waiters for all queries in the cycle. I’m not sure about the correctness of this recovery approach. None of the queries are aborted and they may later overwrite the cycle error in the query result with some other value.

This approach for cycle detection results in a higher latency for cycle errors since they only happen once the thread pool is idle. Cycle errors do not seem to be terribly common though and on the upside, this scheme basically detects them for free.

Error messages

If we run two queries in parallel we must now report errors from both of them, even if the errors are fatal (which means they panic). If we do not do this we might get a fatal error for query A one run and a fatal error for query B on another one. This means that using a parallel rustc could result in more error messages being visible.

The ordering of error messages would also become non-deterministic. This cannot be fixed by simply sorting them as the order becomes illogical. If a query A uses the result of another query B and then emits an error message, we expect to see the error message from A after those from B. I’m not sure if this is a big deal, currently cargo prints error messages from concurrent rustc invocations in a non-deterministic order too.

Parallelizing passes

Let’s look at how we would parallelize borrow checking on MIR. First let’s look at the current code:

time(time_passes,
     "MIR borrow checking",
     || for def_id in tcx.body_owners() { tcx.mir_borrowck(def_id); });

We see that it’s using body_owners to loop over all the bodies in the crate. Let’s look at its definition:

pub fn body_owners(self) -> impl Iterator<Item = DefId> + 'a {
    self.hir.krate()
            .body_ids
            .iter()
            .map(move |&body_id| self.hir.body_owner_def_id(body_id))
}

We add a new parallel version of body_owners called par_body_owners. Here par_iter is an utility function which gives us a parallel iterator for a collection.

pub fn par_body_owners<F: Fn(DefId) + Sync + Send>(self, f: F) {
    par_iter(&self.hir.krate().body_ids).for_each(|&body_id| {
        f(self.hir.body_owner_def_id(body_id))
    });
}

Now we can utilize our new function:

time(time_passes,
     "MIR borrow checking",
     || { tcx.par_body_owners(|def_id| { tcx.mir_borrowck(def_id); })});

Voilà! We have parallel borrow checking.

Evaluation

We can trivially parallelize a number of passes, which my branch does. My branch also parallelizes LLVM IR generation. It turns out this makes rustc’s performance less sensitive to the number of codegen units used, so you can use more codegen units which may be helpful for incremental compilation, as less code would need to be rebuilt.

Running cargo check for syntex_syntax (a relatively large crate) on my branch using an AMD R7 1700 takes 5.31 seconds using 8 threads, 6.25 seconds using 4 threads versus 12.24 seconds before. This is pretty much a best case scenario though. Smaller crates will see less of a benefit and actually generating code will diminish these gains since relatively more time will be spent in LLVM. Building syntex_syntax in release mode on my branch takes 53.94 seconds with 8 cores versus 59.93 seconds before.

Compared to a more distributed approach, this plan avoids recomputing things by taking advantage of the low cost of coordination on the same machine. I expect this plan will outperform the distributed approach when used with a single machine or when used on smaller crates. It will probably also have reduced memory overhead, since the memory overhead in this plan is mostly due to the additional stacks used by the fibers. Rayon supports work stealing, which means that if one of the threads fall idle, it can steal work from other threads evenly distributing the load. This is not feasible with the distributed approach, which will likely result in uneven loads and reduced parallelism. The downside of this plan is that we have to make large part of the compiler thread-safe, and we have to explicitly parallelize code.

Future prospects

I expect that we may increase the parallelization of rustc by making it more query driven and less pass structured. Passes are bad because it limits the amount of work we can do in parallel to the work available in the current pass. If there’s only one thing left to do in the pass, the rest of our CPU cores will remain idle. We want to ensure there’s always work to do so that does not happen. More short-term we could possibly parallelize more of the compiler. Parsing, macro expansion, metadata creation and loading/storing the incremental compilation graph seem to be good targets.

It could also be possible to compile multiple crates in a single rustc process using a single Rayon thread pool. This would ensure that our systems do not get overloaded with multiple processes spawning multiple threads.

This approach could also be combined with a more distributed approach which would allow rustc to scale up using multiple machines. We could split a Rust crate into multiple parts and compile each part using a single process per machine where that process uses all the cores available to the machine.

djc · January 19, 2018, 9:02am

Since you have some implementation done, have you done any benchmarks with that?

I was thinking about this stuff due to the thread about compilation speed currently going on over at dev-platform (tl;dr: the Rust crate depgraph is now the long pole for compiling mozilla-central).

Zoxc · January 20, 2018, 5:02am

There's some benchmarks in the Evaluation section. I haven't done any extensive benchmarks though.

I don't expect this work to be very helpful there unless you are doing debug builds. For release builds LLVM tends to dominate the runtime, so parallelizing the rest of the compiler isn't too impactful. I expect the approaches suggested here to be more effective, as they would distribute LLVM work across multiple machines.

gnzlbg · January 28, 2018, 6:51pm

Are you planning to send a PR about this upstream?

Zoxc · January 29, 2018, 8:47am

I am planning to send more PRs, assuming that the compiler team approves this plan.

nikomatsakis · February 7, 2018, 4:00pm

Ok, as usual, I apologize for not giving faster feedback. =)

Certainly parallelizing the exeution of the query DAG has always been on the radar. Leaving aside the specifics, I am in favor of this idea. The results for libsyntax certainly give some idea as to its potential.

I think we can separate out a few questions. In this comment, let me just focus on the most basic one: do we want to try to do some intraprocess parallelism at all (at least at this time)? In particular, if the answer is yes – leaving aside any of the specifics – then it makes sense to land those refactorings you had to make various data structures Sync and so forth.

I guess the only real options here are:

Do nothing
- Leaves perf on the table, at a time when perf is top priority.
The “distributed approach”
- Nobody has tried it; still seems useful; but also orthogonal. I.e., we could do this and have intra process parallelism for compiling any given “slice”.
- Just generally seems like it will be less good if we are not actually distributing.
Something roughly like what you are describing
- I don’t know of any other ideas that don’t have this general shape.

To me, the answer is clear: we should be pursuing parallelization. Hence, I think we should land your Sync PR and other refactorings – but, as you say, we should be careful not to degrade sequential performance while we play with this.

However, I think that @eddyb had some concerns here! I’d like to hear them.

nikomatsakis · February 7, 2018, 4:06pm

On the more specific questions, there are two things that I was unclear about.

Should we base this work on rayon? Well not really Rayon, but rather the fork of Rayon that you made. I’m a bit unclear on this. I think the question is whether Rayon would want to adopt this basic approach.

Wearing my Rayon maintainer hat, I think the answer is unclear. I’ll have to look at it in more depth. It seems clear that Rayon does want to support DAGs of tasks like this, but I think I had hoped we could do them through futures-based approaches, rather than strands.

But, in any case, the underlying threadpool needs here are pretty simple. It seems like the compiler could easily start with a relatively simple thread-pool based on fibers and move over to Rayon later (or maintain our own fork).

I guess another interesting question, and I don’t know the answer to it, is how much we will want to be doing par_iter sort of things at a finer grained than the query! My assumption has been that we won’t need to do that, because there’ll be plenty of parallelism opportunities at the level of queries. If we do, though, that argues more for Rayon integration.

Lazy cycle detection. I wasn’t as keen on the “detect cycles by letting compilation get to quiescence”, but I have to put my fingers on why. Perhaps because it requires more work on the thread-pool. It seems like it will be pretty easy for us to walk the cycle stack – given that we have it! – and detect cycles eagerly, as we do now. I don’t believe this is a significant cost, and we know that the relevant portion of cycle stack is frozen for all the time in which our execution is active.

Detecting multiple concurrent attempts to execute the same query. One thing I didn’t see mentioned, or maybe I missed it – how are you detecting multiple attempts to execute the same query? (Presumably we want to avoid this.) I guess that there is some global hashmap? Seems like that would also help to make cycle detection very cheap: if we lookup a given query and we find either (a) a result or (b) no existing task, then can’t be a cycle, but if we find © executing, then we can quickly walk our stack and else block on it.

But really these questions are just details.

glaebhoerl · February 8, 2018, 8:47am

Of course these kind of decisions should be made on their technical merit, but rustc using Rayon sounds nifty from a marketing perspective. A nice rejoinder to the old days of “if Rust is so good at concurrency, then why doesn’t the Rust compiler have any parallelism?”.

matklad · February 8, 2018, 8:58am

Regarding libsyntax, I believe it’s reliance on state is also very detrimental to creating great stable IDE support. My RFC proposes to solve this in a pretty radical way, but any small refactoring to make libsyntax use only local state would also help!

Zoxc · February 10, 2018, 7:50am

I think we should use my Rayon fork irregardless of what upstream Rayon does. Given that rayon-core has a quite stable API, the cost of maintenance of a fork shouldn't be too high. I also don't see how something simpler than Rayon would give us decent performance.

I expect that most, if not all of parallelism will happen within queries, as the compiler becomes more on-demand driven.

This uses the existing hashmaps for query results, but now it contains the QueryResult enum instead of the resulting value. This enum stores the state of the query, and if its already started if will be waited upon.

Lazy cycle detection is free if Rayon work threads are not sleeping and they are no cycles. It adds a couple of non-atomic memory accesses if a worker falls asleep. This is really hard to beat. Note that just walking the query stack is insufficient to eagerly catch cycles. We need to operate on the query graph, which is concurrently modified by other threads. I expect that catching cycles eagerly will be complex and expensive.

michaelwoerister · February 12, 2018, 11:16am

Some general feedback here:

It’s impressive that you were able to put together a proof of concept implementation within the existing compiler codebase. Kudos, @Zoxc!
I also think that, unfortunately, there is no way around adding intra-process parallelism to the compiler. I say unfortunately because it will make the code base more complicated.
I think that adding parallelism at the query-level, as you suggest, is the way to go.
I think your concept of query states makes a lot of sense.

I also have a few concerns (listed in decreasing priority):

I am skeptical of just replacing RefCells with mutexes. You probably did a more careful analysis than that but I would want any shared state to be very carefully adapted to a concurrent environment and as much as possible make sure that we have an exhaustive list of shared state in the compiler.
I’m not quite sure how to best integrate this with incremental compilation. Probably by having something like query states for dependency graph marking / dep-nodes too. It’s not too hard to implement this correctly, I think, but it might be hard to also make it use the available parallelism efficiently instead of just slowing things down by locking.
I’m a bit skeptical about using fibers. But that might just be because I don’t have any experience with them. They are probably awesome :)
I’m not sure if your approach allows for using Rayon within queries, e.g. if there’s a really big array to sort, and I’m not sure if that could mess with general query execution somehow.

We should probably do more research about how to make our interners and ID assignments more amenable to multithreaded access, I guess. I support experimenting with this more! However, before doing specific changes to performance critical stuff (like how Symbol is represented) I’d like to have a longterm plan on how to handle these cases (which will probably fall into a handful of categories). Just to avoid unnecessary churn.

nikomatsakis · February 12, 2018, 9:41pm

So I wonder if we can come up with a kind of “action plan”.

For example, maybe we can start by landing changes a bit at a time – e.g., targeting just the interner, and making that thread safe somehow?

Talking a bit with @eddyb on IRC, he expressed some skepticism about locks, and discussed a desire to sketch out how a “parallel from scratch” design might look.

I’m a bit curious, when it comes to cells and refcells etc – I would expect us to modify the RefCell that store the query database to mutex, and a select few shared structures (like perhaps like the interner). If we are finding a lot more ref-cells, though, that does seem like a problem. I would want to factor those away by moving to queries, not by adding a mutex here or there.

(@Zoxc – can you summarize the places you ended up adding mutexes? If that’s a long enough list to be annoying to write out, that’s an interesting fact in and of itself.)

michaelwoerister · February 14, 2018, 12:40pm

That sounds like a good idea.

I think there are much better options than hashmaps within mutexes. See this comparison for example: New Concurrent Hash Maps for C++

What's best will depend on access patterns too, I guess.

arielb1 · February 14, 2018, 4:26pm

IIRC @Zoxc's initial implementation does not try particularly hard to avoid data-structure contention - it even uses a mutex for the main arena.

I don't see much reason for there to be real significant contention in type-checking and translation, so we could try to pick better-optimized data structures after we get something working.

nikomatsakis · February 14, 2018, 5:33pm

I don't disagree, but I think what I'm getting at is -- I would like to enumerate what "shared, mutable" state we have first. It seems to me like a pretty small tweak to change the details of how the query results are stored. I think starting with a Mutex<HashMap<..>> makes sense, then moving to a more performant map is an obvious bit of follow-on work.

nikomatsakis · February 14, 2018, 5:34pm

Yes -- basically this is what I wanted to say. =)

Zoxc · February 15, 2018, 11:04am

Some mutexes got careful analysis, other less so. We should carefully review each mutex addition for thread safety.

I'm not sure where we could have an exhaustive list of shared state in the compiler which does not get outdated.

I think I have incremental compilation working, just using mutexes. I haven't though about anon queries yet though.

It does work within queries.

Changing a single component probably makes sense for adding mutexes, which has to be carefully reviewed. Other things like changing Rc with Lrc should just be a single PR.

This makes me think you haven't looked at the compiler in a while. There is a fair amount of internal mutability around. Here is the commit which introduces most of the locks. I'm sure a number of these could be moved to queries, but there is quite a number of them which cannot. I expect that very few of these are contended though.

I don't think we'll come up with a design which is obviously better, at least when running on a single computer. I'm sure we can find ways to avoid using locks in places, but that isn't really incompatible with my plan.

Yeah, I've focused more on correctness than performance. Also it turns out that Lift requires us to use a single arena.

nikomatsakis · February 15, 2018, 4:01pm

I don't doubt it. =) My main point is this:

We should limit internal mutability to a set of known cases. For each case, there should be a clear description of the "external" interface. We should maintain this list.

We need this for more than parallelization. A big part of the move to incremental compilation and queries was precisely focused on removing such state and pushing it into queries. Basically, any RefCell that is not integrated with the incremental system is a bug. The same is true of parallelization -- just making it a Mutex might be fine, but better still would be converting to a query or otherwise fitting into another system.

So, I am asking: can we make a list of the "categories" where we need mutexes (or some kind of shared, mutable state)? If the answer is "that's too tedious, go read this PR", then I feel like we probably need some more refactoring PRs first. That PR ought to be fairly short, or at least fairly easy to summarize.

(If you think I'm being too idealistic, please let me know some examples where I am wrong.)

I'd be happy for such a list to live in the rustc-guide. It will get outdated, but it can be fixed when that happens. This problem is not unique to parallelization -- in general we have to move to a culture of documenting our compiler better.

arielb1 · February 15, 2018, 6:19pm

This is besides the point, but Lift doesn't require the arena chunks to be contended between all processes - if every arena chunk was only ever assigned to a single thread, then there would be no contention for the fast path.

nikomatsakis · February 18, 2018, 2:30pm

So I started going through the commit that @Zoxc referenced, trying to categorize the various bits of shared state. I’m not done yet, but I created a GH repo where to store the results of the analysis (and any other documents we might want to produce as far as making a plan). You can see the results in the interior-mutability-list.md file.

Based on what I’ve seen so far, I would say the major areas break down into:

The session
Various kinds of caching and lazy computation
Type interning

Most of these may want to use a lock, but there are also quite a few of these where I think using a lock is questionable:

layout_depth seems just plain wrong, for example; that should not I don’t think be a global property.
the MIR predecessor caching should probably be reworked, I think it can use &mut self and remove the interior mutability – MIR is only ever edited by one thread at a time anyway.
the session should be subsumed anyhow. We ought to create a tcx first thing and drive things through queries.
similarly, all the lazy computations (e.g., all_traits) should become queries.

Anyway, I’ll try to finish the sweep later on – might be hard to do over the weekend, have some family medical emergencies to tend to right now. =( But if anyone else wants to open a PR on the repo, would be welcome; shouldn’t be hard to tell where I left off.

Topic		Replies	Views
Parallelizing Query Execution - All Hands 2018 Discussion Summary compiler	6	1455	March 25, 2019
Help test parallel rustc!	27	11938	May 30, 2020
Could rustc compile dependencies in parallel? compiler	5	3096	June 5, 2019
Next steps for reducing overall compilation time compiler	24	8496	March 25, 2019
Rustc dependency policy compiler	23	3575	March 25, 2019

Parallelizing rustc using Rayon

Parallelizing rustc using Rayon

Making data structures thread-safe

Taming thread-local storage

Getting rid of the global stacks

Executing directed acyclic graphs on Rayon

Query states and waiting on queries

Cycle detection

Error messages

Parallelizing passes

Evaluation

Future prospects

Related topics