Parallelizing rustc using Rayon

I think a good way forward would be to identify which state really needs to be mutably shared and which doesn’t (as @nikomatsakis has started doing), identify cases where state is mutably shared unnecessarily, and then make a list of concrete refactorings to get rid of those cases.

I’ve been thinking a bit about parallelizing incr. comp. and I think at least the coloring part can be synchronized without locks, just with atomic-compare-exchange operations.

There’s also a pattern that interestingly seems to occur quite often in the tasks the compiler needs to do. Synchronization can often happen in a two-step fashion: First we use a , possibly lock-free, dictionary to synchronize allocating a shared data-structure between threads (e.g. a query state, an interned type, maybe dep-nodes) and then, if that shared data-structure is not conceptually immutable, it can contain something that is used synchronizing it. The important thing is that we don’t have to keep the synchronized dictionary locked while making progress.

Another major area to add to @nikomatsakis list is probably the CrateStore infrastructure, which contains lots of weird things and is used before the tcx is created.

5 Likes

Just a few comments from a lurking multi-threading nerd:

I agree that it is most important to start by questioning the very existence of shared mutable state. In a threaded environment, synchronizing that state always comes at an overhead, scalability and program complexity cost, which must be motivated by a significant gain elsewhere (e.g. caching of an expensive computation).

Using hierarchical synchronization which separates container access from element access is a double-edged sword. It benefits overall scalability if accesses are well spread across the container, at the expense of requiring more synchronization transactions overall, which in turn increases overhead. For short-lived transactions and under light contention, it can be better to stick with synchronizing the whole container. This sort of trade-off is best evaluated through analysis of existing data structure usage patterns and comparative performance studies of proposed solutions.

Similarly, lock-free vs locking is a delicate trade-off, where I would advise caution. It is important to remember that an uncontended lock requires nothing more than an atomic exchange followed by in-place modification. Whereas many popular lock-free algorithms rely on use of faillible compare-exchange or load-linked/store-conditional instructions, extra dynamic memory allocations, and either linked data structures (inefficient at read time) or copy-update patterns (inefficient at write time). For these lock-free algorithms, the extra overhead only pays off in read-mostly scenarios or in an intermediary contention regime where…

  • Contention is high enough for the scalability gain to offset the overhead cost.
  • But it is not high enough for the faillible nature of compare-exchange and load-linked/store-conditional to start causing too much wasted work.

Add to this that not all lock-free algorithms are born equal in terms of overhead and scalability characteristics, and here again, it seems important to keep the trade-offs in sight, and to make an informed choice on a case-by-case basis.

11 Likes

Thank you, @HadrienG, I think this is very valuable input. Another thing to note is probably that many lock-free data structures work best with garbage collection available.

OK, I finished my sweep through @zoxc’s commit. Here is the full list of stuff I saw. I then went through and grouped it into four categories:

  • The first are the core structures that seem like they definitely require synchronization.
  • The second are things where I think we could/should refactor it away in a fairly obvious way – mostly this amounts to extending the query system.
  • The third are things where locks are wrong.
  • The final are cases where I’m unsure. In some cases, I linked into the commit.

core structures that will require synchronization

things that can be refactored away (and some notes on how)

  • hir-map (inlined bodies)
    • I think that @oli-obk’s work on miri makes this go away
  • session: (in general, this should go away and become the tcx)
    • buffered-lints
      • iirc, the only reason this exists is because the tcx isn’t around to produce the lint-level map query yet
    • recursion-limit
      • move to query
    • entry-fn, plugin crap, etc
      • move to query
    • crate disambiguator
      • move to query
    • features
      • move to query
    • etc
  • caching
    • MIR pred cache
      • this is linked to a MIR and cleared when it changes. Now that MIR moves linearly through the system, we can recode I suspect to use &mut maybe? Or maybe make the computation more explicit (e.g., the cache is populated via explicit action, and then queried, and if it is not populated we get a panic). Synchronization here seems really wasteful in any case.
    • trait system
      • ties in to the WG-traits plans; I would like to completely revamp how trait caching works anyway.
      • but something with locks is prob ok for now
  • all_traits (and this too)
    • move to query
  • MIR stealing
  • set of active features
    • move to query

locks seem wrong

  • layout-depth
    • I think this really wants to walk the stack and count the number of active layout frames, rather than being a counter. Or it could be carried along with the stack.

unclear, would like feedback

  • error emitter, crate metadata store, codemap, filemap, parser session
    • the main reason they use refcell so extensively is because they are old and mutablity in Rust used to be far easier. But maybe it’d be nice to be able to parse source files in parallel, in which case some amount of synchronization is needed? I’d love to see more of this stuff pushed back into queries though. Thoughts would be welcome here.
    • cc @petrochenkov, @eddyb, @estebank
  • session
    • lint-store etc
      • surely this doesn’t have to be as wacky as it is
      • cc @manishearth
    • next-node-id
      • no idea what this is all about
  • derive_macros
  • optimization fuel
    • this whole premise seems to require a single thread
    • we should just lock to one thread if optimization fuel is given I guess to force deterministic ordering
10 Likes

Some meta comments:

  • There’s not that much stuff in that list.
  • In my ideal world, we’d make up a list of the refactorings and plow through them:
    • Merging the session would be the biggest one, but it’d be a great win.
  • It seems ok in some cases to have some extra locks for a time too.
1 Like

Taking a quick look at the "core structures that will require synchronization", I noticed some synchronization performance low-hanging fruits. Wishing I had more time to dedicate to this...

perf-stats

  • these are just hacky little counters and things

As far as I can tell, this is just a bunch of accumulators, and rustc does not need to know their transient values during compilation, only the final result at the end. Whenever you find data which follows this general pattern, you can use a simple trick to get your synchronization overhead near zero:

  • Each thread accumulates the statistics in a local variable
  • All the local variables are merged into a global result at the end

tx_to_llvm_workers2

  • this is just a channel, seems fine

It seems like the fix is easy here: just clone the mpsc::Sender instead of keeping it in the shared state. If the locking solution is correct (i.e. threads do not depend on the order in which data is sent down the pipe), this solution will be correct too, and avoid double synchronization.

7 Likes

Heh, indeed! "MPsc"...

Thanks for compiling this list, @nikomatsakis!

  • Hopefully the crate metadata store can be immutable + queries. The main reason keeps being a special case is that it is used before the tcx exists, I think.
  • perf-stats can also just be removed (and maybe be replaced with something more general later).
  • The codemap conceptually should be just another interner.
  • tx_to_llvm_workers might go away if we de-querify compile_codegen_unit, which we might do for other reasons anyway.
  • next-node-id is for generating NodeIds. It can be an atomic counter.

I too think that merging session and tcx would be one of the first things we should do. That would give us the tools to solve most of the other problems.

Well, that may not be true if we push queries further back. I think that is a key question.

Yes, that’s what I meant with “merging session and tcx would give us the tools”.

Did we ever resolve the string interner? I remember that @Zoxc had https://github.com/rust-lang/rust/pull/46972, but it never landed, and it was a perf hit.

Agreed.

I spoke to @Zoxc a bit over IRC. They expressed the desire to land their branch and then work through the refactorings described above. I am definitely sympathetic to this – I’d hate for the branch to bitrot, and it clearly shows advantages now, so it’d be nice to be exploring this. This will also allow us to explore various things in parallel (e.g., optimizations to how the query structure is setup can be done at the same time as removing shared state).

Basically, so long as sequential performance and correctness does not regress, I think it makes sense to land code. It might be nice though to fix the layout counter case, which I believe is “actually wrong”.

3 Likes

I don't feel that comfortable suddenly parallelizing everything. This seems like a pathway to hard-to-debug bugs galore. I would much rather see a bunch of smaller chunks parallelized, with each chunk extensively fuzz tested for a while between chunks landing.

Note that it would not be enabled by default. I'm not sure if you were thinking that it would be.

No, but my point is that I think it will be a lot easier to parallelize bit by bit, rather than all at once.

I’m not sure I entirely agree with that – that is, what @Zoxc has done is basically to parallelize one core mechanism (queries). It would be hard to parallelize individual queries, as they all work through a uniform mechanism.

However, it occurs to me that inserting other uses of rayon can certainly be done. For example, NLL’s region inference, as well as the data flow mechanism, might well be readily parallelizable.

I see what you mean. Is it possible to serialize some types of queries without losing performance? For example, perhaps we can move backward through the "pipeline" (i.e. parallelize trans before MIR before HIR before the parser). That way we know that (say) the only parallelism is coming from trans, so any new bugs are likely there. Then once we are confident that that works, we can parallelize MIR queries, and so on.

All of this is a bit hand wavy in my mind though...

It might be fairly easy to add a debug mode that serializes everything (by going through a global mutex or something) and let’s you opt into parallel execution on a query-by-query basis. Sounds like an idea to keep in mind.

3 Likes

I agree that we should not wait for all the parts of the compiler to be refactored into a perfect state before merging. We should do a careful review of the changes though. It would be nice if we could do it in a series of PRs, each of which is not too big. That would reduce the risk for each PR to bitrot and it would make it easier to review in a timely fashion.

4 Likes