Parallelizing Query Execution - All Hands 2018 Discussion Summary


#1

At the Rust 2018 All Hands a couple of weeks ago the compiler team discussed, among other things, the way forward for parallelizing query evaluation. The TL;DR is that

  • we basically want to continue following the course laid out by @Zoxc here, but for the initial version we prefer to use a more traditional thread pool instead of a fiber-enhanced version of Rayon, and

  • we will not pursue end-to-end querification (i.e. querifying parsing, macro expansion, and name resolution) in the immediate future, as not have too many things in flight at once.

We also discussed some of the particular instances of mutable state in the compiler and how they should be handled with respect to concurrent access. More details below.

State of Ongoing Work

At the beginning of our parallelization time slot, we tried to verify that the basic concept of parallelizing the compiler at the query level is a good strategy and concluded that we still think it’s a good idea. We also discussed the current implementation approach – that is, slowly merging changes from @Zoxc’s branch into master, reviewing as we go, making sure that single-threaded performance does not regress, and taking note of necessary future refactorings. This approach still seems like a sensible one.

Threads, Rayon, Fibers

While talking about potential problems with the planned architecture, the use of Rayon-with-fibers instead of regular OS threads came up as the biggest unknown and a potential risk:

  • No one in the compiler team (except for @Zoxc maybe) has any substantial experience using fibers.
  • We want parallelization to integrate well with the Make jobserver. Retrofitting Rayon to do so seems harder than implementing this in a custom-tailored thread pool. (Although @nmatsakis says that this might be a longterm goal for Rayon)
  • @nagisa noted that, in their experience, fibers well-suited for massively concurrent scenarios with blocking I/O, but less so for parallelizing workloads.

Overall, the team was more comfortable with using regular OS-threads for the initial version of parallel-queries. This does not preclude that we might want to look into a fiber-based version in the future.

End-to-end queries

A topic that was discussed in a different session of the compiler team was “end-to-end queries”, i.e. making the entire compilation process query-based instead of just the middle part. This is desirable because only queries profit from incremental and parallel evaluation. However, end-to-end queries also require some major refactorings and in the short term, the querification of the early compilation pipeline would be an ineffective one: parsing, macro expansion, and name resolution would each become one monolithic query, thus practically nullifying the potential benefits of incrementalization and parallelization. Since the compiler usually spends only around 5% of its time in these phases, we decided that it was not worth the additional complexity of trying to pursue end-to-end queries and parallelization at the same time. Instead we’d like to push on parallelization first and tackle end-to-end queries at a later point in time.

Measuring Performance

We would like to start measuring performance of the compiler built with #[cfg(parallel_queries)]. Longterm, we obviously want the compiler to efficiently use all available CPU cores, but a question that is of interest much sooner is whether we can always build the compiler with locks in place without noticeably degrading single-threaded execution. Getting an idea of the expected synchronization overhead would be an important first step.

Diagnostics Emission

Some ideas were thrown around on how to handle diagnostics being generated by the compiler concurrently. The only solid conclusion was that we’ll need some kind of buffering in order for messages not to be too indeterministic, but we definitely don’t want to buffer diagnostics until the compiler process ends. Limited-time buffering seems like a promising approach.

Miscellaneous Thread-Safety Issues

We also discussed particular instances where the compiler maintains mutable state that needs to be made thread-safe. However, I won’t go into this here. All discussed items are either listed in Parallelizing rustc using Rayon or in https://github.com/rust-lang/rust/issues/48685.

Overall, this was a very re-assuring session and everybody praised the work that @Zoxc has already put into making a parallel Rust compiler a reality.


#2

I’m not sure how you imagine using regular OS-threads here. I can’t think of any designs which come close to the performance of using Rayon + fibers without the associated complexity. If we temporarily use a slow design it wouldn’t be a good idea to use it to evaluate the performance of potential changes and we’d need to fall back to Rayon anyway.

My branch exposes some Rayon abstractions to rustc: par_iter, join and scope. This is then used to parallelize code in rustc. Would the thread-pool using regular OS-threads expose these abstraction too?

I’m not convinced that integration with the Make jobserver would be easier in a custom thread pool, unless said thread pool is significantly simpler (say lacking work-stealing) and thus has lower performance.

I’d like to hear the reason why @nagisa thinks fibers are less well-suited for parallelizing workloads.

I this approach could be improved by merging the changes faster :smirk:


#3

After the work week, I’ve been thinking that it might be a good idea to start out with regular Rayon as the thread pool, with threads just blocking while they wait for a query. IIRC, that’s what @nikomatsakis suggested.

Yes, you are probably right. Supporting the jobserver in Rayon might not be that hard.

Sigh, yeah, I hear you, but we only have limited resources allocated to this area unfortunately.


#4

Fibers (green threads, light threads) are well suited for concurrent workloads due to needing very little memory to operate and this matters when you have thousands upon thousands of these fibers. In return, you pay for fibers in increased complexity of implementation (outside of Windows, which happens to have built-in fibers), and a slight increase in scheduling cost.

In parallel workloads, which is what compilation is in most part, you usually end up with fewer threads than number of logical cores and saturate the threads you have, which mostly negates the benefits of fiber size and still incurs the cost of switching fibers[^1].

[^1]: though, to be fair, if you never yield there’s no overhead :slight_smile:.

Your point about DAGs is very valid and any sensible design to solve that problem will probably arrive at some sort of fiber implementation anyway *shrugs*.


My concerns then are as thus:

  1. If it is possible to dead-lock queries in a thread-pool setting without fibers, the same should be true for a single threaded compiler, no?
    • Do we want to keep the single-threaded-no-fibers-vintage-and-very-reproducible mode for debugging purposes?
  2. This makes porting rustc to platforms not supported by whatever library we use for fibers significantly more difficult.
    • context lists support for: x86, arm, powerpc, mips;
    • rustc is successfully used on e.g. sparc, s390x systems too AFAIK, not to mention all the operating systems that aren’t linux/osx/windows.

#5

I think blocking threads on queries would avoid deadlocks excluding query cycles. It won’t be ideal from a performance standpoint, but it requires less changes to Rayon. We could use a similar deadlock detection scheme as my Rayon fork, by just tracking how many threads are blocked on queries and waking up all thread block on any query when a query with a waiter completes. On a deadlock, it would trigger the same cycle recovery code as my current branch using fibers does.

I don’t agree with this characteristic of fibers. Their fundamental memory usage is the same as threads. Some operating systems just implement threads inefficiently. If you want memory efficiency you reach for async/await, not fibers. The trade-off being async/await have slower function calls and can’t deal well with recursion, both of which makes it unsuitable for rustc. Fibers are also easier to implement than threads, since you do not have to deal with preemption. Most OSes come with an implementation of threads though. The lack of preemption for fibers means that they can context switch with the same cost as virtual function calls. I’m not sure why you brought up scheduling cost, as that is an ortogonal concern. You can have a cheap or expensive scheduler with both threads and fibers.

I think we should keep it around for a number of reasons. It makes it easier to port rustc which you listed as a concern. It may be useful for debugging. It may be useful for running rustc on platforms with expensive atomic operations. It would also allow us to keep track of the overhead caused by atomic operations.


#6

I think this approach makes sense, even if it just a first step. We can always improve on it in the future.