I think a good way forward would be to identify which state really needs to be mutably shared and which doesn’t (as @nikomatsakis has started doing), identify cases where state is mutably shared unnecessarily, and then make a list of concrete refactorings to get rid of those cases.
I’ve been thinking a bit about parallelizing incr. comp. and I think at least the coloring part can be synchronized without locks, just with atomic-compare-exchange operations.
There’s also a pattern that interestingly seems to occur quite often in the tasks the compiler needs to do. Synchronization can often happen in a two-step fashion: First we use a , possibly lock-free, dictionary to synchronize allocating a shared data-structure between threads (e.g. a query state, an interned type, maybe dep-nodes) and then, if that shared data-structure is not conceptually immutable, it can contain something that is used synchronizing it. The important thing is that we don’t have to keep the synchronized dictionary locked while making progress.
Another major area to add to @nikomatsakis list is probably the CrateStore infrastructure, which contains lots of weird things and is used before the tcx is created.
Just a few comments from a lurking multi-threading nerd:
I agree that it is most important to start by questioning the very existence of shared mutable state. In a threaded environment, synchronizing that state always comes at an overhead, scalability and program complexity cost, which must be motivated by a significant gain elsewhere (e.g. caching of an expensive computation).
Using hierarchical synchronization which separates container access from element access is a double-edged sword. It benefits overall scalability if accesses are well spread across the container, at the expense of requiring more synchronization transactions overall, which in turn increases overhead. For short-lived transactions and under light contention, it can be better to stick with synchronizing the whole container. This sort of trade-off is best evaluated through analysis of existing data structure usage patterns and comparative performance studies of proposed solutions.
Similarly, lock-free vs locking is a delicate trade-off, where I would advise caution. It is important to remember that an uncontended lock requires nothing more than an atomic exchange followed by in-place modification. Whereas many popular lock-free algorithms rely on use of faillible compare-exchange or load-linked/store-conditional instructions, extra dynamic memory allocations, and either linked data structures (inefficient at read time) or copy-update patterns (inefficient at write time). For these lock-free algorithms, the extra overhead only pays off in read-mostly scenarios or in an intermediary contention regime where…
Contention is high enough for the scalability gain to offset the overhead cost.
But it is not high enough for the faillible nature of compare-exchange and load-linked/store-conditional to start causing too much wasted work.
Add to this that not all lock-free algorithms are born equal in terms of overhead and scalability characteristics, and here again, it seems important to keep the trade-offs in sight, and to make an informed choice on a case-by-case basis.
Thank you, @HadrienG, I think this is very valuable input. Another thing to note is probably that many lock-free data structures work best with garbage collection available.
this is linked to a MIR and cleared when it changes. Now that MIR moves linearly through the system, we can recode I suspect to use &mut maybe? Or maybe make the computation more explicit (e.g., the cache is populated via explicit action, and then queried, and if it is not populated we get a panic). Synchronization here seems really wasteful in any case.
trait system
ties in to the WG-traits plans; I would like to completely revamp how trait caching works anyway.
I think this really wants to walk the stack and count the number of active layout frames, rather than being a counter. Or it could be carried along with the stack.
the main reason they use refcell so extensively is because they are old and mutablity in Rust used to be far easier. But maybe it’d be nice to be able to parse source files in parallel, in which case some amount of synchronization is needed? I’d love to see more of this stuff pushed back into queries though. Thoughts would be welcome here.
Taking a quick look at the "core structures that will require synchronization", I noticed some synchronization performance low-hanging fruits. Wishing I had more time to dedicate to this...
As far as I can tell, this is just a bunch of accumulators, and rustc does not need to know their transient values during compilation, only the final result at the end. Whenever you find data which follows this general pattern, you can use a simple trick to get your synchronization overhead near zero:
Each thread accumulates the statistics in a local variable
All the local variables are merged into a global result at the end
It seems like the fix is easy here: just clone the mpsc::Sender instead of keeping it in the shared state. If the locking solution is correct (i.e. threads do not depend on the order in which data is sent down the pipe), this solution will be correct too, and avoid double synchronization.
Hopefully the crate metadata store can be immutable + queries. The main reason keeps being a special case is that it is used before the tcx exists, I think.
perf-stats can also just be removed (and maybe be replaced with something more general later).
The codemap conceptually should be just another interner.
tx_to_llvm_workers might go away if we de-querify compile_codegen_unit, which we might do for other reasons anyway.
next-node-id is for generating NodeIds. It can be an atomic counter.
I too think that merging session and tcx would be one of the first things we should do. That would give us the tools to solve most of the other problems.
I spoke to @Zoxc a bit over IRC. They expressed the desire to land their branch and then work through the refactorings described above. I am definitely sympathetic to this – I’d hate for the branch to bitrot, and it clearly shows advantages now, so it’d be nice to be exploring this. This will also allow us to explore various things in parallel (e.g., optimizations to how the query structure is setup can be done at the same time as removing shared state).
Basically, so long as sequential performance and correctness does not regress, I think it makes sense to land code. It might be nice though to fix the layout counter case, which I believe is “actually wrong”.
I don't feel that comfortable suddenly parallelizing everything. This seems like a pathway to hard-to-debug bugs galore. I would much rather see a bunch of smaller chunks parallelized, with each chunk extensively fuzz tested for a while between chunks landing.
I’m not sure I entirely agree with that – that is, what @Zoxc has done is basically to parallelize one core mechanism (queries). It would be hard to parallelize individual queries, as they all work through a uniform mechanism.
However, it occurs to me that inserting other uses of rayon can certainly be done. For example, NLL’s region inference, as well as the data flow mechanism, might well be readily parallelizable.
I see what you mean. Is it possible to serialize some types of queries without losing performance? For example, perhaps we can move backward through the "pipeline" (i.e. parallelize trans before MIR before HIR before the parser). That way we know that (say) the only parallelism is coming from trans, so any new bugs are likely there. Then once we are confident that that works, we can parallelize MIR queries, and so on.
All of this is a bit hand wavy in my mind though...
It might be fairly easy to add a debug mode that serializes everything (by going through a global mutex or something) and let’s you opt into parallel execution on a query-by-query basis. Sounds like an idea to keep in mind.
I agree that we should not wait for all the parts of the compiler to be refactored into a perfect state before merging. We should do a careful review of the changes though. It would be nice if we could do it in a series of PRs, each of which is not too big. That would reduce the risk for each PR to bitrot and it would make it easier to review in a timely fashion.