A successful PR landing cycle for the compiler now takes over 3 hours. This means that roughly 7.5 PRs can land per 24 hours, which often isn’t enough to keep up with current demand. Also, intermittent failures are common enough that this maximum of 7.5 often isn’t reached.
As a result, PR landing times have been very long and unpredictable lately, and roll-ups have been used more, which has risks. (Have you ever tracked a performance regression down to a roll-up with 20+ PRs in it? It’s frustrating.)
I have two suggestions on how to improve things.
Better priorities
First, to improve predictability, we should change how PRs are prioritized in the queue. I think we should retain the current two-tier system, but change the second tier. Specifically:
-
First, PRs are ordered by their priority level. All PRs default to level 0; people with appropriate permissions can increase the priority level for time-sensitive PRs. (This is the same as the current system.)
-
Second, PRs are ordered by the time (oldest first) that they received their first r+. (This is different to the current system, which orders PRs by PR number.)
The current approach means that older PRs automatically “cut the queue” ahead of newer PRs with the same priority level. PRs frequently sit in the queue for days and days, hovering around the same position, sometimes even moving backwards, as they are repeatedly bumped by lower-numbered PRs. This feels unfair, and also makes landing times highly variable and unpredictable.
It’s important to use the time of first r+, rather than time of latest r+, so that a PR can regain its position in the queue even if conflicts occur and it needs to be rebased.
This feels like a fairer and more predictable approach than the current one. It will reduce variance and worst-case behaviour. About the only downside I can see is that if you have to wait a long time for a review you don’t get any credit for that. But to me that feels like a minor thing; I personally spend much more time waiting on the queue than on reviews.
Parallelism
Second, to improve throughput and reduce the cost of intermittent failures, we should use speculative parallel testing. The “Not Rocket Science” rule says that you can’t test multiple PRs individually and then assume they’ll integrate nicely, so we have to be slightly more clever.
Here’s one simple possibility. If the first two PRs in the queue are A and B, we test the following in parallel: A+B, A, B. Here are the possible outcomes, and appropriate actions that could be taken.
# | A+B A B | Possibilities/actions
------------------------------------
1 | 1 1 1 | All succeeded.
| | -> Land A, land B
| |
2 | 1 1 0 | * B intermittently failed (likely); or
| | * B is broken but A fixes it (unlikely).
| | -> Optimistic: Land A, land B
| | -> Pessimistic: Land A, requeue B
| |
3 | 1 0 1 | * A intermittently failed (likely); or
| | * A is broken but B fixes it (unlikely).
| | -> Optimistic: Land A, land B
| | -> Pessimistic: Land B, requeue A
| |
4 | 1 0 0 | * A and B both intermittently failed (likely); or
| | * A and/or B are broken individually, but work together
| | (unlikely).
| | -> Optimistic: Land A, land B
| | -> Pessimistic: Requeue A, reject B
| | (don't requeue both, could cause infinite loops)
| |
5 | 0 1 1 | * A+B failed intermittently (likely); or
| | * A+B fail when combined (unlikely).
| | -> Land A, requeue B
| |
6 | 0 1 0 | * A+B and B both intermittently failed (unlikely); or
| | * B is broken (likely).
| | -> Land A, reject B
| |
7 | 0 0 1 | * A+B and A both intermittently failed (unlikely); or
| | * A is broken (likely).
| | -> Land B, reject A
| |
8 | 0 0 0 | * Any combination of intermittent and real failures is
| | possible.
| | -> Reject A, reject B
(Note: you may need to scroll within the table to see all the entries.)
In the optimistic actions for outcomes 2, 3, and 4, we would land regardless of what happens with A and B individually. As a result, it is unlikely but possible that we could end up with a broken revision in the history, but that revision won’t be the tip. This is similar to how roll-ups work, except they only test A+B. If that level of optimism is deemed unacceptable, then the pessimistic actions could be used instead.
The upside of this is that a lot of the time we will end up landing two PRs in a single landing slot. And when intermittent failures occur, we will often still land one PR.
The downside is that it roughly triples machine time. I’m working on the assumption that running sufficient machines in parallel is possible and not too expensive; let me know if that’s wrong.
I’ve shown the simplest case where we only combine two PRs. But we could do more: three at once (5 combinations: A+B+C, A+B, A, B, C), four at once (7 combinations: A+B+C+D, A+B+C, A+B, A, B, C, D), etc. The higher numbers would allow higher throughput, at the cost of more machine time. Happily, the machine time growth rate is linear; if we test N PRs in tandem, we need 2N-1 machines running in parallel.