Homu queue woes, and suggestions on how to fix them

nnethercote · December 4, 2018, 10:25am

A successful PR landing cycle for the compiler now takes over 3 hours. This means that roughly 7.5 PRs can land per 24 hours, which often isn’t enough to keep up with current demand. Also, intermittent failures are common enough that this maximum of 7.5 often isn’t reached.

As a result, PR landing times have been very long and unpredictable lately, and roll-ups have been used more, which has risks. (Have you ever tracked a performance regression down to a roll-up with 20+ PRs in it? It’s frustrating.)

I have two suggestions on how to improve things.

Better priorities

First, to improve predictability, we should change how PRs are prioritized in the queue. I think we should retain the current two-tier system, but change the second tier. Specifically:

First, PRs are ordered by their priority level. All PRs default to level 0; people with appropriate permissions can increase the priority level for time-sensitive PRs. (This is the same as the current system.)
Second, PRs are ordered by the time (oldest first) that they received their first r+. (This is different to the current system, which orders PRs by PR number.)

The current approach means that older PRs automatically “cut the queue” ahead of newer PRs with the same priority level. PRs frequently sit in the queue for days and days, hovering around the same position, sometimes even moving backwards, as they are repeatedly bumped by lower-numbered PRs. This feels unfair, and also makes landing times highly variable and unpredictable.

It’s important to use the time of first r+, rather than time of latest r+, so that a PR can regain its position in the queue even if conflicts occur and it needs to be rebased.

This feels like a fairer and more predictable approach than the current one. It will reduce variance and worst-case behaviour. About the only downside I can see is that if you have to wait a long time for a review you don’t get any credit for that. But to me that feels like a minor thing; I personally spend much more time waiting on the queue than on reviews.

Parallelism

Second, to improve throughput and reduce the cost of intermittent failures, we should use speculative parallel testing. The “Not Rocket Science” rule says that you can’t test multiple PRs individually and then assume they’ll integrate nicely, so we have to be slightly more clever.

Here’s one simple possibility. If the first two PRs in the queue are A and B, we test the following in parallel: A+B, A, B. Here are the possible outcomes, and appropriate actions that could be taken.

  # | A+B A   B | Possibilities/actions
  ------------------------------------
  1 | 1   1   1 | All succeeded.
    |           | -> Land A, land B
    |           |
  2 | 1   1   0 | * B intermittently failed (likely); or
    |           | * B is broken but A fixes it (unlikely).
    |           | -> Optimistic: Land A, land B
    |           | -> Pessimistic: Land A, requeue B
    |           |
  3 | 1   0   1 | * A intermittently failed (likely); or
    |           | * A is broken but B fixes it (unlikely).
    |           | -> Optimistic: Land A, land B
    |           | -> Pessimistic: Land B, requeue A
    |           |                                             
  4 | 1   0   0 | * A and B both intermittently failed (likely); or         
    |           | * A and/or B are broken individually, but work together   
    |           |   (unlikely).                               
    |           | -> Optimistic: Land A, land B               
    |           | -> Pessimistic: Requeue A, reject B         
    |           |    (don't requeue both, could cause infinite loops)       
    |           |                                             
  5 | 0   1   1 | * A+B failed intermittently (likely); or    
    |           | * A+B fail when combined (unlikely).        
    |           | -> Land A, requeue B                        
    |           |                                             
  6 | 0   1   0 | * A+B and B both intermittently failed (unlikely); or     
    |           | * B is broken (likely).                     
    |           | -> Land A, reject B                         
    |           |                                             
  7 | 0   0   1 | * A+B and A both intermittently failed (unlikely); or     
    |           | * A is broken (likely).                     
    |           | -> Land B, reject A                         
    |           |                                             
  8 | 0   0   0 | * Any combination of intermittent and real failures is    
    |           |   possible.                                 
    |           | -> Reject A, reject B

(Note: you may need to scroll within the table to see all the entries.)

In the optimistic actions for outcomes 2, 3, and 4, we would land regardless of what happens with A and B individually. As a result, it is unlikely but possible that we could end up with a broken revision in the history, but that revision won’t be the tip. This is similar to how roll-ups work, except they only test A+B. If that level of optimism is deemed unacceptable, then the pessimistic actions could be used instead.

The upside of this is that a lot of the time we will end up landing two PRs in a single landing slot. And when intermittent failures occur, we will often still land one PR.

The downside is that it roughly triples machine time. I’m working on the assumption that running sufficient machines in parallel is possible and not too expensive; let me know if that’s wrong.

I’ve shown the simplest case where we only combine two PRs. But we could do more: three at once (5 combinations: A+B+C, A+B, A, B, C), four at once (7 combinations: A+B+C+D, A+B+C, A+B, A, B, C, D), etc. The higher numbers would allow higher throughput, at the cost of more machine time. Happily, the machine time growth rate is linear; if we test N PRs in tandem, we need 2N-1 machines running in parallel.

nikomatsakis · December 4, 2018, 12:40pm

I think that the reason we've shied away from this before is literally cost, but @alexcrichton would know better

nikomatsakis · December 4, 2018, 12:41pm

I've been stewing on this for a while and I've come to agree with you.

There is no question in any case that, with the current rules, it is pretty common for me to see my PR hovering around position 3 or 4 for many days in a row, if not an entire week. Sometimes that's due to high priority rollups and so forth, but other times it's just things getting r+'d that happened to have a higher number.

jethrogb · December 4, 2018, 12:47pm

This seems like a good thread to re-ask this question?

Why is Cargo.lock considered a binary for merging by homu? If the automatic text merge would result in a problem, that should be caught by building with --frozen right?

johnthagen · December 4, 2018, 1:35pm

Based on @nnethercote’s description, it feels like the person-hour cost here is pretty high, while adding the needed hardware power in theory wouldn’t cost that much. Is it possible for Mozilla (or some other organization) to sponsor some amount of money to get more hardware time from Travis? I’d have to guess that that would be an overall smaller cost than all of the human capital that is expended waiting for PRs and all of the issues that come with that.

Is the rustc repo currently treated like just any other open source repository?

alexcrichton · December 4, 2018, 2:21pm

Yeah unfortunately we just don't have capacity for building much more than what we do today. We definitely don't have capacity to run two full parallel builds right now even if we wanted to :(. As @nikomatsakis mentioned though this is purely a matter of cost, if we had infinite money we could get all the resources we liked for CI!

FWIW in January the infra team plans on reevaluating our CI needs and usage to see if there's improvements that we can make, and CI time is always high up there

Also FWIW this is a bug in our queue or a bug in our CI (in my opinion). We've historically shot for 2 hours per PR (3 hour max for fluctuations). With this headroom though it's very easy for regressions to sneak in slowly over time. As an example I spent maybe a half hour looking at the queue in the past week and shaved 20-30 minutes off builder time on Windows (#56360, #56391).

In any case if builders regularly take 3 hours that's something that needs fixing! We can typically do that with effort on our end which is often a bit easier to come across than a whole new set of builders!

Eh2406 · December 4, 2018, 3:29pm

Your strategy for intelligently and optimistically running multiple builds at the same time is what zuul dose. Zuul is the Homu equivalent for Openstack, so it has been in production for a lot longer than Bors and at Much bigger scale. If we are rethinking our CI setup we should consider just using it.

RalfJung · December 4, 2018, 3:36pm

Please no. This will mean there there is a pressure for reviewers to act fast, because the longer a review takes, the further down the PR will be in the queue. I have several PRs right now that have been waiting for any reaction from the reviewer for more than a week, but that's okay because I know that when I eventually get my review, the PR will land fairly far in the front of the queue. With your proposal, I would be extremely frustrated by the fact that I first have to wait two weeks for an r+, and then another week or so for the PR to land.

Also, speaking as a reviewer, it is nice to know that the PRs assigned to me will not be punished for me taking a bit longer to review. Rust is a hobby for me (most of the time, anyway), and there can be all sorts of reasons that a week goes by without me having the time to take a look at a PR. I would feel pressured into somehow making time for a review if that delay punishes PRs. That's not healthy.

It's not unfair at all! These other PRs have been waiting even longer than your PR (they are older, after all), they just have been waiting in a different state (waiting-on-review instead of waiting-on-bors).

I can only speak for myself, but most of my PRs spend very little time in waiting-on-author.

What about B+C, A+C?

nnethercote · December 4, 2018, 4:04pm

I don't think this is a bad thing. Reviews should be a relatively high priority activity.

Not necessarily. Maybe the PR author posted something that was broken, or didn't gain r+. Maybe they then left it untouched for two weeks while they did something else. Doesn't matter, with the current system they still benefit.

Once or twice I have briefly considered filing placeholder PRs with some placeholder code that can then be overwritten with proper code later on. I never did it because it would be horribly rude, but I think it is revealing that the thought of gaming the system crossed my mind at all.

RalfJung · December 4, 2018, 4:33pm

In that case I am afraid I will have to ask for myself to be removed from the reviewer list. I can help with reviews but I cannot guarantee low latency.

But I don't think that is reasonable. Many of us are volunteers. Reviews might be high priority in my set of "Rust-related things to do", but depending on how busy $DAYJOB and $LIFE are that can still take a while.

Let me game your new system then: When I get assigned a PR that I cannot review now because I am busy, I will "r+" and immediately "r-" it to avoid punishing the PR author.

I am not seriously proposing to do that, or suggesting I might. I just want to demonstrate that all systems can be gamed, so that fact is not an argument.

The current system is clearly not perfect, and your desire to game it originates there. That's fair. But that doesn't mean the fix is to prevent that particular kind of gaming.

pietroalbini · December 4, 2018, 5:40pm

No, the project pays (a lot) both to Travis and AppVeyor to get the CI resources we have right now.

petrochenkov · December 4, 2018, 5:55pm

Is "Travis run on push" in PRs included into the budget?
(The thing that "makes Travis green".)

It eats ~1.5 hours of CI time on every push, but I find it almost useless for myself because I test everything locally before pushing.

pietroalbini · December 4, 2018, 6:11pm

It should be, but even if that was removed I don't think we could have the capacity for two parallel builds anyway.

It's useful to have that early feedback without going through bors though, especially for rollups. I was able to cancel multiple rollups before bors started testing them because the PR build failed.

notriddle · December 4, 2018, 6:36pm

The “Better priorities” part is a red herring. It doesn’t matter how you sort the PRs in the queue if it takes more than a day to get through it, because it still forces somebody to wait for multiple days before their PR can land, and that’s awful. It also doesn’t matter how you sort the PRs in the queue if it takes less than a day, because the difference between being 2/5 or 4/5 is a lot less important than the difference between being 8/20 or 16/20.

Parallelism, on the other hand, would be awesome. Zuul sounds like a perfect fit for the Rust project at this point.

nnethercote · December 4, 2018, 6:52pm

I disagree for the reasons I said above: better priorities reduce variance, reduce worst-case wait periods, increase predictability, and eliminate the sense of unfairness that comes from being cut in front of.

I think it's better and fairer if, say, everyone who gets in the queue on Friday waits 3 days for their PR to land, rather than if some of them take 4 hours and some of them take 7 days.

graydon · December 4, 2018, 10:03pm

Three points:

Please do not consider disabling an integrate-what-you-test gateway on the basis of this observation. There are at least two likely-good options (the following points) to pursue at length before considering giving it up. I know nobody’s suggested it yet but IME it always comes up in conversations like this, and it should be resisted with all your power.
Profile the build and get aggressive about pruning costs in it. A casual sampling of the logs shows you’re spending at least 20 minutes per host rebuilding LLVM every time still (as has been the case since 2011 or so?) and 10 minutes on stage0 (?) and a variety of other things that give the impression of “low hanging fruit”. Similarly, as many of the long steps do seem to be rustc-building-rustc, maybe this is a good community signal to focus effort still further on making rustc faster. Pretty sure it’s still one of the top complaints about the compiler.
Consider adopting the Bors-NG algorithm, in which the integrator attempts a maximal integration each time, bisecting down to either a successful multi-PR prefix (a “rollup”) or an isolatable/skippable failure. This is very much an improvement in the state of the art and (IMO) everything in this family of tools ought to be using that algorithm now.

aidanhs · December 5, 2018, 12:24am

If there is low hanging fruit the infra team would love to hear about it! There are occasional pushes motivated by timing out builds where we discover things (alexcrichton's post up-thread being a great example), but it's a certainly a little ad-hoc.

On your LLVM point in particular, I picked a random 'merge build' (Travis CI - Test and Deploy with Confidence) and had a look at IMAGE=x86_64-gnu - LLVM is from [00:18:55] to [00:23:51], which seems about right because we use sccache to download cached LLVM builds when we can. Without seeing the build you're referring to, it's tricky to say why you saw 20min.

If you're interested in seeing breakdowns, I have a tool for doing so (though I do need to update it with more recent jobs, my last trawl of travis data is from 8 months ago) - https://aidanhs.com/logobble/. Selecting a job will show you a timing breakdown for all stages in the build, based on metrics added to the build logs.

The upshot is - we do care about build times and we do (try to) keep an eye on them! If anyone is interested in working on build times, has questions about the build process (e.g. what stage0 is, how bors works), please visit #rust-infra on discord!

johnthagen · December 5, 2018, 12:31pm

Genuinely curious, are the amounts okay to share publicly? I'm just wondering how they ballpark with a rough estimate of person-hours. I realize this would be hard to quantify precisely since many people donate their time to Rust, but I still think it would be a useful "order-of-magnitude" comparision.

E.g. "Rust pays the rough equivalent of 20 average computer science person-hours of time per month on CI hardware"? Seems like using that to compare against a rough estimate of how much person-time is currently expended waiting would be useful.

I could be way off, but in my experience professionally person-hour salary costs tend to dominate these kinds of comparisons over hardware cost (especially lately). Again this is caveated that Rust is open source, etc..

benh · December 5, 2018, 1:33pm

yeah, we could always do a crowdfunding thing to pay for a couple of ec2 instances or something

pietroalbini · December 5, 2018, 1:55pm

I'm not actually sure, I'd prefer if Alex or Aidan reply to this.

Paying for custom hardware is not the problem, since some companies are willing to sponsor it. The problem is that using custom hardware means maintaining it, and most of the infra team is volunteer. We don't want to be in a situation when the CI is broken and nobody is available to fix it (while right now we just have to send an email to Travis/AppVeyor and they'll fix it for us).

Topic		Replies	Views
Proposal: move to bors-ng tools and infrastructure	25	5754	March 25, 2019
An idea about bors times tools and infrastructure	11	1500	March 25, 2019
Upcoming Automation Changes internals	14	3470	March 25, 2019
Update on the "Placement by return" RFC language design	10	2372	August 27, 2020
Did we start reviewing PRs slower?	55	4245	May 2, 2023

Homu queue woes, and suggestions on how to fix them

Better priorities

Parallelism

Related topics