Homu queue woes, and suggestions on how to fix them


Sure, or like I could totally be misreading the docs! It’s just worth considering that there are a few possible ways to do the optimistic strategy and they will affect both the bill and net speedup in not-necessarily-linear ways.


I’m fairly certain this graph is accurate (but, if it seems wildly wrong, then perhaps it is – for reference, the code used to generate this is here: https://github.com/Mark-Simulacrum/landing-time; it does not include the gnuplot script and will likely not run on anyone else’s computer).

I’ve downloaded all rust-lang/rust PRs since, well, the repository existed and all associated comments so if there’s other data that would be helpful here I can try and get it out. Currently I’m checking for approval via a simple does the comment contain bors r+ or bors r= which seems to work fine (only one merged PR not approved in the last 6000).

Based on the graph, it looks like most PRs are landing within ~1-2 days with essentially all landing within ~5 days of the last approval. Note that this means that if the PR bounces due to needing rebasing we’ll consider that a “new” approval – I’ve not implemented any fancy detection there. Each point on this graph is a single PR; the axes are otherwise labeled.

Let me know if this is helpful – I’m not actually sure that I have derived much meaning from it yet, so not sure what my reaction is.


I think this is a crucial piece of information that is missing.


Okay – the new data is in. This time I’ve implemented “last approval” a little differently: it is now the last time the pull request was r+-ed after being in the “proposed” (i.e., just created) or “denied” (i.e., r--ed) state. Merge conflicts no longer affect the last approval. I’ve also changed the graph to have the 85th, 90th, 95th, and 98th percentiles. It becomes clear that there has been a spike to ~2 weeks of wait time in the last month or so which is likely what brought this on. Generally, as we’ve gotten closer to the edition merge times have gotten progressively worse. Note that the descent at the very end is because we’ve simply not merged many pull requests (and those that do merge, merge in rollups and as such rapidly).


I’ve also put together CI duration over the past year (split into Travis and AppVeyor, plus an “overall”). This is computed only for PRs that get merged themselves (i.e., are not rolled up). The overall measurement is from the time of the “Testing” comment posted by bors to the “Merging” comment posted by bors; Travis and AppVeyor lines are drawn from data provided by their APIs.

The data does suggest that at least one reason as to why queue length has been worse over the past month is that the cycle time for one CI build has been higher – over 3 hours. I’ve not yet attempted to track down why CI builds are taking more than 3 hours; that should be impossible as I believe both Travis and AppVeyor cap us at 3 hours… presumably not all builders are starting simultaneously.

The attached graph shows the 95th percentile with each week of data grouped together.


By the way, at the moment our AppVeyor timeout is 4 hours instead of 3.


The fact that this chart is measured in “days” and not in “hours” suggests we still need significant improvement here.


As a FYI, I’ve just cleaned up and published a basic DNS-over-HTTPS NSS module which might be useful in the Travis DNS brownout scenario. I’m not using it on Travis, but the motivation was similar: have a resolver of last resort when the regular DNS service is flaky, but HTTPS connections go through.


Thanks for linking this! Travis found the cause of the issue (something is disabling ip forwarding every day on their images breaking all the networking inside docker) and they’re going to fix it on their end, so we shouldn’t need that. Also it won’t really be useful since all the networking goes away :(


Something that hasn’t been mentioned yet on this thread is a way to track intermittent errors; Servo currently does this with the intermittent tracker.

The way it works is that it pulls data from Servo’s GitHub issues labelled “I-Intermittent” to determine whether the failed test in the PR matches the filename reported on that issue, e.g. this Servo issue. The tracker does not automatically create new intermittent issues – this still requires manual input by a contributor/maintainer to create a new GitHub issue and flag it appropriately.

With the current number of open issues under the rust repo, filing intermittent error issues would balloon the issue tracker to sizes that make it hard to sift through non-intermittent issues. If we are going forward with implementing this on rust-lang, I would instead suggest that we keep a top-level TOML file that lists all the intermittent tests, and requeue the PR when all failures on the PR are all matched on the list.