Homu queue woes, and suggestions on how to fix them

graydon · December 6, 2018, 2:22am

Sure, or like I could totally be misreading the docs! It's just worth considering that there are a few possible ways to do the optimistic strategy and they will affect both the bill and net speedup in not-necessarily-linear ways.

Mark_Simulacrum · December 8, 2018, 5:56am

I'm fairly certain this graph is accurate (but, if it seems wildly wrong, then perhaps it is -- for reference, the code used to generate this is here: https://github.com/Mark-Simulacrum/landing-time; it does not include the gnuplot script and will likely not run on anyone else's computer).

I've downloaded all rust-lang/rust PRs since, well, the repository existed and all associated comments so if there's other data that would be helpful here I can try and get it out. Currently I'm checking for approval via a simple does the comment contain bors r+ or bors r= which seems to work fine (only one merged PR not approved in the last 6000).

Based on the graph, it looks like most PRs are landing within ~1-2 days with essentially all landing within ~5 days of the last approval. Note that this means that if the PR bounces due to needing rebasing we'll consider that a "new" approval -- I've not implemented any fancy detection there. Each point on this graph is a single PR; the axes are otherwise labeled.

Let me know if this is helpful -- I'm not actually sure that I have derived much meaning from it yet, so not sure what my reaction is.

jethrogb · December 8, 2018, 9:26am

I think this is a crucial piece of information that is missing.

Mark_Simulacrum · December 8, 2018, 3:38pm

Okay -- the new data is in. This time I've implemented "last approval" a little differently: it is now the last time the pull request was r+-ed after being in the "proposed" (i.e., just created) or "denied" (i.e., r--ed) state. Merge conflicts no longer affect the last approval. I've also changed the graph to have the 85th, 90th, 95th, and 98th percentiles. It becomes clear that there has been a spike to ~2 weeks of wait time in the last month or so which is likely what brought this on. Generally, as we've gotten closer to the edition merge times have gotten progressively worse. Note that the descent at the very end is because we've simply not merged many pull requests (and those that do merge, merge in rollups and as such rapidly).

Mark_Simulacrum · December 8, 2018, 10:31pm

I've also put together CI duration over the past year (split into Travis and AppVeyor, plus an "overall"). This is computed only for PRs that get merged themselves (i.e., are not rolled up). The overall measurement is from the time of the "Testing" comment posted by bors to the "Merging" comment posted by bors; Travis and AppVeyor lines are drawn from data provided by their APIs.

The data does suggest that at least one reason as to why queue length has been worse over the past month is that the cycle time for one CI build has been higher -- over 3 hours. I've not yet attempted to track down why CI builds are taking more than 3 hours; that should be impossible as I believe both Travis and AppVeyor cap us at 3 hours... presumably not all builders are starting simultaneously.

The attached graph shows the 95th percentile with each week of data grouped together.

pietroalbini · December 8, 2018, 10:48pm

By the way, at the moment our AppVeyor timeout is 4 hours instead of 3.

josh · December 9, 2018, 1:06pm

The fact that this chart is measured in “days” and not in “hours” suggests we still need significant improvement here.

inejge · December 10, 2018, 9:30am

As a FYI, I've just cleaned up and published a basic DNS-over-HTTPS NSS module which might be useful in the Travis DNS brownout scenario. I'm not using it on Travis, but the motivation was similar: have a resolver of last resort when the regular DNS service is flaky, but HTTPS connections go through.

pietroalbini · December 10, 2018, 10:40am

Thanks for linking this! Travis found the cause of the issue (something is disabling ip forwarding every day on their images breaking all the networking inside docker) and they're going to fix it on their end, so we shouldn't need that. Also it won't really be useful since all the networking goes away :(

KiChjang · December 11, 2018, 12:24am

Something that hasn’t been mentioned yet on this thread is a way to track intermittent errors; Servo currently does this with the intermittent tracker.

The way it works is that it pulls data from Servo’s GitHub issues labelled “I-Intermittent” to determine whether the failed test in the PR matches the filename reported on that issue, e.g. this Servo issue. The tracker does not automatically create new intermittent issues – this still requires manual input by a contributor/maintainer to create a new GitHub issue and flag it appropriately.

With the current number of open issues under the rust repo, filing intermittent error issues would balloon the issue tracker to sizes that make it hard to sift through non-intermittent issues. If we are going forward with implementing this on rust-lang, I would instead suggest that we keep a top-level TOML file that lists all the intermittent tests, and requeue the PR when all failures on the PR are all matched on the list.

system · March 25, 2019, 8:31am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Some thoughts on improving CI infrastructure	25	3597	March 28, 2017
Putting bors on a PIP	23	3255	April 5, 2018
Which CI platform should Rust use? tools and infrastructure	29	19954	May 8, 2019
Did we start reviewing PRs slower?	54	4681	February 1, 2023
Rust CI / release infrastructure changes	18	8831	March 2, 2017

Homu queue woes, and suggestions on how to fix them

Related topics