Putting bors on a PIP


#1

We have a problem: the average queue of ready-to-test PRs to the main Rust repo has been steadily growing for a year. And at the same time, the likelihood of merge conflicts is also growing, as we include more submodules and Cargo dependencies that require updates to Cargo.lock.

This problem could threaten our ability to produce Rust 2018 on time, because as the year progresses, there will be an increasing amount of contention for the PR queue. My goal in this post is to avoid this outcome, without reliving the Rust 1.0-era experience of Alex working full time to land massive rollups.

In particular, the Rust All Hands is coming up next week, and I think it’s a great opportunity to dive into these issues, so after chatting for a while with Alex I wanted to set out some ideas.


#2

So, I’ve actually had plans for fixing this for years, it’s just that I don’t have the time to do this work. Part of the issue is that bors isn’t testable yet so making large changes is tough.


One thing to note is that when doing rollups it’s not about prioritizing small PRs over large ones, it’s about prioritizing less risky PRs over risky ones. My personal heuristics are:

  • PRs marked “rollup” are not risky
  • PRs that touch the build system are risky
  • PRs that touch arch stuff are risky
  • PRs that are humongous are risky
  • PRs which haven’t yet finished travis are risky

Generally when doing rollups I’ll also mark particularly risky PRs as p=5 or something so that if the rollup fails, they get tested (since the non-risky PRs will get in via rollup eventually, but these will never drain out otherwise).

The kernel of the idea is that instead of the rollup column denoting “please rollup this” tiny PRs, we repurpose it to signal riskiness. We can have four levels: not risky at all (what “@bors rollup” is now), somewhat risky but okay to roll up, “probably do not roll up”, and “never roll up” (the last one is for stuff like backports and perf things for which we absolutely do not want a rollup). The queue shows travis statuses and lets you sort by this to create rollups.

This itself isn’t hard to implement.

However, once we have this, the next logical step is to just automate it; have bors automatically create rollups containing mostly not-risky stuff as well as a limited number of “somewhat risky” stuff, and test those. If there’s a failure, go on and test one of the “do not roll up” ones, and hopefully by then we’ll have noticed what caused the failure and r-d the PR. If we want to get smart about it we can have bors bisect by attempting a smaller rollup with half the PRs, and continue trying. (There are a bunch of nuanced issues to handle here)

We can also have the “probably don’t roll up” category go through automated try runs in parallel – passing a try run promotes it to “somewhat risky and okay to roll up”.

Another tactic is to take the rollupable PRs and in parallel do bisecting try runs. This may be a lot more resource-intensive.


FWIW I did a ton of rollups last month, and the biggest papercut was that we have jobs on both travis and appveyor which often run out of time.


#3

Most of my changes to the rust repo are in std or rustdoc or something ancillary to rustc like that. Its always seemed odd to me that I am rebuilding the compiler and competing for queue time for changes that don’t need that. How feasible would it be to split up the rust repo so that there’s less contention (I’m guessing it would not be easy)?


#4

I’d love to see a solution here that moves us away from rollups - I find them a pretty frustrating solution due to the bad prioritisation they cause - as noted in the blog, and also because PRs which are likely to have higher impact on the user (because they are large, complex, or high risk) are de-prioritised.

I think it is worth planning the long term solution now - it’s a big job and not user-visible, so there is not much energy for it when it is non-urgent, but when it is urgent (like now) we don’t want to think about the long term because we need a short-term solution. My preferred long-term solution is to split the repo and have two-stage landing. This gets us the quick satisfaction of landing PRs without them blocking unrelated PRs, and it has the advantage that it scales.

In the short-ish term, I think we could do much better if we allow PRs to run subsets of the tests. For example, most documentation changes don’t need to run any tests at all. Those that touch examples only need to run doc tests, they don’t need to rebuild the compiler and run the whole test suite. Changes to a tool, only need to run that tool’s tests, etc. (Although if we do split the repo, then this becomes wasted work).


Moving to TaskCluster: The Plan
#5

IIRC it is possible to just build std with nightly, but you need to flip a bunch of cfgs to make it work. I did something like this whilst hacking on core::str::pattern (core is trickier). Having this as a dedicated build mode would be fantastic.


#6

Is it possible to approach Travis/AppVeyor about paying some amount of money for more powerful build servers that could be dedicated to the rust repository? I’m not sure what the current limitations are on open source builds, but I’d imagine builds could potentially get more cores/memory/time out or more builds could go in parallel. For example, if rust builds went from 2 to 16 cores, maybe that dramatically helps this problem? Could Mozilla or some other entity sponsor this?

I bring it up because the overall cost could be dramatically less than the total man-hours (probably of core devs) needed to split up repos or allowing untested PRs to accidentally slip into master.


#7

I was independently coming up with similar ideas for automatic roll ups at:

But in the discussion of: 2 things seemed to be made clearer.

  1. brson, and presumably others would need convincing before accepting the automatic roll ups.
  2. There was interest in switching to bors-ng, that already has an implementation this stufe.

#8

Question: are there times of the day/week when bors tends to be idle? For example, if most of the people with bors permission are asleep or not working at some time, maybe that would be a good time run rollups, since they are low-risk and (hopefully) should just work. For example, if bors tends to be idle on weekends or at midnight in the USA, then perhaps rollup jobs should be deferred until then. That way, the waking/working hours can be better utilized for harder PRs. The benefit is that it doesn’t really worsen average rollup merge time because you would just be waiting for maybe half a day extra…


#9

Question: Is there any data available on what percentage of jobs in the queue (on average) are documentation, standard library, compiler, etc.?

That might help identify the optimizations to make - if we found that documentation changes were 20% of the queue (I’ve not got a clue how far off this guess is) then finding a way to optimize that by not running the compiler build and testing might show significant improvements; similarly with any other parts of the project that we can test independently.


#10

Measure, Don’t Guess

Second, do we have some statistics on why PRs fail bors?

IIRC failures can be divided into 4 kinds:

  1. spurious failures (including timeout disease) - these cause random capacity loss and latency increase, and have to be removed.
  2. PRs that don’t pass travis - if this still hasn’t been done, we might want to prevent them from getting into bors.
  3. failures on odd platforms.
  4. merge conflicts.

The latter 2 are the “interesting” cases, and have different solutions. I don’t see an easy way of handling merge conflicts, but I suspect they are fairly rare. I would prefer to measure rather than guess, however.

As for handling failures on odd platforms - we should have a way of running tests on several platforms on a PR without blocking the bors queue.

We might be able to find a representative subset of builders that can be used to try large PRs to shake out any trouble, so you could have @bors test-large-pr which would test your PR on, say, Windows Linux and wasm, 32-bit and 64-bit - basically we want a set of builders that covers say 90% of all non-spurious non-travis failures.

Maybe also have a similar suite for e.g. buildsystem changes, if they trigger a separate segment of tests.

If we could culturally not land large PRs without doing such an “enhanced regression run”, that would reduce the queue capacity loss due to “odd platform” failures.


#11

Network Access

On the purely technical side, IME bors accesses a lot of network sites outside of our control, and that causes a ton of outages. So, we could find a way to proxy these and only say refresh our “caches” once per day - so that we only access github and our cache - while being able to revert the network cache.

That would prevent these sort of outages. Maybe ask some OS packager (@cuviper ?) or CI expert for how they handle network dependencies in buildscripts?


#12

We “handle” network dependencies by forbidding them. Fedora and RHEL builders have no network access at all once they enter the build chroot, and I expect other distros are similar.


#13

A vision for the build system

Alice wants to fix a bug or improve documentation. This change probably does not cause failures on weird platforms. She pushes her PR, waits until travis is green, and gets an r+. Eventually, her PR is picked up in an automatic roll-up and is merged.

If one of the rollups has a failure, the failure is nicely displayed on the rollup PR, and she or a maintainer marks her PR as failed. Meanwhile, bors might start processing a larger PR (is this a good idea?).

A regression had landed on nightly, and Bob wants to get a fix up quickly. He puts his PR up with priority, and after he gets green on travis and the current PR succeeds, his PR is up next.

Charlie is writing a sweeping reform of the MIR optimization pipeline. This will obviously have weird failures in all sorts of weird platforms, so he wants to see that there are no weird failures before merging. Also, sweeping reforms cause regressions in both performance and functionality, so he wants to have a tracking commit.

Charlie can ask bors to run a testsuite on the 90%-percentile of platforms that cause genuine test failures - we keep analytics on platforms that cause test failures in order to compute that. Afterwards and after he gets reviewed, he can get his PR merged in a separate commit.

David wants to add a now function on a less-common platform. Therefore, travis being green won’t tell him much. He can give a bors command to make travis run tests on his platforms. If that travis is green, his PRs will also get rolled up.

Are there any user stories I missed?


#14

You apply patches to the build-systems so that they download network dependencies from local sources. It might be worth investigating whether we can do that change on our own CI.


#15

:+1: I think we have a tendence to use @bors try for this, which is not very efficient because we tend to only use try for things that are not going to be merging immediately. Why should they be serialized with PRs that actually want to merge?


#16

A common and fixable cause of bors failure is lack of rebasing, when a patch doesn’t fail travis but the merged patch will fail. This can be fixed by somehow adding an intermediate step between travis and full bors. Instead of the two-step:

Travis tests single target unmerged -> bors tests all targets merged

We could do a three-step:

Travis tests single target unmerged -> bors tests single target merged -> bors tests all targets merged.


#17

I think “failure due to lack of rebasing” is basically merge conflict. That is, two PRs don’t conflict physically, but they do conflict semantically.


#18

I just want to point out that changes to libstd can totally break rustc. rustdoc, perhaps not.


#19

Sometimes we do have to patch build systems, but not in rust’s case. We just use ./configure --enable-local-rust --enable-vendor and that’s enough.

What happens for local sources is that the packager manually uploads the source tarball to Fedora’s own infrastructure, and it’s fetched from there before the build. When first bootstrapping rustc, we also include the upstream binaries as extra “sources” for stage0 (manually unpacked for --enable-local-rust), then we use rpm dependencies for that in subsequent builds.


#20

Currently, spurious failures largely come down to timeouts; reducing absolute cycle time will help.

I’d like to get thoughts on this line - my belief is that this is a significant factor behind many of our current woes with CI. Timeouts are basically the worst case - CI has run for the maximum time possible, then it fails and the PR needs retrying and the only solution is “make the build faster”. We’ve been bumping into this for quite a while now, it’s just that as average time creeps up, so too does the frequency with which we hit the limit.

The scenario that really worries me is that we get to a state where 9/10 failures is caused by timeout, and we become unable to merge anything.

So how do we solve this? First off, caching is something to tread carefully around - it would be easy to get into a state where builds stop working if the cache isn’t primed (we already have this to some extent with the Docker images and llvm, but they both have some mitigating factors), so my focus has generally been around the ‘worst case build’.

I also feel like 3 hours is pushing the limit of what’s acceptable for a ‘complete’ build - for people who want to do the full build and tests themselves, it’s not a great experience.

To be clear, are you suggesting (say) moving out libstd from the main tree? We may sacrifice some possible ‘correctness’ here, if we don’t do libstd tests against every change to rustc. Of course, this is a spectrum where one end is doing a whole crater run on every commit and the other is just making sure rustc builds.

Perhaps this works if we promote crater to be a regular part of CI - @pietroalbini is doing some great work on automating crater that may help us walk this route. But it’s worth being explicit about what this means for the not rocket science rule and “automatically maintain a repository of code that always passes all the tests” - what does all mean?