Homu queue woes, and suggestions on how to fix them

Crowdfunding for Travis bills then, sorry

1 Like

I've set up a CI pipeline for our company using Jenkins on top of Kubernetes.

We've been in the same boat (on a smaller scale) in which we had long queues and slow builds, but also intermittent failures due to resource constraints on Travis.

We now have a highly parallel Jenkins set-up that automatically spins up extra resources during busy periods (using GKE, Google Kubernetes Engine). The spinning up of tests is purely based on available/requested CPU resources, so if 30 people create a PR at once, new machines are booted in minutes, and the tests have enough capacity. Similarly, once those tests are done, and nothing is running anymore, machines are stopped after a couple of minute of downtime. Since tests are short lived (relatively, compared to a web service), we actually use Google's "Preemptible VMs" for this, which means they are 80% cheaper than regular VMs, with almost zero downsides to this.

Tests run almost instantly, each new project only has to add a Jenkinsfile with one or two lines of config, everything else is highly abstracted and grouped together into a shared set of "pipeline scripts" (written in Groovy) used by all projects in our company. There's even a pretty UI these days that mimics a Travis-like look.

We manage the Jenkins installation using the official Jenkins Kubernetes chart and keep all our configuration in Git (meaning the Jenkins installation itself is stateless/ephemeral), so if anything does go bad (which happens maybe once every couple of months), we run script/deploy, the CI goes down for a few minutes while a new instance is started, and things are back to normal.

There's even Jenkins X (specifically for Kubernetes) these days (although I haven't read too much into it yet, but I believe it's basically a collection of the plugins/best-practices I linked to above) to do most of this for you.

Our set up is not open source (mostly because I haven't put in the time to make it usable outside our own requirements), but I can walk you through it if you'd like, and we can share whatever code could be relevant. We maintain this set-up for a 40-dev size company, it costs us a couple of hours a month at most to maintain (mostly making sure we keep Jenkins + the plugins up to date), and since we run on top of GKE, we have no "metal/VM" related maintenance.

I'm willing to invest time with you all on trying out such an approach, I can also get you in touch with people at Google Cloud to see if we can get some kind of collaboration going, using GKE.

This would (for now) only work for the non-Windows builds, but I'm confident we can tackle that as well (for example, our set-up also has a hosted Mac Mini connected for our iOS app builds).


Back in 2016 when I set this up, I also looked at Drone CI. It might be able to provide the same, in a more modern/less bloated Go-based CI tool, but it also has a smaller community, so there's more likely to be work that has to be picked up by the Rust community if things don't work. There's been active work recently to make Drone work on Kubernetes as well, which would give the same hands-off experience in terms of maintaining the set-up.

2 Likes

I’m not sure whether this is a good idea or not, but perhaps one idea is:

  • Create a new “stability tier”: the daily build. The daily build is the current tip of the master branch.

  • Run bors and CI on every PR after only building stage 1, and merge into to the tip of master only if passing. Use aggressive caching of artifacts that are not touched by the PR in question. These builds should target 30 minutes (hypothetically) or shorter.

  • Run one full build on the tip of master though an automated process every night to find a working commit near the tip. This commit is promoted to the new nightly rust.

1 Like

Respectfully, but your situation has little in common with Rust's. If you're in the SaaS business, then you have the option of just training everybody on your team to fix the Kubernetes cluster if it goes down, whereas Rust has a large volunteer force that wouldn't be allowed to fix CI problems even if they did know how. It also sounds like your test suite is pretty fast, while Rust is heavily limited by the un-parallelizable build step.

Crowdfunding also isn't an option, because it doesn't happen on a reliable monthly plan like a good sponsorship or Mozilla paycheck. It's great for one-off things, but not a good way to pay the bills.

No.

4 Likes

That sounds like it breaks the not-rocket-science rule.

Not sure why crowdfunding needs to be reliable over time. if the crowdfunded budget runs out, CI gets slower. Insert coin to receive faster builds.

2 Likes

Is there a good place to read up on bors/homu strategies that people have considered? It’s fun to think about and I’m super tempted to try to come up with some microoptimizations, but I’m sure a lot of this has been thought through before…

1 Like

I’m not quite sure I follow. How does it break this rule?

Thanks for that bit of information. However, a couple of things to note:

  1. The reason I mentioned GKE (but there are others) is specifically so that you aren't responsible for the correct operation of the Kubernetes cluster. It's the same as was mentioned before "If something breaks, Travis is an email away". The same applies here, except that you contact the company that is responsible for hosting your cluster (in this case, that would be Google)

  2. We actually don't have any operations people, because we outsource the hosting of the cluster. Yes, we have two or three people who are familiar with Kubernetes, but not on an operations level, because it is not needed.

  3. We've had zero downtime of our (hosted) Kubernetes cluster (other than the one or two Google Cloud wide outages, unrelated to Kubernetes) in the last three years or so, so I'm very confident in its capabilities and stability.

That's interesting to know. So I'm assuming there is a team of several volunteers that have access to Travis configurations? This situation would be similar to having a team that has access to the Git repo that hosts the Jenkins configuration.

Indeed, it would require knowledge about Jenkins (this is the one thing that does require knowledge, because you need to set this up yourself, instead of having Travis do that for you, but that's also why I'm offering to help out), and it would require some knowledge on Kubernetes, but not from an operations perspective, but from a user perspective (as in, "how to use these 10 CLI commands"). I don't think that's impossible to achieve, but maybe this is where we disagree?

Some of our services have fast suites, some don't. The longest takes about 45 minutes. Rust's is (I believe) an hour or so? But really it doesn't matter, since the speed of the suite has no impact on how much parallel tests you can run in this set up. You can configure it to (automatically) spin up 2 machines if your queue needs that many resources to have it all run in parallel, or go to 10 if required. Your configuration determines the lower and upper bounds.

I still believe there's nothing that I've read here, and what I've seen in the configurations and the Travis runs that makes this fundamentally impossible, and while it is different from the current set up, I don't think it's worse, and from my experience, if done right, can actually achieve much more CI capacity at lower costs (both in hours, and in hardware).

I'm not saying getting there is easy, but I did want to throw this curveball, and offer my insights/help in case any of you think it's worth considering.

I think there are two different things going on here.

First, there's the issue of pressure on reviewers. Looking at the homu queue, when I sort by PR number, it seems to me that there are many older PRs that haven't been r+'d (for whatever reason). I'd imagine that a tweak to the queue order won't pressure these reviewers considerably more.

Second, there's the question of fairness - whether it's more fair to order the queue based on time-of-request or time-of-review. Neither seems perfectly fair to me; sometimes a PR can be sent well before it is ready (say, it required multiple substantial rewrites), but sometimes also one can get an r+ but, again, have test failures requiring a few rounds of updates (this seems to be the usual case for me for instance). On the whole though I lean toward Nick's proposal for the ordering. I'm swayed by his argument that it will reduce variance. I wouldn't even mind an ordering where the time of the last r+ was used, provided rebases arising from ordinary conflicts were excluded.

1 Like

No, there's not. That's the whole point I'm making; only The Infrastructure Team has the ability to deploy fixes and updates to Rust's servers, and, for infosec reasons, they can't just outsource it to volunteers. You can't just set up a Jenkins instance for them, since they aren't going to trust strangers on the Internet with access to the Rust release pipeline.

Well, alright, Jenkins. Not Kubernetes. Same point, though, that you're adding more load to the infra team.

Are there any hard numbers on how much time is lost to spurious failures from Travis or Appveyor?

The infrastructure team consists exclusively of full-time Mozilla employees?

Right. That's even better, since you don't get access to the Kubernetes main node, as it's fully managed by Google Cloud.

I'm not suggesting I would (although I'm willing to do that, if push comes to shove). I'm suggesting I could help others do that, or just in general help flesh out the project scope and the roadmap on how to get there.

At some point, yes, someone would have to do something to get this running.

Yes, true. If learning how Jenkins works (up to a certain level), or finding people to add to the team of volunteers that know how Jenkins works is out of the question, then my whole suggestion of using Jenkins is moot.

Thankfully we don't have this problem :slight_smile:

If you're interested you (and everyone else!) can find us in the #infra channel on Discord!

By the way, even inside the infra team only few people have access to more critical machines (like the one doing releases).

I don't think we have aggregated numbers, but @kennytm is doing awesome work tracking most of the spurious failures in a spreadsheet. The most common failure we have right now is travis-ci/travis-ci#9696, which is a network failure that happens every day between 6 AM UTC and 7:30 AM UTC since the start of June.

Nope, only few of the members are Mozilla employees as far as I know.

No, but I'm pretty sure all of them work for a company or university that sponsors them (Pingcap, Integer32, Rails, and Mozilla are the ones I know of). The really important part, though, is that none of them are strictly anonymous; Mozilla knows their names and occupations, so if any of them pull something, Mozilla can ban them and guarantee that they just don't rejoin under a new name.

GKE only relieves the Infra Team of having to maintain the underlying container orchestrator. But, because Google will not help with Jenkins problems, it's still on the Infra Team to diagnose any downtime, to figure out whether it's a Jenkins problem that they need to fix or a Kubernetes problem that they just wait for Google to fix. That is a significant regression compared to the status quo where the Infra Team only have to maintain homu/bors.

Yes, that's correct, if the current team structure/responsibilities stay the same.

At the same time, because the Jenkins setup is immutable and restarting it is a matter of stopping/starting the "pod/instance" that is running it, which is done via a CLI command, not by accessing the underlying infrastructure, you could widen access to that command if necessary, and reduce the burden on the infra team.

Again, there are downsides, I'm just stating that there can also be significant upsides, and the downsides can be mitigated, with some investment.

From what I understand from @pietroalbini on the chat (and also what started this thread), Travis isn't working so great right now, so there's more than just maintaining homu/bors. If a workable solution can be found that reduces the time needed from any volunteers, that's obviously the best outcome, but it does sound like Travis will most likely not be a part of that solution.

Either way, I'll let this discussion rest. I'm still offering a helping hand, also if some other solutions turn out to be a better fit, and I'll reach out in chat to see where I can help :+1:

2 Likes

Yes, Zuul and the Bors-NG algorithm are similarly optimistic-rollup oriented; though from the look of it (not completely clear from the docs) Zuul looks like it is going to burn more speculative work eagerly by doing every sub-prefix it can immediately and merging the maximal prefix from that set, whereas Bors-NG starts maximal and bisects-down to either a prefix it can commit or a failure it can kick out, which can take longer than the best-case (can burn log(prefix) worth of bisection time), but might make more sense given the limited width of your cluster (i.e. you don't have enough money to do the Zuul approach).

(To make this more concrete: if you have 50 queued PRs and there’s a bad one at the 15th PR, assuming I’m reading the docs right, Zuul will run 50 jobs in parallel – one per possible prefix – of which the first 14 will pass and the next 36 will fail, and it’ll commit the 15-PR prefix. This is time-optimal but burns a lot of money: 15x speedup for 50x cost. Bors-NG will run one failing 50 PR job, then a failing 25 PR job, then a passing 12 PR job that gets merged. That’s longer than Zuul – took 3 cycles to merge 12 PRs – but it only cost you 3 jobs of CPU time to find that prefix, not 50. 4x speedup for 3x cost amplification. In general the balance depends on the bugginess of your PR queue, and the quality of guesswork going into the existing rollups.)

For what it’s worth, Zuul was built for OpenStack, an open source cloud platform. As a result, there were a lot of cloud providers donating compute (HP, Rackspace, probably some others). Hopefully that explains the design decisions.