I’ve set up a CI pipeline for our company using Jenkins on top of Kubernetes.
We’ve been in the same boat (on a smaller scale) in which we had long queues and slow builds, but also intermittent failures due to resource constraints on Travis.
We now have a highly parallel Jenkins set-up that automatically spins up extra resources during busy periods (using GKE, Google Kubernetes Engine). The spinning up of tests is purely based on available/requested CPU resources, so if 30 people create a PR at once, new machines are booted in minutes, and the tests have enough capacity. Similarly, once those tests are done, and nothing is running anymore, machines are stopped after a couple of minute of downtime. Since tests are short lived (relatively, compared to a web service), we actually use Google’s “Preemptible VMs” for this, which means they are 80% cheaper than regular VMs, with almost zero downsides to this.
Tests run almost instantly, each new project only has to add a Jenkinsfile with one or two lines of config, everything else is highly abstracted and grouped together into a shared set of “pipeline scripts” (written in Groovy) used by all projects in our company. There’s even a pretty UI these days that mimics a Travis-like look.
We manage the Jenkins installation using the official Jenkins Kubernetes chart and keep all our configuration in Git (meaning the Jenkins installation itself is stateless/ephemeral), so if anything does go bad (which happens maybe once every couple of months), we run script/deploy, the CI goes down for a few minutes while a new instance is started, and things are back to normal.
There’s even Jenkins X (specifically for Kubernetes) these days (although I haven’t read too much into it yet, but I believe it’s basically a collection of the plugins/best-practices I linked to above) to do most of this for you.
Our set up is not open source (mostly because I haven’t put in the time to make it usable outside our own requirements), but I can walk you through it if you’d like, and we can share whatever code could be relevant. We maintain this set-up for a 40-dev size company, it costs us a couple of hours a month at most to maintain (mostly making sure we keep Jenkins + the plugins up to date), and since we run on top of GKE, we have no “metal/VM” related maintenance.
I’m willing to invest time with you all on trying out such an approach, I can also get you in touch with people at Google Cloud to see if we can get some kind of collaboration going, using GKE.
This would (for now) only work for the non-Windows builds, but I’m confident we can tackle that as well (for example, our set-up also has a hosted Mac Mini connected for our iOS app builds).
Back in 2016 when I set this up, I also looked at Drone CI. It might be able to provide the same, in a more modern/less bloated Go-based CI tool, but it also has a smaller community, so there’s more likely to be work that has to be picked up by the Rust community if things don’t work. There’s been active work recently to make Drone work on Kubernetes as well, which would give the same hands-off experience in terms of maintaining the set-up.