RFC: Build our own self hosted CI infrastructure


#1

Motivation

CI speed has become a rising concern recently. We’re likely going to hit the limitations of hosted CI services due to our unique requirements. Self hosting has its own costs, but on the other hand its flexibility can greatly help us to improve the CI experience.

Improvements

First, we’re limited by timeouts and have split the workers to 40+ machines. This is because we only have 4 cores for Linux, and less for others. In this way, we’re building host rustc for 40+ times even we don’t have to. If we could build things with e.g. 16 cores, the time will reduce and we can reduce the machines used, and this makes parallel testing more realistic.

Another thing is that with our own clusters it’s possible to over-commit. Parallelism is hard, but increasing the utilization isn’t. Currently on the Travis infra we are sometimes leaving the cores idle. On a self hosted infrastructure, however, over-committing will allow us to exploit the idle cores. Excessive concurrency is normally better than not utilizing all cores, but optionally we can code a global jobserver to control the exact parallelism.

Adding to that, as our jobs finish in different time, when one job is finished we can spend the resources used for that job on other running jobs.

Finally, we get local caching on our own machines. This cuts down 2 minutes for repository fetching, and additional 1m if we count the Docker image cache. (As another weak side point, we can expect better throughput to S3 if we use EC2 for workers.)

How we can do this

First, let’s briefly think about what machines we should use. Our workload is mostly constant, as the queue normally doesn’t go empty and we only run one job at once. Given that, we can use EC2 reserved instances, or cheaper providers that specialize on computing (DigitalOcean, OVH).

Then, for the CI, I would recommend Concourse. It’s designed to fit large scale CI cases, and it should work well with our Docker oriented workflows, as well as normal scripts on macOS and Windows.

Around the gating system, we will need to add integrations with Concourse for bors and cancelbot. (bors-ng which supports commit statuses will not need any code changes, but it has its own blocking issues to use with rustc.)

Drawbacks

  • We need to maintain our own infra.

    Though, this is probably not as bad as you think. Travis goes wrong almost once in a month, and the waiting time for things to recover is annoying. We may be able to recover faster when maintaining our own infra, because we don’t have to wait for other project’s backlog to clear.

  • buildbot was replaced because it didn’t work well.

    I think the primary reason we migrated to Travis at that time is to have a good release/deployment flow. The majority of the changes Rust CI / release infrastructure changes isn’t bottle-necked on buildbot.

    Another issue buildbot had is that it needed configuration files out of repository. Concourse pipelines are highly configurable, and it matches the goal of having a CI system that can be modified by any contributor.


Moving to TaskCluster: The Plan
Moving to TaskCluster: The Plan