Building our own CI
This is a quick and dirty analysis/brainstorm of what it might take to build our own CI service just for the rust-lang/rust repo. I propose that all the other repos can stay with Travis CI for now.
This is just one point in the design space intended to start a discussion.
I propose that we can leverage AWS Spot Instances (maybe GCP or Azure has something similar?) and incremental compilation to achieve very high cost-efficiency by allowing builds to be interruptable.
I don’t know anything about the budget or personal bandwidth the infra team has, so I just layout my rough estimations here, and I will let them consider the numbers and respond as they feel comfortable.
A central build scheduler would keep track of all scheduled and running builds. This could be bors or the GitHub issue tracker if we wanted. The point is that there is some central location that knows the state and progress of each build.
Each build runs in a container (e.g. via docker), as we currently do with craterbot, for security reasons.
On AWS, spot instances get a 2-minute warning when they are about to be interrupted. We could use this 2-minute warning to cache build artifacts to AWS S3 and notify the build scheduler that the build should be restarted later (and the names of the artifacts in S3 to be used later). We can clean up old artifacts in S3 lazily or eagerly when the build terminates.
There would also be integration with GitHub, bors, and rust-highfive to allow the workflow to remain roughly the same. Also, we would likely want some sort of dashboard to expose the progress and build logs for each build.
Normal CI builds (non-bors)
IIUC, these builds are less of a bottleneck and can be more variable time-wise (within reason). I propose that we choose the cheapest AWS spot instances that give reasonable CI time. Alternately, we could overcommit VMs (as Travis does)… whatever makes financial sense and achieves reasonable performance.
These are traditionally the bottleneck. Currently, our process suffers from
- Long build times
- Variable build performance
- Timeouts, time travel, CI service flakes
The hope is that by controlling the CI infra ourselves, we can eliminate the second two, which are a significant source of frustration. We may also be able to improve build times by adjusting the instance types used for builds.
Currently, we pay Travis CI “a lot”. Looking at their public pricing info, it looks like the lowest tier is $130/month and their highest tier is $500/month. The $500/month plan allows 10 concurrent builds.
For comparison, I will use AWS prices from Jan 18 at about 11am CT. The numbers here are all pretty approximate, so take it with a cup of salt.
My laptop has 8 cores and 8GB RAM, and a full clean build (minus LLVM) takes ~20 minutes. IIRC, the full test suite take ~1 hour, and an LLVM build takes ~30 minutes, but I haven’t run these in a while. I will take this as performance baseline.
An AWS c5.2xlarge instance has 8 3GHz cores and 16GB RAM, so I expect it would get similar performance, or maybe a bit better.
The spot instance price for a c5.2xlarge (currently) is ~$0.10 per hour or ~$75/month/instance if left running continuously. According to the AWS page, these instances get interrupted about 10-15% of the time. I think with incr compl, we can easily tolerate this without too much frustration. One open question: how often are the spot instances available; would we spend more time waiting for builds to run?
Using on-demand instances would be too pricy, I think: $0.34/hour => $250/month/instance (running continuously). This might be problematic if we want to do 40 builds for each bors r+, as we currently do. Some potential alternatives would be to use smaller instances, which might slow down build times: t3.large (2 cores, 8GB RAM) = $0.08/hour; t3.xlarge (4 cores, 16GB RAM) = $0.16/hour.
Supposing that we wanted to run spot instances for everything, ~42 c5.2xlarge build instances continuously running would cost around $2100/month, which is not cheap. If we limit everything to 10 concurrent builds, this comes down to about $700/month, which might be competitive with Travis while (hopefully) avoiding some of the problems we’ve experienced.
Obviously, the human cost here is that we would need to maintain and run the CI service. Hopefully, it would be simpler software because it doesn’t have to be as general-purpose as Travis CI, but it’s still more software than we have to maintain now.
Overall, I would estimate that the initial implementation would take a couple of man-months total, and hopefully maintainance/ops workload would be pretty low after that. Perhaps others with more experience could give better estimates?
Finally, since this would be a big experiment, there is the “unknown factor” – the problems we don’t know about yet that come up along the way.
On the other hand, the human cost of dealing with spurious CI failures would hopefully be relieved, which would be good.
The main technical cost, I think, would be the need to implement and maintain a CI service. Also, integration with existing services, such as GitHub, bors, and rust-highfive might be less straightforward than they are now. However, because the service doesn’t need to be general-purpose, I think it will not be too complex of an undertaking.
In exchange, we get all of the following technical improvements:
- Easier debugging (hopefully), since we own the code.
- More consistent performance => hopefully, fewer timeouts and flakes.
- More control over performance:
- Can control VM overcommittment and scheduling.
- Can control instance types.
- Can use different instance types for CI and bors.
- More control over caching.
- With spot instances, can take advantage of incr compl to do interruptable builds
There have been some previous discussion of our CI processes. A quick search brings up the following: