Building our own CI
This is a quick and dirty analysis/brainstorm of what it might take to build our
own CI service just for the rust-lang/rust repo. I propose that all the other
repos can stay with Travis CI for now.
This is just one point in the design space intended to start a discussion.
I propose that we can leverage AWS Spot Instances (maybe GCP or Azure has
something similar?) and incremental compilation to achieve very high
cost-efficiency by allowing builds to be interruptable.
I don’t know anything about the budget or personal bandwidth the infra team
has, so I just layout my rough estimations here, and I will let them consider
the numbers and respond as they feel comfortable.
Design Sketch
A central build scheduler would keep track of all scheduled and running builds.
This could be bors or the GitHub issue tracker if we wanted. The point is that
there is some central location that knows the state and progress of each build.
Each build runs in a container (e.g. via docker), as we currently do with
craterbot, for security reasons.
On AWS, spot instances get a 2-minute warning when they are about to be
interrupted. We could use this 2-minute warning to cache build artifacts to AWS
S3 and notify the build scheduler that the build should be restarted later (and
the names of the artifacts in S3 to be used later). We can clean up old
artifacts in S3 lazily or eagerly when the build terminates.
There would also be integration with GitHub, bors, and rust-highfive to allow
the workflow to remain roughly the same. Also, we would likely want some sort
of dashboard to expose the progress and build logs for each build.
Normal CI builds (non-bors)
IIUC, these builds are less of a bottleneck and can be more variable time-wise
(within reason). I propose that we choose the cheapest AWS spot instances that
give reasonable CI time. Alternately, we could overcommit VMs (as Travis
does)… whatever makes financial sense and achieves reasonable performance.
bors builds
These are traditionally the bottleneck. Currently, our process suffers from
- Long build times
- Variable build performance
- Timeouts, time travel, CI service flakes
The hope is that by controlling the CI infra ourselves, we can eliminate the
second two, which are a significant source of frustration. We may also be able
to improve build times by adjusting the instance types used for builds.
Cost/benefit
Financial
Currently, we pay Travis CI “a lot”. Looking at their public pricing info, it
looks like the lowest tier is $130/month and their highest tier is $500/month.
The $500/month plan allows 10 concurrent builds.
For comparison, I will use AWS prices from Jan 18 at about 11am CT. The numbers
here are all pretty approximate, so take it with a cup of salt.
My laptop has 8 cores and 8GB RAM, and a full clean build (minus LLVM) takes
~20 minutes. IIRC, the full test suite take ~1 hour, and an LLVM build takes
~30 minutes, but I haven’t run these in a while. I will take this as performance
baseline.
An AWS c5.2xlarge instance has 8 3GHz cores and 16GB RAM, so I expect it would
get similar performance, or maybe a bit better.
The spot instance price for a c5.2xlarge (currently) is ~$0.10 per hour or
~$75/month/instance if left running continuously. According to the AWS page,
these instances get interrupted about 10-15% of the time. I think with incr
compl, we can easily tolerate this without too much frustration. One open
question: how often are the spot instances available; would we spend more time
waiting for builds to run?
Using on-demand instances would be too pricy, I think: $0.34/hour =>
$250/month/instance (running continuously). This might be problematic if we
want to do 40 builds for each bors r+, as we currently do. Some potential
alternatives would be to use smaller instances, which might slow down build
times: t3.large (2 cores, 8GB RAM) = $0.08/hour; t3.xlarge (4 cores, 16GB RAM)
= $0.16/hour.
Supposing that we wanted to run spot instances for everything, ~42 c5.2xlarge
build instances continuously running would cost around $2100/month, which is
not cheap. If we limit everything to 10 concurrent builds, this comes down to
about $700/month, which might be competitive with Travis while (hopefully)
avoiding some of the problems we’ve experienced.
Human
Obviously, the human cost here is that we would need to maintain and run the CI
service. Hopefully, it would be simpler software because it doesn’t have to be
as general-purpose as Travis CI, but it’s still more software than we have to
maintain now.
Overall, I would estimate that the initial implementation would take a couple
of man-months total, and hopefully maintainance/ops workload would be pretty
low after that. Perhaps others with more experience could give better
estimates?
Finally, since this would be a big experiment, there is the “unknown factor” –
the problems we don’t know about yet that come up along the way.
On the other hand, the human cost of dealing with spurious CI failures would
hopefully be relieved, which would be good.
Technical
The main technical cost, I think, would be the need to implement and maintain a
CI service. Also, integration with existing services, such as GitHub, bors, and
rust-highfive might be less straightforward than they are now. However,
because the service doesn’t need to be general-purpose, I think it will not be
too complex of an undertaking.
In exchange, we get all of the following technical improvements:
- Easier debugging (hopefully), since we own the code.
- More consistent performance => hopefully, fewer timeouts and flakes.
- More control over performance:
- Can control VM overcommittment and scheduling.
- Can control instance types.
- Can use different instance types for CI and bors.
- More control over caching.
- With spot instances, can take advantage of incr compl to do interruptable builds
Related discussions
There have been some previous discussion of our CI processes. A quick search brings up the following: