The Rust Infrastructure team is looking if it’s worth migrating away from Travis CI in the near future. This discussion was started during autumn of 2018, after a summer full of bad Travis CI issues and outages (more on that below), and the rest of the year was still bad enough that we’re considering migrating to another CI platform.
We’re going to discuss this at the Rust All Hands, taking place next week in Berlin. We’re researching alternative CI platforms on our own, but we’ll likely miss some of them so please suggest what you know or use here in the thread (look at the requirements we have below)! We’re looking forward to reading through your suggestions, and we will consider them when making a decision!
Please note that, even if we’re an open source project, we pay a lot of money to Travis CI and we would like to make sure that we’re getting the best value for money. If alternative CI platform requires us to pay that’s fine.
– The Rust Infrastructure team.
What problems did we have with Travis CI
Still present
- Sometimes scripts included by Travis in the build fail to execute, causing an otherwise-good build to spuriously fail. We don’t have any control on those scripts, so we can’t prevent those failures. There is not really a tracking issue for this, it just happens sometimes.
Resolved problems since May 2018
- 2018-05-24 → 2018-12-18: Broken networking on Docker at ~6:30 UTC
- Caused a spurious failure mostly every day: basically a cronjob was running on the images that disabled IPv4 forwarding everyday at around 6:30 UTC.
- It took Travis CI 208 days to investigate and resolve this issue, despite repeated pings both in the issue tracker and on the support email.
- 2018-06-08: Travis CI outage that prevented any build from starting
- 2018-07-25 → 2018-07-26: macOS builders skipped due to a bug in the configuration parser
- Travis updated their configuration parser that day, but a bug in it ignored jobs in the build matrix if they used a specific
if:
syntax. Since we were using that syntax for our macOS jobs they were ignored. - This caused PRs to land without actually being tested on macOS, a Tier 1 platform. We noticed it since users reported a missing beta on macOS.
- Travis updated their configuration parser that day, but a bug in it ignored jobs in the build matrix if they used a specific
- 2018-07-26: Travis “lost” our macOS images
- According to Travis Support, along with the previous incident they “lost” our macOS image. We use a custom image called
xcode9.3-moar
for our builds, which gives us more cores.
- According to Travis Support, along with the previous incident they “lost” our macOS image. We use a custom image called
- 2018-07-27 → 2018-08-24: Travis CI builders spurious shutdowns in the middle of a build
- Basically a bug in their software marked our VMs as
TERMINATE
instead ofMIGRATE
when an hypervisor needed maintenance. - It took Travis CI 28 days to deploy a fix for this issue, causing multiple failures a day.
- Basically a bug in their software marked our VMs as
- 2018-09-12 → 2018-09-13: Travis CI reduced our timeout from 3 hours to 50 minutes
- A refactoring of their software removed the piece of code that was increasing the jobs timeout of allowed repositories, including rustc. They then deployed a fix.
- It took Travis CI a day to deploy the fix, blocking all the queue.
- 2018-10-04: Travis CI failed to generate build scripts
- A few builds spuriously failed due to “some network slowness inside our systems” (Travis Support). The rate of spurious failures decreased after we reported it, and I don’t think we tracked it after that.
The migration to travis-ci.com
Due to the GitHub Services sunset happening on January 31st Travis CI was forced to migrate existing repositories away from GitHub Services, and they initially decided to do that at the same time of the migration of public repositories away from travis-ci.org
on their unified platform on travis-ci.com
. (note: that decision was reverted and repos on .org are migrated to webhooks).
The way they handled the migration was far from ideal: there was no notification about the migration one month before the cutoff date, and this migration changed the way build results are reported to GitHub, so it required manual action from everyone with custom infrastructure based on Travis (like we do). We learned about the migration when one infrastructure team member randomly noticed the GitHub Services sunset and we asked Travis Support ourselves.
Adding to that, Travis Support reported wrong information in the communications with them: they said encrypted secret variables were not migrated (while they were migrated perfectly fine), and they said there was no way for us to keep using travis-ci.org
or commit statuses after January 31st, even though that’s now the plan for everyone who didn’t migrate.
Also, the migration process was rough (it was marked as beta, so it’s sort of expected, even though we were one month away from the migration…). Cronjobs were not migrated and were broken after manually migrating them with the API (turns out we hit a bug in their API), branch protection had to be updated on every repository and even today build badges are not working, since there is no redirect in place.
What requirements we have for a replacement
These are the requirements we have for a Travis CI replacement. We aren’t looking for an AppVeyor alternative at the moment.
Hard requirements
- The service must be operated by a company we can contact directly for support. Building and maintaining a CI system in a reliable way for a project as big as ours takes a lot of time, especially to test on macOS, and most of the Rust infrastructure team is not paid to work on infrastructure. If the service requires us to use our own servers, the maintenance work we have to do should be minimal.
- Support should be direct, prompt (as appropriate), and helpful.
- The service must provide both Linux and macOS machines. Windows support could be nice, but switching away from AppVeyor is not a priority for us.
- The service must allow us to increase timeouts and the available resources in the VMs.
- Anecdotally we need at least 4-core machines.
- 14 Windows + 5 Mac + 38 Linux = 57 current builders per PR
- The service must be able to build and execute Docker containers (or a comparable system for enabling a level of reproducible builds).
- The service should be somewhat established, in the sense that we won’t have to go looking for a new solution in a year.
Nice to have
- Evidence of usage on a project of a similar size to Rust (number of parallel builders, build length)
- The ability to hook custom hardware up alongside hosted builders. This would allow us to easily add CI for targets that might want to become Tier 1 in the future.
- The ability to log into the builders remotely to investigate spurious failures.
- The ability to easily run and debug builds locally (this already mostly possible for the docker-based Linux builds, but support for other platforms would be nice as well).
- The ability to share the same plan between multiple orgs (
rust-lang
andrust-lang-nursery
) - The ability to prioritize builds from a repo (like rustc) over builds from the other ones.
- Built-in caching support: rustc doesn’t use it, but if we migrate other repositories away from Travis CI we’re going to need it.
- Pay-for-what-you-use rather than a subscription model. We’re somewhat permanently under-utilizing capacity on Travis by design as we don’t want to get clogged, but it means we’re paying a bit more than we otherwise would.