Which CI platform should Rust use?

pietroalbini · February 2, 2019, 8:21am

The Rust Infrastructure team is looking if it’s worth migrating away from Travis CI in the near future. This discussion was started during autumn of 2018, after a summer full of bad Travis CI issues and outages (more on that below), and the rest of the year was still bad enough that we’re considering migrating to another CI platform.

We’re going to discuss this at the Rust All Hands, taking place next week in Berlin. We’re researching alternative CI platforms on our own, but we’ll likely miss some of them so please suggest what you know or use here in the thread (look at the requirements we have below)! We’re looking forward to reading through your suggestions, and we will consider them when making a decision!

Please note that, even if we’re an open source project, we pay a lot of money to Travis CI and we would like to make sure that we’re getting the best value for money. If alternative CI platform requires us to pay that’s fine.

– The Rust Infrastructure team.

What problems did we have with Travis CI

Still present

Sometimes scripts included by Travis in the build fail to execute, causing an otherwise-good build to spuriously fail. We don’t have any control on those scripts, so we can’t prevent those failures. There is not really a tracking issue for this, it just happens sometimes.

Resolved problems since May 2018

2018-05-24 → 2018-12-18: Broken networking on Docker at ~6:30 UTC
- Caused a spurious failure mostly every day: basically a cronjob was running on the images that disabled IPv4 forwarding everyday at around 6:30 UTC.
- It took Travis CI 208 days to investigate and resolve this issue, despite repeated pings both in the issue tracker and on the support email.
2018-06-08: Travis CI outage that prevented any build from starting
2018-07-25 → 2018-07-26: macOS builders skipped due to a bug in the configuration parser
- Travis updated their configuration parser that day, but a bug in it ignored jobs in the build matrix if they used a specific if: syntax. Since we were using that syntax for our macOS jobs they were ignored.
- This caused PRs to land without actually being tested on macOS, a Tier 1 platform. We noticed it since users reported a missing beta on macOS.
2018-07-26: Travis “lost” our macOS images
- According to Travis Support, along with the previous incident they “lost” our macOS image. We use a custom image called xcode9.3-moar for our builds, which gives us more cores.
2018-07-27 → 2018-08-24: Travis CI builders spurious shutdowns in the middle of a build
- Basically a bug in their software marked our VMs as TERMINATE instead of MIGRATE when an hypervisor needed maintenance.
- It took Travis CI 28 days to deploy a fix for this issue, causing multiple failures a day.
2018-09-12 → 2018-09-13: Travis CI reduced our timeout from 3 hours to 50 minutes
- A refactoring of their software removed the piece of code that was increasing the jobs timeout of allowed repositories, including rustc. They then deployed a fix.
- It took Travis CI a day to deploy the fix, blocking all the queue.
2018-10-04: Travis CI failed to generate build scripts
- A few builds spuriously failed due to “some network slowness inside our systems” (Travis Support). The rate of spurious failures decreased after we reported it, and I don’t think we tracked it after that.

The migration to `travis-ci.com`

Due to the GitHub Services sunset happening on January 31st Travis CI was forced to migrate existing repositories away from GitHub Services, and they initially decided to do that at the same time of the migration of public repositories away from travis-ci.org on their unified platform on travis-ci.com. (note: that decision was reverted and repos on .org are migrated to webhooks).

The way they handled the migration was far from ideal: there was no notification about the migration one month before the cutoff date, and this migration changed the way build results are reported to GitHub, so it required manual action from everyone with custom infrastructure based on Travis (like we do). We learned about the migration when one infrastructure team member randomly noticed the GitHub Services sunset and we asked Travis Support ourselves.

Adding to that, Travis Support reported wrong information in the communications with them: they said encrypted secret variables were not migrated (while they were migrated perfectly fine), and they said there was no way for us to keep using travis-ci.org or commit statuses after January 31st, even though that’s now the plan for everyone who didn’t migrate.

Also, the migration process was rough (it was marked as beta, so it’s sort of expected, even though we were one month away from the migration…). Cronjobs were not migrated and were broken after manually migrating them with the API (turns out we hit a bug in their API), branch protection had to be updated on every repository and even today build badges are not working, since there is no redirect in place.

What requirements we have for a replacement

These are the requirements we have for a Travis CI replacement. We aren’t looking for an AppVeyor alternative at the moment.

Hard requirements

The service must be operated by a company we can contact directly for support. Building and maintaining a CI system in a reliable way for a project as big as ours takes a lot of time, especially to test on macOS, and most of the Rust infrastructure team is not paid to work on infrastructure. If the service requires us to use our own servers, the maintenance work we have to do should be minimal.
- Support should be direct, prompt (as appropriate), and helpful.
The service must provide both Linux and macOS machines. Windows support could be nice, but switching away from AppVeyor is not a priority for us.
The service must allow us to increase timeouts and the available resources in the VMs.
- Anecdotally we need at least 4-core machines.
- 14 Windows + 5 Mac + 38 Linux = 57 current builders per PR
The service must be able to build and execute Docker containers (or a comparable system for enabling a level of reproducible builds).
The service should be somewhat established, in the sense that we won’t have to go looking for a new solution in a year.

Nice to have

Evidence of usage on a project of a similar size to Rust (number of parallel builders, build length)
The ability to hook custom hardware up alongside hosted builders. This would allow us to easily add CI for targets that might want to become Tier 1 in the future.
The ability to log into the builders remotely to investigate spurious failures.
The ability to easily run and debug builds locally (this already mostly possible for the docker-based Linux builds, but support for other platforms would be nice as well).
The ability to share the same plan between multiple orgs (rust-lang and rust-lang-nursery)
The ability to prioritize builds from a repo (like rustc) over builds from the other ones.
Built-in caching support: rustc doesn’t use it, but if we migrate other repositories away from Travis CI we’re going to need it.
Pay-for-what-you-use rather than a subscription model. We’re somewhat permanently under-utilizing capacity on Travis by design as we don’t want to get clogged, but it means we’re paying a bit more than we otherwise would.

ishitatsuyuki · February 2, 2019, 9:40am

TeamCity (hybrid solution)

Disclaimer: this is not a fully hosted solution, but still worth discussing.

TeamCity is an enterprise CI solution developed by JetBrains. As a paid product, it’s pretty feature-complete (the requirements listed should be well supported), and it is much more straightforward to configure than Jenkins. JetBrains offers their hosted instance to open-source projects (for free IIRC). As a case study, IntelliJ Rust is currently relying on the hosted TC instance. For a bonus, I have a very positive experience with JetBrains support (prompt and professional).

As the CI server is managed, the main server will be always taken care by a third party (JetBrains). The build agents they offer are very limited (one Ubuntu machine), and we will need to provision our own agent pool, which is why I call this hybrid.

The primary focus of the discussion would probably be on the work needed to manage the agents. The ideal situation is:

Linux and Windows agents are provisioned on cloud (supported by TC). We setup an image and things are supposed to work stably.
macOS agents are one-time provisioned and hopefully they don’t crash into an unrecoverable state. If something goes wrong, we can try recovering from a snapshot.

If this is the case, then TC would be a pretty feasible solution. The exact situation can vary quite a lot, and I’ll leave this up to the discussion at all hands.

crepererum · February 2, 2019, 11:24am

GitLab CI with GitLab runner might be worth a shot, although it might require the infrastructure team to deploy the runners (esp. for Mac OS) themselves or find a company that does that (anyone knows someone doing that?). While this is quite simple for Linux, it’s slightly more effort for Mac, but as far as I can see this should be doable.

The neat thing is that if you control the runners, you get full control over scaling and scaling (but also the burden). Windows is also supported as well as interactive shells.

martinwoodward · February 2, 2019, 11:27am

Azure Pipelines

I work on the Azure DevOps team and our Azure Pipelines service could be a good home. We’d definitely love to have you and while our regular open source offer is for 10 parallel build jobs across our hosted Mac, Linux and Windows pools, for a project as important to the community as rust-lang and rust-lang-nursery I’m very confident I’d be able to get Microsoft to cover 60 parallel build jobs needed for free.

You configure Azure Pipelines with yet another YAML file (azurepipelines.yaml). Here are a few example projects if you want to take a look about how it is configured:

Rust Example: Juniper
C – multiplatform: LibGit2
Python: cpython
C#: Roslyn - the .NET compiler

The official help docs are here: https://docs.microsoft.com/en-us/azure/devops/pipelines/get-started-yaml?view=vsts and we have some guidance on the conversion of complex travis builds here: https://docs.microsoft.com/azure/devops/pipelines/migrate/from-travis?view=azure-devops

I’d be able to get some folks on the team to help out with getting the projects set up if you wanted to give it a try. You could run in parallel for a bit and see if it works for you.

Let me know if folks have any questions or want any help.

domenkozar · February 2, 2019, 1:28pm

We're building Hercules CI, which is a CI for the Nix ecosystem. We're just opening up preview access and we would love to help with open source projects like Rust.

Hercules CI builds a Nix expression in your repository, so you have to provide one. The community's carnix tool helps with that.

Going into details about your requirements:

Hard requirements

The service must be operated by a company we can contact directly for support. Building and maintaining a CI system in a reliable way for a project as big as ours takes a lot of time, especially to test on macOS, and most of the Rust infrastructure team is not paid to work on infrastructure. If the service requires us to use our own servers, the maintenance work we have to do should be minimal.

The CI (scheduling and UI) is hosted. The build agents would be in your hands, easy to manage and deploy because they are disposable and we provide deployment configuration files.

Support should be direct, prompt (as appropriate), and helpful.

Since we're just starting out and don't have a lot of users, we'll mostly focus helping everyone stabilize their CI. But we'd ask for a transition period to make sure everything works smoothly.

The service must provide both Linux and macOS machines. Windows support could be nice, but switching away from AppVeyor is not a priority for us.

Nix supports many platforms including Linux and MacOS, but no native Windows support. Hercules will support what Nix supports.

The service must allow us to increase timeouts and the available resources in the VMs.

Anecdotally we need at least 4-core machines.

14 Windows + 5 Mac + 38 Linux = 57 current builders per PR

Since agents are self-hosted, you can choose whatever configuration you want.

The service must be able to build and execute Docker containers (or a comparable system for enabling a level of reproducible builds).

We'd recommend using Nix to achieve reproducibility and avoiding containers, but you can build Docker containers with Nix even without using Docker.

The service should be somewhat established, in the sense that we won’t have to go looking for a new solution in a year.

Nix is 15 years old technology, but the CI will be fresh. It's written in Haskell and Elm, so we have high confidence.

Nice to have

Evidence of usage on a project of a similar size to Rust (number of parallel builders, build length)

Sadly, not yet.

The ability to hook custom hardware up alongside hosted builders. This would allow us to easily add CI for targets that might want to become Tier 1 in the future.

That's the default. We might add hosted builds later if there's a need.

The ability to log into the builders remotely to investigate spurious failures.

They are under your control. Moreover, Nix allows you to build everything locally in the same sandbox so there should be little to no difference between what runs on CI and on developer builds.

The ability to easily run and debug builds locally (this already mostly possible for the docker-based Linux builds, but support for other platforms would be nice as well).

As said above

The ability to share the same plan between multiple orgs ( rust-lang and rust-lang-nursery )

Pricing is per seat, but we're leaning towards sharing the plan between orgs.

The ability to prioritize builds from a repo (like rustc) over builds from the other ones.

This is not yet built but planned (not in short term).

Built-in caching support: rustc doesn’t use it, but if we migrate other repositories away from Travis CI we’re going to need it.

That is already doable with Cachix and you can share all binaries with the world.

Pay-for-what-you-use rather than a subscription model. We’re somewhat permanently under-utilizing capacity on Travis by design as we don’t want to get clogged, but it means we’re paying a bit more than we otherwise would.

You could configure the agent machines to trigger autoscaling and thus minimize your build costs.

joshk · February 2, 2019, 1:41pm

Hi Pietro

Thank you for this write up. First off, I am very sorry for the outages and issues you and your team have run into. I feel your frustration, and would like to take this opportunity to reach out to see what we can do to rebuild trust in Travis CI.

My name is Josh, I’m the Head of Product at Travis CI, and one of the original co-founders. I remember working with the original Rust team that started using Travis CI to add larger VM support, custom timeouts, as well as making sure custom Mac images were available (as well as acceptable queue times). To hear of these issues you have experienced is, well, humbling.

Would you or anyone on your team be open to a phone call to go over some of these problems, and what we can do to improve your teams experience. I would be happy to also share our upcoming roadmap, and see if there is anything we can modify to improve your teams experience.

Regarding the migration to travis-ci.com, I think there has definitely been some miscommunication which I will need to investigate on our side, because there is a lot that doesn’t quite match up (I worked on the Service Hook -> Webhook migration).

You are more than welcome to reach me at josh@travis-ci.com

Have a great weekend

Josh

fkorotkov · February 2, 2019, 1:55pm

Cirrus CI

Disclaimer: I’m CTO of Cirrus Labs which develops and supports Cirrus CI.

Cirrus CI is a hosted CI which already supports Linux, Windows, macOS and FreeBSD (rust-lang/libc already uses Cirrus CI for FreeBSD).

With Cirrus CI you have an option to either bring your own infrastructure or rely on hosted infrastructure with per-second billing and no concurrency limits. With per-second billing you only pay for CPU and memory used during your tasks with an option to customize them up to 30 CPUs and 90 GB of memory. SonarSource, for example, runs hundreds of integration tests in parallels for each change.

Cirrus CI meets all your hard requirements and most "nice to have"s. A few missing features are already on our roadmap. Would love to prioritize them for you.

We already have a great case of Flutter project migrating fully to Cirrus CI to run CI for Linux, Windows and macOS. I can give you contacts of folks at Google if you want to ask them about their experience with migrating to Cirrus CI.

Let me know if you have any other questions.

EDIT: here is a PR with a proof of concept of Cirrus CI working for rust-lang/rust.

pietroalbini · February 2, 2019, 3:18pm

Thanks everyone for the great suggestions! We’re going to evaluate all options next week to figure out what’s the best for the Rust project moving forward!

@joshk thanks for your message – Travis has come a long way with us, and staying with it is certainly an option in our evaluations. Once we’ve discussed it all as a team, we’ll have a better understanding of the next steps in our “CI platform evaluation” process and will surely be back in touch.

rye · February 2, 2019, 3:33pm

CircleCI?

Disclaimer: I’ve had my own grievances with Circle, but overall their platform is well worth considering because I believe it at least meets the Hard Requirements. Being a college student and only having used their free plan, I also don’t have that much experience working at a Rust-sized scale, but I do believe they could meet these demands.

For what it’s worth, my team had many similar issues with Travis CI’s spotty maintenance and incident response, and while I give them a pass for being one of the largest supporters of Open Source by offering free builds, I think there are definitely some competitive options out there, especially for paying customers.

We don’t use anywhere near the number of workers that Rust would use, but CircleCI reduced our build times by over 50% coming from Travis, because Travis has much higher overhead on each job. In my time on CircleCI, we’ve had very few outages coming from CircleCI themselves—all of the service disruptions I’ve been affected by have been on the side of upstream providers such as GitHub API issues. Anecdotally, a job that previously took 1 minute on Travis to run 5 seconds of tests took 7 seconds on CircleCI. The speed of Circle’s job runners also means that any backlogs take less time to disappear after an outage.

For what it’s worth, CircleCI:

Is a Docker-first platform, and is almost entirely based on Docker. Workers are Docker containers (except on macOS) and start almost instantly as a result, and overhead is quite low. Paid plans would almost certainly support intermediate layer caching.
Definitely meets the requirement for having direct/premium support, and is hosted by them, but also has a self-hosted option.
- They have a sales team who would hopefully be able to answer your questions and provide concrete quotes.
Has macOS and Linux workers available, and payment scales to capacity as set on the account.
- I believe the default resource constraints are 2CPUs x 4 GB, but the resource_class setting (available by support request) allows for scaling up to 8CPUs x 16 GB.
- Capacity is determined on the organization level, i.e. you set up a certain number of workers for the organization and those workers are always available right when the build starts.

As such, I think Circle at least covers the hard requirements. To the nice-to-haves, it’s worth asking sales about some of the things, but there are a few requirements that I can say are pretty much satisfied:

The ability to log into the builders remotely to investigate spurious failures is built-in. One can re-run jobs with SSH and use any SSH keys associated with their GitHub account to log in.
The ability to easily run and debug builds locally is also built-in—there is a CLI which can be used for config validation.
Built-in caching support is also available. There is support for both shorter-term “caching” and longer-term “artifact” storage. (The former primarily for sharing between builds, the latter more for downloadable artifacts.)
Pay-for-what-you-use might be possible. Scaling resource constraints is certainly easier on Circle than it is on Travis, from my experience.

There are a large number of things that my team have found really nice, such as each commit getting multiple commit statuses on GitHub, so you can explicitly click on the “Details” and get right to the build, or mark certain job passes as “required” for branches to be mergeable. I think these things are configurable, if those are not wanted.

There are some cons, though.

Circle’s web UI isn’t the most intuitive, nor is their configuration documentation bulletproof. They’re improving both of these things, but in their current forms they aren’t perfect.
I’ve encountered some performance issues with loading logs from builds with lots of output, in that it just takes time to load the logs.

nelsonjchen · February 2, 2019, 5:14pm

Google Cloud Build for just Linux containers? It might haul very much.

May I make a suggestion to take a look at Google Cloud Build for the Linux jobs as some sort of side CI? They support 32 vCPU machines and do per-second billing. Unfortunately, they seem to have a limit of 10 jobs running concurrently but it seems to imply that more can be requested. Imagine each building of a commit spinning up 38*32=1216 vCPUs for each of those Linux jobs.

With the remote-builder pattern, if you targeted the largest instances, you could probably target 96 vCPU instances out of the box or about 960 vCPU with the out of the box limit. If they approve beyond 10, 38*96=3648 vCPU per build.

I wouldn’t use them as a main platform though. At work, I have been using additionally them for thousands of generated and embarrassingly parallel unit tests which I’ve excluded to run off the main platform and I’ve been pretty happy. But for the more traditional tests I’ve been running them on stuff with more friendly UIs.

It’ll be nice to at least take a look at to rule or rule it out or maybe see if it can be useful in some other capacity like a faster build and test canary for some other CI system like Azure which is limited to 2 vCPU or other CI systems topping out at 8 or 30 vCPU.

alexcrichton · February 2, 2019, 6:39pm

Thanks for posting this and gathering all the data here @pietroalbini, it’s very much appreciated!

One thing I wanted to expand a bit on the hard requirement of the size of CI. The builds for rust-lang/rust are especially intensive and are pretty abnormal compared to many other open source projects. Our build workflow means that we have a queue of PRs to merge, each of which is tested one at a time serially. The full entire test suite is run against each candidate to merge, and if it’s green we merge the PR and if anything fails we move on to the next PR to merge.

A full test suite runs 57 concurrent jobs. The jobs in aggregate test a wide variety of things like:

Platform support, being spread across Windows/Mac/Linux
Literal unit tests, across a good number of configurations
Distribution builds. Each PR produces artifacts as if it were a full Rust nightly release, so we do tons of cross-compiling and production of compilers

Our hard time limit for each of these 57 jobs is three hours. We shoot for and prefer all jobs to be under 2 hours, although we rarely get there and typically sit at 2h15m or 2h30m. This means that 24/7 we’re attempting to merge PRs (serially), and each PR runs 60 different 2-3 hour builds in parallel. These builds don’t all finish at the same time, the quickest tend to finish in 1h15m ish and the slowest extend to 2h30m. We currently use the free time on Travis to basically allow all our other rust-lang and rust-lang-nursery repositories to have ambitious CI configurations as well (aka libc, stdsimd, clippy, etc all have quite an expansive matrix too). Additionally, each PR to the rust-lang/rust repository runs one of the 57 builds as a smoke check, and it typically takes 1h30m ish.

Our CI is clearly quite slow, and we’re always looking for opportunities to make it faster. The easiest way to make it faster is to increase the size of the CPU, either more or faster ones. The difficulty cliff is very sharp after that, as we’re constantly looking for things to cache or improve to make CI faster. All-in-all, other than CPU we really aren’t going to get that much faster than we are today (definitely not an order of magnitude at least).

To that end conventional tips/tricks to make builds faster on CI are typically not that relevant to us. At our scale (multi-hour builds instead of a few minutes/seconds) most tricks can shave off a minute or so and don’t impact queue time in general.

In any case this is largely food for thought! We’d ideally evaluate builds and build times against a CI service to find a good fit for machine size and such. But that’ll all come much later depend on what direction we take!

SimonSapin · February 2, 2019, 8:03pm

Taskcluster

Taskcluster fails the first hard requirement, but might deserve some discussion for completeness.

Taskcluster is Mozilla’s response to Firefox’s testing needs outgrowing Buildbot. It can be thought as a loose collection of hosted services with APIs (Queue, Index, Secrets, …) and pieces of software (API client libraries in various languages, worker agents, …) that you can put together to build any CI system. It is designed to be very generic and “self-service” in order to enable unanticipated use cases, but the counterpart of that is that anything non-trivial takes some work to put in place.

I’ve spent a few months migrating Servo from Buildbot to Taskcluster. (A few jobs are still on Buildbot, but all the infrastructure-level pieces are there to support them.) For more see the tracking issue as well as the scripts and READMEs starting here. A part of that time was me learning about and experimenting with TC (and ops things in general), but I think another was because Servo was possibly the first project of this scale other than Firefox to use TC. I think a lot of that can be reused. For example the generic decisionlib.py is separated from the Servo-specific decision_task.py script that uses it.

Some aspects of Servo + TC that might be unusual in off-the-shelf CI systems:

Testing each PR starts with running in a “decision task” a script from the repository (so it can be modified in the same commit that is being tested) that uses the API to schedule other tasks with an arbitrary dependency graph.
Docker images are built on demand from Dockerfiles in the repository (again so that can be modified in the same commit that uses the modifications) and cached.
A single building task can produce an executable that is then used by many testing tasks in parallel.
Automatic provisioning and scaling (based on queue) of AWS EC2 instances of any type, with any system image. (I think support for other cloud providers is being added.)
Bring your own hardware: any machine with appropriate credentials can pick tasks from the queue. Servo does this with non-virtualized machines from Packet (in order to run Android emulators with KVM for CPU acceleration) and macOS workers from MacStadium.

TC is also moving from being hosted (there is a single Queue service in the world, operated by Mozilla) to “shipped” software (anyone can deploy their independent instance). Of course, everything is open-source.

I think that TC can meet every requirement and nice-to-haves, except the first one: there is no company you can contract for Taskcluster support that I know of. People on Mozilla’s Taskcluster team are generally helpful when asked for help on IRC, but that’s not enough at this scale. For a TC-based CI system to be viable for Rust, there needs to be someone whose paid job (or at least part of it) is to build and maintain this system.

Then a question perhaps more tricky than money for their salary is what company can provide the legal structure to have that person as an employee.

djc · February 2, 2019, 9:45pm

Note that there has been some extensive discussion of downsides of Azure DevOps recently:

Especially, note that people have had trouble building working pipelines with in-repository configuration, and that they instead had to resort to the GUI editor. My team mate who worked on this over the past few weeks got very frustrated trying to get a pipeline going on DevOps -- he did get it working after a few weeks, using the GUI only -- which seems not ideal in terms of having good history and review of pipeline changes.

(I also didn't like the non-standard authentication options Azure comes with: no standard TOTP 2FA -- you have to download MS's custom 2FA app -- and no Ed25519 SSH keys.)

thedataking · February 2, 2019, 10:16pm

Just to offer a counterpoint. I’m currently setting up Azure pipelines to test our C2Rust translator (repo: https://github.com/immunant/c2rust, branch feature/azure-pipelines) using in-repository configuration and I’ve been pleasantly surprised by the short lag between the git push and the build kicking off. Martin and others have also reached out to offer help, so while I realize this is just one data point and that others may have problems, but so far our experience has been positive.

skade · February 2, 2019, 11:46pm

@joshk @pietroalbini @alexcrichton Given that all-hands is next week, I wanted to point out again that the TravisCI Berlin office is roughly a 15 minutes walk away.

Maybe it would be a good time spend some time dissecting a couple of the issues.

TBH: many of the issues, while certainly grave, read like stuff I’ve experienced at many providers, so a switch might not improve things.

@SimonSapin: I don’t see how the legal structure for a person doing CI for Rust is a problem. We have multiple companies around the project that would probably help out.

SimonSapin · February 3, 2019, 12:09am

@skade Maybe it’s not actually a problem and one of those companies would be willing to have a full-time employee who works on Rust rather than that company’s own projects directly, and keep them long-term. In that case, great!

Anyway, that’s not the direction the Infra team seems to be interested in at the moment. So this is rather hypothetical.

LilianMoraru · February 3, 2019, 12:09am

Most of the complaints there seem to be with the UI(limited configurability) and not with the CI service in general(did not see any CI service/stability-related issues actually).
Somebody did mention not being able to use submodules which, if true, would be a serious issue, especially for rust-lang/rust, since it uses git submodules.

plasticine · February 3, 2019, 2:26am

Buildkite

I’m a product engineer at Buildkite currently working on making our open source support awesome — I think Buildkite could be a great fit for Rust’s CI requirements, and we’d love to have you!

Buildkite has been very successfully providing CI for many large software teams since ~2014. Beyond commercial stuff, we also offer free and unlimited accounts for open source, academia and non-profits.

Some notable open source teams using us publicly are Bazel, D Lang, and Angular, as well as many others using us privately (we’ve just very recently released support for publicly viewable pipelines).

We put a lot of effort into the level of support we offer to all our users, and pride ourselves on our documentation.

Buildkite has a bit of a different operational model to most other CI/CD/automation solutions which is worth me quickly touching on; we don’t do opaque, managed compute to run workloads. Instead we provide a lightweight agent binary, the Buildkite Agent which can run pretty much anywhere (The Linuxes, OSX, Windows, Docker, etc). This means that you have complete control over how, where, and on what infra your workloads run.

In practice we find that for most teams the operational overhead of this approach is worth the increased flexibility and control it gives. We’ve also put a lot of effort into providing tooling that lightens any infra burden, like our AWS Elastic CI Stack.

We’re actively ramping up our tooling around supporting OSS projects running on Buildkite. We shipped public pipeline support only just last week, but that was just the first step, and we’ve got a lot more coming. We’re very keen to engage with projects big and small in the OSS community and try and support them however we can.

If you have any questions please feel free to reach out — justin@buildkite.com

Cheers — have a good one!

Justin

parched · February 4, 2019, 9:06am

Codefresh

I started experimenting with Codefresh at the end of last year, specifically for their Arm support in the hope it might be useful for getting some of the Arm targets to tier-1. It’s a Docker based solution where you can build and run Docker containers and is very nice to use in that regard. It will fail the macOS requirement though, so something else would still be needed for that. They provide hosted hardware but can also work on your own hardware I believe.

You can see the pipeline I set up for Rust to run x.py test --stage 2 for aarch64-unknown-linux-gnu and x.py build --stage 1 for armv7-unknown-linux-gnueabihf here.

It has some quite nice caching abilities. A volume is persisted between builds which I setup to keep the git repo, ccache cache and build/cache. It also uses the Docker engine caching, so images aren’t rebuild unless they change.

I have found their support to be the best of any software I have used, always there to help and fix any of the problems I was having. I can’t really comment on the reliability though, as I haven’t used it long enough.

They have also said to me “We would love to support the Rust project!”

gnzlbg · February 5, 2019, 2:54pm

If we had more CI throughput, instead of trying to merge PRs serially, we could attempt to merge N PRs at a time, such that, if N-1 PRs fail to merge, but 1 succeeds, we still reduce the queue. That would require Nx the number of concurrent jobs that we have today.

Also, currently, many PRs that we attempt to merge aggregate multiple PRs submitted by collaborators - this allows us to advance the queue multiple PRs at a time instead of one PR at a time, delivering big wins. However, if a single PR in a group of M PRs fails, no PR is merged. If we had M+1x more concurrent jobs, we could test subsets of the M PR group in parallel, so that if only 1 PR in the group fails, M-1 PRs are still merged.

As you mention, the best way to reduce the 3h latency that we have on modifying master is probably to use faster CPUs. However, we could still significantly improve how much we reduce the queue per 3h cycle if we had a higher CI throughput with a much higher number of concurrent jobs (10-100x more concurrent jobs).

Topic		Replies	Views
Some thoughts on improving CI infrastructure	26	3482	March 25, 2019
Rust CI / release infrastructure changes	19	8691	March 25, 2019
Homu queue woes, and suggestions on how to fix them tools and infrastructure	50	5259	March 25, 2019
Putting bors on a PIP	24	3144	March 25, 2019
Production user research summary	32	15941	March 25, 2019

Which CI platform should Rust use?

What problems did we have with Travis CI

Still present

Resolved problems since May 2018

The migration to travis-ci.com

What requirements we have for a replacement

Hard requirements

Nice to have

TeamCity (hybrid solution)

Azure Pipelines

Hard requirements

Nice to have

Cirrus CI

CircleCI?

Google Cloud Build for just Linux containers? It might haul very much.

Taskcluster

Buildkite

Related topics

The migration to `travis-ci.com`