Which CI platform should Rust use?

Cirrus CI

Disclaimer: I’m CTO of Cirrus Labs which develops and supports Cirrus CI.

Cirrus CI is a hosted CI which already supports Linux, Windows, macOS and FreeBSD (rust-lang/libc already uses Cirrus CI for FreeBSD).

With Cirrus CI you have an option to either bring your own infrastructure or rely on hosted infrastructure with per-second billing and no concurrency limits. With per-second billing you only pay for CPU and memory used during your tasks with an option to customize them up to 30 CPUs and 90 GB of memory. SonarSource, for example, runs hundreds of integration tests in parallels for each change.

Cirrus CI meets all your hard requirements and most "nice to have"s. A few missing features are already on our roadmap. Would love to prioritize them for you.

We already have a great case of Flutter project migrating fully to Cirrus CI to run CI for Linux, Windows and macOS. I can give you contacts of folks at Google if you want to ask them about their experience with migrating to Cirrus CI.

Let me know if you have any other questions.

EDIT: here is a PR with a proof of concept of Cirrus CI working for rust-lang/rust.

15 Likes

Thanks everyone for the great suggestions! We’re going to evaluate all options next week to figure out what’s the best for the Rust project moving forward!

@joshk thanks for your message – Travis has come a long way with us, and staying with it is certainly an option in our evaluations. Once we’ve discussed it all as a team, we’ll have a better understanding of the next steps in our “CI platform evaluation” process and will surely be back in touch.

6 Likes

CircleCI?

Disclaimer: I’ve had my own grievances with Circle, but overall their platform is well worth considering because I believe it at least meets the Hard Requirements. Being a college student and only having used their free plan, I also don’t have that much experience working at a Rust-sized scale, but I do believe they could meet these demands.

For what it’s worth, my team had many similar issues with Travis CI’s spotty maintenance and incident response, and while I give them a pass for being one of the largest supporters of Open Source by offering free builds, I think there are definitely some competitive options out there, especially for paying customers.

We don’t use anywhere near the number of workers that Rust would use, but CircleCI reduced our build times by over 50% coming from Travis, because Travis has much higher overhead on each job. In my time on CircleCI, we’ve had very few outages coming from CircleCI themselves—all of the service disruptions I’ve been affected by have been on the side of upstream providers such as GitHub API issues. Anecdotally, a job that previously took 1 minute on Travis to run 5 seconds of tests took 7 seconds on CircleCI. The speed of Circle’s job runners also means that any backlogs take less time to disappear after an outage.

For what it’s worth, CircleCI:

  • Is a Docker-first platform, and is almost entirely based on Docker. Workers are Docker containers (except on macOS) and start almost instantly as a result, and overhead is quite low. Paid plans would almost certainly support intermediate layer caching.
  • Definitely meets the requirement for having direct/premium support, and is hosted by them, but also has a self-hosted option.
    • They have a sales team who would hopefully be able to answer your questions and provide concrete quotes.
  • Has macOS and Linux workers available, and payment scales to capacity as set on the account.
    • I believe the default resource constraints are 2CPUs x 4 GB, but the resource_class setting (available by support request) allows for scaling up to 8CPUs x 16 GB.
    • Capacity is determined on the organization level, i.e. you set up a certain number of workers for the organization and those workers are always available right when the build starts.

As such, I think Circle at least covers the hard requirements. To the nice-to-haves, it’s worth asking sales about some of the things, but there are a few requirements that I can say are pretty much satisfied:

  • The ability to log into the builders remotely to investigate spurious failures is built-in. One can re-run jobs with SSH and use any SSH keys associated with their GitHub account to log in.
  • The ability to easily run and debug builds locally is also built-in—there is a CLI which can be used for config validation.
  • Built-in caching support is also available. There is support for both shorter-term “caching” and longer-term “artifact” storage. (The former primarily for sharing between builds, the latter more for downloadable artifacts.)
  • Pay-for-what-you-use might be possible. Scaling resource constraints is certainly easier on Circle than it is on Travis, from my experience.

There are a large number of things that my team have found really nice, such as each commit getting multiple commit statuses on GitHub, so you can explicitly click on the “Details” and get right to the build, or mark certain job passes as “required” for branches to be mergeable. I think these things are configurable, if those are not wanted.

There are some cons, though.

  • Circle’s web UI isn’t the most intuitive, nor is their configuration documentation bulletproof. They’re improving both of these things, but in their current forms they aren’t perfect.
  • I’ve encountered some performance issues with loading logs from builds with lots of output, in that it just takes time to load the logs.
7 Likes

Google Cloud Build for just Linux containers? It might haul very much.

May I make a suggestion to take a look at Google Cloud Build for the Linux jobs as some sort of side CI? They support 32 vCPU machines and do per-second billing. Unfortunately, they seem to have a limit of 10 jobs running concurrently but it seems to imply that more can be requested. Imagine each building of a commit spinning up 38*32=1216 vCPUs for each of those Linux jobs. :crazy_face:

With the remote-builder pattern, if you targeted the largest instances, you could probably target 96 vCPU instances out of the box or about 960 vCPU with the out of the box limit. If they approve beyond 10, 38*96=3648 vCPU per build. :laughing:

I wouldn’t use them as a main platform though. At work, I have been using additionally them for thousands of generated and embarrassingly parallel unit tests which I’ve excluded to run off the main platform and I’ve been pretty happy. But for the more traditional tests I’ve been running them on stuff with more friendly UIs.

It’ll be nice to at least take a look at to rule or rule it out or maybe see if it can be useful in some other capacity like a faster build and test canary for some other CI system like Azure which is limited to 2 vCPU or other CI systems topping out at 8 or 30 vCPU.

1 Like

Thanks for posting this and gathering all the data here @pietroalbini, it’s very much appreciated!

One thing I wanted to expand a bit on the hard requirement of the size of CI. The builds for rust-lang/rust are especially intensive and are pretty abnormal compared to many other open source projects. Our build workflow means that we have a queue of PRs to merge, each of which is tested one at a time serially. The full entire test suite is run against each candidate to merge, and if it’s green we merge the PR and if anything fails we move on to the next PR to merge.

A full test suite runs 57 concurrent jobs. The jobs in aggregate test a wide variety of things like:

  • Platform support, being spread across Windows/Mac/Linux
  • Literal unit tests, across a good number of configurations
  • Distribution builds. Each PR produces artifacts as if it were a full Rust nightly release, so we do tons of cross-compiling and production of compilers

Our hard time limit for each of these 57 jobs is three hours. We shoot for and prefer all jobs to be under 2 hours, although we rarely get there and typically sit at 2h15m or 2h30m. This means that 24/7 we’re attempting to merge PRs (serially), and each PR runs 60 different 2-3 hour builds in parallel. These builds don’t all finish at the same time, the quickest tend to finish in 1h15m ish and the slowest extend to 2h30m. We currently use the free time on Travis to basically allow all our other rust-lang and rust-lang-nursery repositories to have ambitious CI configurations as well (aka libc, stdsimd, clippy, etc all have quite an expansive matrix too). Additionally, each PR to the rust-lang/rust repository runs one of the 57 builds as a smoke check, and it typically takes 1h30m ish.

Our CI is clearly quite slow, and we’re always looking for opportunities to make it faster. The easiest way to make it faster is to increase the size of the CPU, either more or faster ones. The difficulty cliff is very sharp after that, as we’re constantly looking for things to cache or improve to make CI faster. All-in-all, other than CPU we really aren’t going to get that much faster than we are today (definitely not an order of magnitude at least).

To that end conventional tips/tricks to make builds faster on CI are typically not that relevant to us. At our scale (multi-hour builds instead of a few minutes/seconds) most tricks can shave off a minute or so and don’t impact queue time in general.

In any case this is largely food for thought! We’d ideally evaluate builds and build times against a CI service to find a good fit for machine size and such. But that’ll all come much later depend on what direction we take!

15 Likes

Taskcluster

Taskcluster fails the first hard requirement, but might deserve some discussion for completeness.

Taskcluster is Mozilla’s response to Firefox’s testing needs outgrowing Buildbot. It can be thought as a loose collection of hosted services with APIs (Queue, Index, Secrets, …) and pieces of software (API client libraries in various languages, worker agents, …) that you can put together to build any CI system. It is designed to be very generic and “self-service” in order to enable unanticipated use cases, but the counterpart of that is that anything non-trivial takes some work to put in place.

I’ve spent a few months migrating Servo from Buildbot to Taskcluster. (A few jobs are still on Buildbot, but all the infrastructure-level pieces are there to support them.) For more see the tracking issue as well as the scripts and READMEs starting here. A part of that time was me learning about and experimenting with TC (and ops things in general), but I think another was because Servo was possibly the first project of this scale other than Firefox to use TC. I think a lot of that can be reused. For example the generic decisionlib.py is separated from the Servo-specific decision_task.py script that uses it.

Some aspects of Servo + TC that might be unusual in off-the-shelf CI systems:

  • Testing each PR starts with running in a “decision task” a script from the repository (so it can be modified in the same commit that is being tested) that uses the API to schedule other tasks with an arbitrary dependency graph.
  • Docker images are built on demand from Dockerfiles in the repository (again so that can be modified in the same commit that uses the modifications) and cached.
  • A single building task can produce an executable that is then used by many testing tasks in parallel.
  • Automatic provisioning and scaling (based on queue) of AWS EC2 instances of any type, with any system image. (I think support for other cloud providers is being added.)
  • Bring your own hardware: any machine with appropriate credentials can pick tasks from the queue. Servo does this with non-virtualized machines from Packet (in order to run Android emulators with KVM for CPU acceleration) and macOS workers from MacStadium.

TC is also moving from being hosted (there is a single Queue service in the world, operated by Mozilla) to “shipped” software (anyone can deploy their independent instance). Of course, everything is open-source.

I think that TC can meet every requirement and nice-to-haves, except the first one: there is no company you can contract for Taskcluster support that I know of. People on Mozilla’s Taskcluster team are generally helpful when asked for help on IRC, but that’s not enough at this scale. For a TC-based CI system to be viable for Rust, there needs to be someone whose paid job (or at least part of it) is to build and maintain this system.

Then a question perhaps more tricky than money for their salary is what company can provide the legal structure to have that person as an employee.

9 Likes

Note that there has been some extensive discussion of downsides of Azure DevOps recently:

Especially, note that people have had trouble building working pipelines with in-repository configuration, and that they instead had to resort to the GUI editor. My team mate who worked on this over the past few weeks got very frustrated trying to get a pipeline going on DevOps -- he did get it working after a few weeks, using the GUI only -- which seems not ideal in terms of having good history and review of pipeline changes.

(I also didn't like the non-standard authentication options Azure comes with: no standard TOTP 2FA -- you have to download MS's custom 2FA app -- and no Ed25519 SSH keys.)

6 Likes

Just to offer a counterpoint. I’m currently setting up Azure pipelines to test our C2Rust translator (repo: https://github.com/immunant/c2rust, branch feature/azure-pipelines) using in-repository configuration and I’ve been pleasantly surprised by the short lag between the git push and the build kicking off. Martin and others have also reached out to offer help, so while I realize this is just one data point and that others may have problems, but so far our experience has been positive.

1 Like

@joshk @pietroalbini @alexcrichton Given that all-hands is next week, I wanted to point out again that the TravisCI Berlin office is roughly a 15 minutes walk away.

Maybe it would be a good time spend some time dissecting a couple of the issues.

TBH: many of the issues, while certainly grave, read like stuff I’ve experienced at many providers, so a switch might not improve things.

@SimonSapin: I don’t see how the legal structure for a person doing CI for Rust is a problem. We have multiple companies around the project that would probably help out.

1 Like

@skade Maybe it’s not actually a problem and one of those companies would be willing to have a full-time employee who works on Rust rather than that company’s own projects directly, and keep them long-term. In that case, great!

Anyway, that’s not the direction the Infra team seems to be interested in at the moment. So this is rather hypothetical.

Most of the complaints there seem to be with the UI(limited configurability) and not with the CI service in general(did not see any CI service/stability-related issues actually).
Somebody did mention not being able to use submodules which, if true, would be a serious issue, especially for rust-lang/rust, since it uses git submodules.

1 Like

Buildkite

:wave:t3: I’m a product engineer at Buildkite currently working on making our open source support awesome — I think Buildkite could be a great fit for Rust’s CI requirements, and we’d love to have you!

Buildkite has been very successfully providing CI for many large software teams since ~2014. Beyond commercial stuff, we also offer free and unlimited accounts for open source, academia and non-profits.

Some notable open source teams using us publicly are Bazel, D Lang, and Angular, as well as many others using us privately (we’ve just very recently released support for publicly viewable pipelines).

We put a lot of effort into the level of support we offer to all our users, and pride ourselves on our documentation.

Buildkite has a bit of a different operational model to most other CI/CD/automation solutions which is worth me quickly touching on; we don’t do opaque, managed compute to run workloads. Instead we provide a lightweight agent binary, the Buildkite Agent which can run pretty much anywhere (The Linuxes, OSX, Windows, Docker, etc). This means that you have complete control over how, where, and on what infra your workloads run.

In practice we find that for most teams the operational overhead of this approach is worth the increased flexibility and control it gives. We’ve also put a lot of effort into providing tooling that lightens any infra burden, like our AWS Elastic CI Stack.

We’re actively ramping up our tooling around supporting OSS projects running on Buildkite. We shipped public pipeline support only just last week, but that was just the first step, and we’ve got a lot more coming. We’re very keen to engage with projects big and small in the OSS community and try and support them however we can.

If you have any questions please feel free to reach out — justin@buildkite.com

Cheers — have a good one!

Justin

8 Likes

Codefresh

I started experimenting with Codefresh at the end of last year, specifically for their Arm support in the hope it might be useful for getting some of the Arm targets to tier-1. It’s a Docker based solution where you can build and run Docker containers and is very nice to use in that regard. It will fail the macOS requirement though, so something else would still be needed for that. They provide hosted hardware but can also work on your own hardware I believe.

You can see the pipeline I set up for Rust to run x.py test --stage 2 for aarch64-unknown-linux-gnu and x.py build --stage 1 for armv7-unknown-linux-gnueabihf here.

It has some quite nice caching abilities. A volume is persisted between builds which I setup to keep the git repo, ccache cache and build/cache. It also uses the Docker engine caching, so images aren’t rebuild unless they change.

I have found their support to be the best of any software I have used, always there to help and fix any of the problems I was having. I can’t really comment on the reliability though, as I haven’t used it long enough.

They have also said to me “We would love to support the Rust project!”

2 Likes

If we had more CI throughput, instead of trying to merge PRs serially, we could attempt to merge N PRs at a time, such that, if N-1 PRs fail to merge, but 1 succeeds, we still reduce the queue. That would require Nx the number of concurrent jobs that we have today.

Also, currently, many PRs that we attempt to merge aggregate multiple PRs submitted by collaborators - this allows us to advance the queue multiple PRs at a time instead of one PR at a time, delivering big wins. However, if a single PR in a group of M PRs fails, no PR is merged. If we had M+1x more concurrent jobs, we could test subsets of the M PR group in parallel, so that if only 1 PR in the group fails, M-1 PRs are still merged.

As you mention, the best way to reduce the 3h latency that we have on modifying master is probably to use faster CPUs. However, we could still significantly improve how much we reduce the queue per 3h cycle if we had a higher CI throughput with a much higher number of concurrent jobs (10-100x more concurrent jobs).

2 Likes

Folks may not realise but Martin is being kinda modest here. He’s the founder of the .NET foundation, one of the key folks behind the open-source culture change and a high ranking individual within the Microsoft organisation chart. ie. he can open doors and make shit happen for rust that others cannot. He’s also a standup guy. In case you’re wondering, I do not work for microsoft.

https://www.linkedin.com/in/martinwoodward/

16 Likes

I’ll mention Google Cloud Build, however noting it doesn’t provide native macOS support, which is probably a showstopper (it does provide Windows support).

Namely, Cloud Build is “Docker native”, and there is presently no first-class solution for running macOS inside of Docker (at least that I’m aware of). Solve that problem (here’s a science experiment which attempts to do that) and it would meet the requirements.

Other than that it would seem great given the other requirements:

  • I’ve been pretty pleased with GCP support and am in frequent contact with my account executives/reps (via email and telephone). Technical / tier 2 support requires purchase of a support package but is also available via phone or support tickets.
  • Highly scalable (up to 32-core machines)
  • Configurable timeouts
  • Docker native: each step in a build can be a different Docker container, and they share a common filesystem mount throughout a build. This allows builds to be composed in terms of different Docker containers that each do one particular job.
  • Google isn’t going to disappear any time soon (but I can’t say the same about any given Google product, they don’t have the greatest track record there)

The main con (other than lack of native macOS support, and Google being known to capriciously kill products) in my experience has been the UI/UX: it feels a bit half-baked. but that’s kind of par for the course with much of GCP, and also Cloud Build is a relatively new product which is about 1½ years old. I’ve brought it up with my account reps and been told it’s a common complaint. I’ve seen UI/UX improvements arrive in other parts of the platform, and can only hope similar improvements can come soon, but until then it feels pretty rough around the edges compared to other CI/build systems.

That said, the core functionality is very solid, and most things can be done through (highly flexible and fully-featured) configuration files and API calls.

Concourse CI

https://concourse-ci.org

Concourse is the internal deployment strategy of pivotal which is part of the couldfoundry initiative/group and is available as an open source project and is part of their cloudfoundry offering.

It is essentially a thing-doer, a meta system where resources can be defined and used as triggers on which jobs - a collection of serial, parellel, failable, actionable tasks are execute on workers. Each worker can either be windows, linux or mac os. The whole concept is built around containers, each and every task has an container based environment which as of today is mostly docker based but can be extended to whatever is needed. Concourse has a directory level cache support, but it is also possible to integrate external services with it, such as sccache/redis.

I’ve been using concourse 2 throughout 4 for open source as well as work environments and while not being a devops at heart, it was fairly easy to get into it.

A few distinctive key features are:

  • being able to beam yourself into the build container and see what went wrong in there, poke around - reduces turn around time for fixing a broken test environment
  • command line utility fly to update the pipeline, independently of commits/git
  • credential management via vault or injectable json, thus the descriptive yaml can be version with git
  • graphical pipeline representation and dashboard (see i.e https://ci.spearow.io )
  • monitoring via influxdb/riemann/prometheus possible …

I missed a few things, but I am confident that this is something that could potentially fit the needs.

The components are loosely coupled, i.e. the container runtime can be change as an argument for the worker. The whole thing is written in go as of today.

downside

  • I have no experience using a hosted version, so far I always spun up my own network/nodeset

I’ve posted a new thread here to talk a little bit about our investigation process after all of these great suggestions.

2 Likes

News Flash: after acquiring Travis CI about ~30 days ago, Ibera just sent a massive wave of termination letters to Travis CI employees.

There is no mention of shutting down Travis CI, or shutting it down for Open Source projects, however the loss of experienced employees is unlikely to help stabilizing the platform.

8 Likes

I have a bold idea: how about building a CI platform in Rust on our own? Maybe the rust infra team will start a company and make money from it (for the strength of Rust, the platform must be very excellent). I don’t have experience in build CI platforms so it’s just an idea.