Which CI platform should Rust use?

Taskcluster

Taskcluster fails the first hard requirement, but might deserve some discussion for completeness.

Taskcluster is Mozilla’s response to Firefox’s testing needs outgrowing Buildbot. It can be thought as a loose collection of hosted services with APIs (Queue, Index, Secrets, …) and pieces of software (API client libraries in various languages, worker agents, …) that you can put together to build any CI system. It is designed to be very generic and “self-service” in order to enable unanticipated use cases, but the counterpart of that is that anything non-trivial takes some work to put in place.

I’ve spent a few months migrating Servo from Buildbot to Taskcluster. (A few jobs are still on Buildbot, but all the infrastructure-level pieces are there to support them.) For more see the tracking issue as well as the scripts and READMEs starting here. A part of that time was me learning about and experimenting with TC (and ops things in general), but I think another was because Servo was possibly the first project of this scale other than Firefox to use TC. I think a lot of that can be reused. For example the generic decisionlib.py is separated from the Servo-specific decision_task.py script that uses it.

Some aspects of Servo + TC that might be unusual in off-the-shelf CI systems:

  • Testing each PR starts with running in a “decision task” a script from the repository (so it can be modified in the same commit that is being tested) that uses the API to schedule other tasks with an arbitrary dependency graph.
  • Docker images are built on demand from Dockerfiles in the repository (again so that can be modified in the same commit that uses the modifications) and cached.
  • A single building task can produce an executable that is then used by many testing tasks in parallel.
  • Automatic provisioning and scaling (based on queue) of AWS EC2 instances of any type, with any system image. (I think support for other cloud providers is being added.)
  • Bring your own hardware: any machine with appropriate credentials can pick tasks from the queue. Servo does this with non-virtualized machines from Packet (in order to run Android emulators with KVM for CPU acceleration) and macOS workers from MacStadium.

TC is also moving from being hosted (there is a single Queue service in the world, operated by Mozilla) to “shipped” software (anyone can deploy their independent instance). Of course, everything is open-source.

I think that TC can meet every requirement and nice-to-haves, except the first one: there is no company you can contract for Taskcluster support that I know of. People on Mozilla’s Taskcluster team are generally helpful when asked for help on IRC, but that’s not enough at this scale. For a TC-based CI system to be viable for Rust, there needs to be someone whose paid job (or at least part of it) is to build and maintain this system.

Then a question perhaps more tricky than money for their salary is what company can provide the legal structure to have that person as an employee.

9 Likes

Note that there has been some extensive discussion of downsides of Azure DevOps recently:

https://news.ycombinator.com/item?id=18983586

Especially, note that people have had trouble building working pipelines with in-repository configuration, and that they instead had to resort to the GUI editor. My team mate who worked on this over the past few weeks got very frustrated trying to get a pipeline going on DevOps – he did get it working after a few weeks, using the GUI only – which seems not ideal in terms of having good history and review of pipeline changes.

(I also didn’t like the non-standard authentication options Azure comes with: no standard TOTP 2FA – you have to download MS’s custom 2FA app – and no Ed25519 SSH keys.)

6 Likes

Just to offer a counterpoint. I’m currently setting up Azure pipelines to test our C2Rust translator (repo: https://github.com/immunant/c2rust, branch feature/azure-pipelines) using in-repository configuration and I’ve been pleasantly surprised by the short lag between the git push and the build kicking off. Martin and others have also reached out to offer help, so while I realize this is just one data point and that others may have problems, but so far our experience has been positive.

1 Like

@joshk @pietroalbini @alexcrichton Given that all-hands is next week, I wanted to point out again that the TravisCI Berlin office is roughly a 15 minutes walk away.

Maybe it would be a good time spend some time dissecting a couple of the issues.

TBH: many of the issues, while certainly grave, read like stuff I’ve experienced at many providers, so a switch might not improve things.

@SimonSapin: I don’t see how the legal structure for a person doing CI for Rust is a problem. We have multiple companies around the project that would probably help out.

1 Like

@skade Maybe it’s not actually a problem and one of those companies would be willing to have a full-time employee who works on Rust rather than that company’s own projects directly, and keep them long-term. In that case, great!

Anyway, that’s not the direction the Infra team seems to be interested in at the moment. So this is rather hypothetical.

Most of the complaints there seem to be with the UI(limited configurability) and not with the CI service in general(did not see any CI service/stability-related issues actually).
Somebody did mention not being able to use submodules which, if true, would be a serious issue, especially for rust-lang/rust, since it uses git submodules.

1 Like

Buildkite

:wave:t3: I’m a product engineer at Buildkite currently working on making our open source support awesome — I think Buildkite could be a great fit for Rust’s CI requirements, and we’d love to have you!

Buildkite has been very successfully providing CI for many large software teams since ~2014. Beyond commercial stuff, we also offer free and unlimited accounts for open source, academia and non-profits.

Some notable open source teams using us publicly are Bazel, D Lang, and Angular, as well as many others using us privately (we’ve just very recently released support for publicly viewable pipelines).

We put a lot of effort into the level of support we offer to all our users, and pride ourselves on our documentation.

Buildkite has a bit of a different operational model to most other CI/CD/automation solutions which is worth me quickly touching on; we don’t do opaque, managed compute to run workloads. Instead we provide a lightweight agent binary, the Buildkite Agent which can run pretty much anywhere (The Linuxes, OSX, Windows, Docker, etc). This means that you have complete control over how, where, and on what infra your workloads run.

In practice we find that for most teams the operational overhead of this approach is worth the increased flexibility and control it gives. We’ve also put a lot of effort into providing tooling that lightens any infra burden, like our AWS Elastic CI Stack.

We’re actively ramping up our tooling around supporting OSS projects running on Buildkite. We shipped public pipeline support only just last week, but that was just the first step, and we’ve got a lot more coming. We’re very keen to engage with projects big and small in the OSS community and try and support them however we can.

If you have any questions please feel free to reach out — justin@buildkite.com

Cheers — have a good one!

Justin

8 Likes

Codefresh

I started experimenting with Codefresh at the end of last year, specifically for their Arm support in the hope it might be useful for getting some of the Arm targets to tier-1. It’s a Docker based solution where you can build and run Docker containers and is very nice to use in that regard. It will fail the macOS requirement though, so something else would still be needed for that. They provide hosted hardware but can also work on your own hardware I believe.

You can see the pipeline I set up for Rust to run x.py test --stage 2 for aarch64-unknown-linux-gnu and x.py build --stage 1 for armv7-unknown-linux-gnueabihf here.

It has some quite nice caching abilities. A volume is persisted between builds which I setup to keep the git repo, ccache cache and build/cache. It also uses the Docker engine caching, so images aren’t rebuild unless they change.

I have found their support to be the best of any software I have used, always there to help and fix any of the problems I was having. I can’t really comment on the reliability though, as I haven’t used it long enough.

They have also said to me “We would love to support the Rust project!”

2 Likes

If we had more CI throughput, instead of trying to merge PRs serially, we could attempt to merge N PRs at a time, such that, if N-1 PRs fail to merge, but 1 succeeds, we still reduce the queue. That would require Nx the number of concurrent jobs that we have today.

Also, currently, many PRs that we attempt to merge aggregate multiple PRs submitted by collaborators - this allows us to advance the queue multiple PRs at a time instead of one PR at a time, delivering big wins. However, if a single PR in a group of M PRs fails, no PR is merged. If we had M+1x more concurrent jobs, we could test subsets of the M PR group in parallel, so that if only 1 PR in the group fails, M-1 PRs are still merged.

As you mention, the best way to reduce the 3h latency that we have on modifying master is probably to use faster CPUs. However, we could still significantly improve how much we reduce the queue per 3h cycle if we had a higher CI throughput with a much higher number of concurrent jobs (10-100x more concurrent jobs).

2 Likes

Folks may not realise but Martin is being kinda modest here. He’s the founder of the .NET foundation, one of the key folks behind the open-source culture change and a high ranking individual within the Microsoft organisation chart. ie. he can open doors and make shit happen for rust that others cannot. He’s also a standup guy. In case you’re wondering, I do not work for microsoft.

https://www.linkedin.com/in/martinwoodward/

16 Likes

I’ll mention Google Cloud Build, however noting it doesn’t provide native macOS support, which is probably a showstopper (it does provide Windows support).

Namely, Cloud Build is “Docker native”, and there is presently no first-class solution for running macOS inside of Docker (at least that I’m aware of). Solve that problem (here’s a science experiment which attempts to do that) and it would meet the requirements.

Other than that it would seem great given the other requirements:

  • I’ve been pretty pleased with GCP support and am in frequent contact with my account executives/reps (via email and telephone). Technical / tier 2 support requires purchase of a support package but is also available via phone or support tickets.
  • Highly scalable (up to 32-core machines)
  • Configurable timeouts
  • Docker native: each step in a build can be a different Docker container, and they share a common filesystem mount throughout a build. This allows builds to be composed in terms of different Docker containers that each do one particular job.
  • Google isn’t going to disappear any time soon (but I can’t say the same about any given Google product, they don’t have the greatest track record there)

The main con (other than lack of native macOS support, and Google being known to capriciously kill products) in my experience has been the UI/UX: it feels a bit half-baked. but that’s kind of par for the course with much of GCP, and also Cloud Build is a relatively new product which is about 1½ years old. I’ve brought it up with my account reps and been told it’s a common complaint. I’ve seen UI/UX improvements arrive in other parts of the platform, and can only hope similar improvements can come soon, but until then it feels pretty rough around the edges compared to other CI/build systems.

That said, the core functionality is very solid, and most things can be done through (highly flexible and fully-featured) configuration files and API calls.

Concourse CI

https://concourse-ci.org

Concourse is the internal deployment strategy of pivotal which is part of the couldfoundry initiative/group and is available as an open source project and is part of their cloudfoundry offering.

It is essentially a thing-doer, a meta system where resources can be defined and used as triggers on which jobs - a collection of serial, parellel, failable, actionable tasks are execute on workers. Each worker can either be windows, linux or mac os. The whole concept is built around containers, each and every task has an container based environment which as of today is mostly docker based but can be extended to whatever is needed. Concourse has a directory level cache support, but it is also possible to integrate external services with it, such as sccache/redis.

I’ve been using concourse 2 throughout 4 for open source as well as work environments and while not being a devops at heart, it was fairly easy to get into it.

A few distinctive key features are:

  • being able to beam yourself into the build container and see what went wrong in there, poke around - reduces turn around time for fixing a broken test environment
  • command line utility fly to update the pipeline, independently of commits/git
  • credential management via vault or injectable json, thus the descriptive yaml can be version with git
  • graphical pipeline representation and dashboard (see i.e https://ci.spearow.io )
  • monitoring via influxdb/riemann/prometheus possible …

I missed a few things, but I am confident that this is something that could potentially fit the needs.

The components are loosely coupled, i.e. the container runtime can be change as an argument for the worker. The whole thing is written in go as of today.

downside

  • I have no experience using a hosted version, so far I always spun up my own network/nodeset

I’ve posted a new thread here to talk a little bit about our investigation process after all of these great suggestions.

2 Likes

News Flash: after acquiring Travis CI about ~30 days ago, Ibera just sent a massive wave of termination letters to Travis CI employees.

There is no mention of shutting down Travis CI, or shutting it down for Open Source projects, however the loss of experienced employees is unlikely to help stabilizing the platform.

8 Likes

I have a bold idea: how about building a CI platform in Rust on our own? Maybe the rust infra team will start a company and make money from it (for the strength of Rust, the platform must be very excellent). I don’t have experience in build CI platforms so it’s just an idea.

@VitalyR This is actually less bold than you might expect :stuck_out_tongue: Discussion (continued): Building our own CI

@mark-i-m Cool! Looking forward to more work.

I find it quite an assumption that we could quickly form a company and build a consumer-usable platform around our project. Rust has some very specific needs that don’t map what the general audience needs.

Also, our tooling doesn’t need to be excellent, it needs to work. A lot of our tooling is surprisingly unsophisticated at close look. That’s fine in an org that knows the limits, but it would hit the ceiling quick if you’d try to generalise its use.

11 Likes

I posted an update with the next steps of our investigation!

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.