Moving to TaskCluster: The Plan

We had some discussion regarding moving to TaskCluster (abbreviated as TC below) yesterday.

In short, TC allows us to migrate to a more powerful infrastructure with less worries about infrastructure maintenance. There’s already some people that are maintaining it, and it’s easier to move our CI system gradually.

To make sure we can improve the infrastructure smoothly, I’d like to have this thread for discussion and give an overview of how we can do it.

  1. Enable TC and try a few builders out.

We will be verifying that the infrastructure works for our model, and as the first step we will be create a minimal configuration for the Docker builders. Docker was chosen here because it’s CI agnostic, and there’s already a working example of our model.

  1. Start gating on TC.

After basic building stuff are stabilized, we will then work on integrating it with our homu workflow. We will be ensuring tools like cancelbot gets integrated, and also that deployments work.

  1. Move Linux builders from Travis to TC.

This step may be done gradually to ensure we don’t have issues with capacity. Things like S3 bucket locality should also be considered.

  1. Work on macOS/Windows builders.

Due to the lacking of containerization, we will need to investigate how we can create a fresh environment, instead of polluting the entire VM. Also given that macOS is not hosted on a cloud infrastructure, we’ll have to reserve the capacity in some way.

2 Likes

We will be verifying that the infrastructure works for our model, and as the first step we will be create a minimal configuration for the Docker builders. Docker was chosen here because it’s CI agnostic, and there’s already a working example of our model.

FYI, the rendered version of those docs are here: https://firefox-source-docs.mozilla.org/taskcluster/taskcluster/docker-images.html

A lot of the nice things described here are specific to the task graph implementation that lives in mozilla-central and drives Firefox CI. Although everything runs in Taskcluster, the way things work is that Taskcluster creates a "decision task" for each CI run, and then that task is responsible for creating all the other tasks it wants to run. For Firefox we run mach taskgraph decision (most of the interesting bits are here) which reads a bunch of YAML files under taskcluster/ci and applies various transforms to turn them into a task graph. You can see it in action by loading treeherder.mozilla.org and locating a "Gecko Decision Task opt" row and clicking the D there. Here's a recent example from mozilla-inbound.

Firefox CI obviously has a lot of complexity: we build multiple build variants (opt/debug/pgo/asan/static analysis) for multiple platforms (linux32,linux64,win32,win64,macos,android) and then run potentially dozens of test suites each containing up to tens of thousands of tests on each build. Taskcluster was built to be flexible enough to handle this, and it's proven to be a great set of tools. I don't know that Rust CI is quite as complicated but it certainly seems to be pushing the limits of what you can usefully accomplish with commercial CI services.

For Firefox we've used this to build some really powerful CI. You pointed out the docs on Docker images--we have Dockerfiles that live in the source tree that define the environment in which tasks that run on Linux are run. Changing any of the inputs to the Dockerfile will cause the image to be rebuilt and then used for task execution, which makes experimentation easy. (We have a special repository called try where developers can push work in progress changes and get CI builds/tests.) In the past year or so we've hooked up a number of other similar things to the task graph, like "toolchain" tasks that build tools that we use during the Firefox build such as our gcc or clang toolchains. This means that nowadays developers can similarly change the compiler used to build Firefox by just editing a file in the source to point to a different revision, and when those changes are pushed the toolchain tasks will run as needed and their outputs will be used in the Firefox build.

There's certainly a bit of a gulf between the base functionality that Taskcluster provides and the abstractions we have in Firefox, and we don't have a great story there yet. If we go farther down the road of having Rust CI in Taskcluster it would probably be worthwhile to look into sharing Firefox's taskgraph generation code. There are many other ways to accomplish this, certainly. NSS uses Taskcluster for their CI now (see NSS CI status on Treeherder) and their CI is driven by a decision task launched from .taskcluster.yml that runs some node.js code to generate the task graph. I've used Taskcluster for some smaller projects such as this one that just have simple JSON task templates in the repository and then a Python script that runs in a decision task to fill in the templates and create the tasks.

That was...a longer post than I expected to write here. I don't work on Taskcluster, but I've been heavily involved in Firefox CI and I really like Taskcluster. I'm happy to help out any way I can here.

5 Likes

I’m not on the infra team, but I’m wary of moving onto a Mozilla-specific technology. This isn’t because it’s bad, but it’s just not as well known as other options, and reinforces that Mozilla controls Rust.

4 Likes

FYI Windows Server containers do exist, though they support few hosts and base images.

On the topic of using CI battle tested by handing web browsers, does someone remember why did we move away from Buildbot? Buildbot handles WebKit CI, so it is demonstrably capable.

It didn’t support in-tree configuration files, so modifying the CI structure was hard, as we can’t test it along the commit. The other thing is that we didn’t have enough people to maintain the infrastructure and it was unstable. TaskCluster has its own maintenance team, so it’s better than one or two Rust team members managing the entire buildbot infrastructure.

Along with that, as a part migration to SaaS it brought some side products, like nightly from artifacts instead of cron build. Those are described at Rust CI / release infrastructure changes.

@luser Thanks, the links really help. I’m working on the initial configuration file now. Seems like the main configuration file is read from the master branch, and we call the scheduler to create tasks based on the tree of the commit being tested, right? We will probably have an initial skeleton so we can test any scheduler additions with the gating system from then.

Also, I found that the cache is cleared quite roughly. It removes all caches not being occupied by a running worker at once if the threshold is reached (all-or-nothing method). How did this perform for Gecko in practice? We rely heavily on the VCS checkout cache to work, and if cache evicts occur frequently, we may have to consider some improvement on the caching system.

It may be worth to look at Gitlab-CI. Especially with their recently annonced integration with GitHub.

GitLab CI is a quite feature packed CI system, but there are two major issues with it:

  • We will need to maintain the CI runner (using their shared runner isn’t realistic).
  • We don’t get dynamic provisioning for Windows.

Beside from that, TaskCluster fits better with model that is quite sticked to AWS, and also has a powerful decision system to schedule the tasks that we need.

I had a quite bad experience coding with GitLab API last time, as their documentation wasn’t tested and the examples even contained false information.

Thanks for writing up how we could achieve this - alternative CI solutions are certainly one piece of the puzzle facing us.

I'd like to dig in some more to the motivating reasons for the move and what problems you'd see being solving by moving. Travis is already maintained by 'other people' (which I like!) and it's not clear what exactly you're referring to by "more powerful infrastructure". Your thread at RFC: Build our own self hosted CI infrastructure is more explicit, but wasn't written with taskcluster in mind so it's unclear how much is applicable.

Let's take a hypothetical motivation of fixing timeout spurious failures by eliminating time limits (I'm not saying this is actually an intention, it's just a strawman example). Assuming we can set up unbounded time limits on taskcluster - great! As soon as we've moved everything over to taskcluster, we've eliminated those spurious failures. However, it looks like >50% of our timeout failures were on Windows which are item 4 on your list as a more speculative 'to investigate' item. Let's put that aside for a minute though and assume we do have a solution. The next question to me is whether we should even try and be supporting unlimited run time - we have 11 PRs in the last 24 hours, meaning we should be targeting a 2 hour merge time.

Just generally, as @nrc points out in Putting bors on a PIP - #4 by nrc, we should look to the future and work towards it - it may well be that taskcluster is part of that, which makes this plan valuable! But we should make sure we're pointing in the right direction before running and I anticipate us discussing this at the Rust all hands next week.

Hi, Taskcluster developer here. A few notes about Taskcluster…

First, for a long time TC has been “Mozilla-specific” in that it was deployed as a SaaS and only used by Mozilla projects. That’s no longer the case - Taskcluster can now be deployed independently just like Discourse or OpenStack. We made that change in large part because the Rust community had good reasons to not run on a Mozilla service, as @steveklabnik mentioned earlier in this thread. We are also actively seeking other users for the application, including outside Mozilla. While Mozilla is unlikely to ever make Taskcluster a full “product” (first off, Marketing would never let us keep the name!), it is likely to see good uptake in the areas of CI that are not well-served by Jenkins, Travis, Buildbot, Appveyor, Buddybuild, and so on.

Taskcluster is not especially AWS-specific. It can run workers anywhere. Firefox has workers running on Mozilla datacenter hardware and in packet.net. Deepspeech has workers running on custom hardware. It’s even running on some rPIs in Germany. The TC provisioner currently supports spinning up instances on-demand in AWS (EC2 spot instances) and we will soon be adding support for packet.net and GCP. But that’s only provisioning – if you bring your own hardware, or always-on cloud instances, there’s no need for provisioning. Firefox stores artifacts – task outputs – in AWS;s S3, but we also support Azure blobs and the architecture is open enough to add other backends as well (which Firefox will probably want for packet.net).

Windows is not hard to run in EC2. Our generic-worker implementation has support for user-based isolation, meaning that different tasks run as different users on the same machine (sequentially or concurrently). It supports both Windows and OS X. That goes a long way toward isolating jobs from contaminating one another, although it’s not a great security boundary - but neither is Docker. For security boundaries, we recommend to use different hosts/instances (for example, Firefox try jobs are scheduled on a different pool of machines than production builds).

Taskcluster does dependency trees very well – the Firefox release process is 10’s of layers deep with lots of complex relationships between layers, and Taskcluster models that quite well. The resulting task graph has, at last check, 8000+ tasks. The only scaling issue we have is in displaying it in the browser!

To the strawman about timeouts – yes, you can set whatever timeout you would like in Taskcluster. Unbounded isn’t great because you’ll end up paying for hung machines in perpetuity, but if doubling the timeout fixes the issue, that’s easy. It’s also possible to define some flexible retry logic to automatically retry after known-intermittent issues (it’s not clear if intermittency is the problem here, or just too-short timeouts…)

P.S. I’m also a former Buildbot maintainer. Webkit does use Buildbot, but a pretty ancient version (0.8.6 - https://build.webkit.org/). Buildbot doesn’t have very good support for self-serve / in-tree configuration, and its support for dynamically scaling its worker pool (e.g., with $CLOUDVENDOR spot instances) is pretty minimal. And as @ishitatsuyuki mentioned in RFC: Build our own self hosted CI infrastructure it doesn’t do well with the deep dependency trees that a release process typically requires.

5 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.