Hi, Taskcluster developer here. A few notes about Taskcluster…
First, for a long time TC has been “Mozilla-specific” in that it was deployed as a SaaS and only used by Mozilla projects. That’s no longer the case - Taskcluster can now be deployed independently just like Discourse or OpenStack. We made that change in large part because the Rust community had good reasons to not run on a Mozilla service, as @steveklabnik mentioned earlier in this thread. We are also actively seeking other users for the application, including outside Mozilla. While Mozilla is unlikely to ever make Taskcluster a full “product” (first off, Marketing would never let us keep the name!), it is likely to see good uptake in the areas of CI that are not well-served by Jenkins, Travis, Buildbot, Appveyor, Buddybuild, and so on.
Taskcluster is not especially AWS-specific. It can run workers anywhere. Firefox has workers running on Mozilla datacenter hardware and in packet.net. Deepspeech has workers running on custom hardware. It’s even running on some rPIs in Germany. The TC provisioner currently supports spinning up instances on-demand in AWS (EC2 spot instances) and we will soon be adding support for packet.net and GCP. But that’s only provisioning – if you bring your own hardware, or always-on cloud instances, there’s no need for provisioning. Firefox stores artifacts – task outputs – in AWS;s S3, but we also support Azure blobs and the architecture is open enough to add other backends as well (which Firefox will probably want for packet.net).
Windows is not hard to run in EC2. Our generic-worker implementation has support for user-based isolation, meaning that different tasks run as different users on the same machine (sequentially or concurrently). It supports both Windows and OS X. That goes a long way toward isolating jobs from contaminating one another, although it’s not a great security boundary - but neither is Docker. For security boundaries, we recommend to use different hosts/instances (for example, Firefox try jobs are scheduled on a different pool of machines than production builds).
Taskcluster does dependency trees very well – the Firefox release process is 10’s of layers deep with lots of complex relationships between layers, and Taskcluster models that quite well. The resulting task graph has, at last check, 8000+ tasks. The only scaling issue we have is in displaying it in the browser!
To the strawman about timeouts – yes, you can set whatever timeout you would like in Taskcluster. Unbounded isn’t great because you’ll end up paying for hung machines in perpetuity, but if doubling the timeout fixes the issue, that’s easy. It’s also possible to define some flexible retry logic to automatically retry after known-intermittent issues (it’s not clear if intermittency is the problem here, or just too-short timeouts…)
P.S. I’m also a former Buildbot maintainer. Webkit does use Buildbot, but a pretty ancient version (0.8.6 - https://build.webkit.org/). Buildbot doesn’t have very good support for self-serve / in-tree configuration, and its support for dynamically scaling its worker pool (e.g., with $CLOUDVENDOR spot instances) is pretty minimal. And as @ishitatsuyuki mentioned in RFC: Build our own self hosted CI infrastructure it doesn’t do well with the deep dependency trees that a release process typically requires.