Some thoughts on improving CI infrastructure

The huge delays that Rust’s merge gating system introduces can be fixed without abandoning merge gating entirely. A system that batches together the entire backlog of waiting PRs and bisects when it breaks, for example. Or parallel-building like OpenStack does.

And, yeah, fix the mac linker segfault. That’s a good idea for reasons that have nothing to do with the autolander.

1 Like

Should we be reporting a bug against Apple?

After some sort of research, I’ve found self-hosting CI solutions could be a better alternative to cloud based service. I’d like to propose using Jenkins, the most widely used self hosted solution.

(By the way, Node and KDE is using it. Mozilla seems to used it before, but it’s no longer used for unknown reasons.)

  • Jenkins has a flexible plugin architecture, which should satisfy our merge gating scheme. It has many users, and has a enterprise grade interface.
  • Jenkins also has a configurable Jenkinsfile, opening modifications to contributors. travis.yml actually is not scalable at all, since there’s no pipeline scheme.
  • Additional machines are required but you can host any OS with Jenkins, potentially bringing platforms like BSD.

Some opinions against Travis/AppVeyor:

  • The capacity is limited. Build box with 2 CPUs, possibility of running out disk space, intermittent network failures. A self-hosted solution shouldn’t have any of these problems.
  • Local repository and Docker caching. Nothing to pull from remote. Cloning rust-lang/rust (shallow) takes 4 secs, rust-lang/llvm (full) takes >1m. They’re a huge cause of network failures.
  • Anyway, Mozilla is paying for both Travis and AppVeyor and they’re probably not financially short to make a switch.

Mozilla stopped using Jenkins, and switched to their own solution (TaskCluster), because Jenkins doesn’t scale. There’s a single master for any Jenkins cluster, just like Buildbot. Worse, the Jenkins master usually runs a thread per job on the master to babysit the slave, eating up gobs of memory.

That probably isn’t a good reason to pick Travis over Jenkins, but it explains TaskCluster. Firefox CI runs a lot of stuff.

2 Likes

I last used Jenkins in the 1.X series and the combination of intermittent breakage, difficulty of config change debugging without xml config diffs (since the UI does not have a 1-1 mapping) and little support for 'niche' use-cases put me off plugins. My feeling was that regularly recreating your Jenkins instance from scratch with a script and limited XML is very important for peace of mind.

Things have probably changed for the better since then (and I know Jenkins itself tends to be ok), I just want to express caution with possible plugin enthusiasm ("just one more!") as they may add a number of moving parts to an already tricky build process.

This isn't specific to appveyor/travis - network failures happen and any decent CI solution will have some way of caching a bunch of files, so it's just a matter of implementation (whether on a self-hosted or external solution).

Docker caching already exists on travis, and repo caching will soon exist on travis and appveyor. It's not easy to get repo caching right, I understand that repo caching in the old buildbot got itself into some pretty sorry states.

A self hosted CI caches things on local workspace - and it has way less possibility of network failures. Only a few of the cloud based CI do that (CircleCI did it by default, and Semaphore is bare metal based, I think). The spinup is really fast, although it doesn't contribute much toward the 1h build time.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.