Some thoughts on improving CI infrastructure

ishitatsuyuki · February 28, 2017, 9:05am

As everyone know, the current CI infrastructure is really slow. The merge queue is sometimes filled up with 50+ accepted PRs, and it’s pretty annoying for contributors to find out the build errored after a few days.

I’d like to propose some points of improvement, from a range of perspective.

Stop building LLVM, and just link to system one. The usage of the fork is for emscripten, and for others cloning it just generates more network error and takes more time.
Try Semaphore CI. Disclaimer: they are a small business (wonderful support though), and probably we should get in touch with them before migrating such a large scale project I know it could be hard to migrate from Travis… but anyway, let me put some points on this:
- 80% faster than Travis, as they claim
- 8GB of total storage (RAM+build ramdisk)
- Kind of full Docker support
Stop focusing on perfect merge. Forcing a up-to-date merge prevents any master breakage from happen, but in most case merge conflict occurs for a breaking change. On the other hand, this invalidates all builds for every retry, and wastes time for resolving spurious errors. Also, it’s impossible to do parallel testing of two changes with this method.

I know much of them are controversial, but please leave your opinions as comments anyway.

xen0n · February 28, 2017, 9:40am

FYI the last bullet point you mentioned is actually due to this, or put it another way, the very existence of bors. I don’t think the gating strategy is open for changes.

And as for the system LLVM usage, most everyone can tolerate the additional time and space requirements of a LLVM clone IMO. It’s an one-time cost, and you can always ignore it and use system LLVM when linking. IIRC the Travis PR tester is already set up like that. Actually most of the cycle time is due to slow Android testing and full bootstraps, and that isn’t getting better anytime soon. We already shaved off one stage for the cross compilation targets and I don’t think stage times have significant room for optimization as incremental compilation has no use across stages.

But the CI provider change may be interesting, though; maybe someone just needs to step up and report the results.

skade · February 28, 2017, 9:41am

This is not the case, the build uses a prebuilt image of the emscripten toolchain.

Semaphore also does not offer OS X builds, which are used on Travis.

Travis has "full" docker support? What are you missing?

Are they comparing their internal offering to the Travis free offering when benchmarking?

Also, semaphore only gives you 2 boxes for free, the rest is paid - I don't see how that would improve things much.

The issue is that we have a high number of targets!

Sorry, but master always being build-able is, IMHO, important for project speed. Especially with the build times rustc has, no one wants to download a broken master and figure out they need to rebuild another one just for tiny fixes. Also, a broken master always needs someone to fix.

The problem with parallel testing is that you can't parallelize merge testing. Pre-testing of PRs is already done in parallel.

arielb1 · February 28, 2017, 11:05am

The main reason we have such a large PR back-up is because an experimental version of sccache we used had serious reliability issues (https://github.com/rust-lang/rust/pull/40084) and failed 80% of our builds. That is fixed now.

ishitatsuyuki · February 28, 2017, 11:28am

No, the LLVM repo is painful enough to clone. Please also note that the bootstrap omits cloning LLVM when linking to system one.

macOS builds run relatively shorter, and they are not the bottleneck for the moment.

Their Docker support is pretty minimal; on the other hand, Semaphore has painless container caching.

I don't think small merge can cause real errors; the drawback of queue based merge is that the workflow is way too slow.

eddyb · February 28, 2017, 12:05pm

It's funny you say that, because the slowest two builds right now are macOS.

petrochenkov · February 28, 2017, 1:17pm

I wonder if rollups could be more automated.

It is true that if a set of PRs can be merged cleanly and each of this PRs passes the test suite when applied to master, then it’s very likely that they will pass the tests after being merged together. This is the whole point behind rollups as they are done now, if fact rollups are the single most important reason why the queue doesn’t grow infinitely.

Maybe rollups should be run automatically and regularly, e.g. nightly?

All approved and mergeable PRs not failing CI could be merged together (unless explicitly marked as non-rollup), then tested and merged into master if testing is successful. If testing failed then PR introducing the point of failure may be marked as non-rollup if it’s detectable, and the next testing cycle may be started, otherwise rollup is canceled.

The main problem is reliability of individual PRs. People are not always responsible enough to run tests or even tidy before submitting PRs. People may not have enough computing resources to run the full test suite in reasonable time. People normally run local tests only on a single platform. There are not enough Travis and Appveyor resources to fully test all PRs as well.

Maybe Travis/Appveyor testing of incoming PRs can be tweaked somehow to find the optimal point with high enough reliability and minimal spent resources?

zackw · February 28, 2017, 1:23pm

People may have local environmental quirks that mean the full test suite fails for them when it wouldn't in CI. For instance, my dev box has no IPv6 loopback, which breaks libstd tests (#39798).

petrochenkov · February 28, 2017, 1:26pm

Yeah, this as well. This is why ability to ignore test groups, that was partially lost in rustbuild, was so important.

alexcrichton · February 28, 2017, 4:09pm

Yeah I definitely agree that the @bors queue can be frustrating, especially when your PR fails for a spurious reason and then takes days to reach the head of the queue again. In general spurious failures are extremely annoying for everyone involved!

In general though we need to approach improvements with a principled eye. Before proposing a solution or a change I'd recommend learning about the current system (e.g. why things are the way they are) to help predict the impact of a proposed change. I'm always up for answering questions about our CI and/or build system!

The statement here is not true, nor is this the cause of really that much slow down. We use a custom LLVM so we have a location to backport fixes which we do so on a regular basis. Put another way it's guaranteed that all system LLVM versions are buggy in one way or another, so it's not acceptable to just blanket always use the system LLVM. Note that we do also have a builder which uses the system LLVM, and this is what's run on Travis (one of our fastest configurations).

Also note that building LLVM is not a time hog. We leverage sccache on all builders to cache builds of LLVM. If you take a look at the logs and look at the timings you'll notice that LLVM typically takes about ~5 minutes to build from a warm cache. This, out of a multi-hour build, is just a drop in the bucket.

So in summary, switching to the system LLVM would (a) mean we can't fix critical bugs and (b) not actually help build times all that much.

Switching to different CI providers should always be an option, so it's worth considering. So long as a CI provider integrates with GitHub it'll be able to interact with the @bors queue correctly.

That being said, I highly doubt that switching will make our builds 80% faster. We're a CPU bound project (we're a compiler), so unless they've got 80% faster CPUs we're not really going to see that much improvement. I'd recommend fully investigating such an alternative proposal before just curtly stating that we should switch.

Furthermore, the lack of "docker support on Travis" isn't really a problem. We cache docker images across PRs today so that doesn't impact build times.

So along a similar vein, I've never actually used a fire extinguisher in my life! I know what it is, and I've learned what it's used for in the past, but I've never personally needed it! It sure does take up a lot of space under my kitchen sink and is getting to be inconvenient nowadays that I have more stuff I want to put under my sink. That means I should throw out my fire extinguisher, right?

In a less joking fashion, this isn't up for change at this time. I do realize it's tough to envision a world without @bors as we've had it so long (for the "lifetime" of many Rust contributors!). Those that remember the dark ages before @bors, however, will swear that this is never worth it.

A system like @bors is not without its downsides, of course, but I'd rather waste some space under my sink for a fire extinguisher than burn my whole house down when I need it.

Yeah spurious failures are exceptionally annoying for basically everyone except @bors who is eternally hungry for more PRs. We have a number of other active issues for spurious failures where I believe the most notorious is spurious segfaults on OSX linkers.

I'd love to foster encouragement to tackle these issues as they are incredibly high profile bugs to fix and you have the benefit of basically making every rust-lang/rust developer's life nicer. I can't remember the last time I saw 24 hours of yellow suns in my inbox from @bors, and I'd love to see it again! In the meantime I unfortunately end up sinking a lot of time into reading every failure log @bors generates to make sure it's not spurious

It would be useful to quantify the pain here rather than just state that such a clone is painful. I've typically seen an LLVM clone take ~5 minutes, which is a drop in the bucket for our builds. I've investigated submodule depth 1 cloning in the past but never got it to work out, but it'd be great to speed this up regardless!

As @eddyb points out, I encourage you to link to factual evidence for claims like this. The OSX builds are about to become the slowest overall builds.

Can you elaborate on what you'd like to see from a hypothetical "docker support with Travis"? I'm under the impression that we wouldn't benefit much at this point (other than having someone else maintain the support), but I may be missing something!

I completely agree and I believe that this is one of the lowest hanging fruit for motivated contributors to help out with our CI. The Homu project has long languished from a lack of a solid maintainer, and we have a laundry list of issues and feature improvements that we could add to Homu. I unfortunately don't personally have time to work on this much, but we can very easily update homu whenever we need! Some issues off the top of my head would be:

Automatic rollups. There's a boatload of heuristics we can throw into this, and there's been a novel's worth of previous conversation on this topic as well. I totally agree with @petrochenkov that I believe this would help tremendously.
Homu could comment on a PR when Travis-on-the-PR fails. This is now just a subset of the main test suite, so a failure on Travis on the PR is a guaranteed failure on Travis as the next PR to merge.
Homu could link to failing logs as opposed to just the build itself, making it easier to explore what failure happpened
Homu could have different prioritization logic. We've long wanted to favor new contributors in the queue to help improve the "first patch experience".
Homu could work with unicode in PR titles/descriptions. Right now if a PR with such a description gets to the head of the queue the whole world grinds to a halt

And much more! The aspirations for Homu go well beyond the Rust project itself to the Rust community as a whole. A one-click integration with a solid CI bot would be massive for everyone!

Eh2406 · February 28, 2017, 5:38pm

I followed the link. Thinking maybe a drive by contribution. It seems we have a lot of open PRs on that repo, I'd find it more inviting if PRs are getting merged. I guess that is what you mean by lack of a solid maintainer.

nikomatsakis · March 1, 2017, 3:37pm

So, this would be a small thing, but I do wonder -- can we make bors parse the logs and highlight (or, better yet, link directly to) lines that indicate the problem? I feel like it takes a frustratingly large number of clicks to extract this information.

nikomatsakis · March 1, 2017, 3:38pm

OK reading to the end of @alexcrichton’s message, I think this falls under “homu needs work”.

alexcrichton · March 1, 2017, 3:51pm

I think this is totally plausible! The barrier for entry is "someone needs to change homu" or "someone needs to write a new bot". Both those barriers seem empirically too high today as it's deterred everyone so far.

Eh2406 · March 7, 2017, 10:55pm

If I am reading https://travis-ci.org/rust-lang/rust/builds correctly many builds are timing out, having run for >24H. Why so many timeouts? Why such a long cutoff? Mabey 3h instead?

Eh2406 · March 7, 2017, 10:58pm

No, I was reading it wrong. They are canceled not timed out.

ishitatsuyuki · March 10, 2017, 1:43am

As you know, nightlies are broken again (woohoo!) reference

Merge gating doesn’t completely avoids breakage from happen, and it has been taking 2 days before fixing the breakage.

The merge gating seems ineffective to me. Most time merging from a (little) outdated branch has no problem. Attempt to merge from a very outdated branch will simply cause a merge conflict. Doing tests in parallel, on the other hand, will allow us to fix the breakage ASAP.

arielb1 · March 10, 2017, 12:13pm

Semantic merge conflicts do occur sometimes, especially when you do refactoring.

Also, for me the important property is that even when we don’t have a nightly, I can pull from Rust master and build on it without having to fear random breakage. In the 3 years I have been working with Rust I never had to work around nightly regressions (what was annoying is that we don’t have good error testing on nightly, so ICEs on compilation errors can slip to beta, but that’s orthogonal).

Also, I’m not sure parallel merges will help - we have 20-ish builders that check our PRs work in all sorts of odd architectures, and the reason CI is slow is because A) we have several intermittents in these odd architectures and nobody with enough knowledge and time to fix them. B) our build accesses the network, and AFAICT travis has unreliable networking and no to do 1 network access for all of our 20-ish builders. C) sccache sucks and nobody has the time to fix it.

ishitatsuyuki · March 10, 2017, 1:01pm

With gate independent testing it’s possible to retry only one build instead of all. It should be faster to get feedback and merge.

Nightly is broken, and building from the fresh source is also broken (I can’t test my PR locally now). Taking 3 days to fix it is too annoying, more than semantic merge conflicts.

arielb1 · March 10, 2017, 1:16pm

How is building from fresh source broken? git submodule update --init.

If you want to improve CI times, find some way to fix the mac linker segfault. Maybe run the linker 3 times and if it works once that’s it.

Topic		Replies	Views
Homu queue woes, and suggestions on how to fix them tools and infrastructure	50	5268	March 25, 2019
Putting bors on a PIP	24	3151	March 25, 2019
Which CI platform should Rust use? tools and infrastructure	30	19733	August 6, 2019
Rust CI / release infrastructure changes	19	8698	March 25, 2019
Pre-RFC: Split `rust-lang/rust`'s `.travis.yml` into 3 files (normal/`auto`/`try`) tools and infrastructure	11	1517	March 25, 2019

Some thoughts on improving CI infrastructure

Related topics