The 1.15.1 release was quite troublesome. Aside from the embarrasing bugs we had to actually fix, we made several mistakes during the release that, while ultimately not creating much end-user disruption, were stressful to deal with and created a lot of risk.
I’ve taken notes here about what happened to try to capture it while it’s fresh. But they are not so organized.
Most of the release tasks were done by me or @alexcrichton, with input from the core team.
Problems
- Retagged 1.15.1 three times. Yes, four tags total.
- Accidentally uploaded rebuilt 1.15.0 mac rustc binaries to the archives
- Accidentally overwrote the released 1.15.0 source tarball and its .asc and .sha256, changing signatures
- Likely accidentally overwrote one of the two sets of 1.15.0 rustc binaries for all Apple platforms. Not the set that rustup cares about. The set that most people don’t touch.
The last three are all a single incident.
Links
- Release process - How we make releases. We would follow this to the letter, except that the releases process changes often, and lately at least there have been lots of bugs as we rewrite the build system, the CI, and release infrastructure. So we debug the release process itself as we go, updating it for next time.
- as_mut_slice /r/rust thread
- as_mut_slice bug
- as_mut_slice master PR
- as_mut_slice stable port
- -fPIC bugzilla - The report that Firefox doesn’t build with 1.15.0
- -fPIC master fix - The fix for the -FPIC bug on master
- -fPIC stable port - A smaller patch than for master, which pulls in unrelated updates via a crate upgrade
- -fPIC beta port - The same patch as stable, running tests while stable is building for release.
- 1.15.0 source overwrite homebrew report
Events
- 2017-02-02 - (Thursday) Released 1.15
- 2017-02-02 - as_mut_slice bug reported and fixed on master
- 2017-02-03 - (Friday) Scheduled 1.15.1 for Tuesday 2/7
- 2017-02-03 - Backported minimal fix to beta and stable
- 2017-02-03 - Added 1.15.1 release notes to stable, containing 2/7 date
- 2017-02-03 - Built and tested successfully on beta
- 2017-02-03 - Build started on stable
- 2017-02-06 - (Monday) Stable build failed because of mac 1.15.0 src tarball collisions (more below)
- 2017-02-06 - Believing the problem is a stale cache on the build master, files are deleted on build master
- 2017-02-06 - Stable build restarted from scratch. Attempting to continue in progress deployment too unknown
- 2017-02-07 - (Tuesday) Release day?
- 2017-02-07 - Stable build failed. Same problem.
- 2017-02-07 - The problem is not on the build master but on the build slaves. Build slaves are cleaned.
- 2017-02-07 - Rescheduled 1.15.1 for Wednesday 2/8
- 2017-02-07 - Updated release notes on stable for 2/8
- 2017-02-07 - Started build
- 2017-02-07 - Updated www and blog for 1.15.1 2017-02-07 - 1.15.1 is tagged, not signed, for updating thanks.rlo
- 2017-02-08 - (Wednesday) Release day?
- 2017-02-08 - Build is good. Ready to deploy.
- 2017-02-08 - Release artifacts contain an error. The same bogus 1.15.0 mac artifacts that were causing it the build to fail before were still produced but did not cause the build to fail. Bogus duplicate 1.15.0 source tarballs and mac rustc bins are uploaded to s3. We think they are only uploaded to the archives, which sucks, but has precedent (we ocassionally end up with non-canonical release artifacts in the archive and don’t worry much about it). We’re still set to deploy.
- 2017-02-08 - We’re directed to a bugzilla bug where Firefox doesn’t build because of the -fPIC regression, and asked if that can be fixed for a point release
- 2017-02-08 - We initially decide to proceed with 1.15.1, but the bug looks kinda bad. With the prospect of an immediate 1.15.2 we delay the release to take a break and investigate.
- 2017-02-08 - Regular core team mtg. We decide to postpone the release another day and backport the fix from master.
- 2017-02-08 - acrichto and brson plan the steps for the revised release
- 2017-02-08 - Patch is applied to beta and run through bors for testing
- 2017-02-08 - Patch is applied to stable and release build begun
- 2017-02-08 - 1.15.1 is retagged, but not signed, for the new commit, in order to close the window in which the tag points to the wrong commit. The new tag is incorrect.
- 2017-02-08 - Wait, the bogus 1.15.0 mac/src deployment wasn’t just the archives, the ones in dist/ were overwritten too, shas and sigs changed. There was never any particular reason to believe otherwise. Tables are flipped.
- 2017-02-09 - (Thursday) Release day
- 2017-02-09 - -fPIC fix is validated on yet-undeployed stable build
- 2017-02-09 - Updated www and blog for 1.15.1 release date and the
-fPIC
fix - 2017-02-09 - Retagged 1.15.1 yet again to fix last night’s error and point it at the real correct commit. Still not signed.
- 2017-02-09 - Released build
- 2017-02-09 - Signed tag
- 2017-02-09 - Sent email to packagers about tagging error
Analysis
There were only two technical errors as far as I can tell:
- The Mac build slaves, which unlike all others do not run in clean environments, exposed a bug in rustbuild where the ‘dist’ directory was not cleaned between runs, thus uploading 1.15.0 binaries during a 1.15.1 build
- Our tagging scheme for 1.15, in order to generate thanks.rlo data ahead of the release, involved pushing an unsigned tag ahead of the release, then signing it after the release. This scheme is incorrect, and completely fell apart when we decided to delay the release to add another patch.
The first is fixed. The second will be fixed before the next release. Unfortunately, this has been the nature of most of the build/release failures we’ve seen - one off bugs that are caught and fixed.
There may be systemic changes we can make to guard against these types of bugs, but I’m not sure. Obviously having containers or some other way of isolating the build environment on macs would be better, but options are poor.
Possibly the best thing we could do is to create an identical staging enviroment where we can dry-run the entire release process, and always do a release dry run before doing the release. Today’s release infrastructure has such an environment but it is severely degraded. The new release infrastructure, scheduled to release the 1.17 build, has a simple option for doing an isolated staging deployment before touching the live s3 bucket.
Likewise, the new release infrastructure has much better "phasing". For legacy reasons, today’s release infra publishes the release artifacts in multiple stages. This doesn’t cause too many problems in practice because its only upon publication of the final set of artifacts that Rust is in practice considered release, but it’s messy, and directly contributed to the confusion with the overwritten 1.15.0 bins.
Another concern might be the decision making process. We had to make some tough choices fast, and it was stressful. Just the question of whether any given patch should be backported is tough - we don’t have any established guidelines yet.
Then there’s decisions about handling problems during the release. Often its tempting to make little adjustments, but at the same time every extra decision introduces the possibility for human error, so I lean toward mechanically following the process when in doubt. Dealing with the bad tag caused a lot of anguish this time. Force pushing tags is clearly a bad practice. For the sake of not screwing up further, we felt it was best to just press forward with the retagging and reevaluate the process after we had safely finished the release.
During this release we had to update the release notes several times, on all three channels, because the release notes contain the release date. Every time the release gets pushed back the notes have to change, or else they will be confusing.
Here are some concrete tasks I am committed to:
- Check in a test for the -fPIC issue
- Adjust the release process to not retag
Other things we might do:
- Adjust the stable release process to do always do a dev-static deployment before doing the real deployment. This is relatively simple with the new deployment process around rust-central-station.
- Remove the release date from the release notes
- Investigate environment isolation for the macs
- Add an independent artifact verification step before the final deploy. Today we pretty much deploy whatever the builders produced. We could pretty easily have a master artifact list saying exactly what we expect the output set to be.