Rust 1.15.1 release postmortem

The 1.15.1 release was quite troublesome. Aside from the embarrasing bugs we had to actually fix, we made several mistakes during the release that, while ultimately not creating much end-user disruption, were stressful to deal with and created a lot of risk.

I’ve taken notes here about what happened to try to capture it while it’s fresh. But they are not so organized.

Most of the release tasks were done by me or @alexcrichton, with input from the core team.

Problems

  • Retagged 1.15.1 three times. Yes, four tags total.
  • Accidentally uploaded rebuilt 1.15.0 mac rustc binaries to the archives
  • Accidentally overwrote the released 1.15.0 source tarball and its .asc and .sha256, changing signatures
  • Likely accidentally overwrote one of the two sets of 1.15.0 rustc binaries for all Apple platforms. Not the set that rustup cares about. The set that most people don’t touch.

The last three are all a single incident.

Links

Events

  • 2017-02-02 - (Thursday) Released 1.15
  • 2017-02-02 - as_mut_slice bug reported and fixed on master
  • 2017-02-03 - (Friday) Scheduled 1.15.1 for Tuesday 2/7
  • 2017-02-03 - Backported minimal fix to beta and stable
  • 2017-02-03 - Added 1.15.1 release notes to stable, containing 2/7 date
  • 2017-02-03 - Built and tested successfully on beta
  • 2017-02-03 - Build started on stable
  • 2017-02-06 - (Monday) Stable build failed because of mac 1.15.0 src tarball collisions (more below)
  • 2017-02-06 - Believing the problem is a stale cache on the build master, files are deleted on build master
  • 2017-02-06 - Stable build restarted from scratch. Attempting to continue in progress deployment too unknown
  • 2017-02-07 - (Tuesday) Release day?
  • 2017-02-07 - Stable build failed. Same problem.
  • 2017-02-07 - The problem is not on the build master but on the build slaves. Build slaves are cleaned.
  • 2017-02-07 - Rescheduled 1.15.1 for Wednesday 2/8
  • 2017-02-07 - Updated release notes on stable for 2/8
  • 2017-02-07 - Started build
  • 2017-02-07 - Updated www and blog for 1.15.1 2017-02-07 - 1.15.1 is tagged, not signed, for updating thanks.rlo
  • 2017-02-08 - (Wednesday) Release day?
  • 2017-02-08 - Build is good. Ready to deploy.
  • 2017-02-08 - Release artifacts contain an error. The same bogus 1.15.0 mac artifacts that were causing it the build to fail before were still produced but did not cause the build to fail. Bogus duplicate 1.15.0 source tarballs and mac rustc bins are uploaded to s3. We think they are only uploaded to the archives, which sucks, but has precedent (we ocassionally end up with non-canonical release artifacts in the archive and don’t worry much about it). We’re still set to deploy.
  • 2017-02-08 - We’re directed to a bugzilla bug where Firefox doesn’t build because of the -fPIC regression, and asked if that can be fixed for a point release
  • 2017-02-08 - We initially decide to proceed with 1.15.1, but the bug looks kinda bad. With the prospect of an immediate 1.15.2 we delay the release to take a break and investigate.
  • 2017-02-08 - Regular core team mtg. We decide to postpone the release another day and backport the fix from master.
  • 2017-02-08 - acrichto and brson plan the steps for the revised release
  • 2017-02-08 - Patch is applied to beta and run through bors for testing
  • 2017-02-08 - Patch is applied to stable and release build begun
  • 2017-02-08 - 1.15.1 is retagged, but not signed, for the new commit, in order to close the window in which the tag points to the wrong commit. The new tag is incorrect.
  • 2017-02-08 - Wait, the bogus 1.15.0 mac/src deployment wasn’t just the archives, the ones in dist/ were overwritten too, shas and sigs changed. There was never any particular reason to believe otherwise. Tables are flipped.
  • 2017-02-09 - (Thursday) Release day
  • 2017-02-09 - -fPIC fix is validated on yet-undeployed stable build
  • 2017-02-09 - Updated www and blog for 1.15.1 release date and the -fPIC fix
  • 2017-02-09 - Retagged 1.15.1 yet again to fix last night’s error and point it at the real correct commit. Still not signed.
  • 2017-02-09 - Released build
  • 2017-02-09 - Signed tag
  • 2017-02-09 - Sent email to packagers about tagging error

Analysis

There were only two technical errors as far as I can tell:

  • The Mac build slaves, which unlike all others do not run in clean environments, exposed a bug in rustbuild where the ‘dist’ directory was not cleaned between runs, thus uploading 1.15.0 binaries during a 1.15.1 build
  • Our tagging scheme for 1.15, in order to generate thanks.rlo data ahead of the release, involved pushing an unsigned tag ahead of the release, then signing it after the release. This scheme is incorrect, and completely fell apart when we decided to delay the release to add another patch.

The first is fixed. The second will be fixed before the next release. Unfortunately, this has been the nature of most of the build/release failures we’ve seen - one off bugs that are caught and fixed.

There may be systemic changes we can make to guard against these types of bugs, but I’m not sure. Obviously having containers or some other way of isolating the build environment on macs would be better, but options are poor.

Possibly the best thing we could do is to create an identical staging enviroment where we can dry-run the entire release process, and always do a release dry run before doing the release. Today’s release infrastructure has such an environment but it is severely degraded. The new release infrastructure, scheduled to release the 1.17 build, has a simple option for doing an isolated staging deployment before touching the live s3 bucket.

Likewise, the new release infrastructure has much better "phasing". For legacy reasons, today’s release infra publishes the release artifacts in multiple stages. This doesn’t cause too many problems in practice because its only upon publication of the final set of artifacts that Rust is in practice considered release, but it’s messy, and directly contributed to the confusion with the overwritten 1.15.0 bins.

Another concern might be the decision making process. We had to make some tough choices fast, and it was stressful. Just the question of whether any given patch should be backported is tough - we don’t have any established guidelines yet.

Then there’s decisions about handling problems during the release. Often its tempting to make little adjustments, but at the same time every extra decision introduces the possibility for human error, so I lean toward mechanically following the process when in doubt. Dealing with the bad tag caused a lot of anguish this time. Force pushing tags is clearly a bad practice. For the sake of not screwing up further, we felt it was best to just press forward with the retagging and reevaluate the process after we had safely finished the release.

During this release we had to update the release notes several times, on all three channels, because the release notes contain the release date. Every time the release gets pushed back the notes have to change, or else they will be confusing.

Here are some concrete tasks I am committed to:

  • Check in a test for the -fPIC issue
  • Adjust the release process to not retag

Other things we might do:

  • Adjust the stable release process to do always do a dev-static deployment before doing the real deployment. This is relatively simple with the new deployment process around rust-central-station.
  • Remove the release date from the release notes
  • Investigate environment isolation for the macs
  • Add an independent artifact verification step before the final deploy. Today we pretty much deploy whatever the builders produced. We could pretty easily have a master artifact list saying exactly what we expect the output set to be.
22 Likes

Another systemic improvement would be to publish to an append-only data store, or to otherwise put in place measures to prevent overwriting any artifact.

1 Like

I hope for your sake that this is true more often than not. Thanks for your transparency, @brson, and for release-wrangling in general - it's a harder job than it seems!

5 Likes

Were both of these bugs in beta as well? Especially regarding the -fPIC bug, shouldn’t the Firefox build have caught this beforehand using beta?

2 Likes

Thanks @cuviper!

@jethrogb Both bugs were on beta. Firefox did not try to do the 1.15 upgrade until after the 1.15 release. I believe they recognize that having Firefox test Rust betas would improve coverage, but I don’t have insight into their process. And fwiw I don’t think it’s practical for us at this time to make Firefox itself part of Rust’s test suite.

This is a wonderful and in depth postmortem on the release process! I also look forward to seeing the discussion around @llogiq's question.

How did that get into a stable release and what can we do to improve our quality assurance to avoid such things happening in the future?

but, that may be a different topic. Thank you to the team for being so open!

1 Like

FYI we have filed a bug to track running Firefox CI against Beta/Nightly Rust toolchains. We’ve hit these same sorts of problems with our other toolchains in the past (I think you still can’t build Firefox with GCC 6), but given that Rust is developed in-house we really should be dogfooding here.

8 Likes

Nice thanks @luser!

FWIW, current clippy master has a mut_from_ref lint that catches the as_mut_slice error class.

If there is interest, I may pull it into rustc. Otherwise, I’d like to add clippy as an optional test to rustbuild.

4 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.