The 1.15.1 release was quite troublesome. Aside from the embarrasing
bugs we had to actually fix, we made several mistakes during the release
that, while ultimately not creating much end-user disruption, were
stressful to deal with and created a lot of risk.
I’ve taken notes here about what happened to try to capture it while
it’s fresh. But they are not so organized.
Most of the release tasks were done by me or @alexcrichton, with input
from the core team.
Problems
- Retagged 1.15.1 three times. Yes, four tags total.
- Accidentally uploaded rebuilt 1.15.0 mac rustc binaries
to the archives
- Accidentally overwrote the released 1.15.0 source tarball and its
.asc and .sha256, changing signatures
- Likely accidentally overwrote one of the two sets of 1.15.0 rustc
binaries for all Apple platforms. Not the set that rustup cares
about. The set that most people don’t touch.
The last three are all a single incident.
Links
Events
- 2017-02-02 - (Thursday) Released 1.15
- 2017-02-02 - as_mut_slice bug reported and fixed on master
- 2017-02-03 - (Friday) Scheduled 1.15.1 for Tuesday 2/7
- 2017-02-03 - Backported minimal fix to beta and stable
- 2017-02-03 - Added 1.15.1 release notes to stable, containing 2/7 date
- 2017-02-03 - Built and tested successfully on beta
- 2017-02-03 - Build started on stable
- 2017-02-06 - (Monday) Stable build failed because of mac 1.15.0 src
tarball collisions (more below)
- 2017-02-06 - Believing the problem is a stale cache on the build master,
files are deleted on build master
- 2017-02-06 - Stable build restarted from scratch. Attempting to
continue in progress deployment too unknown
- 2017-02-07 - (Tuesday) Release day?
- 2017-02-07 - Stable build failed. Same problem.
- 2017-02-07 - The problem is not on the build master but on
the build slaves. Build slaves are cleaned.
- 2017-02-07 - Rescheduled 1.15.1 for Wednesday 2/8
- 2017-02-07 - Updated release notes on stable for 2/8
- 2017-02-07 - Started build
- 2017-02-07 - Updated www and blog for 1.15.1
2017-02-07 - 1.15.1 is tagged, not signed, for updating thanks.rlo
- 2017-02-08 - (Wednesday) Release day?
- 2017-02-08 - Build is good. Ready to deploy.
- 2017-02-08 - Release artifacts contain an error. The same bogus
1.15.0 mac artifacts that were causing it the build to fail before
were still produced but did not cause the build to fail. Bogus
duplicate 1.15.0 source tarballs and mac rustc bins are uploaded to
s3. We think they are only uploaded to the archives, which sucks,
but has precedent (we ocassionally end up with non-canonical release
artifacts in the archive and don’t worry much about it). We’re still
set to deploy.
- 2017-02-08 - We’re directed to a bugzilla bug where Firefox
doesn’t build because of the -fPIC regression, and asked if
that can be fixed for a point release
- 2017-02-08 - We initially decide to proceed with 1.15.1, but the bug
looks kinda bad. With the prospect of an immediate 1.15.2 we delay
the release to take a break and investigate.
- 2017-02-08 - Regular core team mtg. We decide to postpone the release
another day and backport the fix from master.
- 2017-02-08 - acrichto and brson plan the steps for the revised
release
- 2017-02-08 - Patch is applied to beta and run through bors for testing
- 2017-02-08 - Patch is applied to stable and release build begun
- 2017-02-08 - 1.15.1 is retagged, but not signed, for the new commit,
in order to close the window in which the tag points to the wrong
commit. The new tag is incorrect.
- 2017-02-08 - Wait, the bogus 1.15.0 mac/src deployment wasn’t just the
archives, the ones in dist/ were overwritten too, shas and sigs
changed. There was never any particular reason to believe otherwise.
Tables are flipped.
- 2017-02-09 - (Thursday) Release day
- 2017-02-09 - -fPIC fix is validated on yet-undeployed stable build
- 2017-02-09 - Updated www and blog for 1.15.1 release date and the
-fPIC fix
- 2017-02-09 - Retagged 1.15.1 yet again to fix last night’s error
and point it at the real correct commit. Still not signed.
- 2017-02-09 - Released build
- 2017-02-09 - Signed tag
- 2017-02-09 - Sent email to packagers about tagging error
Analysis
There were only two technical errors as far as I can tell:
- The Mac build slaves, which unlike all others do not run in clean environments,
exposed a bug in rustbuild where the ‘dist’ directory was not cleaned between
runs, thus uploading 1.15.0 binaries during a 1.15.1 build
- Our tagging scheme for 1.15, in order to generate thanks.rlo data ahead of
the release, involved pushing an unsigned tag ahead of the release, then
signing it after the release. This scheme is incorrect, and completely
fell apart when we decided to delay the release to add another patch.
The first is fixed. The second will be fixed before the next
release. Unfortunately, this has been the nature of most of the
build/release failures we’ve seen - one off bugs that are caught and
fixed.
There may be systemic changes we can make to guard against these types of
bugs, but I’m not sure. Obviously having containers or some other way of
isolating the build environment on macs would be better, but options are
poor.
Possibly the best thing we could do is to create an identical staging
enviroment where we can dry-run the entire release process, and always
do a release dry run before doing the release. Today’s release infrastructure
has such an environment but it is severely degraded. The new release infrastructure,
scheduled to release the 1.17 build, has a simple option for doing
an isolated staging deployment before touching the live s3 bucket.
Likewise, the new release infrastructure has much better
"phasing". For legacy reasons, today’s release infra publishes the
release artifacts in multiple stages. This doesn’t cause too many
problems in practice because its only upon publication of the final
set of artifacts that Rust is in practice considered release, but it’s
messy, and directly contributed to the confusion with the overwritten
1.15.0 bins.
Another concern might be the decision making process. We had to make
some tough choices fast, and it was stressful. Just the question of
whether any given patch should be backported is tough - we don’t have
any established guidelines yet.
Then there’s decisions about handling problems during the release.
Often its tempting to make little adjustments, but at the same time
every extra decision introduces the possibility for human error, so I
lean toward mechanically following the process when in doubt. Dealing
with the bad tag caused a lot of anguish this time. Force pushing tags
is clearly a bad practice. For the sake of not screwing up further, we
felt it was best to just press forward with the retagging and
reevaluate the process after we had safely finished the release.
During this release we had to update the release notes several times,
on all three channels, because the release notes contain the release
date. Every time the release gets pushed back the notes have to
change, or else they will be confusing.
Here are some concrete tasks I am committed to:
- Check in a test for the -fPIC issue
- Adjust the release process to not retag
Other things we might do:
- Adjust the stable release process to do always do a dev-static deployment
before doing the real deployment. This is relatively simple with
the new deployment process around rust-central-station.
- Remove the release date from the release notes
- Investigate environment isolation for the macs
- Add an independent artifact verification step before the final deploy.
Today we pretty much deploy whatever the builders produced. We could
pretty easily have a master artifact list saying exactly what we expect
the output set to be.