Crates.io expiry postmortem (2016-11-07)


#1

At 11/07/2016 10:55 PM UTC, the SSL certificate for crates.io expired, and was down for approximately one hour and 26 minutes. We’re very sorry about the disruption here.

What happened

We host crates.io with Heroku, and use the Expedited SSL addon to manage the certificate. @alexcrichton got an email for renewal of the certificates on October 19, and followed the steps to renew the certificate. We checked in to see if it had been updated multiple times, as of last week, but they weren’t, but forgot to check again. This wasn’t a hack, all of the data is safe, this was purely an operational error.

This led to a general outage of crates.io, which affected people’s cargo builds when fetching new crates. It also led to our CI system not being able to fetch crates. which failed all of the outstanding builds.

Timeline of events

We noticed at 10:57PM, three minutes after it expired.

We filed an expedited support ticket, and Heroku responded within ten minutes. Unfortunately, since this was a problem with an addon, they forwarded us to ExpeditedSSL’s support, and couldn’t guarantee when they’d get back to us.

At 11:20PM, after not hearing back, @aturon gave the go-ahead to just buy another certificate, in the hopes that this would let us get back up faster. Nine minutes later, @alexcrichton completed the process with DigiCert, and got the message

Your CSR has been submitted. We will update the order as soon as possible and contact you if there is anything else we need to issue the certificate.

@alexcrichton immediately got on the phone with support, but there was a problem: In order to get our new certificate, we had to have an affiliation with a company. Now, Mozilla is of course a company, but they already have an account with DigiCert, and since we aren’t the people affiliated with the account, they would not let us use Mozilla.

The call was elevated to a supervisor, and @brson tried to get in touch with ExpeditedSSL.

In the meantime, @alexcrichton decided to re-issue the certificate, but for him personally, in the hopes (again) that this would be resolved quickly. But at 11:49PM, another error:

You do not have permission to manage sni endpoints on crates-io.

After looking at things and talking to support…

We received your renew cert but we cannot perform the install as it seems that you have a bad SSL Endpoint. Please remove the SSL Endpoint, log out of Heroku and wait about 5 minutes then go back in. Re-add the the SSL Endpoint and let us know so that we can retry the install. This may also require you to update your DNS settings as well.

In the background, @brson was also trying to get in touch with Mozilla people to possibly get a certificate through their account.

At 12:21 PM, @alexcrichton managed to get the new certificate installed and updated the DNS. Everything was then working again, modulo the differences in time it can take DNS to propagate.

We’ll be continuing to monitor what’s going on, please let us know in this thread if you are having more problems.

Steps in the future

We will be figuring out how to make this better in the future, but it’s not totally 100% clear what should be done. We had made this an automated process in order to not have these kinds of issues crop up, but it obviously failed in this case. It’s possible that switching providers can help here.


#2

Have you considered using Let’s Encrypt? They’re top notch, supported by Mozilla, and it doesn’t appear crates.io needs features they lack.

I’m not familiar with Heroku so I don’t know how hard that is to set up.


#3

From IRC:

03:06 < nagisa> so why isn’t crates.io using letsencrypt?
03:06 < nagisa> which is like literally the only sane way to get certs nowadays
03:25 <@brson> nagisa: crates.io predates letsencrypt and retrofiting it isn't trivial. i was told today though that our heroku 
               account will have access to lets encrypt in some systemic way soon.


#4

I’ve been using letsencrypt-rs on all projects that require a ssl cert. Highly recommended :smiley:


#5

I’m confused, was this ultimately a problem with the SSL certificate provider or the heroku addon provider? Or was ExpeditedSSL both in the original case? Which provider did you mean that switching would possibly help?


#6

I use a nagios plugin at work to warn me when certs or domains expire. It could be a good idea to have a last warning on the last days in case you get the warning weeks in advance, mark it as acknowledged and forget about it.


#7

it sounds like “find a way to increase the nag factor” would help here. (this seems is inline with @bbigras’s thoughts)


#8

They were both, yes.

From ExpeditedSSL to something else that makes more noise if things aren’t set up to renew, or is more reliable, or something. As I said, not totally clear :smile:


#9

Just wanted to say oof, sounds like a nasty failure cascade, and certs suck.

It seems like the best thing that could be improved here is monitoring as @bbigras notes. There are services that will monitor your site and notify you when certs are about to expire. Here’s a free one (I’m not sure how good it is, though):

https://certificatemonitor.org/


#10

And what if … the crates would be distributed via p2p, based on their cryptographic hash?