Downtime 4/10/2019 Post Mortem

At ReadMe we automatically generate SSL servers for our customers using Let’s Encrypt (you can read more about this here). The SSL certificates that we generate get stored in a Redis instance hosted on an external database provider. Yesterday (04/10/2019) that database provider performed an internal migration of servers which caused intermittent connectivity issues. Our health checks started to fail on our SSL servers due to connection timeouts to the database. This resulted in downtime to almost all ReadMe hubs.

After a few hours of waiting to see if the upstream issue would be fixed in a timely manner, we made the tough decision to migrate to a different database provider. After taking a recent backup of the SSL certificates and importing them into the new database provider, we redeployed the servers and everything started working again.

We have 4 instances of our SSL server running behind an EC2 load balancer for redundancy but we only had one Redis instance which resulted in a single point of failure. We’ve since added replication to this Redis instance on the new database provider to greatly reduce the scale of impact should there be downtime again.

Using a Redis Cluster would provide even more resilience, but this is not yet supported in Openresty which is what our SSL server is written in. We will monitor for changes on this and upgrade to a Cluster when we can. If you have any more questions, please contact support@readme.io.

Timeline

13:10 PDT - first health checks started failing

13:16 PDT - issue intermittently being reported across all ReadMe hubs

14:40 PDT - we decide to wait it out to see if it gets fixed upstream

15:38 PDT - got confirmation from old database provider that it’s their problem

16:53 PDT - the issue isn’t going to resolve itself in a timely manner, so we start migrating

17:58 PDT - all servers redeployed and issue appears to be resolved

Posted Apr 11, 2019 - 13:35 PDT

Resolved

Redeployed to all servers

Posted Apr 10, 2019 - 18:10 PDT

Update

Deployed the new database to one of our servers. Checking for full functionality then will deploy to the others.

Posted Apr 10, 2019 - 18:01 PDT

Update

Migrating certificates to another provider

Posted Apr 10, 2019 - 17:50 PDT

Monitoring

Working on migrating to a different database provider.

Posted Apr 10, 2019 - 17:35 PDT

Identified

We use a database instance on a third party provider to host our SSL certificates. They're currently having problems which is causing issues with us serving invalid certificates.

Posted Apr 10, 2019 - 14:45 PDT

Investigating

We’re having issues with our SSL Cert generation, and are investigating it now

Posted Apr 10, 2019 - 14:23 PDT

This incident affected: ReadMe Hubs.