At ReadMe we automatically generate SSL servers for our customers using Let’s Encrypt (you can read more about this here). The SSL certificates that we generate get stored in a Redis instance hosted on an external database provider. Yesterday (04/10/2019) that database provider performed an internal migration of servers which caused intermittent connectivity issues. Our health checks started to fail on our SSL servers due to connection timeouts to the database. This resulted in downtime to almost all ReadMe hubs.
After a few hours of waiting to see if the upstream issue would be fixed in a timely manner, we made the tough decision to migrate to a different database provider. After taking a recent backup of the SSL certificates and importing them into the new database provider, we redeployed the servers and everything started working again.
We have 4 instances of our SSL server running behind an EC2 load balancer for redundancy but we only had one Redis instance which resulted in a single point of failure. We’ve since added replication to this Redis instance on the new database provider to greatly reduce the scale of impact should there be downtime again.
Using a Redis Cluster would provide even more resilience, but this is not yet supported in Openresty which is what our SSL server is written in. We will monitor for changes on this and upgrade to a Cluster when we can. If you have any more questions, please contact support@readme.io.
13:10 PDT - first health checks started failing
13:16 PDT - issue intermittently being reported across all ReadMe hubs
14:40 PDT - we decide to wait it out to see if it gets fixed upstream
15:38 PDT - got confirmation from old database provider that it’s their problem
16:53 PDT - the issue isn’t going to resolve itself in a timely manner, so we start migrating
17:58 PDT - all servers redeployed and issue appears to be resolved