Widespread outage

Incident Report for ReadMe

Postmortem

What Happened

ReadMe experienced a significant outage on Tuesday, March 26 beginning at 16:06 UTC (9:06am Pacific). This outage affected all ReadMe services including our management dashboard and the developer hubs we host for our customers.

We recovered the majority of our service by 16:42 UTC (9:42am Pacific) including most access to the Dash and the Hubs. The rest of the service fully recovered at 17:34 UTC (10:34am Pacific).

Although the outage began with one of ReadMe’s service providers, we take full responsibility and we’re truly sorry for the inconvenience to our customers. We’re working through ways to prevent the same issue from happening again and to reduce the impact from similar events in the future.

Root Cause

ReadMe uses a number of third party service providers to host our Internet-facing services including our customer-facing dashboard (dash.readme.com) and developer documentation hubs. One of our primary service providers is Render, a web application hosting platform. This outage began when Render experienced a broad range of outages. We’re still learning more about what happened and we will update this document when those details are available.

We have redundant systems running at Render and can handle a partial Render service outage. Further, it’s usually very fast to replace cloud services on Render in a partial outage. But our infrastructure is not resilient to a full outage of the entire Render service, which is what happened on the 26th.

Update (April 1, 2024): Render has confirmed that the issue began with an unintended restart of all customer and system workloads on their platform, which was caused by a faulty code change. Render has provided a Root Cause Analysis for their underlying incident. Although the incident was triggered by our service provider, we’re ultimately responsible for our own uptime and we are working on remediations to reduce the scope and severity of this class of incidents.

Resolution

We host many services on Render including our Node.js web application and our Redis data stores. Redis is an in-memory data store that we use for caches and queues. We don’t use Redis for long-term (persistent) data storage, but many other companies do. Because of the unique challenges of restoring persistent data stores, Render’s managed Redis services took significantly longer to recover.

We implemented two temporary workarounds to restore ReadMe service: we removed Redis from the critical path in areas of our service where this was possible, and we launched temporary replacement Redis services until our managed Redis instances were recovered. After the managed Redis service was available and stable, we resumed normal operations on our managed Redis instances.

Timeline

2024-03-26 16:06 UTC: All traffic to ReadMe’s web services begins to fail with HTTP 503 server errors. The ReadMe team begins mobilizing at 16:08 and automated alerts fire at 16:10.
2024-03-26 16:12 UTC: Render confirms that they are experiencing a major outage. We begin troubleshooting and looking for paths forward. The ReadMe Status site is updated at 16:13.
2024-03-26 16:35 UTC: Although Render reports that many services have already recovered, ReadMe’s applications are still unavailable. We consult with our service provider and determine that Redis caches and queues will take longer to recover. We immediately began efforts to workaround the Redis services that had not yet recovered.
2024-03-26 16:42 UTC: We deploy a change to remove Redis from the critical path of many application flows. This restores most ReadMe functionality; from this point forward 88% of requests to the Dash and the Hubs are successful. Some functionality that requires Redis is unavailable, like page view tracking and page quality voting. Further, our Developer Dashboard and its API are still offline. We continue attempting to restore remaining service by deploying alternate Redis servers outside the managed Redis infrastructure.
2024-03-26 17:34 UTC: With the temporary Redis servers in place, all remaining issues with our application are resolved, including the Developer Dashboard and its API. Error rates and response times immediately return to nominal levels. We note the full recovery on our status site at 17:53.

Path Forward

ReadMe is committed to maintaining a high level of service availability; we sincerely apologize for letting our customers down. We will be holding an internal retrospective later this week to learn from this incident and improve our response to future incidents.

This incident identified a number of tightly-coupled services in our infrastructure — failures in some internal services caused unforeseen problems in other related services. Among other improvements, we’ll look into ways to decouple those services.

This incident alone isn’t enough to reevaluate our relationship with Render, but we continually monitor our partners’ performance relative to our service targets. If we are unable to meet our service targets with our current providers, we will engage additional providers for redundancy, or look for replacements depending on the situation.

Finally, our close relationship with Render allowed us to get accurate technical details during the incident. This information allowed us to move quickly and take corrective action.

Final Note

ReadMe takes our mission to make the API developer experience magical very seriously. We deeply regret this service outage and are using it as an opportunity to strengthen our processes, provide transparency, and improve our level of service going forward. We apologize for this disruption and thank you for being a valued customer.

Posted Mar 27, 2024 - 14:53 PDT

Resolved

This incident has been resolved.

Posted Mar 26, 2024 - 12:48 PDT

Monitoring

All metrics related products have returned to operational. We have also fixed the underlying issue blocking new edits from being saved. We will continue to monitor to ensure recovery is maintained.

Posted Mar 26, 2024 - 10:53 PDT

Update

While the administrative dashboard is reachable, users are presently unable to save edits to existing documents. We are working to recover this and the metrics-related products.

Posted Mar 26, 2024 - 10:04 PDT

Update

Documentation hubs have recovered and we will monitor recovery. The Developer Metrics API, dashboard metrics, and "Recent Requests" in documentation API reference pages are still unavailable. We will continue to update as we have more information

Posted Mar 26, 2024 - 09:51 PDT

Identified

We are beginning to see recovery on administrative dashboards, while our documentation hubs continue to experience very high latency. We are working with our partners to remedy this situation. We will provide an update as soon as we have more information.

Posted Mar 26, 2024 - 09:39 PDT

Investigating

We are aware of a widespread issue that has impacted all of ReadMe's products. We are working with our partners to identify and resolve this issue.

Posted Mar 26, 2024 - 09:13 PDT

This incident affected: ReadMe Hubs, Admin Dashboard, Developer Metrics, Owlbot AI, and ReadMe Micro.