Gentlent experienced an outage on Jan 9th, 2023 caused by a major power outage in Seattle, WA.
~ 2 min read
On January 9th, 2023, Gentlent experienced an outage that caused disruptions to our services.
The outage was first detected at 17:30 Central European Time (CET), when we received reports of failing HTTP requests from both our team and external uptime providers. Upon further investigation, we determined that the issue was caused by a consensus mismatch between our core database servers, leading to local instances of our codebase crashing.
At 17:41, we determined that the outage was occurring at the data center level and raised an emergency incident. Our team worked on restoring the services, including manually reconfiguring impacted servers and rewriting necessary code lines.
At 17:52, we received notification from our data center provider that a major power outage was occurring in the Seattle area. Despite this, we continued to work on restoring the services.
At 18:07, we began rolling out the first fix, but at 18:16, we encountered another issue caused by the outage. A second fix was deployed shortly thereafter. In order to bring Gentlent back online more quickly, we temporarily removed some non-critical servers from the network.
Gentlent was partially restored at 18:19, but some parts of the infrastructure were still failing in certain regions. Our team continued to investigate the underlying issues and worked on failovers and service restoration.
At 18:22, we began rolling out the second fix globally. The fix started to take effect at 18:31. By 18:43, services had been re-routed to certain core regions, fixes had been deployed, and the majority of the infrastructure was back online. We also worked on an incident report and long-term fixes for failovers.
Finally, at 19:12, we received reports that the power supply had been restored. We began re-enabling certain regions and services without issue, and by this time, the entire infrastructure had been restored.
As a result of this incident, we will be taking several action items to improve the availability of our services in the future. These include moving our legacy status page to a third-party provider, providing emergency notifications to customers, and improving our infrastructure to ensure availability even in the event of key component failures. We will also run tests and simulations to ensure the continued availability of our services during outages.
We apologize for the inconvenience caused to our customers during this outage. We understand that outages can be frustrating and disruptive, and we are committed to improving the reliability and availability of our services. We will take the necessary steps to prevent outages from occurring in the future and to ensure that our systems are able to recover quickly if an incident does occur. We value our customers and appreciate their patience and understanding during this time.