Incident report for 2017-11-09
Network Outage Incident¶
9 November 2017
A network outage at our colocation provider, OVH, caused a complete loss of service to Outlyer customers from around 7:00-10:00 GMT this morning, and partial availability until just after 1:00 pm GMT. The outage was due to a firmware bug in the optical interface cards that provide connectivity into the OVH data center in Roubaix, France.
7:01 GMT - Our data center provider in Europe, OVH, experienced the simultaneous failure of all 44 redundant connections into their Roubaix data center. All Outlyer services were knocked instantly offline.
9:34 GMT - Connections to the data center are restored by OVH. A few minutes later, we are able to log into our servers. We discover that connections between rooms were also severed, so our clusters need to be reintegrated.
10:03 GMT - Agents begin reconnecting to Outlyer and uploading metrics, while we continue to reintegrate the cluster and process a backlog of metrics. We disable alert processing temporarily to prevent false alerts going out.
13:21 GMT - All services are back online. Alert processing is restored.
According to OVH, the network outage was caused by a firmware bug that erased the configuration for the optical interface cards. They are working with the vendor to investigate.
We take very seriously our mission to enable the world’s DevOps teams to monitor their infrastructure. Obviously, reliance on a single data center operated by a single provider is insufficient.
We are currently working on the next iteration of our product. Besides a much improved UI and many new features, it will be hosted in AWS. This will allow us to take advantage of Amazon’s redundant data center infrastructure worldwide.
If you have any comments or concerns about this incident, please reach out to our Director of Customer Success, Todd Radel, at email@example.com, or in our support Slack channel at https://outlyersupport.slack.com.