• Increase font size
  • Default font size
  • Decrease font size
  • default color
  • cyan color
  • red color

ONLINE STATUS

Incident report: Network outage 30/03/2015

Summary

Incident date: 30/03/2015
- IP connectivity disruption between 13h52 and 14h14 CEST for customers hosted into our site in Windhof.
- VPS customers in our cluster1 in Windhof were randomly affected. Services were progressively restored, with a full restore at 16h39.

Incident Details

The report below is a breakdown of the incident that was experienced across our network between 13h52 and 14h14 CEST on Monday 30th March 2015.
The fault was traced to one of core cluster switches in one of our POP in Windhof.

Breakdown of the events:
At 13h52 we became aware of loss of service for customers, which appeared to be spread across our site in Windhof for IP connectivity and VPS services. Our systems and network team were immediately engaged to aid with the diagnostics and resolution.
The cause of the issue was traced to a split-brain scenario between 2 cores switches, which caused to have the 2 equipment in an active/active state, and causing network traffic interruption.
The switch clustering processes were restarted. IP network services were fully restored at 14h14.
As a side effect, our virtual private servers infrastructure also experienced some downtime, due to lost connectivity with the storage.

Incident Response Actions

The immediate action taken was to open an incident report with our hardware vendor to identify and correct any potential problem into the switch clustering mechanism.


Post Incident Actions

Our hardware vendor recommends upgrading our switch firmware, to a newer version which improves stability into the clustering technology. All equipment will be upgraded within the next couple of days, without any impact into your services.
In the upcoming weeks, we will also split the core network services from the network storage services, to enhance redundancy and reliability of our services.