EWR1 Degraded Network
Incident Report for Equinix Metal
Postmortem

Sequence of Events

On Wednesday, May 15th at approximately 18:08 UTC a top of rack switch pair (switch ID: d8601b6b) in our EWR1 datacenter crashed. The crash caused all interfaces on the switch to flap (go down then come up within a few seconds.)

All interfaces to servers came back up quickly, however, the interfaces which connect these switches to our upstream devices stayed down. Manual intervention by our network engineers was required to bring up the troublesome uplinks. Full connectivity was restored to the rack at 18:46 UTC.

Root Cause

There were two issues contributing to this outage. The first being the initial crash of the switch and the other being the uplinks which remained down after they flapped.

The root cause of the initial crash is under investigation with our switch vendor. The crash did not match any known issue and has been escalated to our vendors advanced engineering team. While the crash in itself was the root cause of the issue, this would only have been a minor blip if the second issue did not occur.

The root cause of the uplinks staying down is a known issue with a specific type of optical component running on a specific software version. We have begun to roll out updated software which our testing shows have fixed this problem.

Customer Impact

All servers hosted in this rack had a total loss of connectivity for 40 minutes.

Timeline

All times are in UTC.

18:08: Packet forwarding engine on switch d8601b6b crashed and generates a core dump. All interfaces on the switch flapped, uplinks to spine switches stay down

18:11: Alerts start to appear in our monitoring indicating a problem

18:20: Engineers begin to investigate and troubleshoot the issue

18:32: Internal incident created

18:46: Connectivity is restored to the rack as a result of manual intervention from our network engineers.

21:30: Diagnostics gathered and case logged with switch vendors

Posted May 17, 2019 - 21:11 UTC

Resolved
This incident has been resolved.
Posted May 12, 2019 - 03:58 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 12, 2019 - 03:49 UTC
Identified
Our network engineers are seeing degraded network on one of our TOR switch in EWR1 facility and a fix is being implemented. Please reach out to support@packet.com should you experience issues.
Posted May 12, 2019 - 03:22 UTC
Investigating
Our engineers are currently investigating a degraded network on one of the tor switch in ewr11 Facility. Please reach out to support@packet.com should you experience issues.
Posted May 12, 2019 - 02:44 UTC
This incident affected: Equinix Metal Network.