On Wednesday, May 15th at approximately 18:08 UTC a top of rack switch pair (switch ID: d8601b6b) in our EWR1 datacenter crashed. The crash caused all interfaces on the switch to flap (go down then come up within a few seconds.)
All interfaces to servers came back up quickly, however, the interfaces which connect these switches to our upstream devices stayed down. Manual intervention by our network engineers was required to bring up the troublesome uplinks. Full connectivity was restored to the rack at 18:46 UTC.
There were two issues contributing to this outage. The first being the initial crash of the switch and the other being the uplinks which remained down after they flapped.
The root cause of the initial crash is under investigation with our switch vendor. The crash did not match any known issue and has been escalated to our vendors advanced engineering team. While the crash in itself was the root cause of the issue, this would only have been a minor blip if the second issue did not occur.
The root cause of the uplinks staying down is a known issue with a specific type of optical component running on a specific software version. We have begun to roll out updated software which our testing shows have fixed this problem.
All servers hosted in this rack had a total loss of connectivity for 40 minutes.
All times are in UTC.
18:08: Packet forwarding engine on switch d8601b6b crashed and generates a core dump. All interfaces on the switch flapped, uplinks to spine switches stay down
18:11: Alerts start to appear in our monitoring indicating a problem
18:20: Engineers begin to investigate and troubleshoot the issue
18:32: Internal incident created
18:46: Connectivity is restored to the rack as a result of manual intervention from our network engineers.
21:30: Diagnostics gathered and case logged with switch vendors