AMS1 Networking Issue
Incident Report for Equinix Metal
Postmortem

Reason for Outage

Start Date: October 27th, 2018 @ 3:00 PM EDT

End Date: October 27th, 2018 @ 4:15 PM EDT

Internal Ticket: #265

Location: AMS1

Description: AMS1 Rack Level Outage

Outage Details

On Saturday, October 27th 2018, Packet experienced a Rack Level Outage affecting a single rack of servers in our AMS1 facility, which started at 3:00 PM EST and was resolved at 4:15 PM EST.

During the outage, the affected servers lost full network access, but they never lost power or were rebooted.

After troubleshooting through the issues, our networking team concluded that the cause of the outage was a bug on the TOR switch pair OS, causing a split-brain, where the primary switch lost its state and the backup switch never claimed itself to be the primary routing member.

Full network was restored after our Network team reloaded both of the affected TOR switches, and applied a temporary patch.

Packet will schedule a Maintenance in a few days, and alert all the affected customers, in order to upgrade the affected switches to a new OS version.

Timeline

All times are in EDT.

Saturday, October 27th, 2018

  • 3:00 PM - Packet started receiving various internal alerts related to network issues in AMS1.
  • 3:40 PM - Initial troubleshooting pointed to the issue being limited to a single affected Rack.
  • 4:05 PM - Root cause being identified as a split-brain between the 2 TOR switches.
  • 4:15 PM - Connectivity restored on the rack level, after fully reloading the 2 TOR switches and applying a patch.

Impact Notes

  1. Affected servers lost full network access during the Outage. There was no power loss or any reboots.
  2. Maintenance is currently being scheduled where the affected infrastructure will be upgraded to a new Version.
Posted Oct 31, 2018 - 19:47 UTC

Resolved
This incident has been resolved.
Posted Oct 27, 2018 - 20:49 UTC
Monitoring
We have restored connectivity and are monitoring the health. All affected customers should be seeing normal traffic at this point.
Posted Oct 27, 2018 - 20:20 UTC
Update
We have identified the issue and are working to bring the affected equipment back online.
Posted Oct 27, 2018 - 20:04 UTC
Identified
We have identified a Top of Rack networking issue affecting a limited subset of customers in our AMS1 availability zone. We are working to repair the issue now.
Posted Oct 27, 2018 - 19:50 UTC
This incident affected: Equinix Metal Network.