EWR1 Degraded Network
Incident Report for Equinix Metal
Postmortem

Reason for Outage

Start Date: 10:26 UTC 5/28/2019

End Date: 14:47 UTC 5/28/2019

Problem Location: EWR1

Problem Description: Customer servers intermittently inaccessible

Outage Details

Sequence of Events

On Tuesday, 05/28/2019, at approximately 10:26 UTC customers with servers in a single rack (a04.ewr1) in our EWR1 location were experiencing intermittent loss to their servers. Engineers identified the issue as a combination of failed hardware and a software bug. Once a software patch was applied and the failed hardware was replaced, all customer services were restored.

Root Cause

Customers in a single rack and off of a specific highly available switch (esr1.a04.ewr1) were impactacted by what was determined to be two issues. The first to be discovered was a flapping main uplink caused by a faulty optic. The optic in question began taking errors and flapping which resulted in unstable routing. This failure then exposed a bug in the switch operating system. Engineers applied a software patch to correct the bug and physically replaced the faulty optic. Once this was completed, service was restored to normal.

Customer Impact

All servers hosted in this rack had intermittent connectivity for 1 hour and 20 minutes.

Timeline

All times are in UTC.

10:26: Incident was opened to track case as reported by internal monitoring and customer feedback, initial troubleshooting begun by Packet Cloud Ops team.

11:08: Situation was escalated to Packet Network Engineering team due to severity and customer impact.

11:23: Networking Engineers determine cause of issue to be faulty optic in switch and software bug on switch

11:47: Software patch applied to switch operating system

12:01: Faulty optic removed from available routing paths and customer connectivity restored and normalized

14:47: Faulty optic replaced along with remote side restoring full redundancy

Posted Jun 04, 2019 - 01:32 UTC

Resolved
This incident has been resolved.
Posted May 28, 2019 - 20:15 UTC
Update
We continue to monitor for any further issues. If you encounter an issue, please email supoort@packet.com
Posted May 28, 2019 - 15:57 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 28, 2019 - 12:10 UTC
Investigating
We are investigating the cause of a partial network outage experienced in our EWR1 facility isolated to one of our switches. Please reach out to support@packet.com with any questions.
Posted May 28, 2019 - 10:32 UTC
This incident affected: Equinix Metal Network.