Start Date: 10:26 UTC 5/28/2019
End Date: 14:47 UTC 5/28/2019
Problem Location: EWR1
Problem Description: Customer servers intermittently inaccessible
On Tuesday, 05/28/2019, at approximately 10:26 UTC customers with servers in a single rack (a04.ewr1) in our EWR1 location were experiencing intermittent loss to their servers. Engineers identified the issue as a combination of failed hardware and a software bug. Once a software patch was applied and the failed hardware was replaced, all customer services were restored.
Root Cause
Customers in a single rack and off of a specific highly available switch (esr1.a04.ewr1) were impactacted by what was determined to be two issues. The first to be discovered was a flapping main uplink caused by a faulty optic. The optic in question began taking errors and flapping which resulted in unstable routing. This failure then exposed a bug in the switch operating system. Engineers applied a software patch to correct the bug and physically replaced the faulty optic. Once this was completed, service was restored to normal.
Customer Impact
All servers hosted in this rack had intermittent connectivity for 1 hour and 20 minutes.
Timeline
All times are in UTC.
10:26: Incident was opened to track case as reported by internal monitoring and customer feedback, initial troubleshooting begun by Packet Cloud Ops team.
11:08: Situation was escalated to Packet Network Engineering team due to severity and customer impact.
11:23: Networking Engineers determine cause of issue to be faulty optic in switch and software bug on switch
11:47: Software patch applied to switch operating system
12:01: Faulty optic removed from available routing paths and customer connectivity restored and normalized
14:47: Faulty optic replaced along with remote side restoring full redundancy