EWR Block Storage Outage [February 2nd-3rd, 2019]
Start Date: 19:20 (Saturday, February 2nd)
End Date: 23:14 (Sunday, February 3rd)
Description: ERW1 Block Storage Outage
On Saturday, February 2nd, 2019 at 19:20 UTC, Packet Support noticed that our EWR1 Block Storage cluster was in a failed state, putting volumes in read-only mode. We immediately escalated the issue to our storage fabric vendor and began working to identify the issue.
In the first three (3) hours, we discovered cluster nodes were being forced in a reboot mode by high-volume “garbage” initiator (repeated attach and detach requests at a DDoS level rate resulted in a series of Kernel crashes).
When the offending hosts were neutralized the cluster stabilized, and we began repairs--fixing volumes and bringing everything back online.
Due to unforeseen problems, our restoration efforts took an unacceptably long time. We deeply apologize. We are working on a root cause analysis with our vendor to prevent outages of this severity in the future.
During this time, users were unable to access existing volumes or to create, delete or attach new volumes.
No data was lost or at risk during this outage.
We do not take this extended outage lightly and are looking at all aspects of our storage deployment, maintenance, monitoring, and incident response processes.
In the short term, to make our global fleet of block storage clusters more resilient, we will be performing a stopgap update this week. This will mitigate the Kernel crashing bug which triggered this event. We will also accelerate our scheduled upgrade to a new and more robust version of the underlying storage cluster fabric.
We plan to roll out this full upgrade across all sites this month.
All times are in UTC.
19:20, February 2nd: Issue was discovered by our Customer Experience team
22:35, February 2nd: Issue was identified as a high volume “garbage” initiator
02:42, February 3rd: An initial fix was attempted. Unfortunately, there was a slight setback encountered that kept the nodes in a recovering state.
18:41, February 3rd: The original fix was revised and reimplemented.
21:55, February 3rd: Issue was resolved and team began to monitor.
23:14, February 3rd: Issue was closed and team confirmed that block storage in EWR1 was fully functional.