EWR1 Block Storage Issues

Incident Report for Equinix Metal

Postmortem

Reason for Outage

EWR Block Storage Outage [February 2nd-3rd, 2019]

Incident: https://status.packet.com/incidents/wd043mndj0pm

Start Date: 19:20 (Saturday, February 2nd)

End Date: 23:14 (Sunday, February 3rd)

Location: EWR1

Description: ERW1 Block Storage Outage

Outage Details

On Saturday, February 2nd, 2019 at 19:20 UTC, Packet Support noticed that our EWR1 Block Storage cluster was in a failed state, putting volumes in read-only mode. We immediately escalated the issue to our storage fabric vendor and began working to identify the issue.

In the first three (3) hours, we discovered cluster nodes were being forced in a reboot mode by high-volume “garbage” initiator (repeated attach and detach requests at a DDoS level rate resulted in a series of Kernel crashes).

When the offending hosts were neutralized the cluster stabilized, and we began repairs--fixing volumes and bringing everything back online.

Due to unforeseen problems, our restoration efforts took an unacceptably long time. We deeply apologize. We are working on a root cause analysis with our vendor to prevent outages of this severity in the future.

During this time, users were unable to access existing volumes or to create, delete or attach new volumes.

No data was lost or at risk during this outage.

Follow Up Actions

We do not take this extended outage lightly and are looking at all aspects of our storage deployment, maintenance, monitoring, and incident response processes.

In the short term, to make our global fleet of block storage clusters more resilient, we will be performing a stopgap update this week. This will mitigate the Kernel crashing bug which triggered this event. We will also accelerate our scheduled upgrade to a new and more robust version of the underlying storage cluster fabric.

We plan to roll out this full upgrade across all sites this month.

Timeline

All times are in UTC.

19:20, February 2nd: Issue was discovered by our Customer Experience team

22:35, February 2nd: Issue was identified as a high volume “garbage” initiator

02:42, February 3rd: An initial fix was attempted. Unfortunately, there was a slight setback encountered that kept the nodes in a recovering state.

18:41, February 3rd: The original fix was revised and reimplemented.

21:55, February 3rd: Issue was resolved and team began to monitor.

23:14, February 3rd: Issue was closed and team confirmed that block storage in EWR1 was fully functional.

Posted Feb 05, 2019 - 19:11 UTC

Resolved

This issue has been resolved. Our Block storage service is now fully operational in EWR1.

Posted Feb 03, 2019 - 23:14 UTC

Update

Attention: due to the long duration of this issue, some users may encounter issues detaching and re-attaching block storage volumes. If you encounter this problem, or other block storage issues, we recommend rebooting your server(s).

Posted Feb 03, 2019 - 21:59 UTC

Monitoring

We believe the issue has been resolved. We will continue to monitor the situation for further issues, but at this point full block storage functionality should be restored.

Posted Feb 03, 2019 - 21:55 UTC

Update

Our storage vendor believes they have discovered the missing step in their previous resolution measures. They are now attempting to apply the updated fix now.

Posted Feb 03, 2019 - 18:41 UTC

Update

Our storage vendor is applying some additional steps in their previously proposed fix to make it safe and affective before they attempt to reapply.

Posted Feb 03, 2019 - 18:04 UTC

Update

Our storage vendor is reviewing the setbacks encountered during their previous attempt to resolve the issue.

Unfortunately, this has caused some delay in our resolution time, as we work to ensure the safety of our customer's data while we resolve this issue.

Posted Feb 03, 2019 - 16:53 UTC

Update

Our storage vendor has attempted to roll out a fix to rectify our block storage issue. Unfortunately, there was a slight setback encountered that keeps the nodes in recovery state.

Rest assured that our team is on top of the issue and is working with our storage vendor for resolution. Apologies for the inconvenience and we'll push more updates as soon we have further information.

Posted Feb 03, 2019 - 08:54 UTC

Update

We have identified the issue and are currently testing and implementing a fix. We will update this notice as soon as we have more information.

Posted Feb 03, 2019 - 02:46 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Feb 02, 2019 - 23:42 UTC

Update

Our engineers are still troubleshooting connectivity issues on various blocks storage volumes.

Posted Feb 02, 2019 - 21:49 UTC

Investigating

We are currently investigating issues with our EWR1 Block Storage volumes. Users might be experiencing volumes going into read only mode, or showing errors.

Posted Feb 02, 2019 - 20:21 UTC