Network issues between Packet facilities

Incident Report for Equinix Metal

Postmortem

Start Date: Sunday, June 2nd 2019, 8:49 UTC

End Date: Sunday, June 2nd 2019 15:22 UTC

Problem Location: Global Cloud Locations

Problem Description: Intermittent loss of connectivity and degraded routing for Packet Public Cloud locations.

Outage Details

Sequence of Events

On Sunday, 06/02/2019 at approximately 8:49 AM UTC Packet’s internal network monitoring identified significant reachability and performance issues. Initial investigation pointed to routing loops and blackholed traffic to specific internal and Internet destinations. This was found to be a major outage impacting Packet’s upstream network provider, Zayo. The network engineering team then worked to migrate traffic to an alternative network, removing Zayo from the network mix, and apply appropriate route filters to return service to normal. Subsequently, Packet network engineers worked to optimize specific routing policies to improve latency across the backbone given the new traffic mix. Customers were briefly impacted by upstream connectivity in specific Packet Cloud regions and then subject to sub par routing (high latency) and some specific unreachable paths, for the duration of the outage. No server interface connectivity was impacted.

Root Cause

As touched on above, this outage was caused by an upstream provider outage. Per our provider, Zayo, a trans-contintenal US backbone link had experienced a hardware failure and was blackholing IP traffic routed over the link. Secondary to this outage, network traffic had continued to blackhole even after the problematic link was removed from service, due to lingering control plane issues and stale routes on Zayo’s network, resulting in a greatly extended time to resolution.

Customer Impact and Next Steps

Of note was that customer traffic was not completely impacted during the incident. However, deliverability from Packet’s US datacenters to broadband ISPs did suffer an outsized impact. In addition, this outage only impacted traffic routed between specific (random) source and destination IP address pairs, while other customer traffic remained unharmed. Even after Packet had completely removed Zayo from its provider mix, some large broadband networks and cellular networks (also Zayo customers) experienced degraded connectivity to both Packet and other destinations, given their dependence on Zayo products and services.

As with any incident of this magnitude, Packet will be building upon this experience to strengthen its operations, implementing various corrective actions that include:

-Expanded internal connectivity monitoring, and alerting of on-call operations teams

-Expanded network automation and routing protocol implementation, including the ability to more effectively route/load balance across all available backbone paths, and have an operator implement automated traffic re-routes across all of Packet’s datacenter locations in the event of a major provider incident

-Increased resilience in Packet’s provisioning systems and services, reducing dependencies on third party network providers or inter-datacenter connectivity

Timeline

All times are in UTC.

08:49 - We received an alert from our monitoring tools showing reachability issues in several of Packet’s global datacenter locations.

08:52 - Our engineering team confirmed the reachability issues. An internal incident was created and Packet’s operational staffing was called into action, along with our platform and network engineering escalation contacts.

09:18 - Engineers started investigating the issue and related provisioning impact in our NRT1 availability zone.

10:11 - Engineers noticed that SJC1 device provisioning was also affected.

10:13 - Routing issue was identified.

12:10 - Network Engineers confirmed that the issue was due to an outage from our upstream provider (Zayo). Traffic was re-routed to other paths and connectivity started to restore.

13:32 - A fix was implemented and Network Engineers monitored the results

15:22 - Incident resolved. All services returned to normal.

Posted Jun 06, 2019 - 13:19 UTC

Resolved

After carefully monitoring the situation, we believe this issue has been resolved. All services have returned to normal at this point.

Posted Jun 02, 2019 - 15:22 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jun 02, 2019 - 13:32 UTC

Update

There is a confirmed provider outage, we are beginning to re-route traffic onto other paths and connectivity is starting to restore.

Posted Jun 02, 2019 - 12:10 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Jun 02, 2019 - 10:13 UTC

Update

We are currently investigating an issue with provisioning and deprovisioning in our NRT1 and SJC1 datacenter. Please reach out to support@packet.com with any questions

Posted Jun 02, 2019 - 10:11 UTC

Investigating

We are currently investigating an issue with provisioning and deprovisioning in our NRT1 datacenter. Please reach out to support@packet.com with any questions.

Posted Jun 02, 2019 - 09:18 UTC

This incident affected: Equinix Metal Network.