Start Date: Sunday, June 2nd 2019, 8:49 UTC
End Date: Sunday, June 2nd 2019 15:22 UTC
Problem Location: Global Cloud Locations
Problem Description: Intermittent loss of connectivity and degraded routing for Packet Public Cloud locations.
On Sunday, 06/02/2019 at approximately 8:49 AM UTC Packet’s internal network monitoring identified significant reachability and performance issues. Initial investigation pointed to routing loops and blackholed traffic to specific internal and Internet destinations. This was found to be a major outage impacting Packet’s upstream network provider, Zayo. The network engineering team then worked to migrate traffic to an alternative network, removing Zayo from the network mix, and apply appropriate route filters to return service to normal. Subsequently, Packet network engineers worked to optimize specific routing policies to improve latency across the backbone given the new traffic mix. Customers were briefly impacted by upstream connectivity in specific Packet Cloud regions and then subject to sub par routing (high latency) and some specific unreachable paths, for the duration of the outage. No server interface connectivity was impacted.
As touched on above, this outage was caused by an upstream provider outage. Per our provider, Zayo, a trans-contintenal US backbone link had experienced a hardware failure and was blackholing IP traffic routed over the link. Secondary to this outage, network traffic had continued to blackhole even after the problematic link was removed from service, due to lingering control plane issues and stale routes on Zayo’s network, resulting in a greatly extended time to resolution.
Of note was that customer traffic was not completely impacted during the incident. However, deliverability from Packet’s US datacenters to broadband ISPs did suffer an outsized impact. In addition, this outage only impacted traffic routed between specific (random) source and destination IP address pairs, while other customer traffic remained unharmed. Even after Packet had completely removed Zayo from its provider mix, some large broadband networks and cellular networks (also Zayo customers) experienced degraded connectivity to both Packet and other destinations, given their dependence on Zayo products and services.
As with any incident of this magnitude, Packet will be building upon this experience to strengthen its operations, implementing various corrective actions that include:
-Expanded internal connectivity monitoring, and alerting of on-call operations teams
-Expanded network automation and routing protocol implementation, including the ability to more effectively route/load balance across all available backbone paths, and have an operator implement automated traffic re-routes across all of Packet’s datacenter locations in the event of a major provider incident
-Increased resilience in Packet’s provisioning systems and services, reducing dependencies on third party network providers or inter-datacenter connectivity
All times are in UTC.
08:49 - We received an alert from our monitoring tools showing reachability issues in several of Packet’s global datacenter locations.
08:52 - Our engineering team confirmed the reachability issues. An internal incident was created and Packet’s operational staffing was called into action, along with our platform and network engineering escalation contacts.
09:18 - Engineers started investigating the issue and related provisioning impact in our NRT1 availability zone.
10:11 - Engineers noticed that SJC1 device provisioning was also affected.
10:13 - Routing issue was identified.
12:10 - Network Engineers confirmed that the issue was due to an outage from our upstream provider (Zayo). Traffic was re-routed to other paths and connectivity started to restore.
13:32 - A fix was implemented and Network Engineers monitored the results
15:22 - Incident resolved. All services returned to normal.