Customer Portal
Incident Report for Equinix Metal
Postmortem

Sequence of Events

At 1624 UTC a number of TLS certificates for our core Consul, Vault, and Nomad infrastructure failed to regenerate, causing a communication failure in those services which trickled down to the running services and the load balancers directing traffic. The problem was quickly identified but due to the nature of service discovery recovery took several hours as we brought up each layer and deployed services.

Root Cause

TLS cert expiration caused by failure to auto-regenerate the certs for our core service discovery layer which impacted our orchestration layer and our load balancers.

Customer Impact

All access to API and Portal was down until 20:20 UTC when the API came back online. Portal remained impacted until 21:25. PacketConnect’s portal interface was fully restored at 23:03.

Timeline

All times are in UTC.

16:24 TLS Cert expires

16:30 Response begins

17:01 Problem identified and resolution begins

19:10 Nomad infrastructure is back online, jobs can now be resubmitted

20:20 API Infrastructure is back online and being served

21:25 Portal is back online

21:36 Staff Portal is back online

23:03 Packet Connect fully restored

23:24 Incident fully resolved

Posted Jun 28, 2019 - 00:23 UTC

Resolved
This incident has been resolved.
Posted Jun 26, 2019 - 23:24 UTC
Monitoring
We are thrilled to report that, our customer portal & API are back up and running. We will continue to monitor for any residual errors/issues. Should you encounter a hiccup along the way, do reach out via support@packet.com.
Posted Jun 26, 2019 - 21:28 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jun 26, 2019 - 19:50 UTC
Update
We have found and fixed the issue that we had previously identified. In the course of bringing that system back online, a second issue was identified, and we are working on that second problem now. We appreciate your continued patience.
Posted Jun 26, 2019 - 19:11 UTC
Identified
We have identified the issue and currently implementing a fix.
Posted Jun 26, 2019 - 17:27 UTC
Investigating
We are currently investigating an influx of error messages with our customer portal. Please reach out to support@packet.com should you have questions, or otherwise need assistance.
Posted Jun 26, 2019 - 16:46 UTC
This incident affected: Equinix Metal API and Equinix Metal Portal.