At 1624 UTC a number of TLS certificates for our core Consul, Vault, and Nomad infrastructure failed to regenerate, causing a communication failure in those services which trickled down to the running services and the load balancers directing traffic. The problem was quickly identified but due to the nature of service discovery recovery took several hours as we brought up each layer and deployed services.
TLS cert expiration caused by failure to auto-regenerate the certs for our core service discovery layer which impacted our orchestration layer and our load balancers.
All access to API and Portal was down until 20:20 UTC when the API came back online. Portal remained impacted until 21:25. PacketConnect’s portal interface was fully restored at 23:03.
All times are in UTC.
16:24 TLS Cert expires
16:30 Response begins
17:01 Problem identified and resolution begins
19:10 Nomad infrastructure is back online, jobs can now be resubmitted
20:20 API Infrastructure is back online and being served
21:25 Portal is back online
21:36 Staff Portal is back online
23:03 Packet Connect fully restored
23:24 Incident fully resolved