13th December 2024 - Link Blog
OpenAI's postmortem for API, ChatGPT & Sora Facing Issues (via) OpenAI had an outage across basically everything for four hours on Wednesday. They've now published a detailed postmortem which includes some fascinating technical details about their "hundreds of Kubernetes clusters globally".
The culprit was a newly deployed telemetry system:
Telemetry services have a very wide footprint, so this new service’s configuration unintentionally caused every node in each cluster to execute resource-intensive Kubernetes API operations whose cost scaled with the size of the cluster. With thousands of nodes performing these operations simultaneously, the Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in most of our large clusters. [...]
The Kubernetes data plane can operate largely independently of the control plane, but DNS relies on the control plane – services don’t know how to contact one another without the Kubernetes control plane. [...]
DNS caching mitigated the impact temporarily by providing stale but functional DNS records. However, as cached records expired over the following 20 minutes, services began failing due to their reliance on real-time DNS resolution.
It's always DNS.
Recent articles
- Changes in the system prompt between Claude Opus 4.6 and 4.7 - 18th April 2026
- Join us at PyCon US 2026 in Long Beach - we have new AI and security tracks this year - 17th April 2026
- Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7 - 16th April 2026