OpenAI's postmortem for API, ChatGPT & Sora Facing Issues (via) OpenAI had an outage across basically everything for four hours on Wednesday. They've now published a detailed postmortem which includes some fascinating technical details about their "hundreds of Kubernetes clusters globally".
The culprit was a newly deployed telemetry system:
Telemetry services have a very wide footprint, so this new service’s configuration unintentionally caused every node in each cluster to execute resource-intensive Kubernetes API operations whose cost scaled with the size of the cluster. With thousands of nodes performing these operations simultaneously, the Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in most of our large clusters. [...]
The Kubernetes data plane can operate largely independently of the control plane, but DNS relies on the control plane – services don’t know how to contact one another without the Kubernetes control plane. [...]
DNS caching mitigated the impact temporarily by providing stale but functional DNS records. However, as cached records expired over the following 20 minutes, services began failing due to their reliance on real-time DNS resolution.
It's always DNS.
Recent articles
- LLM 0.27, the annotated release notes: GPT-5 and improved tool calling - 11th August 2025
- Qwen3-4B-Thinking: "This is art - pelicans don't ride bikes!" - 10th August 2025
- My Lethal Trifecta talk at the Bay Area AI Security Meetup - 9th August 2025