Simon Willison’s Weblog

Subscribe
Atom feed for devops

3 items tagged “devops”

2024

OpenAI’s postmortem for API, ChatGPT & Sora Facing Issues (via) OpenAI had an outage across basically everything for four hours on Wednesday. They've now published a detailed postmortem which includes some fascinating technical details about their "hundreds of Kubernetes clusters globally".

The culprit was a newly deployed telemetry system:

Telemetry services have a very wide footprint, so this new service’s configuration unintentionally caused every node in each cluster to execute resource-intensive Kubernetes API operations whose cost scaled with the size of the cluster. With thousands of nodes performing these operations simultaneously, the Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in most of our large clusters. [...]

The Kubernetes data plane can operate largely independently of the control plane, but DNS relies on the control plane – services don’t know how to contact one another without the Kubernetes control plane. [...]

DNS caching mitigated the impact temporarily by providing stale but functional DNS records. However, as cached records expired over the following 20 minutes, services began failing due to their reliance on real-time DNS resolution.

It's always DNS.

# 13th December 2024, 5:29 am / devops, dns, kubernetes, openai, chatgpt, postmortem

2013

What are good and easy practices for frequent web deployments?

At Lanyrd we use a combination of Fabric to drive our deploy scripts, git to get the code on to the servers, puppet for configuration management and Jenkins to run continuous integration tests and provide a “deploy the site” button.

[... 262 words]

2010

Tracking Every Release. How Etsy use Graphite to monitor their continuous deployment releases.

# 10th December 2010, 10:04 am / continuous-deployment, devops, etsy, monitoring, recovered