7 items tagged “kubernetes”
2024
OpenAI’s postmortem for API, ChatGPT & Sora Facing Issues (via) OpenAI had an outage across basically everything for four hours on Wednesday. They've now published a detailed postmortem which includes some fascinating technical details about their "hundreds of Kubernetes clusters globally".
The culprit was a newly deployed telemetry system:
Telemetry services have a very wide footprint, so this new service’s configuration unintentionally caused every node in each cluster to execute resource-intensive Kubernetes API operations whose cost scaled with the size of the cluster. With thousands of nodes performing these operations simultaneously, the Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in most of our large clusters. [...]
The Kubernetes data plane can operate largely independently of the control plane, but DNS relies on the control plane – services don’t know how to contact one another without the Kubernetes control plane. [...]
DNS caching mitigated the impact temporarily by providing stale but functional DNS records. However, as cached records expired over the following 20 minutes, services began failing due to their reliance on real-time DNS resolution.
It's always DNS.
We used this model [periodically transmitting configuration to different hosts] to distribute translations, feature flags, configuration, search indexes, etc at Airbnb. But instead of SQLite we used Sparkey, a KV file format developed by Spotify. In early years there was a Cron job on every box that pulled that service’s thingies; then once we switched to Kubernetes we used a daemonset & host tagging (taints?) to pull a variety of thingies to each host and then ensure the services that use the thingies only ran on the hosts that had the thingies.
2023
Carving the Scheduler Out of Our Orchestrator (via) Thomas Ptacek describes Fly’s new custom-built alternative to Nomad and Kubernetes in detail, including why they eventually needed to build something custom to best serve their platform. In doing so he provides the best explanation I’ve ever seen of what an orchestration system actually does.
2022
Two reasons Kubernetes is so complex (via) I like how this article proposes that Kubernetes isn’t trying to be a tool for deploying containers—it’s more like an operating system for a cluster of machines, responsible for the same kind of goals as a regular operating system such as resource sharing and portability. And since everything is built as control loops which attempt to modify actual state to fit the declarative desired state, errors can occur asynchronously seconds or even minutes after the desired state has been updated.
2021
Weeknotes: Learning Kubernetes, learning Web Components
I’ve been mainly climbing the learning curve for Kubernetes and Web Components this week. I also released Datasette 0.59.1 with Python 3.10 compatibility and an updated Docker image.
[... 1,101 words]Smooth sailing with Kubernetes (via) Scott McCloud (of Understanding Comics) authored this comic introduction to Kubernetes, and it’s a really good explanation of the core concepts. I’d love to have something like this for Datasette—I still feel like I’m a long way from being able to explain the project with anything like this amount of clarity.
2018
Experiences with running PostgreSQL on Kubernetes (via) Fascinating interview that makes a solid argument for the idea that running stateful data stores like PostgreSQL or Cassandra is made harder, not easier when you add an orchestration tool like Kubernetes into the mix.