<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: devops</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/devops.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2024-12-13T05:29:10+00:00</updated><author><name>Simon Willison</name></author><entry><title>OpenAI's postmortem for API, ChatGPT &amp; Sora Facing Issues</title><link href="https://simonwillison.net/2024/Dec/13/openai-postmortem/#atom-tag" rel="alternate"/><published>2024-12-13T05:29:10+00:00</published><updated>2024-12-13T05:29:10+00:00</updated><id>https://simonwillison.net/2024/Dec/13/openai-postmortem/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://status.openai.com/incidents/ctrsv3lwd797"&gt;OpenAI&amp;#x27;s postmortem for API, ChatGPT &amp;amp; Sora Facing Issues&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenAI had an outage across basically everything for four hours on Wednesday. They've now published a detailed postmortem which includes some fascinating technical details about their "hundreds of Kubernetes clusters globally".&lt;/p&gt;
&lt;p&gt;The culprit was a newly deployed telemetry system:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Telemetry services have a very wide footprint, so this new service’s configuration unintentionally caused every node in each cluster to execute resource-intensive Kubernetes API operations whose cost scaled with the size of the cluster. With thousands of nodes performing these operations simultaneously, the Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in most of our large clusters. [...]&lt;/p&gt;
&lt;p&gt;The Kubernetes data plane can operate largely independently of the control plane, but DNS relies on the control plane – services don’t know how to contact one another without the Kubernetes control plane. [...]&lt;/p&gt;
&lt;p&gt;DNS caching mitigated the impact temporarily by providing stale but functional DNS records. However, as cached records expired over the following 20 minutes, services began failing due to their reliance on real-time DNS resolution.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's always DNS.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/therealadamg/status/1867393379287650778"&gt;@therealadamg&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/dns"&gt;dns&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/devops"&gt;devops&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kubernetes"&gt;kubernetes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postmortem"&gt;postmortem&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;&lt;/p&gt;



</summary><category term="dns"/><category term="devops"/><category term="kubernetes"/><category term="postmortem"/><category term="openai"/><category term="chatgpt"/></entry><entry><title>What are good and easy practices for frequent web deployments?</title><link href="https://simonwillison.net/2013/Jan/8/what-are-good-and/#atom-tag" rel="alternate"/><published>2013-01-08T10:32:00+00:00</published><updated>2013-01-08T10:32:00+00:00</updated><id>https://simonwillison.net/2013/Jan/8/what-are-good-and/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;em&gt;My answer to &lt;a href="https://www.quora.com/What-are-good-and-easy-practices-for-frequent-web-deployments/answer/Simon-Willison"&gt;What are good and easy practices for frequent web deployments?&lt;/a&gt; on Quora&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;At Lanyrd we use a combination of Fabric to drive our deploy scripts, git to get the code on to the servers, puppet for configuration management and Jenkins to run continuous integration tests and provide a "deploy the site" button.&lt;/p&gt;

&lt;p&gt;Here are a few important techniques I've learned:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;Use symlink switching to keep the previous version of the code around, so you can switch back in the case of problems (that said, we've never actually used this capability - but its nice for atomic deploys as well)&lt;/li&gt;&lt;li&gt;Have your build script rename your static asset files (CSS/JS/etc) to include part of the md5 hash of the file contents in their filename. This means you can upload them to your static host provider (we use S3) before you run a deploy, guaranteeing that freshly deployed templates will point to the right files. It also keeps the older versions around in case you need to roll back.&lt;/li&gt;&lt;li&gt;Having one button that deploys the site is invaluable&lt;/li&gt;&lt;li&gt;Deploys need to be almost "free" in terms of impact on site performance - if it doesn't cost anything to deploy the site people will be freely able to deploy often and push out small fixes, which is good for the health of your codebase&lt;/li&gt;&lt;li&gt;Get new engineers to deploy on the first day! Doing so forces you/them to get a full development and deployment environment up and running for them on day one, which means that they can start doing real work on day two.&lt;/li&gt;&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/continuous-integration"&gt;continuous-integration&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/startups"&gt;startups&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/quora"&gt;quora&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lanyrd"&gt;lanyrd&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/devops"&gt;devops&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="continuous-integration"/><category term="startups"/><category term="quora"/><category term="lanyrd"/><category term="devops"/></entry><entry><title>Tracking Every Release</title><link href="https://simonwillison.net/2010/Dec/10/etsy/#atom-tag" rel="alternate"/><published>2010-12-10T10:04:00+00:00</published><updated>2010-12-10T10:04:00+00:00</updated><id>https://simonwillison.net/2010/Dec/10/etsy/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://codeascraft.etsy.com/2010/12/08/track-every-release/"&gt;Tracking Every Release&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
How Etsy use Graphite to monitor their continuous deployment releases.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/continuous-deployment"&gt;continuous-deployment&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/monitoring"&gt;monitoring&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/devops"&gt;devops&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/recovered"&gt;recovered&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/etsy"&gt;etsy&lt;/a&gt;&lt;/p&gt;



</summary><category term="continuous-deployment"/><category term="monitoring"/><category term="devops"/><category term="recovered"/><category term="etsy"/></entry></feed>