11 items tagged “ops”
Over the years, across multiple deployments, DynamoDB has learned that it’s not just the end state and the start state that matter; there could be times when the newly deployed software doesn’t work and needs a rollback. The rolled-back state might be different from the initial state of the software. The rollback procedure is often missed in testing and can lead to customer impact. DynamoDB runs a suite of upgrade and downgrade tests at a component level before every deployment. Then, the software is rolled back on purpose and tested by running functional tests. DynamoDB has found this process valuable for catching issues that otherwise would make it hard to rollback if needed.
Roblox Return to Service 10/28-10/31 2021 (via) A particularly good example of a public postmortem on an outage. Roblox was down for 72 hours last year, as a result of an extremely complex set of circumstances which took a lot of effort to uncover. It’s interesting to think through what kind of monitoring you would need to have in place to help identify the root cause of this kind of issue. # 21st January 2022, 4:41 pm
Operations engineering does not consist of firefighting your shitty software, it is the science of delivering value to users.
Zero-downtime Redis upgrade discussion. GitHub have a short window of scheduled downtime in order to upgrade their Redis server. I asked in their comments if they’d considered trying to run the upgrade with no downtime at all using Redis replication, and Ryan Tomayko has posted some interesting replies. # 28th May 2010, 2:50 pm
Round-robin Django setup with nginx. An nginx trick I didn’t know: a low proxy_connect_timeout value (e.g. 2 seconds) combined with the proxy_next_upstream setting means that if one of your backends breaks a user won’t even see an error, they’ll just have a short delay before getting a response from a working server. # 21st December 2009, 3:43 pm
Announcing Kong: A server description and deployment testing tool. An ultra simple website monitoring tool written in Django which makes it easy to manage a list of Twill scripts for testing different sites. It was developed at the Lawrence Journal-World—Eric showed me a demo if this a year or so ago and I’ve been hoping they would open source it. # 18th November 2009, 12:47 pm
Using Graphics Card Memory as Swap (via) Interesting idea: “Graphic cards contain a lot of very fast RAM, typically between 64 and 512 MB. With Linux, it’s possible to use it as swap space, or even as RAM disk.” # 3rd November 2009, 11:01 am
I loathe [hardware load balancers]. They’re expensive, restrictive, slow, and generally cause you a lot more pain and suffering than they’re worth. At my last job, one of my projects was to convert most of one of our existing clusters from a load-balancing appliance to use keepalived. Why would we do this? Because the $100k worth of appliance wasn’t capable of doing the job that $15k worth of commodity hardware and an installation of keepalived were handling with ease.