Simon Willison’s Weblog

6 items tagged “observability”


Roblox Return to Service 10/28-10/31 2021 (via) A particularly good example of a public postmortem on an outage. Roblox was down for 72 hours last year, as a result of an extremely complex set of circumstances which took a lot of effort to uncover. It’s interesting to think through what kind of monitoring you would need to have in place to help identify the root cause of this kind of issue. # 21st January 2022, 4:41 pm


When I was a performance consultant I’d show up to random companies who wanted me to fix their computer performance issues. If they trusted me with a login to their production servers, I could help them a lot quicker. To get that trust I knew which tools looked but didn’t touch: Which were observability tools and which were experimental tools. “I’ll start with observability tools only” is something I’d say at the start of every engagement.

Brendan Gregg # 8th June 2021, 7:33 pm


Instead of seeing instrumentation as a last-ditch effort of strings and metrics, we must think about propagating the full context of a request and emitting it at regular pulses. No pull request should ever be accepted unless the engineer can answer the question, “How will I know if this breaks?”

Charity Majors # 19th July 2020, 4:05 pm


Logs vs. metrics: a false dichotomy (via) Nick Stenning discusses the differences between logs and metrics: most notably that metrics can be derived from logs but logs cannot be reconstituted starting with time-series metrics. # 3rd August 2019, 4:46 pm

Targeted diagnostic logging in production (via) Will Sargent defines diagnostic logging as “debug logging statements with an audience”, and proposes controlling this style if logging via a feature flat system to allow detailed logging to be turned on in production against a selected subset if users in order to help debug difficult problems. Lots of great background material in the topic of observability here too. # 24th July 2019, 5:44 am

Metrics are lossily compressed logs. Traces are logs with parent child relationships between entries. The only reason we have three terms is because getting value from them has required different compromises to make them cost effective.

Clint Sharp # 25th February 2019, 10:15 pm