<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: observability</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/observability.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2024-07-26T13:06:38+00:00</updated><author><name>Simon Willison</name></author><entry><title>Did you know about Instruments?</title><link href="https://simonwillison.net/2024/Jul/26/did-you-know-about-instruments/#atom-tag" rel="alternate"/><published>2024-07-26T13:06:38+00:00</published><updated>2024-07-26T13:06:38+00:00</updated><id>https://simonwillison.net/2024/Jul/26/did-you-know-about-instruments/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://registerspill.thorstenball.com/p/did-you-know-about-instruments"&gt;Did you know about Instruments?&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Thorsten Ball shows how the macOS Instruments app (installed as part of Xcode) can be used to run a CPU profiler against &lt;em&gt;any&lt;/em&gt; application - not just code written in Swift/Objective C.&lt;/p&gt;
&lt;p&gt;I tried this against a Python process running &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; executing a Llama 3.1 prompt with my new &lt;a href="https://github.com/simonw/llm-gguf"&gt;llm-gguf&lt;/a&gt; plugin and captured this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a deep nested stack trace showing _PyFunction_Vectorcall from python3.10 calling PyCFuncPtr_call _ctypes.cpython-310-darwin.so which then calls ggml_ methods in libggml.dylib" src="https://static.simonwillison.net/static/2024/instruments-ggml.jpg" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/kr9od0/did_you_know_about_instruments"&gt;lobste.rs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/profiling"&gt;profiling&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/observability"&gt;observability&lt;/a&gt;&lt;/p&gt;



</summary><category term="profiling"/><category term="python"/><category term="observability"/></entry><entry><title>All you need is Wide Events, not “Metrics, Logs and Traces”</title><link href="https://simonwillison.net/2024/Feb/27/all-you-need-is-wide-events-not-metrics-logs-and-traces/#atom-tag" rel="alternate"/><published>2024-02-27T22:57:14+00:00</published><updated>2024-02-27T22:57:14+00:00</updated><id>https://simonwillison.net/2024/Feb/27/all-you-need-is-wide-events-not-metrics-logs-and-traces/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://isburmistrov.substack.com/p/all-you-need-is-wide-events-not-metrics"&gt;All you need is Wide Events, not “Metrics, Logs and Traces”&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I’ve heard great things about Meta’s internal observability platform Scuba, here’s an explanation from ex-Meta engineer Ivan Burmistrov describing the value it provides and comparing it to the widely used OpenTelemetry stack.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=39529775"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/facebook"&gt;facebook&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/observability"&gt;observability&lt;/a&gt;&lt;/p&gt;



</summary><category term="facebook"/><category term="observability"/></entry><entry><title>Roblox Return to Service 10/28-10/31 2021</title><link href="https://simonwillison.net/2022/Jan/21/roblox/#atom-tag" rel="alternate"/><published>2022-01-21T16:41:00+00:00</published><updated>2022-01-21T16:41:00+00:00</updated><id>https://simonwillison.net/2022/Jan/21/roblox/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.roblox.com/2022/01/roblox-return-to-service-10-28-10-31-2021/"&gt;Roblox Return to Service 10/28-10/31 2021&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A particularly good example of a public postmortem on an outage. Roblox was down for 72 hours last year, as a result of an extremely complex set of circumstances which took a lot of effort to uncover. It’s interesting to think through what kind of monitoring you would need to have in place to help identify the root cause of this kind of issue.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/benbjohnson/status/1484288578918047745"&gt;@benbjohnson&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ops"&gt;ops&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/observability"&gt;observability&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postmortem"&gt;postmortem&lt;/a&gt;&lt;/p&gt;



</summary><category term="ops"/><category term="observability"/><category term="postmortem"/></entry><entry><title>Quoting Brendan Gregg</title><link href="https://simonwillison.net/2021/Jun/8/observability/#atom-tag" rel="alternate"/><published>2021-06-08T19:33:16+00:00</published><updated>2021-06-08T19:33:16+00:00</updated><id>https://simonwillison.net/2021/Jun/8/observability/#atom-tag</id><summary type="html">
    &lt;blockquote cite="http://www.brendangregg.com/blog/2021-05-23/what-is-observability.html"&gt;&lt;p&gt;When I was a performance consultant I'd show up to random companies who wanted me to fix their computer performance issues. If they trusted me with a login to their production servers, I could help them a lot quicker. To get that trust I knew which tools looked but didn't touch: Which were observability tools and which were experimental tools. "I'll start with observability tools only" is something I'd say at the start of every engagement.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="http://www.brendangregg.com/blog/2021-05-23/what-is-observability.html"&gt;Brendan Gregg&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/performance"&gt;performance&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/observability"&gt;observability&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/brendan-gregg"&gt;brendan-gregg&lt;/a&gt;&lt;/p&gt;



</summary><category term="performance"/><category term="observability"/><category term="brendan-gregg"/></entry><entry><title>Quoting Charity Majors</title><link href="https://simonwillison.net/2020/Jul/19/charity-majors/#atom-tag" rel="alternate"/><published>2020-07-19T16:05:08+00:00</published><updated>2020-07-19T16:05:08+00:00</updated><id>https://simonwillison.net/2020/Jul/19/charity-majors/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://increment.com/testing/i-test-in-production/"&gt;&lt;p&gt;Instead of seeing instrumentation as a last-ditch effort of strings and metrics, we must think about propagating the full context of a request and emitting it at regular pulses. No pull request should ever be accepted unless the engineer can answer the question, “How will I know if this breaks?”&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://increment.com/testing/i-test-in-production/"&gt;Charity Majors&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/observability"&gt;observability&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/charity-majors"&gt;charity-majors&lt;/a&gt;&lt;/p&gt;



</summary><category term="observability"/><category term="charity-majors"/></entry><entry><title>Logs vs. metrics: a false dichotomy</title><link href="https://simonwillison.net/2019/Aug/3/logs-vs-metrics/#atom-tag" rel="alternate"/><published>2019-08-03T16:46:55+00:00</published><updated>2019-08-03T16:46:55+00:00</updated><id>https://simonwillison.net/2019/Aug/3/logs-vs-metrics/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://whiteink.com/2019/logs-vs-metrics-a-false-dichotomy/"&gt;Logs vs. metrics: a false dichotomy&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Nick Stenning discusses the differences between logs and metrics: most notably that metrics can be derived from logs but logs cannot be reconstituted starting with time-series metrics.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/mipsytipsy/status/1157503142134607872"&gt;Charity Majors&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/logging"&gt;logging&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/observability"&gt;observability&lt;/a&gt;&lt;/p&gt;



</summary><category term="logging"/><category term="observability"/></entry><entry><title>Targeted diagnostic logging in production</title><link href="https://simonwillison.net/2019/Jul/24/targeted-diagnostic-logging-production/#atom-tag" rel="alternate"/><published>2019-07-24T05:44:39+00:00</published><updated>2019-07-24T05:44:39+00:00</updated><id>https://simonwillison.net/2019/Jul/24/targeted-diagnostic-logging-production/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://tersesystems.com/blog/2019/07/22/targeted-diagnostic-logging-in-production/"&gt;Targeted diagnostic logging in production&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Will Sargent defines diagnostic logging as “debug logging statements with an audience”, and proposes controlling this style if logging via a feature flat system to allow detailed logging to be turned on in production against a selected subset if users in order to help debug difficult problems. Lots of great background material in the topic of observability here too.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/mipsytipsy/status/1153889935536975872"&gt;Charity Majors&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/logging"&gt;logging&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/observability"&gt;observability&lt;/a&gt;&lt;/p&gt;



</summary><category term="logging"/><category term="observability"/></entry><entry><title>Quoting Clint Sharp</title><link href="https://simonwillison.net/2019/Feb/25/clint-sharp/#atom-tag" rel="alternate"/><published>2019-02-25T22:15:45+00:00</published><updated>2019-02-25T22:15:45+00:00</updated><id>https://simonwillison.net/2019/Feb/25/clint-sharp/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://twitter.com/clintsharp/status/1098848313170784256"&gt;&lt;p&gt;Metrics are lossily compressed logs. Traces are logs with parent child relationships between entries. The only reason we have three terms is because getting value from them has required different compromises to make them cost effective.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://twitter.com/clintsharp/status/1098848313170784256"&gt;Clint Sharp&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/logs"&gt;logs&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/observability"&gt;observability&lt;/a&gt;&lt;/p&gt;



</summary><category term="logs"/><category term="observability"/></entry></feed>