<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: git-scraping</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/git-scraping.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-12-28T22:45:10+00:00</updated><author><name>Simon Willison</name></author><entry><title>simonw/actions-latest</title><link href="https://simonwillison.net/2025/Dec/28/actions-latest/#atom-tag" rel="alternate"/><published>2025-12-28T22:45:10+00:00</published><updated>2025-12-28T22:45:10+00:00</updated><id>https://simonwillison.net/2025/Dec/28/actions-latest/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/actions-latest"&gt;simonw/actions-latest&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Today in extremely niche projects, I got fed up of Claude Code creating GitHub Actions workflows for me that used stale actions: &lt;code&gt;actions/setup-python@v4&lt;/code&gt; when the latest is &lt;code&gt;actions/setup-python@v6&lt;/code&gt; for example.&lt;/p&gt;
&lt;p&gt;I couldn't find a good single place listing those latest versions, so I had Claude Code for web (via my phone, I'm out on errands) build a Git scraper to publish those versions in one place:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://simonw.github.io/actions-latest/versions.txt"&gt;https://simonw.github.io/actions-latest/versions.txt&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Tell your coding agent of choice to fetch that any time it wants to write a new GitHub Actions workflows.&lt;/p&gt;
&lt;p&gt;(I may well bake this into a Skill.)&lt;/p&gt;
&lt;p&gt;Here's the &lt;a href="https://gistpreview.github.io/?7883c719a25802afa5cdde7d3ed68b32/index.html"&gt;first&lt;/a&gt; and &lt;a href="https://gistpreview.github.io/?0ddaa82aac2c062ff157c7a01db0a274/page-001.html"&gt;second&lt;/a&gt; transcript I used to build this, shared using my &lt;a href="https://simonwillison.net/2025/Dec/25/claude-code-transcripts/"&gt;claude-code-transcripts&lt;/a&gt; tool (which just &lt;a href="https://github.com/simonw/claude-code-transcripts/issues/15"&gt;gained a search feature&lt;/a&gt;.)


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;&lt;/p&gt;



</summary><category term="github"/><category term="ai"/><category term="github-actions"/><category term="git-scraping"/><category term="generative-ai"/><category term="llms"/><category term="coding-agents"/><category term="claude-code"/></entry><entry><title>uv-init-demos</title><link href="https://simonwillison.net/2025/Dec/24/uv-init-demos/#atom-tag" rel="alternate"/><published>2025-12-24T22:05:23+00:00</published><updated>2025-12-24T22:05:23+00:00</updated><id>https://simonwillison.net/2025/Dec/24/uv-init-demos/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/uv-init-demos"&gt;uv-init-demos&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;code&gt;uv&lt;/code&gt; has a useful &lt;code&gt;uv init&lt;/code&gt; command for setting up new Python projects, but it comes with a bunch of different options like &lt;code&gt;--app&lt;/code&gt; and &lt;code&gt;--package&lt;/code&gt; and &lt;code&gt;--lib&lt;/code&gt; and I wasn't sure how they differed.&lt;/p&gt;
&lt;p&gt;So I created this GitHub repository which demonstrates all of those options, generated using this &lt;a href="https://github.com/simonw/uv-init-demos/blob/main/update-projects.sh"&gt;update-projects.sh&lt;/a&gt; script (&lt;a href="https://gistpreview.github.io/?9cff2d3b24ba3d5f423b34abc57aec13"&gt;thanks, Claude&lt;/a&gt;) which will run on a schedule via GitHub Actions to capture any changes made by future releases of &lt;code&gt;uv&lt;/code&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;&lt;/p&gt;



</summary><category term="projects"/><category term="python"/><category term="github-actions"/><category term="git-scraping"/><category term="uv"/></entry><entry><title>aavetis/PRarena</title><link href="https://simonwillison.net/2025/Oct/1/prarena/#atom-tag" rel="alternate"/><published>2025-10-01T23:59:40+00:00</published><updated>2025-10-01T23:59:40+00:00</updated><id>https://simonwillison.net/2025/Oct/1/prarena/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/aavetis/PRarena"&gt;aavetis/PRarena&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Albert Avetisian runs this repository on GitHub which uses the Github Search API to track the number of PRs that can be credited to a collection of different coding agents. The repo runs &lt;a href="https://github.com/aavetis/PRarena/blob/main/collect_data.py"&gt;this collect_data.py script&lt;/a&gt; every three hours &lt;a href="https://github.com/aavetis/PRarena/blob/main/.github/workflows/pr%E2%80%91stats.yml"&gt;using GitHub Actions&lt;/a&gt; to collect the data, then updates the &lt;a href="https://prarena.ai/"&gt;PR Arena site&lt;/a&gt; with a visual leaderboard.&lt;/p&gt;
&lt;p&gt;The result is this neat chart showing adoption of different agents over time, along with their PR success rate:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Line and bar chart showing PR metrics over time from 05/26 to 10/01. The left y-axis shows &amp;quot;Number of PRs&amp;quot; from 0 to 1,800,000, the right y-axis shows &amp;quot;Success Rate (%)&amp;quot; from 0% to 100%, and the x-axis shows &amp;quot;Time&amp;quot; with dates. Five line plots track success percentages: &amp;quot;Copilot Success % (Ready)&amp;quot; and &amp;quot;Copilot Success % (All)&amp;quot; (both blue, top lines around 90-95%), &amp;quot;Codex Success % (Ready)&amp;quot; and &amp;quot;Codex Success % (All)&amp;quot; (both brown/orange, middle lines declining from 80% to 60%), and &amp;quot;Cursor Success % (Ready)&amp;quot; and &amp;quot;Cursor Success % (All)&amp;quot; (both purple, middle lines around 75-85%), &amp;quot;Devin Success % (Ready)&amp;quot; and &amp;quot;Devin Success % (All)&amp;quot; (both teal/green, lower lines around 65%), and &amp;quot;Codegen Success % (Ready)&amp;quot; and &amp;quot;Codegen Success % (All)&amp;quot; (both brown, declining lines). Stacked bar charts show total and merged PRs for each tool: light blue and dark blue for Copilot, light red and dark red for Codex, light purple and dark purple for Cursor, light green and dark green for Devin, and light orange for Codegen. The bars show increasing volumes over time, with the largest bars appearing at 10/01 reaching approximately 1,700,000 total PRs." src="https://static.simonwillison.net/static/2025/ai-agents-chart.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I found this today while trying to pull off the exact same trick myself! I got as far as creating the following table before finding Albert's work and abandoning my own project.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Search term&lt;/th&gt;
&lt;th&gt;Total PRs&lt;/th&gt;
&lt;th&gt;Merged PRs&lt;/th&gt;
&lt;th&gt;% merged&lt;/th&gt;
&lt;th&gt;Earliest&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://claude.com/product/claude-code"&gt;Claude Code&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;is:pr in:body "Generated with Claude Code"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/search?q=is%3Apr+in%3Abody+%22Generated+with+Claude+Code%22&amp;amp;type=pullrequests&amp;amp;s=created&amp;amp;o=asc"&gt;146,000&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/search?q=is%3Apr+in%3Abody+%22Generated+with+Claude+Code%22+is%3Amerged&amp;amp;type=pullrequests&amp;amp;s=created&amp;amp;o=asc"&gt;123,000&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;84.2%&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/turlockmike/hataraku/pull/83"&gt;Feb 21st&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/features/copilot"&gt;GitHub Copilot&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;is:pr author:copilot-swe-agent[bot]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/search?q=is%3Apr+author%3Acopilot-swe-agent%5Bbot%5D&amp;amp;type=pullrequests&amp;amp;s=created&amp;amp;o=asc"&gt;247,000&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/search?q=is%3Apr+author%3Acopilot-swe-agent%5Bbot%5D+is%3Amerged&amp;amp;type=pullrequests&amp;amp;s=created&amp;amp;o=asc"&gt;152,000&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;61.5%&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/abbhardwa/Relational-Database-Query-Parser/pull/2"&gt;March 7th&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://developers.openai.com/codex/cloud/"&gt;Codex Cloud&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;is:pr in:body "chatgpt.com" label:codex&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/search?q=is%3Apr+in%3Abody+%22chatgpt.com%22+label%3Acodex&amp;amp;type=pullrequests&amp;amp;s=created&amp;amp;o=asc"&gt;1,900,000&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/search?q=is%3Apr+in%3Abody+%22chatgpt.com%22+label%3Acodex+is%3Amerged&amp;amp;type=pullrequests&amp;amp;s=created&amp;amp;o=asc"&gt;1,600,000&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;84.2%&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/adrianadiwidjaja/my-flask-app/pull/1"&gt;April 23rd&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://jules.google/"&gt;Google Jules&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;is:pr author:google-labs-jules[bot]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/search?q=is%3Apr+author%3Agoogle-labs-jules%5Bbot%5D&amp;amp;type=pullrequests&amp;amp;s=created&amp;amp;o=asc"&gt;35,400&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/search?q=is%3Apr+author%3Agoogle-labs-jules%5Bbot%5D+is%3Amerged&amp;amp;type=pullrequests&amp;amp;s=created&amp;amp;o=asc"&gt;27,800&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;78.5%&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/yukikurage/memento-proto/pull/2"&gt;May 22nd&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;(Those "earliest" links are a little questionable, I tried to filter out false positives and find the oldest one that appeared to really be from the agent in question.)&lt;/p&gt;
&lt;p&gt;It looks like OpenAI's Codex Cloud is &lt;em&gt;massively&lt;/em&gt; ahead of the competition right now in terms of numbers of PRs both opened and merged on GitHub.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: To clarify, these numbers are for the category of &lt;strong&gt;autonomous coding agents&lt;/strong&gt; - those systems where you assign a cloud-based agent a task or issue and the output is a PR against your repository. They do not (and cannot) capture the popularity of many forms of AI tooling that don't result in an easily identifiable pull request.&lt;/p&gt;
&lt;p&gt;Claude Code for example will be dramatically under-counted here because its version of an autonomous coding agent comes in the form of a somewhat obscure GitHub Actions workflow &lt;a href="https://docs.claude.com/en/docs/claude-code/github-actions"&gt;buried in the documentation&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/async-coding-agents"&gt;async-coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jules"&gt;jules&lt;/a&gt;&lt;/p&gt;



</summary><category term="github"/><category term="ai"/><category term="git-scraping"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="anthropic"/><category term="coding-agents"/><category term="claude-code"/><category term="async-coding-agents"/><category term="jules"/></entry><entry><title>simonw/ollama-models-atom-feed</title><link href="https://simonwillison.net/2025/Mar/22/ollama-models-atom-feed/#atom-tag" rel="alternate"/><published>2025-03-22T22:04:57+00:00</published><updated>2025-03-22T22:04:57+00:00</updated><id>https://simonwillison.net/2025/Mar/22/ollama-models-atom-feed/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/ollama-models-atom-feed"&gt;simonw/ollama-models-atom-feed&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I setup a GitHub Actions + GitHub Pages Atom feed of scraped recent models data from the Ollama &lt;a href="https://ollama.com/search?o=newest"&gt;latest models&lt;/a&gt; page - Ollama remains one of the easiest ways to run models on a laptop so a new model release from them is worth hearing about.&lt;/p&gt;
&lt;p&gt;I built the scraper by pasting example HTML &lt;a href="https://claude.ai/share/c96d6bb9-a976-45f9-82c2-8599c2d6d492"&gt;into Claude&lt;/a&gt; and asking for a Python script to convert it to Atom - here's &lt;a href="https://github.com/simonw/ollama-models-atom-feed/blob/main/to_atom.py"&gt;the script&lt;/a&gt; we wrote together.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 25th March 2025&lt;/strong&gt;: The first version of this included all 160+ models in a single feed. I've upgraded the script to output two feeds - the original &lt;a href="https://simonw.github.io/ollama-models-atom-feed/atom.xml"&gt;atom.xml&lt;/a&gt; one and a new &lt;a href="https://simonw.github.io/ollama-models-atom-feed/atom-recent-20.xml"&gt;atom-recent-20.xml&lt;/a&gt; feed containing just the most recent 20 items.&lt;/p&gt;
&lt;p&gt;I modified the script using Google's &lt;a href="https://simonwillison.net/2025/Mar/25/gemini/"&gt;new Gemini 2.5 Pro&lt;/a&gt; model, like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;cat to_atom.py | llm -m gemini-2.5-pro-exp-03-25 \
  -s 'rewrite this script so that instead of outputting Atom to stdout it saves two files, one called atom.xml with everything and another called atom-recent-20.xml with just the most recent 20 items - remove the output option entirely'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the &lt;a href="https://gist.github.com/simonw/358b5caa015de53dee0fbc96415ae6d6"&gt;full transcript&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/atom"&gt;atom&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;&lt;/p&gt;



</summary><category term="atom"/><category term="github"/><category term="projects"/><category term="ai"/><category term="github-actions"/><category term="git-scraping"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="claude"/><category term="gemini"/><category term="ollama"/></entry><entry><title>Building and deploying a custom site using GitHub Actions and GitHub Pages</title><link href="https://simonwillison.net/2025/Mar/18/actions-pages/#atom-tag" rel="alternate"/><published>2025-03-18T20:17:34+00:00</published><updated>2025-03-18T20:17:34+00:00</updated><id>https://simonwillison.net/2025/Mar/18/actions-pages/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://til.simonwillison.net/github-actions/github-pages"&gt;Building and deploying a custom site using GitHub Actions and GitHub Pages&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I figured out a minimal example of how to use GitHub Actions to run custom scripts to build a website and then publish that static site to GitHub Pages. I turned &lt;a href="https://github.com/simonw/minimal-github-pages-from-actions/"&gt;the example&lt;/a&gt; into a template repository, which should make getting started for a new project extremely quick.&lt;/p&gt;
&lt;p&gt;I've needed this for various projects over the years, but today I finally put these notes together while setting up &lt;a href="https://github.com/simonw/recent-california-brown-pelicans"&gt;a system&lt;/a&gt; for scraping the &lt;a href="https://www.inaturalist.org/"&gt;iNaturalist&lt;/a&gt; API for recent sightings of the California Brown Pelican and converting those into an Atom feed that I can subscribe to in &lt;a href="https://netnewswire.com/"&gt;NetNewsWire&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a Brown Pelican sighting Atom feed in NetNewsWire showing a list of entries on the left sidebar and detailed view of &amp;quot;Brown Pelican at Art Museum, Isla Vista, CA 93117, USA&amp;quot; on the right with date &amp;quot;MAR 13, 2025 AT 10:40 AM&amp;quot;, coordinates &amp;quot;34.4115542997, -119.8500448&amp;quot;, and a photo of three brown pelicans in water near a dock with copyright text &amp;quot;(c) Ery, all rights reserved&amp;quot;" src="https://static.simonwillison.net/static/2025/pelicans-netnewswire.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I got Claude &lt;a href="https://claude.ai/share/533a1d59-60db-4686-bd50-679dd01a585e"&gt;to write&lt;/a&gt; me &lt;a href="https://github.com/simonw/recent-california-brown-pelicans/blob/81f87b378b6626e97eeca0719e89c87ace141816/to_atom.py"&gt;the script&lt;/a&gt; that converts the scraped JSON to atom.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: I just &lt;a href="https://sfba.social/@kueda/114185945871929778"&gt;found out&lt;/a&gt; iNaturalist have their own atom feeds! Here's their own &lt;a href="https://www.inaturalist.org/observations.atom?verifiable=true&amp;amp;taxon_id=123829"&gt;feed of recent Pelican observations&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/atom"&gt;atom&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/netnewswire"&gt;netnewswire&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/inaturalist"&gt;inaturalist&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;&lt;/p&gt;



</summary><category term="atom"/><category term="github"/><category term="netnewswire"/><category term="inaturalist"/><category term="github-actions"/><category term="git-scraping"/><category term="ai-assisted-programming"/></entry><entry><title>Cutting-edge web scraping techniques at NICAR</title><link href="https://simonwillison.net/2025/Mar/8/cutting-edge-web-scraping/#atom-tag" rel="alternate"/><published>2025-03-08T19:25:36+00:00</published><updated>2025-03-08T19:25:36+00:00</updated><id>https://simonwillison.net/2025/Mar/8/cutting-edge-web-scraping/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/nicar-2025-scraping/blob/main/README.md"&gt;Cutting-edge web scraping techniques at NICAR&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Here's the handout for a workshop I presented this morning at &lt;a href="https://www.ire.org/training/conferences/nicar-2025/"&gt;NICAR 2025&lt;/a&gt; on web scraping, focusing on lesser know tips and tricks that became possible only with recent developments in LLMs.&lt;/p&gt;
&lt;p&gt;For workshops like this I like to work off an extremely detailed handout, so that people can move at their own pace or catch up later if they didn't get everything done.&lt;/p&gt;
&lt;p&gt;The workshop consisted of four parts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;Building a &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraper&lt;/a&gt; - an automated scraper in GitHub Actions that records changes to a resource over time&lt;/li&gt;
&lt;li&gt;Using in-browser JavaScript and then &lt;a href="https://shot-scraper.datasette.io/"&gt;shot-scraper&lt;/a&gt; to extract useful information&lt;/li&gt;
&lt;li&gt;Using &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; with both OpenAI and Google Gemini to extract structured data from unstructured websites&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2024/Oct/17/video-scraping/"&gt;Video scraping&lt;/a&gt; using &lt;a href="https://aistudio.google.com/"&gt;Google AI Studio&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;p&gt;I released several new tools in preparation for this workshop (I call this "NICAR Driven Development"):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/git-scraper-template"&gt;git-scraper-template&lt;/a&gt; template repository for quickly setting up new Git scrapers, which I &lt;a href="https://simonwillison.net/2025/Feb/26/git-scraper-template/"&gt;wrote about here&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Feb/28/llm-schemas/"&gt;LLM schemas&lt;/a&gt;, finally adding structured schema support to my LLM tool&lt;/li&gt;
&lt;li&gt;&lt;a href="https://shot-scraper.datasette.io/en/stable/har.html"&gt;shot-scraper har&lt;/a&gt;  for archiving pages as HTML Archive files - though I cut this from the workshop for time&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I also came up with a fun way to distribute API keys for workshop participants: I &lt;a href="https://claude.ai/share/8d3330c8-7fd4-46d1-93d4-a3bd05915793"&gt;had Claude build me&lt;/a&gt; a web page where I can create an encrypted message with a passphrase, then share a URL to that page with users and give them the passphrase to unlock the encrypted message. You can try that at &lt;a href="https://tools.simonwillison.net/encrypt"&gt;tools.simonwillison.net/encrypt&lt;/a&gt; - or &lt;a href="https://tools.simonwillison.net/encrypt#5ZeXCdZ5pqCcHqE1y0aGtoIijlUW+ipN4gjQV4A2/6jQNovxnDvO6yoohgxBIVWWCN8m6ppAdjKR41Qzyq8Keh0RP7E="&gt;use this link&lt;/a&gt; and enter the passphrase "demo":&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a message encryption/decryption web interface showing the title &amp;quot;Encrypt / decrypt message&amp;quot; with two tab options: &amp;quot;Encrypt a message&amp;quot; and &amp;quot;Decrypt a message&amp;quot; (highlighted). Below shows a decryption form with text &amp;quot;This page contains an encrypted message&amp;quot;, a passphrase input field with dots, a blue &amp;quot;Decrypt message&amp;quot; button, and a revealed message saying &amp;quot;This is a secret message&amp;quot;." src="https://static.simonwillison.net/static/2025/encrypt-decrypt.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/speaking"&gt;speaking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/shot-scraper"&gt;shot-scraper&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicar"&gt;nicar&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-artifacts"&gt;claude-artifacts&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-to-app"&gt;prompt-to-app&lt;/a&gt;&lt;/p&gt;



</summary><category term="scraping"/><category term="speaking"/><category term="ai"/><category term="git-scraping"/><category term="shot-scraper"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="claude"/><category term="gemini"/><category term="nicar"/><category term="claude-artifacts"/><category term="prompt-to-app"/></entry><entry><title>simonw/git-scraper-template</title><link href="https://simonwillison.net/2025/Feb/26/git-scraper-template/#atom-tag" rel="alternate"/><published>2025-02-26T05:34:05+00:00</published><updated>2025-02-26T05:34:05+00:00</updated><id>https://simonwillison.net/2025/Feb/26/git-scraper-template/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/git-scraper-template"&gt;simonw/git-scraper-template&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I built this new GitHub template repository in preparation for a workshop I'm giving at &lt;a href="https://www.ire.org/training/conferences/nicar-2025/"&gt;NICAR&lt;/a&gt; (the data journalism conference) next week on &lt;a href="https://github.com/simonw/nicar-2025-scraping/"&gt;Cutting-edge web scraping techniques&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;One of the topics I'll be covering is &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; - creating a GitHub repository that uses scheduled GitHub Actions workflows to grab copies of websites and data feeds and store their changes over time using Git.&lt;/p&gt;
&lt;p&gt;This template repository is designed to be the fastest possible way to get started with a new Git scraper: simple &lt;a href="https://github.com/new?template_name=git-scraper-template&amp;amp;template_owner=simonw"&gt;create a new repository from the template&lt;/a&gt; and paste the URL you want to scrape into the &lt;strong&gt;description&lt;/strong&gt; field and the repository will be initialized with a custom script that scrapes and stores that URL.&lt;/p&gt;
&lt;p&gt;It's modeled after my earlier &lt;a href="https://github.com/simonw/shot-scraper-template"&gt;shot-scraper-template&lt;/a&gt; tool which I described in detail in &lt;a href="https://simonwillison.net/2022/Mar/14/shot-scraper-template/"&gt;Instantly create a GitHub repository to take screenshots of a web page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The new &lt;code&gt;git-scraper-template&lt;/code&gt; repo took &lt;a href="https://github.com/simonw/git-scraper-template/issues/2#issuecomment-2683871054"&gt;some help from Claude&lt;/a&gt; to figure out. It uses a &lt;a href="https://github.com/simonw/git-scraper-template/blob/a2b12972584099d7c793ee4b38303d94792bf0f0/download.sh"&gt;custom script&lt;/a&gt; to download the provided URL and derive a filename to use based on the URL and the content type, detected using &lt;code&gt;file --mime-type -b "$file_path"&lt;/code&gt; against the downloaded file.&lt;/p&gt;
&lt;p&gt;It also detects if the downloaded content is JSON and, if it is, pretty-prints it using &lt;code&gt;jq&lt;/code&gt; - I find this is a quick way to generate much more useful diffs when the content changes.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicar"&gt;nicar&lt;/a&gt;&lt;/p&gt;



</summary><category term="data-journalism"/><category term="git"/><category term="github"/><category term="projects"/><category term="scraping"/><category term="github-actions"/><category term="git-scraping"/><category term="nicar"/></entry><entry><title>Using a Tailscale exit node with GitHub Actions</title><link href="https://simonwillison.net/2025/Feb/23/tailscale-exit-node-with-github-actions/#atom-tag" rel="alternate"/><published>2025-02-23T02:49:32+00:00</published><updated>2025-02-23T02:49:32+00:00</updated><id>https://simonwillison.net/2025/Feb/23/tailscale-exit-node-with-github-actions/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://til.simonwillison.net/tailscale/tailscale-github-actions"&gt;Using a Tailscale exit node with GitHub Actions&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New TIL. I started running a &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;git scraper&lt;/a&gt; against doge.gov to track changes made to that website over time. The DOGE site runs behind Cloudflare which was blocking requests from the GitHub Actions IP range, but I figured out how to run a Tailscale exit node on my Apple TV and use that to proxy my &lt;a href="https://shot-scraper.datasette.io/"&gt;shot-scraper&lt;/a&gt; requests.&lt;/p&gt;
&lt;p&gt;The scraper is running in &lt;a href="https://github.com/simonw/scrape-doge-gov"&gt;simonw/scrape-doge-gov&lt;/a&gt;. It uses the new &lt;a href="https://shot-scraper.datasette.io/en/stable/har.html"&gt;shot-scraper har&lt;/a&gt; command I added in &lt;a href="https://github.com/simonw/shot-scraper/releases/tag/1.6"&gt;shot-scraper 1.6&lt;/a&gt; (and improved in &lt;a href="https://github.com/simonw/shot-scraper/releases/tag/1.7"&gt;shot-scraper 1.7&lt;/a&gt;).


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tailscale"&gt;tailscale&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/til"&gt;til&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/p&gt;



</summary><category term="github"/><category term="scraping"/><category term="github-actions"/><category term="tailscale"/><category term="til"/><category term="git-scraping"/><category term="shot-scraper"/></entry><entry><title>New improved commit messages for scrape-hacker-news-by-domain</title><link href="https://simonwillison.net/2024/Sep/6/improved-commit-messages-csv-diff/#atom-tag" rel="alternate"/><published>2024-09-06T05:40:01+00:00</published><updated>2024-09-06T05:40:01+00:00</updated><id>https://simonwillison.net/2024/Sep/6/improved-commit-messages-csv-diff/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain/issues/6"&gt;New improved commit messages for scrape-hacker-news-by-domain&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
My &lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain"&gt;simonw/scrape-hacker-news-by-domain&lt;/a&gt; repo has a very specific purpose. Once an hour it scrapes the Hacker News &lt;a href="https://news.ycombinator.com/from?site=simonwillison.net"&gt;/from?site=simonwillison.net&lt;/a&gt; page (and the equivalent &lt;a href="https://news.ycombinator.com/from?site=datasette.io"&gt;for datasette.io&lt;/a&gt;) using my &lt;a href="https://shot-scraper.datasette.io/"&gt;shot-scraper&lt;/a&gt; tool and stashes the parsed links, scores and comment counts in JSON files in that repo.&lt;/p&gt;
&lt;p&gt;It does this mainly so I can subscribe to GitHub's Atom feed of the commit log - visit &lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain/commits/main"&gt;simonw/scrape-hacker-news-by-domain/commits/main&lt;/a&gt; and add &lt;code&gt;.atom&lt;/code&gt; to the URL to get that.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://netnewswire.com/"&gt;NetNewsWire&lt;/a&gt; will inform me within about an hour if any of my content has made it to Hacker News, and the repo will track the score and comment count for me over time. I wrote more about how this works in &lt;a href="https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-scraper/#scrape-a-web-page"&gt;Scraping web pages from the command line with shot-scraper&lt;/a&gt; back in March 2022.&lt;/p&gt;
&lt;p&gt;Prior to the latest improvement, the commit messages themselves were pretty uninformative. The message had the date, and to actually see which Hacker News post it was referring to, I had to click through to the commit and look at the diff.&lt;/p&gt;
&lt;p&gt;I built my &lt;a href="https://github.com/simonw/csv-diff"&gt;csv-diff&lt;/a&gt; tool a while back to help address this problem: it can produce a slightly more human-readable version of a diff between two CSV or JSON files, ideally suited for including in a commit message attached to a &lt;a href="https://simonwillison.net/tags/git-scraping/"&gt;git scraping&lt;/a&gt; repo like this one.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain/commit/35aa3c6c03507d89dd2eb7afa54839b2575b0e33"&gt;got that working&lt;/a&gt;, but there was still room for improvement. I recently learned that any Hacker News thread has an undocumented URL at &lt;code&gt;/latest?id=x&lt;/code&gt; which displays the most recently added comments at the top.&lt;/p&gt;
&lt;p&gt;I wanted that in my commit messages, so I could quickly click a link to see the most recent comments on a thread.&lt;/p&gt;
&lt;p&gt;So... I added one more feature to &lt;code&gt;csv-diff&lt;/code&gt;: a new &lt;a href="https://github.com/simonw/csv-diff/issues/38"&gt;--extra option&lt;/a&gt; lets you specify a Python format string to be used to add extra fields to the displayed difference.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain/blob/main/.github/workflows/scrape.yml"&gt;GitHub Actions workflow&lt;/a&gt; now runs this command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;csv-diff simonwillison-net.json simonwillison-net-new.json \
  --key id --format json \
  --extra latest 'https://news.ycombinator.com/latest?id={id}' \
  &amp;gt;&amp;gt; /tmp/commit.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This generates the diff between the two versions, using the &lt;code&gt;id&lt;/code&gt; property in the JSON to tie records together. It adds a &lt;code&gt;latest&lt;/code&gt; field linking to that URL.&lt;/p&gt;
&lt;p&gt;The commits now &lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain/commit/bda23fc358d978392d38933083ba1c49f50c107a"&gt;look like this&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Fri Sep 6 05:22:32 UTC 2024. 1 row changed. id: 41459472 points: &amp;quot;25&amp;quot; =&amp;gt; &amp;quot;27&amp;quot; numComments: &amp;quot;7&amp;quot; =&amp;gt; &amp;quot;8&amp;quot; extras: latest: https://news.ycombinator.com/latest?id=41459472" src="https://static.simonwillison.net/static/2024/hacker-news-commit.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/hacker-news"&gt;hacker-news&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/p&gt;



</summary><category term="hacker-news"/><category term="json"/><category term="projects"/><category term="github-actions"/><category term="git-scraping"/><category term="shot-scraper"/></entry><entry><title>interactive-feed</title><link href="https://simonwillison.net/2024/Jul/5/interactive-feed/#atom-tag" rel="alternate"/><published>2024-07-05T23:39:01+00:00</published><updated>2024-07-05T23:39:01+00:00</updated><id>https://simonwillison.net/2024/Jul/5/interactive-feed/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/sammorrisdesign/interactive-feed"&gt;interactive-feed&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Sam Morris maintains this project which gathers interactive, graphic and data visualization stories from various newsrooms around the world and publishes them on  &lt;a href="https://twitter.com/InteractiveFeed"&gt;Twitter&lt;/a&gt;, &lt;a href="https://botsin.space/@Interactives"&gt;Mastodon&lt;/a&gt; and &lt;a href="https://staging.bsky.app/profile/interactives.bsky.social"&gt;Bluesky&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It runs automatically using GitHub Actions, and gathers data using a number of different techniques - XML feeds, custom API integrations (for the NYT, Guardian and Washington Post) and in some cases by scraping index pages on news websites &lt;a href="https://github.com/sammorrisdesign/interactive-feed/blob/1652b7b6a698ad97f88b542cfdd94a90be4f119c/src/fetchers.js#L221-L251"&gt;using CSS selectors and cheerio&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The data it collects is archived as JSON in the &lt;a href="https://github.com/sammorrisdesign/interactive-feed/tree/main/data"&gt;data/ directory&lt;/a&gt; of the repository.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/palewire/status/1809361645799452977"&gt;@palewire&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mastodon"&gt;mastodon&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/bluesky"&gt;bluesky&lt;/a&gt;&lt;/p&gt;



</summary><category term="data-journalism"/><category term="git-scraping"/><category term="mastodon"/><category term="bluesky"/></entry><entry><title>Figure out who's leaving the company: dump, diff, repeat</title><link href="https://simonwillison.net/2024/Feb/9/figure-out-whos-leaving-the-company/#atom-tag" rel="alternate"/><published>2024-02-09T05:44:31+00:00</published><updated>2024-02-09T05:44:31+00:00</updated><id>https://simonwillison.net/2024/Feb/9/figure-out-whos-leaving-the-company/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://rachelbythebay.com/w/2024/02/08/ldap/"&gt;Figure out who&amp;#x27;s leaving the company: dump, diff, repeat&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Rachel Kroll describes a neat hack for companies with an internal LDAP server or similar machine-readable employee directory: run a cron somewhere internal that grabs the latest version and diffs it against the previous to figure out who has joined or left the company.&lt;/p&gt;
&lt;p&gt;I suggest using Git for this - a form of Git scraping - as then you get a detailed commit log of changes over time effectively for free.&lt;/p&gt;
&lt;p&gt;I really enjoyed Rachel's closing thought:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Incidentally, if someone gets mad about you running this sort of thing, you probably don't want to work there anyway. On the other hand, if you're able to build such tools without IT or similar getting "threatened" by it, then you might be somewhere that actually enjoys creating interesting and useful stuff. Treasure such places. They don't tend to last.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=39311507"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rachel-kroll"&gt;rachel-kroll&lt;/a&gt;&lt;/p&gt;



</summary><category term="git"/><category term="git-scraping"/><category term="rachel-kroll"/></entry><entry><title>Tracking Mastodon user numbers over time with a bucket of tricks</title><link href="https://simonwillison.net/2022/Nov/20/tracking-mastodon/#atom-tag" rel="alternate"/><published>2022-11-20T07:00:54+00:00</published><updated>2022-11-20T07:00:54+00:00</updated><id>https://simonwillison.net/2022/Nov/20/tracking-mastodon/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://joinmastodon.org/"&gt;Mastodon&lt;/a&gt; is definitely having a moment. User growth is skyrocketing as more and more people migrate over from Twitter.&lt;/p&gt;
&lt;p&gt;I've set up a new &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;git scraper&lt;/a&gt; to track the number of registered user accounts on known Mastodon instances over time.&lt;/p&gt;
&lt;p&gt;It's only been running for a few hours, but it's already collected enough data to &lt;a href="https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time"&gt;render this chart&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/mastodon-users-few-hours.png" alt="The chart starts at around 1am with 4,694,000 users - it climbs to 4,716,000 users by 6am in a relatively straight line" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I'm looking forward to seeing how this trend continues to develop over the next days and weeks.&lt;/p&gt;
&lt;h4&gt;Scraping the data&lt;/h4&gt;
&lt;p&gt;My scraper works by tracking &lt;a href="https://instances.social/"&gt;https://instances.social/&lt;/a&gt; - a website that lists a large number (but not all) of the Mastodon instances that are out there.&lt;/p&gt;
&lt;p&gt;That site publishes an &lt;a href="https://instances.social/instances.json"&gt;instances.json&lt;/a&gt; array which currently contains 1,830 objects representing Mastodon instances. Each of those objects looks something like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
    &lt;span class="pl-ent"&gt;"name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;pleroma.otter.sh&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"title"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Otterland&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"short_description"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"description"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Otters does squeak squeak&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"uptime"&lt;/span&gt;: &lt;span class="pl-c1"&gt;0.944757&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"up"&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"https_score"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"https_rank"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"ipv6"&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"openRegistrations"&lt;/span&gt;: &lt;span class="pl-c1"&gt;false&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"users"&lt;/span&gt;: &lt;span class="pl-c1"&gt;5&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"statuses"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;54870&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"connections"&lt;/span&gt;: &lt;span class="pl-c1"&gt;9821&lt;/span&gt;,
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I have &lt;a href="https://github.com/simonw/scrape-instances-social/blob/main/.github/workflows/scrape.yml"&gt;a GitHub Actions workflow&lt;/a&gt; running approximately every 20 minutes that fetches a copy of that file and commits it back to this repository:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/scrape-instances-social"&gt;https://github.com/simonw/scrape-instances-social&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Since each instance includes a &lt;code&gt;users&lt;/code&gt; count, the commit history of my &lt;code&gt;instances.json&lt;/code&gt; file tells the story of Mastodon's growth over time.&lt;/p&gt;
&lt;h4&gt;Building a database&lt;/h4&gt;
&lt;p&gt;A commit log of a JSON file is interesting, but the next step is to turn that into actionable information.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://simonwillison.net/2021/Dec/7/git-history/"&gt;git-history tool&lt;/a&gt; is designed to do exactly that.&lt;/p&gt;
&lt;p&gt;For the chart up above, the only number I care about is the total number of users listed in each snapshot of the file - the sum of that &lt;code&gt;users&lt;/code&gt; field for each instance.&lt;/p&gt;
&lt;p&gt;Here's how to run &lt;code&gt;git-history&lt;/code&gt; against that file's commit history to generate tables showing how that count has changed over time:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git-history file counts.db instances.json \
  --convert &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;return [&lt;/span&gt;
&lt;span class="pl-s"&gt;    {&lt;/span&gt;
&lt;span class="pl-s"&gt;        'id': 'all',&lt;/span&gt;
&lt;span class="pl-s"&gt;        'users': sum(d['users'] or 0 for d in json.loads(content)),&lt;/span&gt;
&lt;span class="pl-s"&gt;        'statuses': sum(int(d['statuses'] or 0) for d in json.loads(content)),&lt;/span&gt;
&lt;span class="pl-s"&gt;    }&lt;/span&gt;
&lt;span class="pl-s"&gt;  ]&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; --id id&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I'm creating a file called &lt;code&gt;counts.db&lt;/code&gt; that shows the history of the &lt;code&gt;instances.json&lt;/code&gt; file.&lt;/p&gt;
&lt;p&gt;The real trick here though is that &lt;code&gt;--convert&lt;/code&gt; argument. I'm using that to compress each snapshot down to a single row that looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
    &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;all&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"users"&lt;/span&gt;: &lt;span class="pl-c1"&gt;4717781&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"statuses"&lt;/span&gt;: &lt;span class="pl-c1"&gt;374217860&lt;/span&gt;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Normally &lt;code&gt;git-history&lt;/code&gt; expects to work against an array of objects, tracking the history of changes to each one based on their &lt;code&gt;id&lt;/code&gt; property.&lt;/p&gt;
&lt;p&gt;Here I'm tricking it a bit - I only return a single object with the ID of &lt;code&gt;all&lt;/code&gt;. This means that &lt;code&gt;git-history&lt;/code&gt; will only track the history of changes to that single object.&lt;/p&gt;
&lt;p&gt;It works though! The result is a &lt;code&gt;counts.db&lt;/code&gt; file which is currently 52KB and has the following schema (truncated to the most interesting bits):&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;CREATE TABLE [commits] (
   [id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [namespace] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [namespaces]([id]),
   [hash] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [commit_at] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [item_version] (
   [_id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [_item] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [item]([_id]),
   [_version] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [_commit] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [commits]([id]),
   [id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [users] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [statuses] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [_item_full_hash] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Each &lt;code&gt;item_version&lt;/code&gt; row will tell us the number of users and statuses at a particular point in time, based on a join against that &lt;code&gt;commits&lt;/code&gt; table to find the &lt;code&gt;commit_at&lt;/code&gt; date.&lt;/p&gt;
&lt;h4&gt;Publishing the database&lt;/h4&gt;
&lt;p&gt;For this project, I decided to publish the SQLite database to an S3 bucket. I considered pushing the binary SQLite file directly to the GitHub repository but this felt rude, since a binary file that changes every 20 minutes would bloat the repository.&lt;/p&gt;
&lt;p&gt;I wanted to serve the file with open CORS headers so I could load it into Datasette Lite and Observable notebooks.&lt;/p&gt;
&lt;p&gt;I used my &lt;a href="https://s3-credentials.readthedocs.io/"&gt;s3-credentials&lt;/a&gt; tool to create a bucket for this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;~ % s3-credentials create scrape-instances-social --public --website --create-bucket
Created bucket: scrape-instances-social
Attached bucket policy allowing public access
Configured website: IndexDocument=index.html, ErrorDocument=error.html
Created  user: 's3.read-write.scrape-instances-social' with permissions boundary: 'arn:aws:iam::aws:policy/AmazonS3FullAccess'
Attached policy s3.read-write.scrape-instances-social to user s3.read-write.scrape-instances-social
Created access key for user: s3.read-write.scrape-instances-social
{
    "UserName": "s3.read-write.scrape-instances-social",
    "AccessKeyId": "AKIAWXFXAIOZI5NUS6VU",
    "Status": "Active",
    "SecretAccessKey": "...",
    "CreateDate": "2022-11-20 05:52:22+00:00"
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This created a new bucket called &lt;code&gt;scrape-instances-social&lt;/code&gt; configured to work as a website and allow public access.&lt;/p&gt;
&lt;p&gt;It also generated an access key and a secret access key with access to just that bucket. I saved these in GitHub Actions secrets called &lt;code&gt;AWS_ACCESS_KEY_ID&lt;/code&gt; and &lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I enabled a CORS policy on the bucket like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3-credentials set-cors-policy scrape-instances-social
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I added the following to my GitHub Actions workflow to build and upload the database after each run of the scraper:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Build and publish database using git-history&lt;/span&gt;
      &lt;span class="pl-ent"&gt;env&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;AWS_ACCESS_KEY_ID&lt;/span&gt;: &lt;span class="pl-s"&gt;${{ secrets.AWS_ACCESS_KEY_ID }}&lt;/span&gt;
        &lt;span class="pl-ent"&gt;AWS_SECRET_ACCESS_KEY&lt;/span&gt;: &lt;span class="pl-s"&gt;${{ secrets.AWS_SECRET_ACCESS_KEY }}&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;        # First download previous database to save some time&lt;/span&gt;
&lt;span class="pl-s"&gt;        wget https://scrape-instances-social.s3.amazonaws.com/counts.db&lt;/span&gt;
&lt;span class="pl-s"&gt;        # Update with latest commits&lt;/span&gt;
&lt;span class="pl-s"&gt;        ./build-count-history.sh&lt;/span&gt;
&lt;span class="pl-s"&gt;        # Upload to S3&lt;/span&gt;
&lt;span class="pl-s"&gt;        s3-credentials put-object scrape-instances-social counts.db counts.db \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --access-key $AWS_ACCESS_KEY_ID \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --secret-key $AWS_SECRET_ACCESS_KEY&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; knows how to only process commits since the last time the database was built, so downloading the previous copy saves a lot of time.&lt;/p&gt;
&lt;h4&gt;Exploring the data&lt;/h4&gt;
&lt;p&gt;Now that I have a SQLite database that's being served over CORS-enabled HTTPS I can open it in &lt;a href="https://simonwillison.net/2022/May/4/datasette-lite/"&gt;Datasette Lite&lt;/a&gt; - my implementation of Datasette compiled to WebAssembly that runs entirely in a browser.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db"&gt;https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Any time anyone follows this link their browser will fetch the latest copy of the &lt;code&gt;counts.db&lt;/code&gt; file directly from S3.&lt;/p&gt;
&lt;p&gt;The most interesting page in there is the &lt;code&gt;item_version_detail&lt;/code&gt; SQL view, which joins against the commits table to show the date of each change:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db#/counts/item_version_detail"&gt;https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db#/counts/item_version_detail&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;(Datasette Lite lets you link directly to pages within Datasette itself via a &lt;code&gt;#hash&lt;/code&gt;.)&lt;/p&gt;
&lt;h4&gt;Plotting a chart&lt;/h4&gt;
&lt;p&gt;Datasette Lite doesn't have charting yet, so I decided to turn to my favourite visualization tool, an &lt;a href="https://observablehq.com/"&gt;Observable&lt;/a&gt; notebook.&lt;/p&gt;
&lt;p&gt;Observable has the ability to query SQLite databases (that are served via CORS) directly these days!&lt;/p&gt;
&lt;p&gt;Here's my notebook:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time"&gt;https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;There are only four cells needed to create the chart shown above.&lt;/p&gt;
&lt;p&gt;First, we need to open the SQLite database from the remote URL:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-s1"&gt;database&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;SQLiteDatabaseClient&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;open&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;
  &lt;span class="pl-s"&gt;"https://scrape-instances-social.s3.amazonaws.com/counts.db"&lt;/span&gt;
&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Next we need to use an Obervable Database query cell to execute SQL against that database and pull out the data we want to plot - and store it in a &lt;code&gt;query&lt;/code&gt; variable:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;SELECT&lt;/span&gt; _commit_at &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-k"&gt;date&lt;/span&gt;, users, statuses
&lt;span class="pl-k"&gt;FROM&lt;/span&gt; item_version_detail&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We need to make one change to that data - we need to convert the &lt;code&gt;date&lt;/code&gt; column from a string to a JavaScript date object:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-s1"&gt;points&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;query&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;map&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;d&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;
  &lt;span class="pl-c1"&gt;date&lt;/span&gt;: &lt;span class="pl-k"&gt;new&lt;/span&gt; &lt;span class="pl-v"&gt;Date&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;d&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;date&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-c1"&gt;users&lt;/span&gt;: &lt;span class="pl-s1"&gt;d&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;users&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-c1"&gt;statuses&lt;/span&gt;: &lt;span class="pl-s1"&gt;d&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;statuses&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Finally, we can plot the data using the &lt;a href="https://observablehq.com/@observablehq/plot"&gt;Observable Plot&lt;/a&gt; charting library like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-v"&gt;Plot&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;plot&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;
  &lt;span class="pl-c1"&gt;y&lt;/span&gt;: &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-c1"&gt;grid&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
    &lt;span class="pl-c1"&gt;label&lt;/span&gt;: &lt;span class="pl-s"&gt;"Total users over time across all tracked instances"&lt;/span&gt;
  &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-c1"&gt;marks&lt;/span&gt;: &lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-v"&gt;Plot&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;line&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;points&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt; &lt;span class="pl-c1"&gt;x&lt;/span&gt;: &lt;span class="pl-s"&gt;"date"&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;y&lt;/span&gt;: &lt;span class="pl-s"&gt;"users"&lt;/span&gt; &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-c1"&gt;marginLeft&lt;/span&gt;: &lt;span class="pl-c1"&gt;100&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I added 100px of margin to the left of the chart to ensure there was space for the large (4,696,000 and up) labels on the y-axis.&lt;/p&gt;
&lt;h4&gt;A bunch of tricks combined&lt;/h4&gt;
&lt;p&gt;This project combines a whole bunch of tricks I've been pulling together over the past few years:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; is the technique I use to gather the initial data, turning a static listing of instances into a record of changes over time&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://datasette.io/tools/git-history"&gt;git-history&lt;/a&gt; is my tool for turning a scraped Git history into a SQLite database that's easier to work with&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://s3-credentials.readthedocs.io/"&gt;s3-credentials&lt;/a&gt; makes working with S3 buckets - in particular creating credentials that are restricted to just one bucket - much less frustrating&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2022/May/4/datasette-lite/"&gt;Datasette Lite&lt;/a&gt; means that once you have a SQLite database online somewhere you can explore it in your browser - without having to run my full server-side &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; Python application on a machine somewhere&lt;/li&gt;
&lt;li&gt;And finally, combining the above means I can take advantage of &lt;a href="https://observablehq.com/"&gt;Observable notebooks&lt;/a&gt; for ad-hoc visualization of data that's hosted online, in this case as a static SQLite database file served from S3&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/observable"&gt;observable&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-lite"&gt;datasette-lite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mastodon"&gt;mastodon&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cors"&gt;cors&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="github"/><category term="projects"/><category term="datasette"/><category term="observable"/><category term="github-actions"/><category term="git-scraping"/><category term="git-history"/><category term="s3-credentials"/><category term="datasette-lite"/><category term="mastodon"/><category term="cors"/></entry><entry><title>Measuring traffic during the Half Moon Bay Pumpkin Festival</title><link href="https://simonwillison.net/2022/Oct/19/measuring-traffic/#atom-tag" rel="alternate"/><published>2022-10-19T15:41:09+00:00</published><updated>2022-10-19T15:41:09+00:00</updated><id>https://simonwillison.net/2022/Oct/19/measuring-traffic/#atom-tag</id><summary type="html">
    &lt;p&gt;This weekend was the &lt;a href="https://pumpkinfest.miramarevents.com/" rel="nofollow"&gt;50th annual Half Moon Bay Pumpkin Festival&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We live in El Granada, a tiny town 8 minutes drive from Half Moon Bay. There is a single road (coastal highway one) between the two towns, and the festival is locally notorious for its impact on traffic.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://twitter.com/natbat" rel="nofollow"&gt;Natalie&lt;/a&gt; suggested that we measure the traffic and try and see the impact for ourselves!&lt;/p&gt;
&lt;p&gt;Here's the end result for Saturday. Read on for details on how we created it.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A chart showing the two lines over time" src="https://static.simonwillison.net/static/2022/pumpkin-saturday-smooth.png" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;&lt;a id="user-content-collecting-the-data" class="anchor" aria-hidden="true" href="#collecting-the-data"&gt;&lt;span aria-hidden="true" class="octicon octicon-link"&gt;&lt;/span&gt;&lt;/a&gt;Collecting the data&lt;/h4&gt;
&lt;p&gt;I built a &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/" rel="nofollow"&gt;git scraper&lt;/a&gt; to gather data from the Google Maps &lt;a href="https://developers.google.com/maps/documentation/directions/overview" rel="nofollow"&gt;Directions API&lt;/a&gt;. It turns out if you pass &lt;code&gt;departure_time=now&lt;/code&gt; to that API it returns the current estimated time in traffic as part of the response.&lt;/p&gt;
&lt;p&gt;I picked a location in Half Moon Bay an a location in El Granada and constructed the following URL (pretty-printed):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;https://maps.googleapis.com/maps/api/directions/json?
  origin=GG49%2BCH,%20Half%20Moon%20Bay%20CA
  &amp;amp;destination=FH78%2BQJ,%20Half%20Moon%20Bay,%20CA
  &amp;amp;departure_time=now
  &amp;amp;key=$GOOGLE_MAPS_KEY
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The two locations here are defined using Google Plus codes. Here they are on Google Maps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.google.com/maps/search/FH78%2BQJ+Half+Moon+Bay,+CA,+USA" rel="nofollow"&gt;FH78+QJ Half Moon Bay, CA, USA&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.google.com/maps/search/GG49%2BCH+El+Granada+CA,+USA" rel="nofollow"&gt;GG49+CH El Granada CA, USA&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I constructed the reverse of the URL too, to track traffic in the other direction. Then I rigged up a scheduled GitHub Actions workflow in &lt;a href="https://github.com/simonw/scrape-hmb-traffic"&gt;this repository&lt;/a&gt; to fetch this API data, pretty-print it with &lt;code&gt;jq&lt;/code&gt; and write it to the repsoitory:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Scrape traffic&lt;/span&gt;

&lt;span class="pl-ent"&gt;on&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;push&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;workflow_dispatch&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;schedule&lt;/span&gt;:
  - &lt;span class="pl-ent"&gt;cron&lt;/span&gt;:  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;*/5 * * * *&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-ent"&gt;jobs&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;shot-scraper&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;runs-on&lt;/span&gt;: &lt;span class="pl-s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="pl-ent"&gt;steps&lt;/span&gt;:
    - &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/checkout@v2&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Scrape&lt;/span&gt;
      &lt;span class="pl-ent"&gt;env&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;GOOGLE_MAPS_KEY&lt;/span&gt;: &lt;span class="pl-s"&gt;${{ secrets.GOOGLE_MAPS_KEY }}&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|        &lt;/span&gt;
&lt;span class="pl-s"&gt;        curl "https://maps.googleapis.com/maps/api/directions/json?origin=GG49%2BCH,%20Half%20Moon%20Bay%20CA&amp;amp;destination=FH78%2BQJ,%20Half%20Moon%20Bay,%20California&amp;amp;departure_time=now&amp;amp;key=$GOOGLE_MAPS_KEY" | jq &amp;gt; one.json&lt;/span&gt;
&lt;span class="pl-s"&gt;        sleep 3&lt;/span&gt;
&lt;span class="pl-s"&gt;        curl "https://maps.googleapis.com/maps/api/directions/json?origin=FH78%2BQJ,%20Half%20Moon%20Bay%20CA&amp;amp;destination=GG49%2BCH,%20Half%20Moon%20Bay,%20California&amp;amp;departure_time=now&amp;amp;key=$GOOGLE_MAPS_KEY" | jq &amp;gt; two.json&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Commit and push&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;        git config user.name "Automated"&lt;/span&gt;
&lt;span class="pl-s"&gt;        git config user.email "actions@users.noreply.github.com"&lt;/span&gt;
&lt;span class="pl-s"&gt;        git add -A&lt;/span&gt;
&lt;span class="pl-s"&gt;        timestamp=$(date -u)&lt;/span&gt;
&lt;span class="pl-s"&gt;        git commit -m "${timestamp}" || exit 0&lt;/span&gt;
&lt;span class="pl-s"&gt;        git pull --rebase&lt;/span&gt;
&lt;span class="pl-s"&gt;        git push&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I'm using a GitHub Actions secret called &lt;code&gt;GOOGLE_MAPS_KEY&lt;/code&gt; to store the Google Maps API key.&lt;/p&gt;
&lt;p&gt;This workflow runs every 5 minutes (more-or-less - GitHub Actions doesn't necessarily stick to the schedule). It fetches the two JSON results and writes them to files called &lt;code&gt;one.json&lt;/code&gt; and &lt;code&gt;two.json&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;... and that was the initial setup for the project. This took me about fifteen minutes to put in place, because I've built systems like this so many times before. I launched it at about 10am on Saturday and left it to collect data.&lt;/p&gt;
&lt;h4&gt;&lt;a id="user-content-analyzing-the-data-and-drawing-some-charts" class="anchor" aria-hidden="true" href="#analyzing-the-data-and-drawing-some-charts"&gt;&lt;span aria-hidden="true" class="octicon octicon-link"&gt;&lt;/span&gt;&lt;/a&gt;Analyzing the data and drawing some charts&lt;/h4&gt;
&lt;p&gt;The trick with git scraping is that the data you care about ends up captured in &lt;a href="https://github.com/simonw/scrape-hmb-traffic/commits/main"&gt;the git commit log&lt;/a&gt;. The challenge is how to extract that back out again and turn it into something useful.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://simonwillison.net/2021/Dec/7/git-history/" rel="nofollow"&gt;git-history tool&lt;/a&gt; is designed to solve this. It's a command-line utility which can iterate through every version of a file stored in a git repository, extracting information from that file out into a SQLite database table and creating a new row for every commit.&lt;/p&gt;
&lt;p&gt;Normally I run it against CSV or JSON files containing an array of rows - effectively tabular data already, where I just want to record what has changed in between commits.&lt;/p&gt;
&lt;p&gt;For this project, I was storing the raw JSON output by the Google Maps API. I didn't care about most of the information in there: I really just wanted the &lt;code&gt;duration_in_traffic&lt;/code&gt; value.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; can accept a snippet of Python code that will be run against each stored copy of a file. The snippet should return a list of JSON objects (as Python dictionaries) which the rest of the tool can then use to figure out what has changed.&lt;/p&gt;
&lt;p&gt;To cut a long story short, here's the incantation that worked:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git-history file hmb.db one.json \
--convert '
try:
    duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
    return [{"id": "one", "duration_in_traffic": duration_in_traffic}]
except Exception as ex:
    return []
' \
  --full-versions \
  --id id
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;git-history file&lt;/code&gt; command is used to load the history for a specific file - in this case it's the file &lt;code&gt;one.json&lt;/code&gt;, which will be loaded into a new SQLite database file called &lt;code&gt;hm.db&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--convert&lt;/code&gt; code uses &lt;code&gt;json.loads(content)&lt;/code&gt; to load the JSON for the current file version, then pulls out the &lt;code&gt;["routes"][0]["legs"][0]["duration_in_traffic"]["value"]&lt;/code&gt; nested value from it.&lt;/p&gt;
&lt;p&gt;If that's missing (e.g. in an earlier commit, when I hadn't yet added the &lt;code&gt;departure_time=now&lt;/code&gt; parameter to the URL) an exception will be caught and the function will return an empty list.&lt;/p&gt;
&lt;p&gt;If the &lt;code&gt;duration_in_traffic&lt;/code&gt; value is present, the function returns the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[{"id": "one", "duration_in_traffic": duration_in_traffic}]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; likes lists of dictionaries. It's usually being run against files that contain many different rows, where the &lt;code&gt;id&lt;/code&gt; column can be used to de-dupe rows across commits and spot what has changed.&lt;/p&gt;
&lt;p&gt;In this case, each file only has a single interesting value.&lt;/p&gt;
&lt;p&gt;Two more options are used here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--full-versions&lt;/code&gt; - tells &lt;code&gt;git-history&lt;/code&gt; to store all of the columns, not just columns that have changed since the last run. The default behaviour here is to store a &lt;code&gt;null&lt;/code&gt; if a value has not changed in order to save space, but our data is tiny here so we don't need any clever optimizations.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--id id&lt;/code&gt; specifies the ID column that should be used to de-dupe changes. Again, not really important for this tiny project.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After running the above command, the resulting schema includes these tables:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;CREATE TABLE [commits] (
   [id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [namespace] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [namespaces]([id]),
   [hash] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [commit_at] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [item_version] (
   [_id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [_item] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [item]([_id]),
   [_version] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [_commit] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [commits]([id]),
   [id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [duration_in_traffic] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;
);&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The &lt;code&gt;commits&lt;/code&gt; table includes the date of the commit - &lt;code&gt;commit_at&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;item_version&lt;/code&gt; table has that &lt;code&gt;duration_in_traffic&lt;/code&gt; value.&lt;/p&gt;
&lt;p&gt;So... to get back the duration in traffic at different times of day I can run this SQL query to join those two tables together:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt;
    &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;,
    duration_in_traffic
&lt;span class="pl-k"&gt;from&lt;/span&gt;
    item_version
&lt;span class="pl-k"&gt;join&lt;/span&gt;
    commits &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;item_version&lt;/span&gt;.&lt;span class="pl-c1"&gt;_commit&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;
&lt;span class="pl-k"&gt;order by&lt;/span&gt;
    &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That query returns data that looks like this:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;commit_at&lt;/th&gt;
&lt;th&gt;duration_in_traffic&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2022-10-15T17:09:06+00:00&lt;/td&gt;
&lt;td&gt;1110&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022-10-15T17:17:38+00:00&lt;/td&gt;
&lt;td&gt;1016&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022-10-15T17:30:06+00:00&lt;/td&gt;
&lt;td&gt;1391&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;A couple of problems here. First, the &lt;code&gt;commit_at&lt;/code&gt; column is in UTC, not local time. And &lt;code&gt;duration_in_traffic&lt;/code&gt; is in seconds, which aren't particularly easy to read.&lt;/p&gt;
&lt;p&gt;Here's a SQLite fix for these two issues:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;select
    time(datetime(commits.commit_at, '-7 hours')) as t,
    duration_in_traffic / 60 as mins_in_traffic
from
    item_version
join
    commits on item_version._commit = commits.id
order by
    commits.commit_at
&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;t&lt;/th&gt;
&lt;th&gt;mins_in_traffic&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10:09:06&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:17:38&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:30:06&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;code&gt;datetime(commits.commit_at, '-7 hours')&lt;/code&gt; parses the UTC string as a datetime, and then subsracts 7 hours from it to get the local time in California converted from UTC.&lt;/p&gt;
&lt;p&gt;I wrap that in &lt;code&gt;time()&lt;/code&gt; here because for the chart I want to render I know everything will be on the same day.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;mins_in_traffic&lt;/code&gt; now shows minutes, not seconds.&lt;/p&gt;
&lt;p&gt;We now have enough data to render a chart!&lt;/p&gt;
&lt;p&gt;But... we only have one of the two directions of traffic here. To process the numbers from &lt;code&gt;two.json&lt;/code&gt; as well I ran this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git-history file hmb.db two.json \
--convert '
try:
    duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
    return [{"id": "two", "duration_in_traffic": duration_in_traffic}]
except Exception as ex:
    return []
' \
  --full-versions \
  --id id --namespace item2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is &lt;em&gt;almost&lt;/em&gt; the same as the previous command. It's running against &lt;code&gt;two.json&lt;/code&gt; instead of &lt;code&gt;one.json&lt;/code&gt;, and it's using the &lt;code&gt;--namespace item2&lt;/code&gt; option.&lt;/p&gt;
&lt;p&gt;This causes it to populate a new table called &lt;code&gt;item2_version&lt;/code&gt; instead of &lt;code&gt;item_version&lt;/code&gt;, which is a cheap trick to avoid having to figure out how to load both files into the same table.&lt;/p&gt;
&lt;h2&gt;&lt;a id="user-content-two-lines-on-one-chart" class="anchor" aria-hidden="true" href="#two-lines-on-one-chart"&gt;&lt;span aria-hidden="true" class="octicon octicon-link"&gt;&lt;/span&gt;&lt;/a&gt;Two lines on one chart&lt;/h2&gt;
&lt;p&gt;I rendered an initial single line chart using &lt;a href="https://datasette.io/plugins/datasette-vega" rel="nofollow"&gt;datasette-vega&lt;/a&gt;, but Natalie suggested that putting lines on the same chart for the two directions of traffic would be more interesting.&lt;/p&gt;
&lt;p&gt;Since I now had one table for each direction of traffic (&lt;code&gt;item_version&lt;/code&gt; and &lt;code&gt;item_version2&lt;/code&gt;) I decided to combine those into a single table, suitable for pasting into Google Sheets.&lt;/p&gt;
&lt;p&gt;Here's the SQL I came up with to do that:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;with item1 &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    &lt;span class="pl-k"&gt;time&lt;/span&gt;(datetime(&lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;-7 hours&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;)) &lt;span class="pl-k"&gt;as&lt;/span&gt; t,
    duration_in_traffic &lt;span class="pl-k"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;60&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; mins_in_traffic
  &lt;span class="pl-k"&gt;from&lt;/span&gt;
    item_version
    &lt;span class="pl-k"&gt;join&lt;/span&gt; commits &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;item_version&lt;/span&gt;.&lt;span class="pl-c1"&gt;_commit&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;
  &lt;span class="pl-k"&gt;order by&lt;/span&gt;
    &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;
),
item2 &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    &lt;span class="pl-k"&gt;time&lt;/span&gt;(datetime(&lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;-7 hours&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;)) &lt;span class="pl-k"&gt;as&lt;/span&gt; t,
    duration_in_traffic &lt;span class="pl-k"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;60&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; mins_in_traffic
  &lt;span class="pl-k"&gt;from&lt;/span&gt;
    item2_version
    &lt;span class="pl-k"&gt;join&lt;/span&gt; commits &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;item2_version&lt;/span&gt;.&lt;span class="pl-c1"&gt;_commit&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;
  &lt;span class="pl-k"&gt;order by&lt;/span&gt;
    &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;
)
&lt;span class="pl-k"&gt;select&lt;/span&gt;
  item1.&lt;span class="pl-k"&gt;*&lt;/span&gt;,
  &lt;span class="pl-c1"&gt;item2&lt;/span&gt;.&lt;span class="pl-c1"&gt;mins_in_traffic&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; mins_in_traffic_other_way
&lt;span class="pl-k"&gt;from&lt;/span&gt;
  item1
  &lt;span class="pl-k"&gt;join&lt;/span&gt; item2 &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;item1&lt;/span&gt;.&lt;span class="pl-c1"&gt;t&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;item2&lt;/span&gt;.&lt;span class="pl-c1"&gt;t&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This uses two CTEs (Common Table Expressions - the &lt;code&gt;with X as&lt;/code&gt; pieces) using the pattern I explained earlier - now called &lt;code&gt;item1&lt;/code&gt; and &lt;code&gt;item2&lt;/code&gt;. Having defined these two CTEs, I can join them together on the &lt;code&gt;t&lt;/code&gt; column, which is the time of day.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db?&amp;amp;install=datasette-copyable#/hmb?sql=with+item1+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item_version%0A++++join+commits+on+item_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%2C%0Aitem2+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item2_version%0A++++join+commits+on+item2_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%0Aselect%0A++item1.*%2C%0A++item2.mins_in_traffic+as+mins_in_traffic_other_way%0Afrom%0A++item1%0A++join+item2+on+item1.t+%3D+item2.t" rel="nofollow"&gt;Try running this query&lt;/a&gt; in Datasette Lite.&lt;/p&gt;
&lt;p&gt;Here's the output of that query for Saturday (10am to 8pm):&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;t&lt;/th&gt;
&lt;th&gt;mins_in_traffic&lt;/th&gt;
&lt;th&gt;mins_in_traffic_other_way&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10:09:06&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:17:38&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:30:06&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:47:38&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:57:37&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11:08:20&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11:22:27&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11:38:42&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11:52:35&lt;/td&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:03:23&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:15:16&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:27:51&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:37:48&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:46:41&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:55:03&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13:05:10&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13:17:57&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13:32:55&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13:44:53&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13:55:22&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:05:21&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:17:48&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:31:04&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:41:59&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:51:48&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:00:09&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:11:17&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:25:48&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:39:41&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:51:11&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:59:34&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16:10:50&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16:25:43&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16:53:06&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17:11:34&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17:40:29&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;18:12:07&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;18:58:17&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20:05:13&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;I copied and pasted this table into Google Sheets and messed around with the charting tools there until I had the following chart:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A chart showing the two lines over time" src="https://static.simonwillison.net/static/2022/pumpkin-saturday-smooth.png" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's the same chart for Sunday:&lt;/p&gt;
&lt;p&gt;&lt;img alt="This chart shows the same thing but for Sunday" src="https://static.simonwillison.net/static/2022/pumpkin-sunday-smooth.png" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Our &lt;a href="https://docs.google.com/spreadsheets/d/1JOimtkugZBF_YQxqn0Gn6NiIhNz-OMH2rpOZtmECAY4/edit#gid=0" rel="nofollow"&gt;Google Sheet is here&lt;/a&gt; - the two days have two separate tabs within the sheet.&lt;/p&gt;
&lt;h4&gt;&lt;a id="user-content-building-the-sqlite-database-in-github-actions" class="anchor" aria-hidden="true" href="#building-the-sqlite-database-in-github-actions"&gt;&lt;span aria-hidden="true" class="octicon octicon-link"&gt;&lt;/span&gt;&lt;/a&gt;Building the SQLite database in GitHub Actions&lt;/h4&gt;
&lt;p&gt;I did most of the development work for this project on my laptop, running &lt;code&gt;git-history&lt;/code&gt; and &lt;code&gt;datasette&lt;/code&gt; locally for speed of iteration.&lt;/p&gt;
&lt;p&gt;Once I had everything working, I decided to automate the process of building the SQLite database as well.&lt;/p&gt;
&lt;p&gt;I made the following changes to my GitHub Actions workflow:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;jobs&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;shot-scraper&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;runs-on&lt;/span&gt;: &lt;span class="pl-s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="pl-ent"&gt;steps&lt;/span&gt;:
    - &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/checkout@v3&lt;/span&gt;
      &lt;span class="pl-ent"&gt;with&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;fetch-depth&lt;/span&gt;: &lt;span class="pl-c1"&gt;0&lt;/span&gt; &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Needed by git-history&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Set up Python 3.10&lt;/span&gt;
      &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/setup-python@v4&lt;/span&gt;
      &lt;span class="pl-ent"&gt;with&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;python-version&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;3.10&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
        &lt;span class="pl-ent"&gt;cache&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;pip&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;pip install -r requirements.txt&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Scrape&lt;/span&gt;
      &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Same as before...&lt;/span&gt;
      &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; env:&lt;/span&gt;
      &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; run&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Build SQLite database&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|&lt;/span&gt;
&lt;span class="pl-s"&gt;        rm -f hmb.db # Recreate from scratch each time&lt;/span&gt;
&lt;span class="pl-s"&gt;        git-history file hmb.db one.json \&lt;/span&gt;
&lt;span class="pl-s"&gt;        --convert '&lt;/span&gt;
&lt;span class="pl-s"&gt;        try:&lt;/span&gt;
&lt;span class="pl-s"&gt;            duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]&lt;/span&gt;
&lt;span class="pl-s"&gt;            return [{"id": "one", "duration_in_traffic": duration_in_traffic}]&lt;/span&gt;
&lt;span class="pl-s"&gt;        except Exception as ex:&lt;/span&gt;
&lt;span class="pl-s"&gt;            return []&lt;/span&gt;
&lt;span class="pl-s"&gt;        ' \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --full-versions \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --id id&lt;/span&gt;
&lt;span class="pl-s"&gt;        git-history file hmb.db two.json \&lt;/span&gt;
&lt;span class="pl-s"&gt;        --convert '&lt;/span&gt;
&lt;span class="pl-s"&gt;        try:&lt;/span&gt;
&lt;span class="pl-s"&gt;            duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]&lt;/span&gt;
&lt;span class="pl-s"&gt;            return [{"id": "two", "duration_in_traffic": duration_in_traffic}]&lt;/span&gt;
&lt;span class="pl-s"&gt;        except Exception as ex:&lt;/span&gt;
&lt;span class="pl-s"&gt;            return []&lt;/span&gt;
&lt;span class="pl-s"&gt;        ' \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --full-versions \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --id id --namespace item2&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Commit and push&lt;/span&gt;
      &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Same as before...&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I also added a &lt;code&gt;requirements.txt&lt;/code&gt; file containing just &lt;code&gt;git-history&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Note how the &lt;code&gt;actions/checkout@v3&lt;/code&gt; step now has &lt;code&gt;fetch-depth: 0&lt;/code&gt; - this is necessary because &lt;code&gt;git-history&lt;/code&gt; needs to loop through the entire repository history, but &lt;code&gt;actions/checkout@v3&lt;/code&gt; defaults to only fetching the most recent commit.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;setup-python&lt;/code&gt; step uses &lt;code&gt;cache: "pip"&lt;/code&gt;, which causes it to cache installed dependencies from &lt;code&gt;requirements.txt&lt;/code&gt; between runs.&lt;/p&gt;
&lt;p&gt;Because that big &lt;code&gt;git-history&lt;/code&gt; step creates a &lt;code&gt;hmb.db&lt;/code&gt; SQLite database, the "Commit and push" step now includes that file in the push to the repository. So every time the workflow runs a new binary SQLite database file is committed.&lt;/p&gt;
&lt;p&gt;Normally I wouldn't do this, because Git isn't a great place to keep constantly changing binary files... but in this case the SQLite database is only 100KB and won't continue to be updated beyond the end of the pumpkin festival.&lt;/p&gt;
&lt;p&gt;End result: &lt;a href="https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db"&gt;hmb.db is available&lt;/a&gt; in the GitHub repository.&lt;/p&gt;
&lt;h4&gt;&lt;a id="user-content-querying-it-using-datasette-lite" class="anchor" aria-hidden="true" href="#querying-it-using-datasette-lite"&gt;&lt;span aria-hidden="true" class="octicon octicon-link"&gt;&lt;/span&gt;&lt;/a&gt;Querying it using Datasette Lite&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://simonwillison.net/2022/May/4/datasette-lite/" rel="nofollow"&gt;Datasette Lite&lt;/a&gt; is my repackaged version of my Datasette server-side Python application which runs entirely in the user's browser, using WebAssembly.&lt;/p&gt;
&lt;p&gt;A neat feature of Datasette Lite is that you can pass it the URL to a SQLite database file and it will load that database in your browser and let you run queries against it.&lt;/p&gt;
&lt;p&gt;These database files need to be served with CORS headers. Every file served by GitHub includes these headers!&lt;/p&gt;
&lt;p&gt;Which means the following URL can be used to open up the latest &lt;code&gt;hmb.db&lt;/code&gt; file directly in Datasette in the browser:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db" rel="nofollow"&gt;https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;(This takes advantage of a &lt;a href="https://simonwillison.net/2022/Sep/16/weeknotes/" rel="nofollow"&gt;feature I added&lt;/a&gt; to Datasette Lite where it knows how to convert the URL to the HTML page about a file on GitHub to the URL to the raw file itself.)&lt;/p&gt;
&lt;p&gt;URLs to SQL queries work too. This URL will open Datasette Lite, load the SQLite database AND execute the query I constructed above:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db#/hmb?sql=with+item1+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item_version%0A++++join+commits+on+item_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%2C%0Aitem2+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item2_version%0A++++join+commits+on+item2_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%0Aselect%0A++item1.*%2C%0A++item2.mins_in_traffic+as+mins_in_traffic_other_way%0Afrom%0A++item1%0A++join+item2+on+item1.t+%3D+item2.t" rel="nofollow"&gt;https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db#/hmb?sql=with+item1+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item_version%0A++++join+commits+on+item_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%2C%0Aitem2+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item2_version%0A++++join+commits+on+item2_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%0Aselect%0A++item1.*%2C%0A++item2.mins_in_traffic+as+mins_in_traffic_other_way%0Afrom%0A++item1%0A++join+item2+on+item1.t+%3D+item2.t&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;And finally... Datasette Lite &lt;a href="https://simonwillison.net/2022/Aug/17/datasette-lite-plugins/" rel="nofollow"&gt;has plugin support&lt;/a&gt;. Adding &lt;code&gt;&amp;amp;install=datasette-copyable&lt;/code&gt; to the URL adds the &lt;a href="https://datasette.io/plugins/datasette-copyable" rel="nofollow"&gt;datasette-copyable&lt;/a&gt; plugin, which adds a page for easily copying out the query results as TSV (useful for pasting into a spreadsheet) or even as GitHub-flavored Markdown (which I used to add results to this blog post).&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db&amp;amp;install=datasette-copyable#/hmb.copyable?sql=with+item1+as+%28%0A++select%0A++++time%28datetime%28commits.commit_at%2C+%27-7+hours%27%29%29+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item_version%0A++++join+commits+on+item_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A%29%2C%0Aitem2+as+%28%0A++select%0A++++time%28datetime%28commits.commit_at%2C+%27-7+hours%27%29%29+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item2_version%0A++++join+commits+on+item2_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A%29%0Aselect%0A++item1.%2A%2C%0A++item2.mins_in_traffic+as+mins_in_traffic_other_way%0Afrom%0A++item1%0A++join+item2+on+item1.t+%3D+item2.t&amp;amp;_table_format=github" rel="nofollow"&gt;an example&lt;/a&gt; of that plugin in action.&lt;/p&gt;
&lt;p&gt;This was a fun little project that brought together a whole bunch of things I've been working on over the past few years. Here's some more of my writing on these different techniques and tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/series/git-scraping/" rel="nofollow"&gt;Git scraping&lt;/a&gt; is the key technique I'm using here to collect the data&lt;/li&gt;
&lt;li&gt;I've written a lot about &lt;a href="https://simonwillison.net/tags/githubactions/" rel="nofollow"&gt;GitHub Actions&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;These are my notes about &lt;a href="https://simonwillison.net/tags/githistory/" rel="nofollow"&gt;git-history&lt;/a&gt;, the tool I used to turn a commit history into a SQLite database&lt;/li&gt;
&lt;li&gt;Here's my series of posts about &lt;a href="https://simonwillison.net/series/datasette-lite/" rel="nofollow"&gt;Datasette Lite&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/natalie-downe"&gt;natalie-downe&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-lite"&gt;datasette-lite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/half-moon-bay"&gt;half-moon-bay&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="data-journalism"/><category term="natalie-downe"/><category term="projects"/><category term="sqlite"/><category term="datasette"/><category term="git-scraping"/><category term="git-history"/><category term="datasette-lite"/><category term="half-moon-bay"/></entry><entry><title>Half Moon Bay Pumpkin Festival traffic on Saturday 15th October 2022</title><link href="https://simonwillison.net/2022/Oct/16/half-moon-bay-pumpkin-festival-traffic/#atom-tag" rel="alternate"/><published>2022-10-16T03:56:51+00:00</published><updated>2022-10-16T03:56:51+00:00</updated><id>https://simonwillison.net/2022/Oct/16/half-moon-bay-pumpkin-festival-traffic/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/scrape-hmb-traffic"&gt;Half Moon Bay Pumpkin Festival traffic on Saturday 15th October 2022&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
It’s the Half Moon Bay Pumpkin Festival this weekend... and its impact on the traffic between our little town of El Granada and Half Moon Bay—8 minutes drive away—is notorious. So I built a git scraper that archives estimated driving times from the Google Maps Navigation API, and used git-history to turn that scraped data into a SQLite database and visualize it on a chart.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/simonw/status/1581493679738363904"&gt;@simonw&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/half-moon-bay"&gt;half-moon-bay&lt;/a&gt;&lt;/p&gt;



</summary><category term="projects"/><category term="git-scraping"/><category term="git-history"/><category term="half-moon-bay"/></entry><entry><title>Automatically opening issues when tracked file content changes</title><link href="https://simonwillison.net/2022/Apr/28/issue-on-changes/#atom-tag" rel="alternate"/><published>2022-04-28T17:18:14+00:00</published><updated>2022-04-28T17:18:14+00:00</updated><id>https://simonwillison.net/2022/Apr/28/issue-on-changes/#atom-tag</id><summary type="html">
    &lt;p&gt;I figured out a GitHub Actions pattern to keep track of a file published somewhere on the internet and automatically open a new repository issue any time the contents of that file changes.&lt;/p&gt;
&lt;h4&gt;Extracting GZipMiddleware from Starlette&lt;/h4&gt;
&lt;p&gt;Here's why I needed to solve this problem.&lt;/p&gt;
&lt;p&gt;I want to add gzip support to my &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; open source project. Datasette builds on the Python &lt;a href="https://asgi.readthedocs.io/"&gt;ASGI&lt;/a&gt; standard, and &lt;a href="https://www.starlette.io/"&gt;Starlette&lt;/a&gt; provides an extremely well tested, robust &lt;a href="https://www.starlette.io/middleware/#gzipmiddleware"&gt;GZipMiddleware class&lt;/a&gt; that adds gzip support to any ASGI application. As with everything else in Starlette, it's &lt;em&gt;really&lt;/em&gt; good code.&lt;/p&gt;
&lt;p&gt;The problem is, I don't want to add the whole of Starlette as a dependency. I'm trying to keep Datasette's core as small as possible, so I'm very careful about new dependencies. Starlette itself is actually very light (and only has a tiny number of dependencies of its own) but I still don't want the whole thing just for that one class.&lt;/p&gt;
&lt;p&gt;So I decided to extract the &lt;code&gt;GZipMiddleware&lt;/code&gt; class into a separate Python package, under the same BSD license as Starlette itself.&lt;/p&gt;
&lt;p&gt;The result is my new &lt;a href="https://pypi.org/project/asgi-gzip/"&gt;asgi-gzip&lt;/a&gt; package, now available on PyPI.&lt;/p&gt;
&lt;h4&gt;What if Starlette fixes a bug?&lt;/h4&gt;
&lt;p&gt;The problem with extracting code like this is that Starlette is a very effectively maintained package. What if they make improvements or fix bugs in the &lt;code&gt;GZipMiddleware&lt;/code&gt; class? How can I make sure to apply those same fixes to my extracted copy?&lt;/p&gt;
&lt;p&gt;As I thought about this challenge, I realized I had most of the solution already.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt;&lt;/strong&gt; is the name I've given to the trick of running a periodic scraper that writes to a git repository in order to track changes to data over time.&lt;/p&gt;
&lt;p&gt;It may seem redundant to do this against a file that already &lt;a href="https://github.com/encode/starlette/commits/master/starlette/middleware/gzip.py"&gt;lives in version control&lt;/a&gt; elsewhere - but in addition to tracking changes, Git scraping can offfer a cheap and easy way to add automation that triggers when a change is detected.&lt;/p&gt;
&lt;p&gt;I need an actionable alert any time the Starlette code changes so I can review the change and apply a fix to my own library, if necessary.&lt;/p&gt;
&lt;p&gt;Since I already run all of my projects out of GitHub issues, automatically opening an issue against the &lt;a href="https://github.com/simonw/asgi-gzip"&gt;asgi-gzip repository&lt;/a&gt; would be ideal.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://github.com/simonw/asgi-gzip/blob/0.1/.github/workflows/track.yml"&gt;track.yml workflow&lt;/a&gt; does exactly that: it implements the Git scraping pattern against the &lt;a href="https://github.com/encode/starlette/blob/master/starlette/middleware/gzip.py"&gt;gzip.py module&lt;/a&gt; in Starlette, and files an issue any time it detects changes to that file.&lt;/p&gt;
&lt;p&gt;Starlette haven't made any changes to that file since I started tracking it, so I created &lt;a href="https://github.com/simonw/issue-when-changed"&gt;a test repo&lt;/a&gt; to try this out.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/issue-when-changed/issues/3"&gt;one of the example issues&lt;/a&gt;. I decided to include the visual diff in the issue description and have a link to it from the underlying commit as well.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/issue-when-changed.jpg" alt="Screenshot of an open issue page. The issues is titled &amp;quot;gzip.py was updated&amp;quot; and contains a visual diff showing the change to a file. A commit that references the issue is listed too." style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;How it works&lt;/h4&gt;
&lt;p&gt;The implementation is contained entirely in this &lt;a href="https://github.com/simonw/asgi-gzip/blob/0.1/.github/workflows/track.yml"&gt;track.yml workflow&lt;/a&gt;. I designed this to be contained as a single file to make it easy to copy and paste it to adapt it for other projects.&lt;/p&gt;
&lt;p&gt;It uses &lt;a href="https://github.com/actions/github-script"&gt;actions/github-script&lt;/a&gt;, which makes it easy to do things like file new issues using JavaScript.&lt;/p&gt;
&lt;p&gt;Here's a heavily annotated copy:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Track the Starlette version of this&lt;/span&gt;

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Run on repo pushes, and if a user clicks the "run this action" button,&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; and on a schedule at 5:21am UTC every day&lt;/span&gt;
&lt;span class="pl-ent"&gt;on&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;push&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;workflow_dispatch&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;schedule&lt;/span&gt;:
  - &lt;span class="pl-ent"&gt;cron&lt;/span&gt;:  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;21 5 * * *&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Without this block I got this error when the action ran:&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; HttpError: Resource not accessible by integration&lt;/span&gt;
&lt;span class="pl-ent"&gt;permissions&lt;/span&gt;:
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Allow the action to create issues&lt;/span&gt;
  &lt;span class="pl-ent"&gt;issues&lt;/span&gt;: &lt;span class="pl-s"&gt;write&lt;/span&gt;
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Allow the action to commit back to the repository&lt;/span&gt;
  &lt;span class="pl-ent"&gt;contents&lt;/span&gt;: &lt;span class="pl-s"&gt;write&lt;/span&gt;

&lt;span class="pl-ent"&gt;jobs&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;check&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;runs-on&lt;/span&gt;: &lt;span class="pl-s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="pl-ent"&gt;steps&lt;/span&gt;:
    - &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/checkout@v2&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/github-script@v6&lt;/span&gt;
      &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Using env: here to demonstrate how an action like this can&lt;/span&gt;
      &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; be adjusted to take dynamic inputs&lt;/span&gt;
      &lt;span class="pl-ent"&gt;env&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;URL&lt;/span&gt;: &lt;span class="pl-s"&gt;https://raw.githubusercontent.com/encode/starlette/master/starlette/middleware/gzip.py&lt;/span&gt;
        &lt;span class="pl-ent"&gt;FILE_NAME&lt;/span&gt;: &lt;span class="pl-s"&gt;tracking/gzip.py&lt;/span&gt;
      &lt;span class="pl-ent"&gt;with&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;script&lt;/span&gt;: &lt;span class="pl-s"&gt;|&lt;/span&gt;
&lt;span class="pl-s"&gt;          const { URL, FILE_NAME } = process.env;&lt;/span&gt;
&lt;span class="pl-s"&gt;          // promisify pattern for getting an await version of child_process.exec&lt;/span&gt;
&lt;span class="pl-s"&gt;          const util = require("util");&lt;/span&gt;
&lt;span class="pl-s"&gt;          // Used exec_ here because 'exec' variable name is already used:&lt;/span&gt;
&lt;span class="pl-s"&gt;          const exec_ = util.promisify(require("child_process").exec);&lt;/span&gt;
&lt;span class="pl-s"&gt;          // Use curl to download the file&lt;/span&gt;
&lt;span class="pl-s"&gt;          await exec_(`curl -o ${FILE_NAME} ${URL}`);&lt;/span&gt;
&lt;span class="pl-s"&gt;          // Use 'git diff' to detect if the file has changed since last time&lt;/span&gt;
&lt;span class="pl-s"&gt;          const { stdout } = await exec_(`git diff ${FILE_NAME}`);&lt;/span&gt;
&lt;span class="pl-s"&gt;          if (stdout) {&lt;/span&gt;
&lt;span class="pl-s"&gt;            // There was a diff to that file&lt;/span&gt;
&lt;span class="pl-s"&gt;            const title = `${FILE_NAME} was updated`;&lt;/span&gt;
&lt;span class="pl-s"&gt;            const body =&lt;/span&gt;
&lt;span class="pl-s"&gt;              `${URL} changed:` +&lt;/span&gt;
&lt;span class="pl-s"&gt;              "\n\n```diff\n" +&lt;/span&gt;
&lt;span class="pl-s"&gt;              stdout +&lt;/span&gt;
&lt;span class="pl-s"&gt;              "\n```\n\n" +&lt;/span&gt;
&lt;span class="pl-s"&gt;              "Close this issue once those changes have been integrated here";&lt;/span&gt;
&lt;span class="pl-s"&gt;            const issue = await github.rest.issues.create({&lt;/span&gt;
&lt;span class="pl-s"&gt;              owner: context.repo.owner,&lt;/span&gt;
&lt;span class="pl-s"&gt;              repo: context.repo.repo,&lt;/span&gt;
&lt;span class="pl-s"&gt;              title: title,&lt;/span&gt;
&lt;span class="pl-s"&gt;              body: body,&lt;/span&gt;
&lt;span class="pl-s"&gt;            });&lt;/span&gt;
&lt;span class="pl-s"&gt;            const issueNumber = issue.data.number;&lt;/span&gt;
&lt;span class="pl-s"&gt;            // Now commit and reference that issue number, so the commit shows up&lt;/span&gt;
&lt;span class="pl-s"&gt;            // listed at the bottom of the issue page&lt;/span&gt;
&lt;span class="pl-s"&gt;            const commitMessage = `${FILE_NAME} updated, refs #${issueNumber}`;&lt;/span&gt;
&lt;span class="pl-s"&gt;            // https://til.simonwillison.net/github-actions/commit-if-file-changed&lt;/span&gt;
&lt;span class="pl-s"&gt;            await exec_(`git config user.name "Automated"`);&lt;/span&gt;
&lt;span class="pl-s"&gt;            await exec_(`git config user.email "actions@users.noreply.github.com"`);&lt;/span&gt;
&lt;span class="pl-s"&gt;            await exec_(`git add -A`);&lt;/span&gt;
&lt;span class="pl-s"&gt;            await exec_(`git commit -m "${commitMessage}" || exit 0`);&lt;/span&gt;
&lt;span class="pl-s"&gt;            await exec_(`git pull --rebase`);&lt;/span&gt;
&lt;span class="pl-s"&gt;            await exec_(`git push`);&lt;/span&gt;
&lt;span class="pl-s"&gt;          }&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;In the &lt;a href="https://github.com/simonw/asgi-gzip"&gt;asgi-gzip&lt;/a&gt; repository I keep the fetched &lt;code&gt;gzip.py&lt;/code&gt; file in a &lt;code&gt;tracking/&lt;/code&gt; directory. This directory isn't included in the Python package that gets uploaded to PyPI - it's there only so that my code can track changes to it over time.&lt;/p&gt;
&lt;h4&gt;More interesting applications&lt;/h4&gt;
&lt;p&gt;I built this to solve my "tell me when Starlette update their &lt;code&gt;gzip.py&lt;/code&gt; file" problem, but clearly this pattern has much more interesting uses.&lt;/p&gt;
&lt;p&gt;You could point this at any web page to get a new GitHub issue opened when that page content changes. Subscribe to notifications for that repository and you get a robust , shared mechanism for alerts - plus an issue system where you can post additional comments and close the issue once someone has reviewed the change.&lt;/p&gt;
&lt;p&gt;There's a lot of potential here for solving all kinds of interesting problems. And it doesn't cost anything either: GitHub Actions (somehow) remains completely free for public repositories!&lt;/p&gt;
&lt;h4&gt;Update: October 13th 2022&lt;/h4&gt;
&lt;p&gt;Almost six months after writing about this... it triggered for the first time!&lt;/p&gt;
&lt;p&gt;Here's the issue that the script opened: &lt;a href="https://github.com/simonw/asgi-gzip/issues/4"&gt;#4: tracking/gzip.py was updated&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I applied the improvement (Marcelo Trylesinski and Kai Klingenberg updated Starlette's code to avoid gzipping if the response already had a Content-Encoding header) and released &lt;a href="https://github.com/simonw/asgi-gzip/releases/tag/0.2"&gt;version 0.2&lt;/a&gt; of the package.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gzip"&gt;gzip&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/asgi"&gt;asgi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-issues"&gt;github-issues&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="github"/><category term="gzip"/><category term="projects"/><category term="python"/><category term="datasette"/><category term="asgi"/><category term="github-actions"/><category term="git-scraping"/><category term="github-issues"/></entry><entry><title>Scraping web pages from the command line with shot-scraper</title><link href="https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-scraper/#atom-tag" rel="alternate"/><published>2022-03-14T01:29:56+00:00</published><updated>2022-03-14T01:29:56+00:00</updated><id>https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-scraper/#atom-tag</id><summary type="html">
    &lt;p&gt;I've added a powerful new capability to my &lt;strong&gt;&lt;a href="https://github.com/simonw/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/strong&gt; command line browser automation tool: you can now use it to load a web page in a headless browser, execute JavaScript to extract information and return that information back to the terminal as JSON.&lt;/p&gt;
&lt;p&gt;Among other things, this means you can construct Unix pipelines that incorporate a full headless web browser as part of their processing.&lt;/p&gt;
&lt;p&gt;It's also a really neat web scraping tool.&lt;/p&gt;
&lt;h4&gt;shot-scraper&lt;/h4&gt;
&lt;p&gt;I &lt;a href="https://simonwillison.net/2022/Mar/10/shot-scraper/"&gt;introduced shot-scraper&lt;/a&gt; last Thursday. It's a Python utility that wraps &lt;a href="https://playwright.dev/"&gt;Playwright&lt;/a&gt;, providing both a command line interface and a YAML-driven configuration flow for automating the process of taking screenshots of web pages.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% pip install shot-scraper
% shot-scraper https://simonwillison.net/ --height 800
Screenshot of 'https://simonwillison.net/' written to 'simonwillison-net.png'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/simonwillison-net.png" alt="Screenshot of my blog homepage" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Since Thursday &lt;code&gt;shot-scraper&lt;/code&gt; has had &lt;a href="https://github.com/simonw/shot-scraper/releases"&gt;a flurry of releases&lt;/a&gt;, adding features like &lt;a href="https://github.com/simonw/shot-scraper/blob/0.9/README.md#saving-a-web-page-to-pdf"&gt;PDF exports&lt;/a&gt;, the ability to dump the Chromium &lt;a href="https://github.com/simonw/shot-scraper/blob/0.9/README.md#dumping-out-an-accessibility-tree"&gt;accessibilty tree&lt;/a&gt; and the ability to take screenshots of &lt;a href="https://github.com/simonw/shot-scraper/blob/0.9/README.md#websites-that-need-authentication"&gt;authenticated web pages&lt;/a&gt;. But the most exciting new feature landed today.&lt;/p&gt;
&lt;h4&gt;Executing JavaScript and returning the result&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/shot-scraper/releases/tag/0.9"&gt;Release 0.9&lt;/a&gt; takes the tool in a new direction. The following command will execute JavaScript on the page and return the resulting value:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% shot-scraper javascript simonwillison.net document.title
"Simon Willison\u2019s Weblog"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or you can return a JSON object:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% shot-scraper javascript https://datasette.io/ "({
  title: document.title,
  tagline: document.querySelector('.tagline').innerText
})"
{
  "title": "Datasette: An open source multi-tool for exploring and publishing data",
  "tagline": "An open source multi-tool for exploring and publishing data"
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or if you want to use functions like &lt;code&gt;setTimeout()&lt;/code&gt; - for example, if you want to insert a delay to allow an animation to finish before running the rest of your code - you can return a promise:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% shot-scraper javascript datasette.io "
new Promise(done =&amp;gt; setInterval(
  () =&amp;gt; {
    done({
      title: document.title,
      tagline: document.querySelector('.tagline').innerText
    });
  }, 1000
));"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Errors that occur in the JavaScript turn into an exit code of 1 returned by the tool - which means you can also use this to execute simple tests in a CI flow. This example will fail a GitHub Actions workflow if the extracted page title is not the expected value:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Test page title&lt;/span&gt;
  &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;    shot-scraper javascript datasette.io "&lt;/span&gt;
&lt;span class="pl-s"&gt;      if (document.title != 'Datasette') {&lt;/span&gt;
&lt;span class="pl-s"&gt;        throw 'Wrong title detected';&lt;/span&gt;
&lt;span class="pl-s"&gt;      }"&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="scrape-a-web-page"&gt;Using this to scrape a web page&lt;/h4&gt;
&lt;p&gt;The most exciting use case for this new feature is web scraping. I'll illustrate that with an example.&lt;/p&gt;
&lt;p&gt;Posts from my blog occasionally show up on &lt;a href="https://news.ycombinator.com/"&gt;Hacker News&lt;/a&gt; - sometimes I spot them, sometimes I don't.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://news.ycombinator.com/from?site=simonwillison.net"&gt;https://news.ycombinator.com/from?site=simonwillison.net&lt;/a&gt; is a Hacker News page showing content from the specified domain. It's really useful, but it sadly isn't included in the official &lt;a href="https://github.com/HackerNews/API"&gt;Hacker News API&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/news-ycombinator-com-from.png" alt="Screenshot of the Hacker News listing for my domain" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;So... let's write a scraper for it.&lt;/p&gt;
&lt;p&gt;I started out running the Firefox developer console against that page, trying to figure out the right JavaScript to extract the data I was interested in. I came up with this:&lt;/p&gt;

&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-v"&gt;Array&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;from&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelectorAll&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.athing'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;title&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.titleline a'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;points&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;parseInt&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;nextSibling&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.score'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;url&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.titleline a'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;href&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;dt&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;nextSibling&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.age'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;title&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;submitter&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;nextSibling&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.hnuser'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;commentsUrl&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;nextSibling&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.age a'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;href&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;id&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;commentsUrl&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;split&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'?id='&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-c"&gt;// Only posts with comments have a comments link&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;commentsLink&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;Array&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;from&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;
    &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;nextSibling&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelectorAll&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'a'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;
  &lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;filter&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;el&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt; &lt;span class="pl-c1"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;includes&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'comment'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;0&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;let&lt;/span&gt; &lt;span class="pl-s1"&gt;numComments&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;0&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;commentsLink&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-s1"&gt;numComments&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;parseInt&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;commentsLink&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;split&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;0&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;}&lt;/span&gt;
  &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;id&lt;span class="pl-kos"&gt;,&lt;/span&gt; title&lt;span class="pl-kos"&gt;,&lt;/span&gt; url&lt;span class="pl-kos"&gt;,&lt;/span&gt; dt&lt;span class="pl-kos"&gt;,&lt;/span&gt; points&lt;span class="pl-kos"&gt;,&lt;/span&gt; submitter&lt;span class="pl-kos"&gt;,&lt;/span&gt; commentsUrl&lt;span class="pl-kos"&gt;,&lt;/span&gt; numComments&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The great thing about modern JavaScript is that everything you could need to write a scraper is already there in the default environment.&lt;/p&gt;
&lt;p&gt;I'm using &lt;code&gt;document.querySelectorAll('.itemlist .athing')&lt;/code&gt; to loop through each element that matches that selector.&lt;/p&gt;
&lt;p&gt;I wrap that with &lt;code&gt;Array.from(...)&lt;/code&gt; so I can use the &lt;code&gt;.map()&lt;/code&gt; method. Then for each element I can extract out the details that I need.&lt;/p&gt;
&lt;p&gt;The resulting array contains 30 items that look like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;[
  {
    &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30658310&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"title"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Track changes to CLI tools by recording their help output&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"url"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;https://simonwillison.net/2022/Feb/2/help-scraping/&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"dt"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2022-03-13T05:36:13&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"submitter"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;appwiz&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"commentsUrl"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;https://news.ycombinator.com/item?id=30658310&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"numComments"&lt;/span&gt;: &lt;span class="pl-c1"&gt;19&lt;/span&gt;
  }
]&lt;/pre&gt;&lt;/div&gt;
&lt;h4&gt;Running it with shot-scraper&lt;/h4&gt;
&lt;p&gt;Now that I have a recipe for a scraper, I can run it in the terminal like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;shot-scraper javascript &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://news.ycombinator.com/from?site=simonwillison.net&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;Array.from(document.querySelectorAll('.athing'), el =&amp;gt; {&lt;/span&gt;
&lt;span class="pl-s"&gt;  const title = el.querySelector('.titleline a').innerText;&lt;/span&gt;
&lt;span class="pl-s"&gt;  const points = parseInt(el.nextSibling.querySelector('.score').innerText);&lt;/span&gt;
&lt;span class="pl-s"&gt;  const url = el.querySelector('.titleline a').href;&lt;/span&gt;
&lt;span class="pl-s"&gt;  const dt = el.nextSibling.querySelector('.age').title;&lt;/span&gt;
&lt;span class="pl-s"&gt;  const submitter = el.nextSibling.querySelector('.hnuser').innerText;&lt;/span&gt;
&lt;span class="pl-s"&gt;  const commentsUrl = el.nextSibling.querySelector('.age a').href;&lt;/span&gt;
&lt;span class="pl-s"&gt;  const id = commentsUrl.split('?id=')[1];&lt;/span&gt;
&lt;span class="pl-s"&gt;  // Only posts with comments have a comments link&lt;/span&gt;
&lt;span class="pl-s"&gt;  const commentsLink = Array.from(&lt;/span&gt;
&lt;span class="pl-s"&gt;    el.nextSibling.querySelectorAll('a')&lt;/span&gt;
&lt;span class="pl-s"&gt;  ).filter(el =&amp;gt; el &amp;amp;&amp;amp; el.innerText.includes('comment'))[0];&lt;/span&gt;
&lt;span class="pl-s"&gt;  let numComments = 0;&lt;/span&gt;
&lt;span class="pl-s"&gt;  if (commentsLink) {&lt;/span&gt;
&lt;span class="pl-s"&gt;    numComments = parseInt(commentsLink.innerText.split()[0]);&lt;/span&gt;
&lt;span class="pl-s"&gt;  }&lt;/span&gt;
&lt;span class="pl-s"&gt;  return {id, title, url, dt, points, submitter, commentsUrl, numComments};&lt;/span&gt;
&lt;span class="pl-s"&gt;})&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; simonwillison-net.json&lt;/pre&gt;&lt;/div&gt;  
&lt;p&gt;&lt;code&gt;simonwillison-net.json&lt;/code&gt; is now a JSON file containing the scraped data.&lt;/p&gt;
&lt;h4&gt;Running the scraper in GitHub Actions&lt;/h4&gt;
&lt;p&gt;I want to keep track of changes to this data structure over time. My preferred technique for that is something I call &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; - the core idea is to keep the data in a Git repository and commit an update any time it updates. This provides a cheap and robust history of changes over time.&lt;/p&gt;
&lt;p&gt;Running the scraper in GitHub Actions means I don't need to administrate my own server to keep this running.&lt;/p&gt;
&lt;p&gt;So I built exactly that, in the &lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain"&gt;simonw/scrape-hacker-news-by-domain&lt;/a&gt; repository.&lt;/p&gt;
&lt;p&gt;The GitHub Actions workflow is in &lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain/blob/485841482a39869759e39f4d8dee21b9adc963d7/.github/workflows/scrape.yml"&gt;.github/workflows/scrape.yml&lt;/a&gt;. It runs the above command once an hour, then pushes a commit back to the repository should the file have any changes since last time it ran.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain/commits/main/simonwillison-net.json"&gt;commit history of simonwillison-net.json&lt;/a&gt; will show me any time a new link from my site appears on Hacker News, or a comment is added.&lt;/p&gt;
&lt;p&gt;(Fun GitHub trick: add &lt;code&gt;.atom&lt;/code&gt; to the end of that URL to get &lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain/commits/main/simonwillison-net.json.atom"&gt;an Atom feed of those commits&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;The whole scraper, from idea to finished implementation, took less than fifteen minutes to build and deploy.&lt;/p&gt;
&lt;p&gt;I can see myself using this technique &lt;em&gt;a lot&lt;/em&gt; in the future.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hacker-news"&gt;hacker-news&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="github"/><category term="hacker-news"/><category term="scraping"/><category term="github-actions"/><category term="git-scraping"/><category term="shot-scraper"/></entry><entry><title>shot-scraper: automated screenshots for documentation, built on Playwright</title><link href="https://simonwillison.net/2022/Mar/10/shot-scraper/#atom-tag" rel="alternate"/><published>2022-03-10T00:13:30+00:00</published><updated>2022-03-10T00:13:30+00:00</updated><id>https://simonwillison.net/2022/Mar/10/shot-scraper/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://github.com/simonw/shot-scraper"&gt;shot-scraper&lt;/a&gt; is a new tool that I’ve built to help automate the process of keeping screenshots up-to-date in my documentation. It also doubles as a scraping tool - hence the name - which I picked as a complement to my &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;git scraping&lt;/a&gt; and &lt;a href="https://simonwillison.net/2022/Feb/2/help-scraping/"&gt;help scraping&lt;/a&gt; techniques.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 13th March 2022:&lt;/strong&gt; The new &lt;code&gt;shot-scraper javascript&lt;/code&gt; command can now be used to &lt;a href="https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-scraper/"&gt;scrape web pages from the command line&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 14th October 2022:&lt;/strong&gt; &lt;a href="https://simonwillison.net/2022/Oct/14/automating-screenshots/"&gt;Automating screenshots for the Datasette documentation using shot-scraper&lt;/a&gt; offers a tutorial introduction to using the tool.&lt;/p&gt;
&lt;h4&gt;The problem&lt;/h4&gt;
&lt;p&gt;I like to include screenshots in documentation. I recently &lt;a href="https://simonwillison.net/2022/Feb/27/datasette-tutorials/"&gt;started writing end-user tutorials&lt;/a&gt; for Datasette, which are particularly image heavy (&lt;a href="https://datasette.io/tutorials/explore"&gt;for example&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;As software changes over time, screenshots get out-of-date. I don't like the idea of stale screenshots, but I also don't want to have to manually recreate them every time I make the tiniest tweak to the visual appearance of my software.&lt;/p&gt;
&lt;h4&gt;Introducing shot-scraper&lt;/h4&gt;
&lt;p&gt;&lt;code&gt;shot-scraper&lt;/code&gt; is a tool for automating this process. You can install it using &lt;code&gt;pip&lt;/code&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pip install shot-scraper
shot-scraper install
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That second &lt;code&gt;shot-scraper install&lt;/code&gt; line will install the browser it needs to do its job - more on that later.&lt;/p&gt;
&lt;p&gt;You can use it in two ways. To take a one-off screenshot, you can run it like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;shot-scraper https://simonwillison.net/ -o simonwillison.png
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or if you want to take a set of screenshots in a repeatable way, you can define them in a YAML file that looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;url&lt;/span&gt;: &lt;span class="pl-s"&gt;https://simonwillison.net/&lt;/span&gt;
  &lt;span class="pl-ent"&gt;output&lt;/span&gt;: &lt;span class="pl-s"&gt;simonwillison.png&lt;/span&gt;
- &lt;span class="pl-ent"&gt;url&lt;/span&gt;: &lt;span class="pl-s"&gt;https://www.example.com/&lt;/span&gt;
  &lt;span class="pl-ent"&gt;width&lt;/span&gt;: &lt;span class="pl-c1"&gt;400&lt;/span&gt;
  &lt;span class="pl-ent"&gt;height&lt;/span&gt;: &lt;span class="pl-c1"&gt;400&lt;/span&gt;
  &lt;span class="pl-ent"&gt;quality&lt;/span&gt;: &lt;span class="pl-c1"&gt;80&lt;/span&gt;
  &lt;span class="pl-ent"&gt;output&lt;/span&gt;: &lt;span class="pl-s"&gt;example.jpg&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And then use &lt;code&gt;shot-scraper multi&lt;/code&gt; to execute every screenshot in one go:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% shot-scraper multi shots.yml 
Screenshot of 'https://simonwillison.net/' written to 'simonwillison.png'
Screenshot of 'https://www.example.com/' written to 'example.jpg'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href="https://shot-scraper.datasette.io/en/stable/screenshots.html"&gt;The documentation&lt;/a&gt; describes all of the available options you can use when taking a screenshot.&lt;/p&gt;
&lt;p&gt;Each option can be provided to the &lt;code&gt;shot-scraper&lt;/code&gt; one-off tool, or can be embedded in the YAML file for use with &lt;code&gt;shot-scraper multi&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;JavaScript and CSS selectors&lt;/h4&gt;
&lt;p&gt;The default behaviour for &lt;code&gt;shot-scraper&lt;/code&gt; is to take a full page screenshot, using a browser width of 1280px.&lt;/p&gt;
&lt;p&gt;For documentation screenshots you probably don't want the whole page though - you likely want to create an image of one specific part of the interface.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--selector&lt;/code&gt; option allows you to specify an area of the page by CSS selector. The resulting image will consist just of that part of the page.&lt;/p&gt;
&lt;p&gt;What if you want to modify the page in addition to selecting a specific area?&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--javascript&lt;/code&gt; option lets you pass in a block of JavaScript code which will be injected into the page and executed after the page has loaded, but before the screenshot is taken.&lt;/p&gt;
&lt;p&gt;The combination of these two options - also available as &lt;code&gt;javascript:&lt;/code&gt; and &lt;code&gt;selector:&lt;/code&gt; keys in the YAML file - should be flexible enough to cover the custom screenshot case for documentation.&lt;/p&gt;
&lt;h4 id="a-complex-example"&gt;A complex example&lt;/h4&gt;
&lt;p&gt;To prove to myself that the tool works, I decided to try replicating this screenshot from &lt;a href="https://datasette.io/tutorials/explore"&gt;my tutorial&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I made the original using &lt;a href="https://cleanshot.com/"&gt;CleanShot X&lt;/a&gt;, manually adding the two pink arrows:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/select-facets-original.jpg" alt="A screenshot of a portion of the table interface in Datasette, with a menu open and two pink arrows pointing to menu items" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This is pretty tricky!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It's not &lt;a href="https://congress-legislators.datasettes.com/legislators/executive_terms?start__startswith=18&amp;amp;type=prez"&gt;this whole page&lt;/a&gt;, just a subset of the page&lt;/li&gt;
&lt;li&gt;The cog menu for one of the columns is open, which means the cog icon needs to be clicked before taking the screenshot&lt;/li&gt;
&lt;li&gt;There are two pink arrows superimposed on the image&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I decided to do use just one arrow for the moment, which should hopefully result in a clearer image.&lt;/p&gt;
&lt;p&gt;I started by &lt;a href="https://github.com/simonw/shot-scraper/issues/9#issuecomment-1063314278"&gt;creating my own pink arrow SVG&lt;/a&gt; using Figma:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/pink-arrow.png" alt="A big pink arrow, with a drop shadow" style="width: 200px; max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I then fiddled around in the Firefox developer console for quite a while, working out the JavaScript needed to trim the page down to the bit I wanted, open the menu and position the arrow.&lt;/p&gt;
&lt;p&gt;With the JavaScript figured out, I pasted it into a YAML file called &lt;code&gt;shot.yml&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;url&lt;/span&gt;: &lt;span class="pl-s"&gt;https://congress-legislators.datasettes.com/legislators/executive_terms?start__startswith=18&amp;amp;type=prez&lt;/span&gt;
  &lt;span class="pl-ent"&gt;javascript&lt;/span&gt;: &lt;span class="pl-s"&gt;|&lt;/span&gt;
&lt;span class="pl-s"&gt;    new Promise(resolve =&amp;gt; {&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Run in a promise so we can sleep 1s at the end&lt;/span&gt;
&lt;span class="pl-s"&gt;      function remove(el) { el.parentNode.removeChild(el);}&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Remove header and footer&lt;/span&gt;
&lt;span class="pl-s"&gt;      remove(document.querySelector('header'));&lt;/span&gt;
&lt;span class="pl-s"&gt;      remove(document.querySelector('footer'));&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Remove most of the children of .content&lt;/span&gt;
&lt;span class="pl-s"&gt;      Array.from(document.querySelectorAll('.content &amp;gt; *:not(.table-wrapper,.suggested-facets)')).map(remove)&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Bit of breathing room for the screenshot&lt;/span&gt;
&lt;span class="pl-s"&gt;      document.body.style.marginTop = '10px';&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Add a bit of padding to .content&lt;/span&gt;
&lt;span class="pl-s"&gt;      var content = document.querySelector('.content');&lt;/span&gt;
&lt;span class="pl-s"&gt;      content.style.width = '820px';&lt;/span&gt;
&lt;span class="pl-s"&gt;      content.style.padding = '10px';&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Open the menu - it's an SVG so we need to use dispatchEvent here&lt;/span&gt;
&lt;span class="pl-s"&gt;      document.querySelector('th.col-executive_id svg').dispatchEvent(new Event('click'));&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Remove all but table header and first 11 rows&lt;/span&gt;
&lt;span class="pl-s"&gt;      Array.from(document.querySelectorAll('tr')).slice(12).map(remove);&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Add a pink SVG arrow&lt;/span&gt;
&lt;span class="pl-s"&gt;      let div = document.createElement('div');&lt;/span&gt;
&lt;span class="pl-s"&gt;      div.innerHTML = `&amp;lt;svg width="104" height="60" fill="none" xmlns="http://www.w3.org/2000/svg"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;        &amp;lt;g filter="url(#a)"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;          &amp;lt;path fill-rule="evenodd" clip-rule="evenodd" d="m76.7 1 2 2 .2-.1.1.4 20 20a3.5 3.5 0 0 1 0 5l-20 20-.1.4-.3-.1-1.9 2a3.5 3.5 0 0 1-5.4-4.4l3.2-14.4H4v-12h70.6L71.3 5.4A3.5 3.5 0 0 1 76.7 1Z" fill="#FF31A0"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;        &amp;lt;/g&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;        &amp;lt;defs&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;          &amp;lt;filter id="a" x="0" y="0" width="104" height="59.5" filterUnits="userSpaceOnUse" color-interpolation-filters="sRGB"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feFlood flood-opacity="0" result="BackgroundImageFix"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feColorMatrix in="SourceAlpha" values="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 127 0" result="hardAlpha"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feOffset dy="4"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feGaussianBlur stdDeviation="2"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feComposite in2="hardAlpha" operator="out"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feColorMatrix values="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.25 0"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feBlend in2="BackgroundImageFix" result="effect1_dropShadow_2_26"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feBlend in="SourceGraphic" in2="effect1_dropShadow_2_26" result="shape"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;          &amp;lt;/filter&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;        &amp;lt;/defs&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;      &amp;lt;/svg&amp;gt;`;&lt;/span&gt;
&lt;span class="pl-s"&gt;      let svg = div.firstChild;&lt;/span&gt;
&lt;span class="pl-s"&gt;      content.appendChild(svg);&lt;/span&gt;
&lt;span class="pl-s"&gt;      content.style.position = 'relative';&lt;/span&gt;
&lt;span class="pl-s"&gt;      svg.style.position = 'absolute';&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Give the menu time to finish fading in&lt;/span&gt;
&lt;span class="pl-s"&gt;      setTimeout(() =&amp;gt; {&lt;/span&gt;
&lt;span class="pl-s"&gt;        // Position arrow pointing to the 'facet by this' menu item&lt;/span&gt;
&lt;span class="pl-s"&gt;        var pos = document.querySelector('.dropdown-facet').getBoundingClientRect();&lt;/span&gt;
&lt;span class="pl-s"&gt;        svg.style.left = (pos.left - pos.width) + 'px';&lt;/span&gt;
&lt;span class="pl-s"&gt;        svg.style.top = (pos.top - 20) + 'px';&lt;/span&gt;
&lt;span class="pl-s"&gt;        resolve();&lt;/span&gt;
&lt;span class="pl-s"&gt;      }, 1000);&lt;/span&gt;
&lt;span class="pl-s"&gt;    });&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;  &lt;span class="pl-ent"&gt;output&lt;/span&gt;: &lt;span class="pl-s"&gt;annotated-screenshot.png&lt;/span&gt;
  &lt;span class="pl-ent"&gt;selector&lt;/span&gt;: &lt;span class="pl-s"&gt;.content&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And ran this command to generate the screenshot:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;shot-scraper multi shot.yml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The generated &lt;code&gt;annotated-screenshot.png&lt;/code&gt; image looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/annotated-screenshot.png" alt="A screenshot of the table with the menu open and a single pink arrow pointing to the 'facet by this' menu item" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I'm pretty happy with this! I think it works very well as a proof of concept for the process.&lt;/p&gt;
&lt;h4 id="how-it-works-playwright"&gt;How it works: Playwright&lt;/h4&gt;
&lt;p&gt;I built the &lt;a href="https://github.com/simonw/shot-scraper/tree/44995cd45ca6c56d34c5c3d131217f7b9170f6f7"&gt;first prototype&lt;/a&gt; of &lt;code&gt;shot-scraper&lt;/code&gt; using Puppeteer, because I had &lt;a href="https://simonwillison.net/2020/Sep/3/weeknotes-airtable-screenshots-dogsheep/"&gt;used that before&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Then I noticed that the &lt;a href="https://www.npmjs.com/package/puppeteer-cli"&gt;puppeteer-cli&lt;/a&gt; package I was using hadn't had an update in two years, which reminded me to check out Playwright.&lt;/p&gt;
&lt;p&gt;I've been looking for an excuse to learn &lt;a href="https://playwright.dev/"&gt;Playwright&lt;/a&gt; for a while now, and this project turned out to be ideal.&lt;/p&gt;
&lt;p&gt;Playwright is Microsoft's open source browser automation framework. They promote it as a testing tool, but it has plenty of applications outside of testing - screenshot automation and screen scraping being two of the most obvious.&lt;/p&gt;
&lt;p&gt;Playwright is comprehensive: it downloads its own custom browser builds, and can run tests across multiple different rendering engines.&lt;/p&gt;
&lt;p&gt;The second prototype used the &lt;a href="https://github.com/simonw/shot-scraper/tree/b3318b2f27ca1526d5a9f06de50cf9900dd4d8d0"&gt;Playwright CLI utility&lt;/a&gt; instead, &lt;a href="https://github.com/simonw/shot-scraper/blob/b3318b2f27ca1526d5a9f06de50cf9900dd4d8d0/shot_scraper/cli.py#L39-L50"&gt;executed via npx&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;subprocess&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(
    [
        &lt;span class="pl-s"&gt;"npx"&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"playwright"&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"screenshot"&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"--full-page"&lt;/span&gt;,
        &lt;span class="pl-s1"&gt;url&lt;/span&gt;,
        &lt;span class="pl-s1"&gt;output&lt;/span&gt;,
    ],
    &lt;span class="pl-s1"&gt;capture_output&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;,
)&lt;/pre&gt;
&lt;p&gt;This could take a full page screenshot, but that CLI tool wasn't flexible enough to take screenshots of specific elements. So I needed to switch to the Playwright programmatic API.&lt;/p&gt;
&lt;p&gt;I started out trying to get Python to generate and pass JavaScript to the Node.js library... and then I spotted the official &lt;a href="https://playwright.dev/python/docs/intro"&gt;Playwright for Python&lt;/a&gt; package.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pip install playwright
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It's amazing! It has the exact same functionality as the JavaScript library - the same classes, the same methods. Everything just works, in both languages.&lt;/p&gt;
&lt;p&gt;I was curious how they pulled this off, so I dug inside the &lt;code&gt;playwright&lt;/code&gt; Python package in my &lt;code&gt;site-packages&lt;/code&gt; folder... and found it bundles a full Node.js binary executable and uses it to bridge the two worlds! What a wild hack.&lt;/p&gt;
&lt;p&gt;Thanks to Playwright, the entire implementation of &lt;code&gt;shot-scraper&lt;/code&gt; is currently just &lt;a href="https://github.com/simonw/shot-scraper/blob/0.3/shot_scraper/cli.py"&gt;181 lines of Python code&lt;/a&gt; - it's all glue code tying together a &lt;a href="https://click.palletsprojects.com/"&gt;Click&lt;/a&gt; CLI interface with some code that calls Playwright to do the actual work.&lt;/p&gt;
&lt;p&gt;I couldn't be more impressed with Playwright. I'll definitely be using it for other projects - for one thing, I think I'll finally be able to add automated tests to my &lt;a href="https://datasette.io/desktop"&gt;Datasette Desktop&lt;/a&gt; Electron application.&lt;/p&gt;
&lt;h4&gt;Hooking shot-scraper up to GitHub Actions&lt;/h4&gt;
&lt;p&gt;I built &lt;code&gt;shot-scraper&lt;/code&gt; very much with GitHub Actions in mind.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://github.com/simonw/shot-scraper-demo"&gt;shot-scraper-demo&lt;/a&gt; repository is my first live demo of the tool.&lt;/p&gt;
&lt;p&gt;Once a day, it runs &lt;a href="https://github.com/simonw/shot-scraper-demo/blob/3fdd9d3e79f95d9d396aeefd5bf65e85a7700ef4/.github/workflows/shots.yml"&gt;this shots.yml&lt;/a&gt; file, generates two screenshots and commits them back to the repository.&lt;/p&gt;
&lt;p&gt;One of them is the tutorial screenshot described above.&lt;/p&gt;
&lt;p&gt;The other is a screenshot of the list of "recently spotted owls" from &lt;a href="https://www.owlsnearme.com/?place=127871"&gt;this page&lt;/a&gt; on &lt;a href="https://www.owlsnearme.com/"&gt;owlsnearme.com&lt;/a&gt;. I wanted a page that would change on an occasional basis, to demonstrate GitHub's neat image diffing interface.&lt;/p&gt;
&lt;p&gt;I may need to change that demo though! That page includes "spotted 5 hours ago" text, which means that there's almost always a tiny pixel difference, &lt;a href="https://github.com/simonw/shot-scraper-demo/commit/bc86510f49b6f8d6728c9f1880b999c83361dd5a#diff-897c3444fbbb2033cbba5840da4994d01c3f396e0cdf4b0613d7f410db9887e0"&gt;like this one&lt;/a&gt; (use the "swipe" comparison tool to watch 6 hours ago change to 7 hours ago under the top left photo).&lt;/p&gt;
&lt;p&gt;Storing image files that change frequently in a free repository on GitHub feels rude to me, so please use this tool cautiously there!&lt;/p&gt;
&lt;h4&gt;What's next?&lt;/h4&gt;
&lt;p&gt;I had ambitious plans to add utilities to the tool that would &lt;a href="https://github.com/simonw/shot-scraper/issues/9"&gt;help with annotations&lt;/a&gt;, such as adding pink arrows and drawing circles around different elements on the page.&lt;/p&gt;
&lt;p&gt;I've shelved those plans for the moment: as the demo above shows, the JavaScript hook is good enough. I may revisit this later once common patterns have started to emerge.&lt;/p&gt;
&lt;p&gt;So really, my next step is to start using this tool for my own projects - to generate screenshots for my documentation.&lt;/p&gt;
&lt;p&gt;I'm also very interested to see what kinds of things other people use this for.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/documentation"&gt;documentation&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/puppeteer"&gt;puppeteer&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/playwright"&gt;playwright&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="documentation"/><category term="projects"/><category term="scraping"/><category term="github-actions"/><category term="git-scraping"/><category term="puppeteer"/><category term="playwright"/><category term="shot-scraper"/></entry><entry><title>Help scraping: track changes to CLI tools by recording their --help using Git</title><link href="https://simonwillison.net/2022/Feb/2/help-scraping/#atom-tag" rel="alternate"/><published>2022-02-02T23:46:35+00:00</published><updated>2022-02-02T23:46:35+00:00</updated><id>https://simonwillison.net/2022/Feb/2/help-scraping/#atom-tag</id><summary type="html">
    &lt;p&gt;I've been experimenting with a new variant of &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; this week which I'm calling &lt;strong&gt;Help scraping&lt;/strong&gt;. The key idea is to track changes made to CLI tools over time by recording the output of their &lt;code&gt;--help&lt;/code&gt; commands in a Git repository.&lt;/p&gt;
&lt;p&gt;My new &lt;a href="https://github.com/simonw/help-scraper"&gt;help-scraper GitHub repository&lt;/a&gt; is my first implementation of this pattern.&lt;/p&gt;
&lt;p&gt;It uses &lt;a href="https://github.com/simonw/help-scraper/blob/cd18c5d7c1ac7c3851823dcabaa21ee920d73720/.github/workflows/scrape.yml"&gt;this GitHub Actions workflow&lt;/a&gt; to record the &lt;code&gt;--help&lt;/code&gt; output for the Amazon Web Services &lt;code&gt;aws&lt;/code&gt; CLI tool, and also for the &lt;code&gt;flyctl&lt;/code&gt; tool maintained by the &lt;a href="https://fly.io/"&gt;Fly.io&lt;/a&gt; hosting platform.&lt;/p&gt;
&lt;p&gt;The workflow runs once a day. It loops through every available AWS command (using &lt;a href="https://github.com/simonw/help-scraper/blob/cd18c5d7c1ac7c3851823dcabaa21ee920d73720/aws_commands.py"&gt;this script&lt;/a&gt;) and records the output of that command's CLI help option to a &lt;code&gt;.txt&lt;/code&gt; file in the repository - then commits the result at the end.&lt;/p&gt;
&lt;p&gt;The result is a version history of changes made to those help files. It's essentially a much more detailed version of a changelog - capturing all sorts of details that might not be reflected in the official release notes for the tool.&lt;/p&gt;
&lt;p&gt;Here's an example. This morning, AWS released version 1.22.47 of their CLI helper tool. They release new versions on an almost daily basis.&lt;/p&gt;
&lt;p&gt;Here are &lt;a href="https://github.com/aws/aws-cli/blob/develop/CHANGELOG.rst#12247"&gt;the official release notes&lt;/a&gt; - 12 bullet points, spanning 12 different AWS services.&lt;/p&gt;
&lt;p&gt;My help scraper caught the details of the release in &lt;a href="https://github.com/simonw/help-scraper/commit/cd18c5d7c1ac7c3851823dcabaa21ee920d73720#diff-c2559859df8912eb13a6017d81019bf5452cead3e6495744e2d0c82202bf33ac"&gt;this commit&lt;/a&gt; - 89 changed files with 3,543 additions and 1,324 deletions. It tells the story of what's changed in a whole lot more detail.&lt;/p&gt;
&lt;p&gt;The AWS CLI tool is &lt;em&gt;enormous&lt;/em&gt;. Running &lt;code&gt;find aws -name '*.txt' | wc -l&lt;/code&gt; in that repository counts help pages for 11,401 individual commands - or 11,390 if you checkout the previous version, showing that there were 11 commands added just in this morning's new release.&lt;/p&gt;
&lt;p&gt;There are plenty of other ways of tracking changes made to AWS. I've previously kept an eye on &lt;a href="https://github.com/boto/botocore/commits/develop"&gt;the botocore GitHub history&lt;/a&gt;, which exposes changes to the underlying JSON - and there are projects like &lt;a href="https://awsapichanges.info/"&gt;awschanges.info&lt;/a&gt; which try to turn those sources of data into something more readable.&lt;/p&gt;
&lt;p&gt;But I think there's something pretty neat about being able to track changes in detail for any CLI tool that offers help output, independent of the official release notes for that tool. Not everyone writes release notes &lt;a href="https://simonwillison.net/2022/Jan/31/release-notes/"&gt;with the detail I like from them&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;I implemented this for &lt;code&gt;flyctl&lt;/code&gt; first, because I wanted to see what changes were being made that might impact my &lt;a href="https://datasette.io/plugins/datasette-publish-fly"&gt;datasette-publish-fly&lt;/a&gt; plugin which shells out to that tool. Then I realized it could be applied to AWS as well.&lt;/p&gt;
&lt;h4&gt;Help scraping my own projects&lt;/h4&gt;
&lt;p&gt;I got the initial idea for this technique from a change I made to my &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; and &lt;a href="https://sqlite-utils.datasette.io"&gt;sqlite-utils&lt;/a&gt; projects a few weeks ago.&lt;/p&gt;
&lt;p&gt;Both tools offer CLI commands with &lt;code&gt;--help&lt;/code&gt; output - but I kept on forgetting to update the help, partly because there was no easy way to see its output online without running the tools themselves.&lt;/p&gt;
&lt;p&gt;So, I added documentation pages that list the output of &lt;code&gt;--help&lt;/code&gt; for each of the CLI commands, generated using the &lt;a href="https://nedbatchelder.com/code/cog"&gt;Cog&lt;/a&gt; file generation tool:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://sqlite-utils.datasette.io/en/stable/cli-reference.html"&gt;sqlite-utils CLI reference&lt;/a&gt; (39 commands!)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.datasette.io/en/stable/cli-reference.html"&gt;datasette CLI reference&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Having added these pages, I realized that the Git commit history of those generated documentation pages could double up as a history of changes I made to the &lt;code&gt;--help&lt;/code&gt; output - here's &lt;a href="https://github.com/simonw/sqlite-utils/commits/main/docs/cli-reference.rst"&gt;that history for sqlite-utils&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It was a short jump from that to the idea of combining it with &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; to generate history for other tools.&lt;/p&gt;
&lt;h4&gt;Bonus trick: GraphQL schema scraping&lt;/h4&gt;
&lt;p&gt;I've started making selective use of the &lt;a href="https://fly.io/"&gt;Fly.io&lt;/a&gt; GraphQL API as part of &lt;a href="https://github.com/simonw/datasette-publish-fly"&gt;my plugin&lt;/a&gt; for publishing Datasette instances to that platform.&lt;/p&gt;
&lt;p&gt;Their GraphQL API is openly available, but it's not extensively documented - presumably because they reserve the right to make breaking changes to it at any time. I collected some notes on it in this TIL: &lt;a href="https://til.simonwillison.net/fly/undocumented-graphql-api"&gt;Using the undocumented Fly GraphQL API&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This gave me an idea: could I track changes made to their GraphQL schema using the same scraping trick?&lt;/p&gt;
&lt;p&gt;It turns out I can! There's an NPM package called &lt;a href="https://www.npmjs.com/package/get-graphql-schema"&gt;get-graphql-schema&lt;/a&gt; which can extract the GraphQL schema from any GraphQL server and write it out to disk:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;npx get-graphql-schema https://api.fly.io/graphql &amp;gt; /tmp/fly.graphql
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I've added that to my &lt;code&gt;help-scraper&lt;/code&gt; repository too - so now I have a &lt;a href="https://github.com/simonw/help-scraper/commits/main/flyctl/fly.graphql"&gt;commit history of changes&lt;/a&gt; of changes they are making there too. Here's &lt;a href="https://github.com/simonw/help-scraper/commit/f11072ff23f0d654395be7c2b1e98e84dbbc26a3#diff-c9cd49cf2aa3b983457e2812ba9313cc254aba74aaba9a36d56c867e32221589"&gt;an example&lt;/a&gt; from this morning.&lt;/p&gt;
&lt;h3&gt;Other weeknotes&lt;/h3&gt;
&lt;p&gt;I've decided to start setting goals on a monthly basis. My goal for February is to finally ship Datasette 1.0! I'm trying to make at least one commit every day that takes me closer to &lt;a href="https://github.com/simonw/datasette/milestone/7"&gt;that milestone&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This week I did &lt;a href="https://github.com/simonw/datasette/issues/1533"&gt;a bunch of work&lt;/a&gt; adding a &lt;code&gt;Link: https://...; rel="alternate"; type="application/datasette+json"&lt;/code&gt; HTTP header to a bunch of different pages in the Datasette interface, to support discovery of the JSON version of a page based on a URL to the human-readable version.&lt;/p&gt;
&lt;p&gt;(I had originally planned &lt;a href="https://github.com/simonw/datasette/issues/1534"&gt;to also support&lt;/a&gt; &lt;code&gt;Accept: application/json&lt;/code&gt; request headers for this, but I've been put off that idea by the discovery that Cloudflare &lt;a href="https://twitter.com/simonw/status/1478470282931163137"&gt;deliberately ignores&lt;/a&gt; the &lt;code&gt;Vary: Accept&lt;/code&gt; header.)&lt;/p&gt;
&lt;p&gt;Unrelated to Datasette: I also started a new Twitter thread, gathering &lt;a href="https://twitter.com/simonw/status/1487673496977113088"&gt;behind the scenes material from the movie the Mitchells vs the Machines&lt;/a&gt;. There's been a flurry of great material shared recently by the creative team, presumably as part of the run-up to awards season - and I've been enjoying trying to tie it all together in a thread.&lt;/p&gt;
&lt;p&gt;The last time I did this &lt;a href="https://twitter.com/simonw/status/1077737871602110466"&gt;was for Into the Spider-Verse&lt;/a&gt; (from the same studio) and that thread ended up running for more than a year!&lt;/p&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/pytest/only-run-integration"&gt;Opt-in integration tests with pytest --integration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/graphql/get-graphql-schema"&gt;get-graphql-schema&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/github-actions/python-3-11"&gt;Testing against Python 3.11 preview using GitHub Actions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/graphql"&gt;graphql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fly"&gt;fly&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="git"/><category term="github"/><category term="projects"/><category term="scraping"/><category term="graphql"/><category term="weeknotes"/><category term="github-actions"/><category term="git-scraping"/><category term="fly"/></entry><entry><title>Weeknotes: Shaving some beautiful yaks</title><link href="https://simonwillison.net/2021/Dec/1/beautiful-yaks/#atom-tag" rel="alternate"/><published>2021-12-01T03:43:18+00:00</published><updated>2021-12-01T03:43:18+00:00</updated><id>https://simonwillison.net/2021/Dec/1/beautiful-yaks/#atom-tag</id><summary type="html">
    &lt;p&gt;I've been mostly &lt;a href="https://en.wiktionary.org/wiki/yak_shaving"&gt;shaving yaks&lt;/a&gt; this week - two in particular: the Datasette table refactor and the next release of &lt;a href="https://datasette.io/tools/git-history"&gt;git-history&lt;/a&gt;. I also built and released my first Web Component!&lt;/p&gt;
&lt;h4&gt;A Web Component for embedding Datasette tables&lt;/h4&gt;
&lt;p&gt;A longer term goal that I have for Datasette is to figure out a good way of using it to build dashboards, tying together summaries and visualizations of the latest data from a bunch of different sources.&lt;/p&gt;
&lt;p&gt;I'm excited about the potential of &lt;a href="https://developer.mozilla.org/en-US/docs/Web/Web_Components"&gt;Web Components&lt;/a&gt; to help solve this problem.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://github.com/simonw/datasette-notebook"&gt;datasette-notebook&lt;/a&gt; project is a &lt;em&gt;very&lt;/em&gt; early experiment in this direction: it's a Datasette notebook that provides a Markdown wiki (persisted to SQLite) to which I plan to add the ability to embed tables and visualizations in wiki pages - forming a hybrid of a wiki, dashboarding system and Notion/Airtable-style database.&lt;/p&gt;
&lt;p&gt;It does almost none of those things right now, which is why I've not really talked about it here.&lt;/p&gt;
&lt;p&gt;Web Components offer a standards-based mechanism for creating custom HTML tags. Imagine being able to embed a Datasette table on a page by adding the following to your HTML:&lt;/p&gt;
&lt;div class="highlight highlight-text-html-basic"&gt;&lt;pre&gt;&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;datasette-table&lt;/span&gt;
    &lt;span class="pl-c1"&gt;url&lt;/span&gt;="&lt;span class="pl-s"&gt;https://global-power-plants.datasettes.com/global-power-plants/global-power-plants.json&lt;/span&gt;"
&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;datasette-table&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That's exactly what &lt;a href="https://github.com/simonw/datasette-table"&gt;datasette-table&lt;/a&gt; lets you do! Here's &lt;a href="https://simonw.github.io/datasette-table/"&gt;a demo&lt;/a&gt; of it in action.&lt;/p&gt;
&lt;p&gt;This is version 0.1.0 - it works, but I've not even started to flesh it out.&lt;/p&gt;
&lt;p&gt;I did learn a bunch of things building it though: it's my first Web Component, my first time using &lt;a href="https://lit.dev/"&gt;Lit&lt;/a&gt;, my first time using &lt;a href="https://vitejs.dev/"&gt;Vite&lt;/a&gt; and the first JavaScript library I've ever packaged and &lt;a href="https://www.npmjs.com/package/datasette-table"&gt;published to npm&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's a detailed TIL on &lt;a href="https://til.simonwillison.net/npm/publish-web-component"&gt;Publishing a Web Component to npm&lt;/a&gt; encapsulating everything I've learned from this project so far.&lt;/p&gt;
&lt;p&gt;This is also my first piece of yak shaving this week: I built this partly to make progress on &lt;code&gt;datasette-notebook&lt;/code&gt;, but also because my big Datasette refactor involves finalizing the design of the JSON API for version 1.0. I realized that I don't actually have a project that makes full use of that API, which has been hindering my attempts to redesign it. Having one or more Web Components that consume the API will be a fantastic way for me to eat my own dog food.&lt;/p&gt;
&lt;h4&gt;Link: rel="alternate" for Datasette tables&lt;/h4&gt;
&lt;p&gt;Here's an interesting problem that came up while I was working on the &lt;code&gt;datasette-table&lt;/code&gt; component.&lt;/p&gt;
&lt;p&gt;As designed right now, you need to figure out the JSON URL for a table and pass that to the component.&lt;/p&gt;
&lt;p&gt;This is &lt;em&gt;usually&lt;/em&gt; a case of adding &lt;code&gt;.json&lt;/code&gt; to the path, while preserving any query string parameters - but there's a nasty edge-case: if your SQLite table itself ends with the string &lt;code&gt;.json&lt;/code&gt; (which could happen! Especially since Datasette promises to work with any existing SQLite database) the URL becomes this instead:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/mydb/table.json?_format=json
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Telling users of my component that they need to first construct the JSON URL for their page isn't the best experience: I'd much rather let people paste in the URL to the HTML version and derive the JSON from that.&lt;/p&gt;
&lt;p&gt;This is made more complex by the fact that, thanks to &lt;code&gt;--cors&lt;/code&gt;, the Web Component can be embedded on any page. And for &lt;code&gt;datasette-notebook&lt;/code&gt; I'd like to provide a feature where any URLs to Datasette instances - no matter where they are hosted - are turned into embedded tables automatically.&lt;/p&gt;
&lt;p&gt;To do this, I need an efficient way to tell that an arbitrary URL corresponds to a Datasette table.&lt;/p&gt;
&lt;p&gt;My latest idea here is to use a combination of HTTP &lt;code&gt;HEAD&lt;/code&gt; requests and a &lt;code&gt;Link: rel="alternate"&lt;/code&gt; header - something like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;~ % curl -I 'https://latest.datasette.io/fixtures/compound_three_primary_keys'
HTTP/1.1 200 OK
date: Sat, 27 Nov 2021 20:09:36 GMT
server: uvicorn
Link: https://latest.datasette.io/fixtures/compound_three_primary_keys.json; rel="alternate"; type="application/datasette+json"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This would allow a (hopefully fast) &lt;code&gt;fetch()&lt;/code&gt; call from JavaScript to confirm that a URL is a Datasette table, and get back the JSON that should be fetched by the component in order to render it on the page.&lt;/p&gt;
&lt;p&gt;I have a prototype of this in &lt;a href="https://github.com/simonw/datasette/issues/1533"&gt;Datasette issue #1533&lt;/a&gt;. I think it's a promising approach!&lt;/p&gt;
&lt;p&gt;It's also now part of the ever-growing table refactor. Adding custom headers to page responses is currently far harder than it should be.&lt;/p&gt;
&lt;h4&gt;sqlite-utils STRICT tables&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://www.sqlite.org/releaselog/3_37_0.html"&gt;SQLite 3.37.0&lt;/a&gt; came out at the weekend with a long-awaited feature: &lt;a href="https://www.sqlite.org/stricttables.html"&gt;STRICT tables&lt;/a&gt;, which enforce column types such that you get an error if you try to insert a string into an integer column.&lt;/p&gt;
&lt;p&gt;(This has been a long-standing complaint about SQLite by people who love strong typing, and D. Richard Hipp finally shipped the change for them with some salty release notes saying it's "for developers who prefer that kind of thing.")&lt;/p&gt;
&lt;p&gt;I started researching how to add support for this to my &lt;a href="https://sqlite-utils.datasette.io/en/stable/python-api.html"&gt;sqlite-utils Python library&lt;/a&gt;. You can follow my thinking in &lt;a href="https://github.com/simonw/sqlite-utils/issues/344"&gt;sqlite-utils issue #344&lt;/a&gt; - I'm planning to add a &lt;code&gt;strict=True&lt;/code&gt; option to methods that create tables, but for the moment I've shipped &lt;a href="https://github.com/simonw/sqlite-utils/commit/e3f108e0f339e3d87ce48541bbca8f891bfaf040"&gt;new introspection properties&lt;/a&gt; for seeing if a table uses strict mode or not.&lt;/p&gt;
&lt;h4&gt;git-history update&lt;/h4&gt;
&lt;p&gt;My other big yak this week has been work on &lt;a href="https://github.com/simonw/git-history"&gt;git-history&lt;/a&gt;. I'm determined to get it into a stable state such that I can write it up, produce a tutorial and maybe produce a video demonstration as well - but I keep on finding things I want to change about how it works.&lt;/p&gt;
&lt;p&gt;The big challenge is how to most effectively represent the history of a bunch of different items over time in a relational database schema.&lt;/p&gt;
&lt;p&gt;I started with a &lt;code&gt;item&lt;/code&gt; table that presents just the most recent version of each item, and an &lt;code&gt;item_version&lt;/code&gt; table with a row for every subsequent version.&lt;/p&gt;
&lt;p&gt;That table got pretty big, with vast amounts of duplicated data in it.&lt;/p&gt;
&lt;p&gt;So I've been working on an optimization where columns are only included in an &lt;code&gt;item_version&lt;/code&gt; row &lt;a href="https://github.com/simonw/git-history/issues/21"&gt;if they have changed since the previous version&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The problem there is what to do about &lt;code&gt;null&lt;/code&gt; - does &lt;code&gt;null&lt;/code&gt; mean "this column didn't change" or does it mean "this column was set from some other value back to &lt;code&gt;null&lt;/code&gt;"?&lt;/p&gt;
&lt;p&gt;After a few different attempts I've decided to solve this with a many-to-many table, so for any row in the &lt;code&gt;item_version&lt;/code&gt; table you can see which columns were explicitly changed by that version.&lt;/p&gt;
&lt;p&gt;This is all working pretty nicely now, but still needs documentation, and tests, and then a solid write-up and tutorial and demos and a video... hopefully tomorrow!&lt;/p&gt;
&lt;p&gt;One of my design decisions for this tool has been to use an underscore prefix for "reserved columns", such that non-reserved columns can be safely used by the arbitrary data that is being tracked by the tool.&lt;/p&gt;
&lt;p&gt;Having columns with names like &lt;code&gt;_id&lt;/code&gt; and &lt;code&gt;_item&lt;/code&gt; has highlighted several bugs with Datasette's handling of these column names, since Datasette itself tries to use things like &lt;code&gt;?_search=&lt;/code&gt; for special query string parameters. I released &lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-59-4"&gt;Datasette 0.59.4&lt;/a&gt; with some relevant fixes.&lt;/p&gt;
&lt;h4&gt;A beautiful yak&lt;/h4&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/yak.jpg" alt="A very beautiful yak" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;As a consumate yak shaver this beautiful yak that &lt;a href="https://www.reddit.com/r/interestingasfuck/comments/qtpm0x/this_white_yak_in_tibet/"&gt;showed up on Reddit&lt;/a&gt; a few weeks ago has me absolutely delighted.  I've not been able to determine the photography credit.&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.7"&gt;0.7&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;7 releases total&lt;/a&gt;) - 2021-11-30
&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette"&gt;datasette&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette/releases/tag/0.59.4"&gt;0.59.4&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette/releases"&gt;102 releases total&lt;/a&gt;) - 2021-11-30
&lt;br /&gt;An open source multi-tool for exploring and publishing data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-table"&gt;datasette-table&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-table/releases/tag/0.1.0"&gt;0.1.0&lt;/a&gt; - 2021-11-28
&lt;br /&gt;A Web Component for embedding a Datasette table on a page&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/caddy/pause-retry-traffic"&gt;Pausing traffic and retrying in Caddy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/npm/publish-web-component"&gt;Publishing a Web Component to npm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/datasette/reuse-click-for-register-commands"&gt;Reusing an existing Click tool with register_commands&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/python/ignore-both-flake8-and-mypy"&gt;Ignoring a line in both flake8 and mypy&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/npm"&gt;npm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/web-components"&gt;web-components&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="npm"/><category term="datasette"/><category term="web-components"/><category term="weeknotes"/><category term="git-scraping"/></entry><entry><title>Weeknotes: Apache proxies in Docker containers, refactoring Datasette</title><link href="https://simonwillison.net/2021/Nov/22/apache-proxies-datasette/#atom-tag" rel="alternate"/><published>2021-11-22T05:43:44+00:00</published><updated>2021-11-22T05:43:44+00:00</updated><id>https://simonwillison.net/2021/Nov/22/apache-proxies-datasette/#atom-tag</id><summary type="html">
    &lt;p&gt;Updates to six major projects this week, plus finally some concrete progress towards Datasette 1.0.&lt;/p&gt;
&lt;h4&gt;Fixing Datasette's proxy bugs&lt;/h4&gt;
&lt;p&gt;Now that Datasette has had its fourth birthday I've decided to really push towards hitting &lt;a href="https://github.com/simonw/datasette/milestone/7"&gt;the 1.0 milestone&lt;/a&gt;. The key property of that release will be a stable JSON API, stable plugin hooks and a stable, documented context for custom templates. There's quite a lot of mostly unexciting work needed to get there.&lt;/p&gt;
&lt;p&gt;As I work through the issues in that milestone I'm encountering some that I filed more than two years ago!&lt;/p&gt;
&lt;p&gt;Two of those made it into the &lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-59-3"&gt;Datasette 0.59.3&lt;/a&gt; bug fix release earlier this week.&lt;/p&gt;
&lt;p&gt;The majority of the work in that release though related to Datasette's &lt;a href="https://docs.datasette.io/en/stable/settings.html#base-url"&gt;base_url feature&lt;/a&gt;, designed to help people who run Datasette behind a proxy.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;base_url&lt;/code&gt; lets you run Datasette like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;datasette --setting base_url=/prefix/ fixtures.db
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When you do this, Datasette will change its URLs to start with that prefix - so the hompage will live at &lt;code&gt;/prefix/&lt;/code&gt;, the database index page at &lt;code&gt;/prefix/fixtures/&lt;/code&gt;, tables at &lt;code&gt;/prefix/fixtures/facetable&lt;/code&gt; etc.&lt;/p&gt;
&lt;p&gt;The reason you would want this is if you are running a larger website, and you intend to proxy traffic to &lt;code&gt;/prefix/&lt;/code&gt; to a separate Datasette instance.&lt;/p&gt;
&lt;p&gt;The Datasette documentation includes &lt;a href="https://docs.datasette.io/en/stable/deploying.html#running-datasette-behind-a-proxy"&gt;suggested nginx and Apache configurations&lt;/a&gt; for doing exactly that.&lt;/p&gt;
&lt;p&gt;This feature has been &lt;a href="https://github.com/simonw/datasette/issues?q=is%3Aissue+base_url"&gt;a magnet for bugs&lt;/a&gt; over the years! People keep finding new parts of the Datasette interface that fail to link to the correct pages when run in this mode.&lt;/p&gt;
&lt;p&gt;The principle cause of these bugs is that I don't use Datasette in this way myself, so I wasn't testing it nearly as thoroughly as it needed.&lt;/p&gt;
&lt;p&gt;So the first step in finally solving these issues once and for all was to get my own instance of Datasette up and running behind an Apache proxy.&lt;/p&gt;
&lt;p&gt;Since I like to deploy live demos to Cloud Run, I decided to try and run Apache and Datasette in the same container. This took a &lt;em&gt;lot&lt;/em&gt; of figuring out. You can follow my progress on this in these two issue threads:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/datasette/issues/1521"&gt;#1521: Docker configuration for exercising Datasette behind Apache mod_proxy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/datasette/issues/1522"&gt;#1522: Deploy a live instance of demos/apache-proxy&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The short version: I got it working! My Docker implementation now lives in the &lt;a href="https://github.com/simonw/datasette/tree/0.59.3/demos/apache-proxy"&gt;demos/apache-proxy&lt;/a&gt; directory and the live demo itself is deployed to &lt;a href="https://datasette-apache-proxy-demo.fly.dev/prefix/"&gt;datasette-apache-proxy-demo.fly.dev/prefix/&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;(I ended up deploying it to &lt;a href="https://fly.io/"&gt;Fly&lt;/a&gt; after running into a bug when deployed to Cloud Run that I couldn't replicate on my own laptop.)&lt;/p&gt;
&lt;p&gt;My final implementation uses a Debian base container with Supervisord to manage the two processes.&lt;/p&gt;
&lt;p&gt;With a working live environment, I was finally able to track down the root cause of the bugs. My notes on
&lt;a href="https://github.com/simonw/datasette/issues/1519"&gt;#1519: base_url is omitted in JSON and CSV views&lt;/a&gt; document how I found and solved them, and updated the associated test to hopefully avoid them ever coming back in the future.&lt;/p&gt;
&lt;h4&gt;The big Datasette table refactor&lt;/h4&gt;
&lt;p&gt;The single most complicated part of the Datasette codebase is the code behind the table view - the page that lets you browse, facet, search, filter and paginate through the contents of a table (&lt;a href="https://covid-19.datasettes.com/covid/ny_times_us_counties"&gt;this page here&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;It's got very thorough tests, but the actual implementation is mostly &lt;a href="https://github.com/simonw/datasette/blob/main/datasette/views/table.py#L303-L992"&gt;a 600 line class method&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It was already difficult to work with, but the changes I want to make for Datasette 1.0 have proven too much for it. I need to refactor.&lt;/p&gt;
&lt;p&gt;Apart from making that view easier to change and maintain, a major goal I have is for it to support a much more flexible JSON syntax. I want the JSON version to default to just returning minimal information about the table, then allow &lt;code&gt;?_extra=x&lt;/code&gt; parameters to opt into additional information - like facets, suggested facets, full counts, SQL schema information and so on.&lt;/p&gt;
&lt;p&gt;This means I want to break up that 600 line method into a bunch of separate methods, each of which can be opted-in-to by the calling code.&lt;/p&gt;
&lt;p&gt;The HTML interface should then build on top of the JSON, requesting the extras that it knows it will need and passing the resulting data through to the template. This helps solve the challenge of having a stable template context that I can document in advance of Datasette 1.0&lt;/p&gt;
&lt;p&gt;I've been putting this off for over a year now, because it's a &lt;em&gt;lot&lt;/em&gt; of work. But no longer! This week I finally started to get stuck in.&lt;/p&gt;
&lt;p&gt;I don't know if I'll stick with it, but my initial attempt at this is a little unconventional. Inspired by how &lt;a href="https://docs.pytest.org/en/6.2.x/fixture.html#back-to-fixtures"&gt;pytest fixtures work&lt;/a&gt; I'm experimenting with a form of dependency injection, in a new (very alpha) library I've released called &lt;a href="https://github.com/simonw/asyncinject"&gt;asyncinject&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The key idea behind &lt;code&gt;asyncinject&lt;/code&gt; is to provide a way for class methods to indicate their dependencies as named parameters, in the same way as pytest fixtures do.&lt;/p&gt;
&lt;p&gt;When you call a method, the code can spot which dependencies have not yet been resolved and execute them before executing the method.&lt;/p&gt;
&lt;p&gt;Crucially, since they are all &lt;code&gt;async def&lt;/code&gt; methods they can be &lt;em&gt;executed in parallel&lt;/em&gt;. I'm cautiously excited about this - Datasette has a bunch of opportunities for parallel queries - fetching a single page of table rows, calculating a &lt;code&gt;count(*)&lt;/code&gt; for the entire table, executing requested facets and calculating suggested facets are all queries that could potentially run in parallel rather than in serial.&lt;/p&gt;
&lt;p&gt;What about the GIL, you might ask? Datasette's database queries are handled by the &lt;code&gt;sqlite3&lt;/code&gt; module, and that module releases the GIL once it gets into SQLite C code. So theoretically I should be able to use more than one core for this all.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/asyncinject/blob/0.2a0/README.md"&gt;asyncinject README&lt;/a&gt; has more details, including code examples. This may turn out to be a terrible idea! But it's really fun to explore, and I'll be able to tell for sure if this is a useful, maintainable and performant approach once I have Datasette's table view running on top of it.&lt;/p&gt;
&lt;h4&gt;git-history and sqlite-utils&lt;/h4&gt;
&lt;p&gt;I made some big improvements to my &lt;a href="https://github.com/simonw/git-history"&gt;git-history&lt;/a&gt; tool, which automates the process of turning a JSON (or other) file that has been version-tracked in a GitHub repository (see &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt;) into a SQLite database that can be used to explore changes to it over time.&lt;/p&gt;
&lt;p&gt;The biggest was a major change to the database schema. Previously, the tool used full Git SHA hashes as foreign keys in the largest table.&lt;/p&gt;
&lt;p&gt;The problem here is that a SHA hash string is 40 characters long, and if they are being used as a foreign key that's a LOT of extra weight added to the largest table.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;sqlite-utils&lt;/code&gt; has a &lt;a href="https://sqlite-utils.datasette.io/en/stable/python-api.html#python-api-lookup-tables"&gt;table.lookup() method&lt;/a&gt; which is designed to make creating "lookup" tables - where a string is stored in a unique column but an integer ID can be used for things like foreign keys - as easy as possible.&lt;/p&gt;
&lt;p&gt;That method was previously quite limited, but in &lt;a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-18"&gt;sqlite-utils 3.18&lt;/a&gt; and &lt;a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-19"&gt;3.19&lt;/a&gt; - both released this week - I expanded it to cover the more advanced needs of my &lt;code&gt;git-history&lt;/code&gt; tool.&lt;/p&gt;
&lt;p&gt;The great thing about building stuff on top of your own libraries is that you can discover new features that you need along the way - and then ship them promptly without them blocking your progress!&lt;/p&gt;
&lt;h4&gt;Some other highlights&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.6"&gt;s3-credentials 0.6&lt;/a&gt; adds a &lt;code&gt;--dry-run&lt;/code&gt; option that you can use to show what the tool would do without making any actual changes to your AWS account. I found myself wanting this while continuing to work on the ability to &lt;a href="https://github.com/simonw/s3-credentials/issues/12"&gt;specify a folder prefix&lt;/a&gt; within S3 that the bucket credentials should be limited to.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-publish-vercel/releases/tag/0.12"&gt;datasette-publish-vercel 0.12&lt;/a&gt; applies some pull requests from Romain Clement that I had left unreviewed for far too long, and adds the ability to customize the &lt;code&gt;vercel.json&lt;/code&gt; file used for the deployment - useful for things like setting up additional custom redirects.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-graphql/releases/tag/2.0"&gt;datasette-graphql 2.0&lt;/a&gt; updates that plugin to &lt;a href="https://github.com/graphql-python/graphene/wiki/v3-release-notes"&gt;Graphene 3.0&lt;/a&gt;, a major update to that library. I had to break backwards compatiblity in very minor ways, hence the 2.0 version number.&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/csvs-to-sqlite/releases/tag/1.3"&gt;csvs-to-sqlite 1.3&lt;/a&gt; is the first relase of that tool in just over a year. William Rowell contributed a new feature that allows you to populate "fixed" database columns on your imported records, see &lt;a href="https://github.com/simonw/csvs-to-sqlite/pull/81"&gt;PR #81&lt;/a&gt; for details.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/python/graphlib-topologicalsorter"&gt;Planning parallel downloads with TopologicalSorter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/python/cog-to-update-help-in-readme"&gt;Using cog to update --help in a Markdown README file&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/cloudrun/using-build-args-with-cloud-run"&gt;Using build-arg variables with Cloud Run deployments&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/fly/custom-subdomain-fly"&gt;Assigning a custom subdomain to a Fly app&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-publish-vercel"&gt;datasette-publish-vercel&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-publish-vercel/releases/tag/0.12"&gt;0.12&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-publish-vercel/releases"&gt;18 releases total&lt;/a&gt;) - 2021-11-22
&lt;br /&gt;Datasette plugin for publishing data using Vercel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/git-history"&gt;git-history&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/git-history/releases/tag/0.4"&gt;0.4&lt;/a&gt; - (&lt;a href="https://github.com/simonw/git-history/releases"&gt;6 releases total&lt;/a&gt;) - 2021-11-21
&lt;br /&gt;Tools for analyzing Git history using SQLite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/3.19"&gt;3.19&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-utils/releases"&gt;90 releases total&lt;/a&gt;) - 2021-11-21
&lt;br /&gt;Python CLI utility and library for manipulating SQLite databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette"&gt;datasette&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette/releases/tag/0.59.3"&gt;0.59.3&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette/releases"&gt;101 releases total&lt;/a&gt;) - 2021-11-20
&lt;br /&gt;An open source multi-tool for exploring and publishing data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-redirect-to-https"&gt;datasette-redirect-to-https&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-redirect-to-https/releases/tag/0.1"&gt;0.1&lt;/a&gt; - 2021-11-20
&lt;br /&gt;Datasette plugin that redirects all non-https requests to https&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.6"&gt;0.6&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;6 releases total&lt;/a&gt;) - 2021-11-18
&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/csvs-to-sqlite"&gt;csvs-to-sqlite&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/csvs-to-sqlite/releases/tag/1.3"&gt;1.3&lt;/a&gt; - (&lt;a href="https://github.com/simonw/csvs-to-sqlite/releases"&gt;13 releases total&lt;/a&gt;) - 2021-11-18
&lt;br /&gt;Convert CSV files into a SQLite database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-graphql"&gt;datasette-graphql&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-graphql/releases/tag/2.0"&gt;2.0&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-graphql/releases"&gt;32 releases total&lt;/a&gt;) - 2021-11-17
&lt;br /&gt;Datasette plugin providing an automatic GraphQL API for your SQLite databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/asyncinject"&gt;asyncinject&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/asyncinject/releases/tag/0.2a0"&gt;0.2a0&lt;/a&gt; - (&lt;a href="https://github.com/simonw/asyncinject/releases"&gt;2 releases total&lt;/a&gt;) - 2021-11-17
&lt;br /&gt;Run async workflows using pytest-fixtures-style dependency injection&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/apache"&gt;apache&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/proxies"&gt;proxies&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/refactoring"&gt;refactoring&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/supervisord"&gt;supervisord&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/docker"&gt;docker&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="apache"/><category term="proxies"/><category term="refactoring"/><category term="supervisord"/><category term="docker"/><category term="datasette"/><category term="weeknotes"/><category term="git-scraping"/><category term="sqlite-utils"/></entry><entry><title>Weeknotes: git-history, created for a Git scraping workshop</title><link href="https://simonwillison.net/2021/Nov/15/weeknotes-git-history/#atom-tag" rel="alternate"/><published>2021-11-15T04:10:50+00:00</published><updated>2021-11-15T04:10:50+00:00</updated><id>https://simonwillison.net/2021/Nov/15/weeknotes-git-history/#atom-tag</id><summary type="html">
    &lt;p&gt;My main project this week was a 90 minute workshop I delivered about Git scraping at &lt;a href="https://escoladedados.org/coda2021/"&gt;Coda.Br 2021&lt;/a&gt;, a Brazilian data journalism conference, on Friday. This inspired the creation of a brand new tool, &lt;strong&gt;git-history&lt;/strong&gt;, plus smaller improvements to a range of other projects.&lt;/p&gt;
&lt;h4&gt;git-history&lt;/h4&gt;
&lt;p&gt;I still need to do a detailed write-up of this one (update: &lt;a href="https://simonwillison.net/2021/Dec/7/git-history/"&gt;git-history: a tool for analyzing scraped data collected using Git and SQLite&lt;/a&gt;), but on Thursday I released a brand new tool called &lt;a href="https://datasette.io/tools/git-history"&gt;git-history&lt;/a&gt;, which I describe as "tools for analyzing Git history using SQLite".&lt;/p&gt;
&lt;p&gt;This tool is the missing link in the &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping pattern&lt;/a&gt; I described here last October.&lt;/p&gt;
&lt;p&gt;Git scraping is the technique of regularly scraping an online source of information and writing the results to a file in a Git repository... which automatically gives you a full revision history of changes made to that data source over time.&lt;/p&gt;
&lt;p&gt;The missing piece has always been what to do next: how do you turn a commit history of changes to a JSON or CSV file into a data source that can be used to answer questions about how that file changed over time?&lt;/p&gt;
&lt;p&gt;I've written one-off Python scripts for this a few times (here's &lt;a href="https://github.com/simonw/cdc-vaccination-history/blob/6f6bcb9437c0d44c4bcf94c111c631cc50bc2744/build_database.py"&gt;my CDC vaccinations one&lt;/a&gt;, for example), but giving an interactive workshop about the technique finally inspired me to build a tool to help.&lt;/p&gt;
&lt;p&gt;The tool has &lt;a href="https://datasette.io/tools/git-history"&gt;a comprehensive README&lt;/a&gt;, but the short version is that you can take a JSON (or CSV) file in a repository that has been tracking changes to some items over time and run the following to load all of the different versions into a SQLite database file for analysis with &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git-convert file incidents.db incidents.json --id IncidentID
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This assumes that &lt;code&gt;incidents.json&lt;/code&gt; contains a JSON array of incidents (reported fires for example) and that each incident has a &lt;code&gt;IncidentID&lt;/code&gt; identifier key. It will then loop through the Git history of that file right from the start, creating an &lt;code&gt;item_versions&lt;/code&gt; table that tracks every change made to each of those items - using &lt;code&gt;IncidentID&lt;/code&gt; to decide if a row represents a new incident or an update to a previous one.&lt;/p&gt;
&lt;p&gt;I have a few more improvements I want to make before I start more widely promoting this, but it's already really useful. I've had a lot of fun running it against example repos from the &lt;a href="https://github.com/topics/git-scraping"&gt;git-scraping GitHub topic&lt;/a&gt; (now at 202 repos and counting).&lt;/p&gt;
&lt;h4&gt;Workshop: Raspando dados com o GitHub Actions e analisando com Datasette&lt;/h4&gt;
&lt;p&gt;The workshop I gave at the conference was live-translated into Portuguese, which is really exciting! I'm looking forward to watching the video when it comes out and seeing how well that worked.&lt;/p&gt;
&lt;p&gt;The title translates to "Scraping data with GitHub Actions and analyzing with Datasette", and it was the first time I've given a workshop that combines Git scraping and Datasette - hence the development of the new git-history tool to help tie the two together.&lt;/p&gt;
&lt;p&gt;I think it went really well. I put together four detailed exercises for the attendees, and then worked through each one live with the goal of attendees working through them at the same time - a method I learned from the Carpentries training course I took &lt;a href="https://simonwillison.net/2020/Sep/26/weeknotes-software-carpentry-sqlite/"&gt;last year&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Four exercises turns out to be exactly right for 90 minutes, with reasonable time for an introduction and some extra material and questions at the end.&lt;/p&gt;
&lt;p&gt;The worst part of running a workshop is inevitably the part where you try and get everyone setup with a functional development environment on their own machines (see &lt;a href="https://xkcd.com/1987/"&gt;XKCD 1987&lt;/a&gt;). This time round I skipped that entirely by encouraging my students to use &lt;strong&gt;&lt;a href="https://gitpod.io/"&gt;GitPod&lt;/a&gt;&lt;/strong&gt;, which provides free browser-based cloud development environments running Linux, with a browser-embedded VS Code editor and terminal running on top.&lt;/p&gt;

&lt;p&gt;&lt;img style="max-width: 100%" src="https://static.simonwillison.net/static/2021/start-datasette-gitpod.gif" alt="Animated demo of GitPod showing how to run Datasette and have it proxy a port" /&gt;&lt;/p&gt;

&lt;p&gt;(It's similar to &lt;a href="https://github.com/features/codespaces"&gt;GitHub Codespaces&lt;/a&gt;, but Codespaces is not yet available to free customers outside of the beta.)&lt;/p&gt;
&lt;p&gt;I demonstrated all of the exercises using GitPod myself during the workshop, and ensured that they could be entirely completed through that environment, with no laptop software needed at all.&lt;/p&gt;
&lt;p&gt;This worked &lt;strong&gt;so well&lt;/strong&gt;. Not having to worry about development environments makes workshops massively more productive. I will absolutely be doing this again in the future.&lt;/p&gt;
&lt;p&gt;The workshop exercises are available &lt;a href="https://docs.google.com/document/d/1TCatZP5gQNfFjZJ5M77wMlf9u_05Z3BZnjp6t1SA6UU/edit"&gt;in this Google Doc&lt;/a&gt;, and I hope to extract some of them out into official tutorials for various tools later on.&lt;/p&gt;
&lt;h4&gt;Datasette 0.58.2&lt;/h4&gt;
&lt;p&gt;Yesterday was Datasette's fourth birthday - the four year anniversary of &lt;a href="https://simonwillison.net/2017/Nov/13/datasette/"&gt;the initial release announcement&lt;/a&gt;! I celebrated by releasing a minor bug-fix, &lt;a href="https://github.com/simonw/datasette/releases/tag/0.59.2"&gt;Datasette 0.58.2&lt;/a&gt;, the release notes for which are quoted below:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Column names with a leading underscore now work correctly when used as a facet. (&lt;a href="https://github.com/simonw/datasette/issues/1506"&gt;#1506&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Applying &lt;code&gt;?_nocol=&lt;/code&gt; to a column no longer removes that column from the filtering interface. (&lt;a href="https://github.com/simonw/datasette/issues/1503"&gt;#1503&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Official Datasette Docker container now uses Debian Bullseye as the base image. (&lt;a href="https://github.com/simonw/datasette/issues/1497"&gt;#1497&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That first change was inspired by ongoing work on &lt;code&gt;git-history&lt;/code&gt;, where I decided to use a &lt;code&gt;_id&lt;/code&gt; underscoper prefix pattern for columns that were reserved for use by that tool in order &lt;a href="https://github.com/simonw/git-history/issues/14"&gt;to avoid clashing with column names&lt;/a&gt; in the provided source data.&lt;/p&gt;
&lt;h4&gt;sqlite-utils 3.18&lt;/h4&gt;
&lt;p&gt;Today I released &lt;a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-18"&gt;sqlite-utils 3.18&lt;/a&gt; - initially also to provide a feature I wanted for &lt;code&gt;git-history&lt;/code&gt; (a way to &lt;a href="https://github.com/simonw/sqlite-utils/issues/339"&gt;populate additional columns&lt;/a&gt; when creating a row using &lt;code&gt;table.lookup()&lt;/code&gt;) but I also closed some bug reports and landed some small pull requests that had come in since 3.17.&lt;/p&gt;
&lt;h4&gt;s3-credentials 0.5&lt;/h4&gt;
&lt;p&gt;Earlier in the week I released &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.5"&gt;version 0.5&lt;/a&gt; of &lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt; - my CLI tool for creating read-only, read-write or write-only AWS credentials for a specific S3 bucket.&lt;/p&gt;
&lt;p&gt;The biggest new feature is the ability to create temporary credentials, that expire after a given time limit.&lt;/p&gt;
&lt;p&gt;This is achived using &lt;code&gt;STS.assume_role()&lt;/code&gt;, where STS is &lt;a href="https://docs.aws.amazon.com/STS/latest/APIReference/welcome.html"&gt;Security Token Service&lt;/a&gt;. I've been wanting to learn this API for quite a while now.&lt;/p&gt;
&lt;p&gt;Assume role comes with some limitations: tokens must live between 15 minutes and 12 hours, and you need to first create a role that you can assume. In creating those credentials you can define an additional policy document, which is how I scope down the token I'm creating to only allow a specific level of access to a specific S3 bucket.&lt;/p&gt;
&lt;p&gt;I've learned a huge amount about AWS, IAM and S3 through developming this project. I think I'm finally overcoming my multi-year phobia of anything involving IAM!&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/3.18"&gt;3.18&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-utils/releases"&gt;88 releases total&lt;/a&gt;) - 2021-11-15
&lt;br /&gt;Python CLI utility and library for manipulating SQLite databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette"&gt;datasette&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette/releases/tag/0.59.2"&gt;0.59.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette/releases"&gt;100 releases total&lt;/a&gt;) - 2021-11-14
&lt;br /&gt;An open source multi-tool for exploring and publishing data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-hello-world"&gt;datasette-hello-world&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-hello-world/releases/tag/0.1.1"&gt;0.1.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-hello-world/releases"&gt;2 releases total&lt;/a&gt;) - 2021-11-14
&lt;br /&gt;The hello world of Datasette plugins&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/git-history"&gt;git-history&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/git-history/releases/tag/0.3.1"&gt;0.3.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/git-history/releases"&gt;5 releases total&lt;/a&gt;) - 2021-11-12
&lt;br /&gt;Tools for analyzing Git history using SQLite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.5"&gt;0.5&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;5 releases total&lt;/a&gt;) - 2021-11-11
&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/kubernetes/basic-datasette-in-kubernetes"&gt;Basic Datasette in Kubernetes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/deno/annotated-deno-deploy-demo"&gt;Annotated code for a demo of WebSocket chat in Deno Deploy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/javascript/tesseract-ocr-javascript"&gt;Using Tesseract.js to OCR every image on a page&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/my-talks"&gt;my-talks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/teaching"&gt;teaching&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="aws"/><category term="projects"/><category term="s3"/><category term="my-talks"/><category term="teaching"/><category term="datasette"/><category term="weeknotes"/><category term="git-scraping"/><category term="sqlite-utils"/><category term="git-history"/><category term="s3-credentials"/></entry><entry><title>Weeknotes: CDC vaccination history fixes, developing in GitHub Codespaces</title><link href="https://simonwillison.net/2021/Sep/28/weeknotes/#atom-tag" rel="alternate"/><published>2021-09-28T01:53:49+00:00</published><updated>2021-09-28T01:53:49+00:00</updated><id>https://simonwillison.net/2021/Sep/28/weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;I spent the last week mostly surrounded by boxes: we're completing our move to the new place and life is mostly unpacking now. I did find some time to fix some issues with my &lt;a href="https://cdc-vaccination-history.datasette.io/"&gt;CDC vaccination history&lt;/a&gt; Datasette instance though.&lt;/p&gt;
&lt;h4&gt;Fixing my CDC vaccination history site&lt;/h4&gt;
&lt;p&gt;I started tracking changes made to the &lt;a href="https://covid.cdc.gov/covid-data-tracker/#vaccinations_vacc-total-admin-rate-total"&gt;CDC's COVID Data Tracker&lt;/a&gt; website back in Feburary. I created &lt;a href="https://github.com/simonw/cdc-vaccination-history"&gt;a git scraper repository&lt;/a&gt; for it as part of my &lt;a href="https://simonwillison.net/2021/Mar/5/git-scraping/"&gt;five minute lightning talk on git scraping&lt;/a&gt; (notes and video) at this year's NICAR data journalism conference.&lt;/p&gt;
&lt;p&gt;Since then it's been quietly ticking along, recording the latest data in a git repository that now has &lt;a href="https://github.com/simonw/cdc-vaccination-history/commits/main"&gt;335 commits&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In March I &lt;a href="https://github.com/simonw/cdc-vaccination-history/commit/bf88c1e6cc3e5b6344a7dfea5d2a70dcb0552847#diff-87ee5504a3e25ac558b343724c905f2f7949e8cec3d92b9c4300bb922afa164f"&gt;added a script&lt;/a&gt; to build the collected historic data into a SQLite database and publish it to Vercel using GitHub. That started breaking a few weeks ago, and it turnoud out that was because the database file had grown in size to the point where it was too large to deploy to Vercel (~100MB).&lt;/p&gt;
&lt;p&gt;I got a bug report about this, so I took some time to &lt;a href="https://github.com/simonw/cdc-vaccination-history/issues/8"&gt;move the deployment over&lt;/a&gt; to Google Cloud Run which doesn't have a documented size limit (though in my experience starts to creak once you go above about 2GB.)&lt;/p&gt;
&lt;p&gt;I also started publishing the raw collected data &lt;a href="https://github.com/simonw/cdc-vaccination-history/issues/9"&gt;directly as a CSV file&lt;/a&gt;, partly as an excuse to learn &lt;a href="https://til.simonwillison.net/googlecloud/gsutil-bucket"&gt;how to publish to Google Cloud Storage&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;datasette-template-request&lt;/h4&gt;
&lt;p&gt;I released an extremely simple plugin this week called &lt;a href="https://datasette.io/plugins/datasette-template-request"&gt;datasette-template-request&lt;/a&gt; - all it does is expose Datasette's &lt;a href="https://docs.datasette.io/en/stable/internals.html#request-object"&gt;request object&lt;/a&gt; in the context passed to &lt;a href="https://docs.datasette.io/en/stable/custom_templates.html"&gt;custom templates&lt;/a&gt;, for people who want to update their custom page based on incoming request parameters.&lt;/p&gt;
&lt;p&gt;More notable is how I built the plugin: this is the first plugin I've developed, tested and released entirely in my browser using the new &lt;a href="https://github.com/features/codespaces"&gt;GitHub Codespaces&lt;/a&gt; online development environment.&lt;/p&gt;
&lt;p&gt;I created the new repo using my &lt;a href="https://github.com/simonw/datasette-plugin-template-repository"&gt;Datasette plugin template repository&lt;/a&gt;, opened it up in Codespaces, implemented the plugin and tests, tried it out using the port forwarding feature and then published it to PyPI using the &lt;a href="https://github.com/simonw/datasette-template-request/blob/0.1/.github/workflows/publish.yml"&gt;publish.yml&lt;/a&gt; workflow.&lt;/p&gt;
&lt;p&gt;Not having to even open a text editor on my laptop (let alone get a new Python development environment up and running) felt really good. I should turn this into a tutorial.&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-template-request"&gt;datasette-template-request&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-template-request/releases/tag/0.1"&gt;0.1&lt;/a&gt; - 2021-09-23
&lt;br /&gt;Expose the Datasette request object to custom templates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-notebook"&gt;datasette-notebook&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-notebook/releases/tag/0.1a1"&gt;0.1a1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-notebook/releases"&gt;2 releases total&lt;/a&gt;) - 2021-09-22
&lt;br /&gt;A markdown wiki and dashboarding system for Datasette&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-render-markdown"&gt;datasette-render-markdown&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-render-markdown/releases/tag/2.0"&gt;2.0&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-render-markdown/releases"&gt;8 releases total&lt;/a&gt;) - 2021-09-22
&lt;br /&gt;Datasette plugin for rendering Markdown&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/3.17.1"&gt;3.17.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-utils/releases"&gt;87 releases total&lt;/a&gt;) - 2021-09-22
&lt;br /&gt;Python CLI utility and library for manipulating SQLite databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/dogsheep/twitter-to-sqlite"&gt;twitter-to-sqlite&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/dogsheep/twitter-to-sqlite/releases/tag/0.22"&gt;0.22&lt;/a&gt; - (&lt;a href="https://github.com/dogsheep/twitter-to-sqlite/releases"&gt;28 releases total&lt;/a&gt;) - 2021-09-21
&lt;br /&gt;Save data from Twitter to a SQLite database&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/googlecloud_gsutil-bucket.md"&gt;Publishing to a public Google Cloud bucket with gsutil&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/javascript_lit-with-skypack.md"&gt;Loading lit from Skypack&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/covid19"&gt;covid19&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-codespaces"&gt;github-codespaces&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="github"/><category term="projects"/><category term="weeknotes"/><category term="covid19"/><category term="git-scraping"/><category term="github-codespaces"/></entry><entry><title>Flat Data</title><link href="https://simonwillison.net/2021/May/19/flat-data/#atom-tag" rel="alternate"/><published>2021-05-19T01:05:54+00:00</published><updated>2021-05-19T01:05:54+00:00</updated><id>https://simonwillison.net/2021/May/19/flat-data/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://octo.github.com/projects/flat-data"&gt;Flat Data&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New project from the GitHub OCTO (the Office of the CTO, love that backronym) somewhat inspired by my work on Git scraping: I’m really excited to see GitHub embracing git for CSV/JSON data in this way. Flat incorporates a reusable Action for scraping and storing data (using Deno), a VS Code extension for setting up those workflows and a very nicely designed Flat Viewer web app for browsing CSV and JSON data hosted on GitHub.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deno"&gt;deno&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;



</summary><category term="github"/><category term="deno"/><category term="git-scraping"/></entry><entry><title>Weeknotes: SpatiaLite 5, Datasette on Azure, more CDC vaccination history</title><link href="https://simonwillison.net/2021/Mar/28/weeknotes/#atom-tag" rel="alternate"/><published>2021-03-28T05:19:57+00:00</published><updated>2021-03-28T05:19:57+00:00</updated><id>https://simonwillison.net/2021/Mar/28/weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;This week I got SpatiaLite 5 working in the Datasette Docker image, improved the CDC vaccination history git scraper, figured out Datasette on Azure and we closed on a new home!&lt;/p&gt;

&lt;h4 id="spatialite-5-datasette"&gt;SpatiaLite 5 for Datasette&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://www.gaia-gis.it/fossil/libspatialite/wiki?name=5.0.0-doc"&gt;SpatiaLite 5&lt;/a&gt; came out earlier this year with a bunch of exciting improvements, most notably an implementation of &lt;a href="https://www.gaia-gis.it/fossil/libspatialite/wiki?name=KNN"&gt;KNN&lt;/a&gt; (K-nearest neighbours) - a way to efficiently answer the question "what are the 10 closest rows to this latitude/longitude point".&lt;/p&gt;
&lt;p&gt;I love building &lt;a href="https://www.owlsnearme.com/"&gt;X near me&lt;/a&gt; websites so I expect I'll be using this a &lt;em&gt;lot&lt;/em&gt; in the future.&lt;/p&gt;
&lt;p&gt;I spent a bunch of time this week figuring out how best to install it into a Docker container for use with Datasette. I finally cracked it in &lt;a href="https://github.com/simonw/datasette/issues/1249"&gt;issue 1249&lt;/a&gt; and the &lt;a href="https://github.com/simonw/datasette/blob/3fcfc8513465339ac5f055296cbb67f5262af02b/Dockerfile"&gt;Dockerfile&lt;/a&gt; in the Datasette repository now builds with the SpatiaLite 5.0 extension, using a pattern &lt;a href="https://til.simonwillison.net/docker/debian-unstable-packages"&gt;I figured out&lt;/a&gt; for installing Debian unstable packages into a Debian stable base container.&lt;/p&gt;
&lt;p&gt;When Datasette 0.56 is released the official Datasette Docker image will bundle SpatiaLite 5.0.&lt;/p&gt;
&lt;h4 id="cdc-vaccination-datasette"&gt;CDC vaccination history in Datasette&lt;/h4&gt;
&lt;p&gt;I'm tracking the CDC's per-state vaccination numbers in my &lt;a href="https://github.com/simonw/cdc-vaccination-history"&gt;cdc-vaccination-history&lt;/a&gt; repository, as described in my &lt;a href="https://simonwillison.net/2021/Mar/5/git-scraping/"&gt;Git scraping lightning talk&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Scraping data into a git repository to track changes to it over time is easy. What's harder is extracting that data back out of the commit history in order to analyze and visualize it later.&lt;/p&gt;
&lt;p&gt;To demonstrate how this can work I added a &lt;a href="https://github.com/simonw/cdc-vaccination-history/blob/1fd1003f34ec512cf0b89c68fe609e130c7fe3f1/build_database.py"&gt;build_database.py&lt;/a&gt; script to that repository which iterates through the git history and uses it to build a SQLite database containing daily state reports. I also added &lt;a href="https://github.com/simonw/cdc-vaccination-history/blob/1fd1003f34ec512cf0b89c68fe609e130c7fe3f1/.github/workflows/scrape.yml#L29-L57"&gt;steps to the GitHub Actions workflow&lt;/a&gt; to publish that SQLite database using Datasette and Vercel.&lt;/p&gt;
&lt;p&gt;I installed the &lt;a href="https://datasette.io/plugins/datasette-vega"&gt;datasette-vega&lt;/a&gt; visualization plugin there too. Here's &lt;a href="https://cdc-vaccination-history.datasette.io/cdc/daily_reports?_sort=id&amp;amp;Location__exact=CA#g.mark=bar&amp;amp;g.x_column=Date&amp;amp;g.x_type=temporal&amp;amp;g.y_column=Doses_Administered&amp;amp;g.y_type=quantitative"&gt;a chart&lt;/a&gt; showing the number of doses administered over time in California.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Chart of vaccines distributed in California, which is going up at a healthy pace" src="https://static.simonwillison.net/static/2021/cdc-vaccines-california.png" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This morning I &lt;a href="https://github.com/simonw/cdc-vaccination-history/commit/1b274d3aab5cc58ae6c79411dbc15d28d8bd0c8b#diff-87ee5504a3e25ac558b343724c905f2f7949e8cec3d92b9c4300bb922afa164f"&gt;started capturing&lt;/a&gt; the CDC's per-county data too, but I've not yet written code to load that into Datasette. [UPDATE: that table is now available: &lt;a href="https://cdc-vaccination-history.datasette.io/cdc/daily_reports_counties"&gt;cdc/daily_reports_counties&lt;/a&gt;]&lt;/p&gt;
&lt;h4&gt;Datasette on Azure&lt;/h4&gt;
&lt;p&gt;I'm keen to make Datasette easy to deploy in as many places as possible. I already have mechanisms for publishing to Heroku, Cloud Run, Vercel and Fly.io - today I worked out the recipe needed for &lt;a href="https://docs.microsoft.com/en-us/azure/azure-functions/"&gt;Azure Functions&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I haven't bundled it into a &lt;code&gt;datasette-publish-azure&lt;/code&gt; plugin yet but that's the next step. In the meantime the &lt;a href="https://github.com/simonw/azure-functions-datasette"&gt;azure-functions-datasette&lt;/a&gt; repo has a working example with instructions on how to deploy it.&lt;/p&gt;
&lt;p&gt;Thanks go to Anthony Shaw for &lt;a href="https://github.com/Azure/azure-functions-python-library/issues/75#issuecomment-808553496"&gt;building out the ASGI wrapper&lt;/a&gt; needed to run ASGI applications like Datasette on Azure Functions.&lt;/p&gt;
&lt;h4&gt;iam-to-sqlite&lt;/h4&gt;
&lt;p&gt;I spend way too much time &lt;a href="https://twitter.com/search?q=from%3Asimonw%20iam&amp;amp;src=typed_query"&gt;whinging about IAM&lt;/a&gt; on Twitter. I'm certain that properly learning IAM will unlock the entire world of AWS, but I have so far been unable to overcome my discomfort with it long enough to actually figure it out.&lt;/p&gt;
&lt;p&gt;After &lt;a href="https://twitter.com/simonw/status/1374494730088706058"&gt;yet another unproductive whinge&lt;/a&gt; this week I guilted myself into putting in some effort, and it's already started to pay off: I figured out how to dump out all existing IAM data (users, groups, roles and policies) as JSON using the &lt;code&gt;aws  iam get-account-authorization-details&lt;/code&gt; command, and got so excited about it that I built &lt;a href="https://github.com/simonw/iam-to-sqlite"&gt;iam-to-sqlite&lt;/a&gt; as a wrapper around that command that writes the results into SQLite so I can browse them using Datasette!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Datasette showing IAM database tables" src="https://static.simonwillison.net/static/2021/iam-to-sqlite.jpg" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I'm increasingly realizing that the key to me understanding how pretty much any service works is to pull their JSON into a SQLite database so I can explore it as relational tables.&lt;/p&gt;
&lt;h4&gt;A useful trick for writing weeknotes&lt;/h4&gt;
&lt;p&gt;When writing weeknotes like these, it's really useful to be able to see all of the commits from the past week across many different projects.&lt;/p&gt;
&lt;p&gt;Today I realized you can use GitHub search for this. Run a search for &lt;code&gt;author:simonw created:&amp;gt;2021-03-20&lt;/code&gt; and filter to commits, ordered by "Recently committed".&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/search?l=&amp;amp;o=desc&amp;amp;q=author%3Asimonw+created%3A%3E2021-03-20&amp;amp;s=committer-date&amp;amp;type=Commits"&gt;Here's that search for me&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Django pull request accepted!&lt;/h4&gt;
&lt;p&gt;I had &lt;a href="https://github.com/django/django/pull/14171"&gt;a pull request&lt;/a&gt; accepted to Django this week! It was a documentation fix for the &lt;a href="https://docs.djangoproject.com/en/dev/ref/models/expressions/#django.db.models.expressions.RawSQL"&gt;RawSQL query expression&lt;/a&gt; - I found a pattern for using it as part of an &lt;code&gt;.filter(id__in=RawSQL(...))&lt;/code&gt; query that wasn't covered by the documentation.&lt;/p&gt;
&lt;h4&gt;And we found a new home&lt;/h4&gt;
&lt;p&gt;One other project this week: Natalie and I closed on a new home! We're moving to El Granada, a tiny town just north of Half Moon Bay, on the coast 40 minutes south of San Francisco. We'll be ten minutes from the ocean, with plenty of &lt;a href="https://pinnipeds-near-me.now.sh/"&gt;pinnipeds&lt;/a&gt; and &lt;a href="https://simonwillison.net/2020/May/21/dogsheep-photos/"&gt;pelicans&lt;/a&gt;. Exciting!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Cleo asleep on the deck with the Pacific ocean in the distance" src="https://static.simonwillison.net/static/2021/new-house-cleo.jpg" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/docker/gdb-python-docker"&gt;Running gdb against a Python process in a running Docker container&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/python/tracing-every-statement"&gt;Tracing every executed Python statement&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/docker/debian-unstable-packages"&gt;Installing packages from Debian unstable in a Docker image based on stable&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/postgresql/closest-locations-to-a-point"&gt;Closest locations to a point&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/zeit-now/redirecting-all-paths-on-vercel"&gt;Redirecting all paths on a Vercel instance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/azure/all-traffic-to-subdomain"&gt;Writing an Azure Function that serves all traffic to a subdomain&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-publish-vercel"&gt;datasette-publish-vercel&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-publish-vercel/releases/tag/0.9.3"&gt;0.9.3&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-publish-vercel/releases"&gt;15 releases total&lt;/a&gt;) - 2021-03-26
&lt;br /&gt;Datasette plugin for publishing data using Vercel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-transform"&gt;sqlite-transform&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-transform/releases/tag/0.5"&gt;0.5&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-transform/releases"&gt;6 releases total&lt;/a&gt;) - 2021-03-24
&lt;br /&gt;Tool for running transformations on columns in a SQLite database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/django-sql-dashboard"&gt;django-sql-dashboard&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/django-sql-dashboard/releases/tag/0.5a0"&gt;0.5a0&lt;/a&gt; - (&lt;a href="https://github.com/simonw/django-sql-dashboard/releases"&gt;12 releases total&lt;/a&gt;) - 2021-03-24
&lt;br /&gt;Django app for building dashboards using raw SQL queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/iam-to-sqlite"&gt;iam-to-sqlite&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/iam-to-sqlite/releases/tag/0.1"&gt;0.1&lt;/a&gt; - 2021-03-24
&lt;br /&gt;Load Amazon IAM data into a SQLite database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/tableau-to-sqlite"&gt;tableau-to-sqlite&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/tableau-to-sqlite/releases/tag/0.2.1"&gt;0.2.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/tableau-to-sqlite/releases"&gt;4 releases total&lt;/a&gt;) - 2021-03-22
&lt;br /&gt;Fetch data from Tableau into a SQLite database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/c64"&gt;c64&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/c64/releases/tag/0.1a0"&gt;0.1a0&lt;/a&gt; - 2021-03-21
&lt;br /&gt;Experimental package of ASGI utilities extracted from Datasette&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/azure"&gt;azure&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="aws"/><category term="azure"/><category term="datasette"/><category term="weeknotes"/><category term="git-scraping"/></entry><entry><title>Weeknotes: Datasette and Git scraping at NICAR, VaccinateCA</title><link href="https://simonwillison.net/2021/Mar/7/weeknotes/#atom-tag" rel="alternate"/><published>2021-03-07T07:29:00+00:00</published><updated>2021-03-07T07:29:00+00:00</updated><id>https://simonwillison.net/2021/Mar/7/weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;This week I virtually attended the NICAR data journalism conference and made a ton of progress on the Django backend for VaccinateCA (see &lt;a href="https://simonwillison.net/2021/Feb/28/vaccinateca/"&gt;last week&lt;/a&gt;).&lt;/p&gt;
&lt;h4&gt;NICAR 2021&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://www.ire.org/training/conferences/nicar-2021/"&gt;NICAR&lt;/a&gt; stands for the National Institute for Computer Assisted Reporting - an acronym that reflects the age of the organization, which started teaching journalists data-driven reporting back in 1989, long before the term "data journalism" became commonplace.&lt;/p&gt;
&lt;p&gt;This was my third NICAR and it's now firly established itself at the top of the list of my favourite conferences. Every year it attracts over 1,000 of the highest quality data nerds - from data journalism veterans who've been breaking stories for decades to journalists who are just getting started with data and want to start learning Python or polish up their skills with Excel.&lt;/p&gt;
&lt;p&gt;I presented &lt;a href="https://nicar21.pathable.co/meetings/virtual/xEmubEJvwB5mv3Dfn"&gt;an hour long workshop&lt;/a&gt; on Datasette, which I'm planning to turn into the first official Datasette tutorial. I also got to pre-record a five minute lightning talk about Git scraping.&lt;/p&gt;
&lt;p&gt;I published &lt;a href="https://simonwillison.net/2021/Mar/5/git-scraping/"&gt;the video and notes for that&lt;/a&gt; yesterday. It really seemed to strike a nerve at the conference: I showed how you can set up a scheduled scraper using GitHub Actions with just a few lines of YAML configuration, and do so entirely through the GitHub web interface without even opening a text editor.&lt;/p&gt;
&lt;p&gt;Pretty much every data journalist wants to run scrapers, and understands the friction involved in maintaining your own dedicated server and crontabs and storage and backups for running them. Being able to do this for free on GitHub's infrastructure drops that friction down to almost nothing.&lt;/p&gt;
&lt;p&gt;The lightning talk lead to a last-minute GitHub Actions and Git scraping &lt;a href="https://nicar21.pathable.co/meetings/virtual/FTTWfJicMwFLP849H"&gt;office hours session&lt;/a&gt; being added to the schedule, and I was delighted to have &lt;a href="https://github.com/rdmurphy"&gt;Ryan Murphy&lt;/a&gt; from the LA Times join that session to demonstrate the incredible things the LA Times have been doing with scrapers and GitHub Actions. You can see some of their scrapers in the &lt;a href="https://github.com/datadesk/california-coronavirus-scrapers"&gt;datadesk/california-coronavirus-scrapers&lt;/a&gt; repo.&lt;/p&gt;
&lt;h4&gt;VaccinateCA&lt;/h4&gt;
&lt;p&gt;The race continues to build out a Django backend for the &lt;a href="https://www.vaccinateca.com/"&gt;VaccinateCA&lt;/a&gt; project, to collect data on vaccine availability from people making calls on that organization's behalf.&lt;/p&gt;
&lt;p&gt;The new backend is getting perilously close to launch. I'm leaning heavily on the Django admin for this, refreshing my knowledge of how to customize it with things like &lt;a href="https://docs.djangoproject.com/en/3.1/ref/contrib/admin/actions/"&gt;admin actions&lt;/a&gt; and &lt;a href="https://docs.djangoproject.com/en/3.1/ref/contrib/admin/#django.contrib.admin.ModelAdmin.list_filter"&gt;custom filters&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It's been quite a while since I've done anything sophisticated with the Django admin and it has evolved a LOT. In the past I've advised people to drop the admin for custom view functions the moment they want to do anything out-of-the-ordinary - I don't think that advice holds any more. It's got really good over the years!&lt;/p&gt;
&lt;p&gt;A very smart thing the team at VaccinateCA did a month ago is to start logging the full incoming POST bodies for every API request handled by their existing Netlify functions (which then write to Airtable).&lt;/p&gt;
&lt;p&gt;This has given me an invaluable tool for testing out the new replacement API: I wrote &lt;a href="https://gist.github.com/simonw/83e66d618f07aa3b19d2f1db58be78b8"&gt;a script&lt;/a&gt; which replays those API logs against my new implementation - allowing me to test that every one of several thousand previously recorded API requests will run without errors against my new code.&lt;/p&gt;
&lt;p&gt;Since this is so valuable, I've written code that will log API requests to the new stack directly to the database. Normally I'd shy away from a database table for logging data like this, but the expected traffic is the low thousands of API requests a day - and a few thousand extra database rows per day is a tiny price to pay for having such a high level of visibility into how the API is being used.&lt;/p&gt;
&lt;p&gt;(I'm also logging the API requests to PostgreSQL using Django's JSONField, which means I can analyze them in depth later on using PostgreSQL's JSON functionality!)&lt;/p&gt;
&lt;h4&gt;YouTube subtitles&lt;/h4&gt;
&lt;p&gt;I decided to add proper subtitles to my &lt;a href="https://www.youtube.com/watch?v=2CjA-03yK8I&amp;amp;t=1s"&gt;lightning talk video&lt;/a&gt;, and was delighted to learn that the YouTube subtitle editor pre-populates with an automatically generated transcript, which you can then edit in place to fix up spelling, grammar and remove the various "um" and "so" filler words.&lt;/p&gt;
&lt;p&gt;This makes creating high quality captions extremely productive. I've also added them to the 17 minute &lt;a href="https://simonwillison.net/2021/Feb/7/video/"&gt;Introduction to Datasette and sqlite-utils&lt;/a&gt; video that's embedded on the &lt;a href="https://datasette.io/"&gt;datasette.io&lt;/a&gt; homepage - editing the transcript for that only took about half an hour.&lt;/p&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/django/testing-django-admin-with-pytest"&gt;Writing tests for the Django admin with pytest-django&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/django/show-timezone-in-django-admin"&gt;Show the timezone for datetimes in the Django admin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/mediawiki/mediawiki-sqlite-macos"&gt;How to run MediaWiki with SQLite on a macOS laptop&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django-admin"&gt;django-admin&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/youtube"&gt;youtube&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vaccinate-ca"&gt;vaccinate-ca&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicar"&gt;nicar&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="data-journalism"/><category term="django-admin"/><category term="youtube"/><category term="datasette"/><category term="weeknotes"/><category term="git-scraping"/><category term="vaccinate-ca"/><category term="nicar"/></entry><entry><title>Git scraping, the five minute lightning talk</title><link href="https://simonwillison.net/2021/Mar/5/git-scraping/#atom-tag" rel="alternate"/><published>2021-03-05T00:44:15+00:00</published><updated>2021-03-05T00:44:15+00:00</updated><id>https://simonwillison.net/2021/Mar/5/git-scraping/#atom-tag</id><summary type="html">
    &lt;p&gt;I prepared a lightning talk about &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; for the &lt;a href="https://www.ire.org/training/conferences/nicar-2021/"&gt;NICAR 2021&lt;/a&gt; data journalism conference. In the talk I explain the idea of running scheduled scrapers in GitHub Actions, show some examples and then live code a new scraper for the CDC's vaccination data using the GitHub web interface. Here's the video.&lt;/p&gt;
&lt;div class="resp-container"&gt;
    &lt;iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/2CjA-03yK8I" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="allowfullscreen"&gt; &lt;/iframe&gt;
&lt;/div&gt;
&lt;h4&gt;Notes from the talk&lt;/h4&gt;
&lt;p&gt;Here's &lt;a href="https://m.pge.com/#outages"&gt;the PG&amp;amp;E outage map&lt;/a&gt; that I scraped. The trick here is to open the browser developer tools network tab, then order resources by size and see if you can find the JSON resource that contains the most interesting data.&lt;/p&gt;
&lt;p&gt;I scraped that outage data into &lt;a href="https://github.com/simonw/pge-outages"&gt;simonw/pge-outages&lt;/a&gt; - here's the &lt;a href="https://github.com/simonw/pge-outages/commits"&gt;commit history&lt;/a&gt; (over 40,000 commits now!)&lt;/p&gt;
&lt;p&gt;The scraper code itself &lt;a href="https://github.com/simonw/disaster-scrapers/blob/3eed6eca820e14e2f89db3910d1aece72717d387/pge.py"&gt;is here&lt;/a&gt;. I wrote about the project in detail in &lt;a href="https://simonwillison.net/2019/Oct/10/pge-outages/"&gt;Tracking PG&amp;amp;E outages by scraping to a git repo&lt;/a&gt; - my database of outages database is at &lt;a href="https://pge-outages.simonwillison.net/pge-outages/outages"&gt;pge-outages.simonwillison.net&lt;/a&gt; and the animation I made of outages over time is attached to &lt;a href="https://twitter.com/simonw/status/1188612004572880896"&gt;this tweet&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote class="twitter-tweet"&gt;&lt;p lang="en" dir="ltr"&gt;Here&amp;#39;s a video animation of PG&amp;amp;E&amp;#39;s outages from October 5th up until just a few minutes ago &lt;a href="https://t.co/50K3BrROZR"&gt;pic.twitter.com/50K3BrROZR&lt;/a&gt;&lt;/p&gt;- Simon Willison (@simonw) &lt;a href="https://twitter.com/simonw/status/1188612004572880896?ref_src=twsrc%5Etfw"&gt;October 28, 2019&lt;/a&gt;&lt;/blockquote&gt;
&lt;p&gt;The much simpler scraper for the &lt;a href="https://www.fire.ca.gov/incidents"&gt;www.fire.ca.gov/incidents&lt;/a&gt; website is at &lt;a href="https://github.com/simonw/ca-fires-history"&gt;simonw/ca-fires-history&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In the video I used that as the template to create a new scraper for CDC vaccination data - their website is &lt;a href="https://covid.cdc.gov/covid-data-tracker/#vaccinations"&gt;https://covid.cdc.gov/covid-data-tracker/#vaccinations&lt;/a&gt; and the API I found using the browser developer tools is &lt;a href="https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data"&gt;https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The new CDC scraper and the data it has scraped lives in &lt;a href="https://github.com/simonw/cdc-vaccination-history"&gt;simonw/cdc-vaccination-history&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You can find more examples of Git scraping in the &lt;a href="https://github.com/topics/git-scraping"&gt;git-scraping GitHub topic&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/my-talks"&gt;my-talks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/annotated-talks"&gt;annotated-talks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicar"&gt;nicar&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="data-journalism"/><category term="scraping"/><category term="my-talks"/><category term="github-actions"/><category term="git-scraping"/><category term="annotated-talks"/><category term="nicar"/></entry><entry><title>Weeknotes: sqlite-utils 3.0 alpha, Git scraping in the zeitgeist</title><link href="https://simonwillison.net/2020/Nov/7/weeknotes-sqlite-utils-git-scraping/#atom-tag" rel="alternate"/><published>2020-11-07T02:17:55+00:00</published><updated>2020-11-07T02:17:55+00:00</updated><id>https://simonwillison.net/2020/Nov/7/weeknotes-sqlite-utils-git-scraping/#atom-tag</id><summary type="html">
    &lt;p&gt;Natalie and I decided to escape San Francisco for election week, and have been holed up in Fort Bragg on the Northern California coast. I've mostly been on vacation, but I did find time to make some significant changes to &lt;a href="https://github.com/simonw/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;. Plus notes on an exciting Git scraping project.&lt;/p&gt;
&lt;h4&gt;Better search in the sqlite-utils 3.0 alpha&lt;/h4&gt;
&lt;p&gt;I practice &lt;a href="https://www.google.com/search?channel=cus2&amp;amp;client=firefox-b-1-d&amp;amp;q=semver"&gt;semantic versioning&lt;/a&gt; with sqlite-utils, which means it only gets a major version bump if I break backwards compatibility in some way.&lt;/p&gt;
&lt;p&gt;My goal is to avoid breaking backwards compatibility as much as possible, and I was proud to have made it all the way to &lt;a href="https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v2-23"&gt;version 2.23&lt;/a&gt; representing 23 new feature releases since the 2.0 release without breaking any documented features!&lt;/p&gt;
&lt;p&gt;Sadly this run has come to an end: I realized that the &lt;code&gt;table.search()&lt;/code&gt; method was poorly designed, and I also needed to grab back the &lt;code&gt;-c&lt;/code&gt; command-line option (a shortcut for &lt;code&gt;--csv&lt;/code&gt; output) to be used for another purpose.&lt;/p&gt;
&lt;p&gt;The chances that either of these changes will break anyone are pretty small, but semantic versioning dictates a major version bump so here we are.&lt;/p&gt;
&lt;p&gt;I shipped a &lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/3.0a0"&gt;3.0 alpha&lt;/a&gt; today, which should hopefully become a stable release very shortly (&lt;a href="https://github.com/simonw/sqlite-utils/milestone/4"&gt;milestone here&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;The big new feature is &lt;code&gt;sqlite-utils search&lt;/code&gt; - a command-line tool for executing searches against a full-text search enabled table:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ sqlite-utils search 24ways-fts4.db articles maps -c title
[{"rowid": 163, "title": "Get To Grips with Slippy Maps", "rank": -10.028754920576421},
 {"rowid": 220, "title": "Finding Your Way with Static Maps", "rank": -9.952534352591737},
 {"rowid": 27, "title": "Putting Design on the Map", "rank": -5.667327088267961},
 {"rowid": 168, "title": "Unobtrusively Mapping Microformats with jQuery", "rank": -4.662224207228984},
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://sqlite-utils.readthedocs.io/en/latest/cli.html#cli-search"&gt;full documentation&lt;/a&gt; for the new command.&lt;/p&gt;
&lt;p&gt;Notably, this command works against both FTS4 and FTS5 tables in SQLite - despite FTS4 not shipping with a built-in ranking function. I'm using my &lt;a href="https://github.com/simonw/sqlite-fts4"&gt;sqlite-fts4&lt;/a&gt; package for this, which I described back in January 2019 in &lt;a href="https://simonwillison.net/2019/Jan/7/exploring-search-relevance-algorithms-sqlite/"&gt;Exploring search relevance algorithms with SQLite&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Git scraping to predict the election&lt;/h4&gt;
&lt;p&gt;It's not quite over yet but the end is in sight, and one of the best tools to track the late arriving vote counts is &lt;a href="https://alex.github.io/nyt-2020-election-scraper/battleground-state-changes.html"&gt;this Election 2020 results site&lt;/a&gt; built by Alex Gaynor and a growing cohort of contributors.&lt;/p&gt;
&lt;p&gt;The site is a beautiful example of &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; in action, and I'm thrilled that it links to my article in the README!&lt;/p&gt;
&lt;p&gt;Take a look &lt;a href="https://github.com/alex/nyt-2020-election-scraper"&gt;at the repo&lt;/a&gt; to see how it works. Short version: this &lt;a href="https://github.com/alex/nyt-2020-election-scraper/blob/01060c06c35442c0654e18b84e22394ef3ef5a9c/.github/workflows/scrape.yml"&gt;GitHub Action workflow&lt;/a&gt; grabs the latest snapshot of this &lt;a href="https://static01.nyt.com/elections-assets/2020/data/api/2020-11-03/votes-remaining-page/national/president.json"&gt;undocumented New York Times JSON API&lt;/a&gt; once every five minutes and commits it to the repository. It then runs &lt;a href="https://github.com/alex/nyt-2020-election-scraper/blob/01060c06c35442c0654e18b84e22394ef3ef5a9c/print-battleground-state-changes"&gt;this Python script&lt;/a&gt; which iterates through the Git history and generates an HTML summary showing the different batches of new votes that were reported and their impact on the overall race.&lt;/p&gt;
&lt;p&gt;The resulting report is published to GitHub pages - resulting in a site that can handle a great deal of traffic and is updated entirely by code running in scheduled actions.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of the generated report" src="https://static.simonwillison.net/static/2020/election-data-git-scraper.png" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This is a perfect use-case for Git scraping: it takes a JSON endpoint that represents the current state of the world and turns it into a sequence of historic snapshots, then uses those snapshots to build a unique and useful new source of information to help people understand what's going on.&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/3.0a0"&gt;sqlite-utils 3.0a0&lt;/a&gt; - 2020-11-07&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/sqlite-fts4/releases/tag/1.0.1"&gt;sqlite-fts4 1.0.1&lt;/a&gt; - 2020-11-06&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/sqlite-fts4/releases/tag/1.0"&gt;sqlite-fts4 1.0&lt;/a&gt; - 2020-11-06&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/csvs-to-sqlite/releases/tag/1.2"&gt;csvs-to-sqlite 1.2&lt;/a&gt; - 2020-11-03&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette/releases/tag/0.51.1"&gt;datasette 0.51.1&lt;/a&gt; - 2020-11-01&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/alex-gaynor"&gt;alex-gaynor&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/elections"&gt;elections&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="alex-gaynor"/><category term="elections"/><category term="weeknotes"/><category term="git-scraping"/><category term="sqlite-utils"/></entry><entry><title>nyt-2020-election-scraper</title><link href="https://simonwillison.net/2020/Nov/6/nyt-2020-election-scraper/#atom-tag" rel="alternate"/><published>2020-11-06T14:24:36+00:00</published><updated>2020-11-06T14:24:36+00:00</updated><id>https://simonwillison.net/2020/Nov/6/nyt-2020-election-scraper/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/alex/nyt-2020-election-scraper"&gt;nyt-2020-election-scraper&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Brilliant application of git scraping by Alex Gaynor and a growing team of contributors. Takes a JSON snapshot of the NYT’s latest election poll figures every five minutes, then runs a Python script to iterate through the history and build an HTML page showing the trends, including what percentage of the remaining votes each candidate needs to win each state. This is the perfect case study in why it can be useful to take a “snapshot if the world right now” data source and turn it into a git revision history over time.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/alex-gaynor"&gt;alex-gaynor&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/elections"&gt;elections&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/new-york-times"&gt;new-york-times&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;



</summary><category term="alex-gaynor"/><category term="data-journalism"/><category term="elections"/><category term="git"/><category term="new-york-times"/><category term="git-scraping"/></entry><entry><title>Datasette Weekly: Datasette 0.50, git scraping, extracting columns</title><link href="https://simonwillison.net/2020/Oct/10/datasette-weekly-1/#atom-tag" rel="alternate"/><published>2020-10-10T21:00:30+00:00</published><updated>2020-10-10T21:00:30+00:00</updated><id>https://simonwillison.net/2020/Oct/10/datasette-weekly-1/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://datasette.substack.com/p/datasette-050-git-scraping-extracting"&gt;Datasette Weekly: Datasette 0.50, git scraping, extracting columns&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The first edition of the new Datasette Weekly newsletter—covering Datasette 0.50, Git scraping, extracting columns with sqlite-utils and featuring datasette-graphql as the first “plugin of the week”

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/simonw/status/1315031815166410752"&gt;@simonw&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/email"&gt;email&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/graphql"&gt;graphql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/p&gt;



</summary><category term="email"/><category term="projects"/><category term="sqlite"/><category term="graphql"/><category term="datasette"/><category term="git-scraping"/><category term="sqlite-utils"/></entry><entry><title>Git scraping: track changes over time by scraping to a Git repository</title><link href="https://simonwillison.net/2020/Oct/9/git-scraping/#atom-tag" rel="alternate"/><published>2020-10-09T18:27:23+00:00</published><updated>2020-10-09T18:27:23+00:00</updated><id>https://simonwillison.net/2020/Oct/9/git-scraping/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Git scraping&lt;/strong&gt; is the name I've given a scraping technique that I've been experimenting with for a few years now. It's really effective, and more people should use it.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 5th March 2021:&lt;/strong&gt; I presented a version of this post as &lt;a href="https://simonwillison.net/2021/Mar/5/git-scraping/"&gt;a five minute lightning talk at NICAR 2021&lt;/a&gt;, which includes a live coding demo of building a new git scraper.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 5th January 2022:&lt;/strong&gt; I released a tool called &lt;a href="https://simonwillison.net/2021/Dec/7/git-history/"&gt;git-history&lt;/a&gt; that helps analyze data that has been collected using this technique.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The internet is full of interesting data that changes over time. These changes can sometimes be more interesting than the underlying static data - The &lt;a href="https://twitter.com/nyt_diff"&gt;@nyt_diff Twitter account&lt;/a&gt; tracks changes made to New York Times headlines for example, which offers a fascinating insight into that publication's editorial process.&lt;/p&gt;
&lt;p&gt;We already have a great tool for efficiently tracking changes to text over time: &lt;strong&gt;Git&lt;/strong&gt;. And &lt;a href="https://github.com/features/actions"&gt;GitHub Actions&lt;/a&gt; (and other CI systems) make it easy to create a scraper that runs every few minutes, records the current state of a resource and records changes to that resource over time in the commit history.&lt;/p&gt;
&lt;p&gt;Here's a recent example. Fires continue to rage in California, and the &lt;a href="https://www.fire.ca.gov/"&gt;CAL FIRE website&lt;/a&gt; offers an &lt;a href="https://www.fire.ca.gov/incidents/"&gt;incident map&lt;/a&gt; showing the latest fire activity around the state.&lt;/p&gt;
&lt;p&gt;Firing up the Firefox Network pane, filtering to requests triggered by XHR and sorting by size, largest first reveals this endpoint:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents"&gt;https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;That's a 241KB JSON endpoints with full details of the various fires around the state.&lt;/p&gt;
&lt;p&gt;So... I started running a git scraper against it. My scraper lives in the &lt;a href="https://github.com/simonw/ca-fires-history"&gt;simonw/ca-fires-history&lt;/a&gt; repository on GitHub.&lt;/p&gt;
&lt;p&gt;Every 20 minutes it grabs the latest copy of that JSON endpoint, pretty-prints it (for diff readability) using &lt;code&gt;jq&lt;/code&gt; and commits it back to the repo if it has changed.&lt;/p&gt;
&lt;p&gt;This means I now have a &lt;a href="https://github.com/simonw/ca-fires-history/commits/main"&gt;commit log&lt;/a&gt; of changes to that information about fires in California. Here's an &lt;a href="https://github.com/simonw/ca-fires-history/commit/7b0f42d4bf198885ab2b41a22a8da47157572d18"&gt;example commit&lt;/a&gt; showing that last night the Zogg Fires percentage contained increased from 90% to 92%, the number of personnel involved dropped from 968 to 798 and the number of engines responding dropped from 82 to 59.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2020/git-scraping.png" alt="Screenshot of a diff against the Zogg Fires, showing personnel involved dropping from 968 to 798, engines dropping 82 to 59, water tenders dropping 31 to 27 and percent contained increasing from 90 to 92." style="max-width: 100%" /&gt;&lt;/p&gt;
&lt;p&gt;The implementation of the scraper is entirely contained in a single GitHub Actions workflow. It's in a file called &lt;a href="https://github.com/simonw/ca-fires-history/blob/main/.github/workflows/scrape.yml"&gt;.github/workflows/scrape.yml&lt;/a&gt; which looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Scrape latest data&lt;/span&gt;

&lt;span class="pl-ent"&gt;on&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;push&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;workflow_dispatch&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;schedule&lt;/span&gt;:
    - &lt;span class="pl-ent"&gt;cron&lt;/span&gt;:  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;6,26,46 * * * *&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-ent"&gt;jobs&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;scheduled&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;runs-on&lt;/span&gt;: &lt;span class="pl-s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="pl-ent"&gt;steps&lt;/span&gt;:
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Check out this repo&lt;/span&gt;
      &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/checkout@v2&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Fetch latest data&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;        curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents | jq . &amp;gt; incidents.json&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Commit and push if it changed&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;        git config user.name "Automated"&lt;/span&gt;
&lt;span class="pl-s"&gt;        git config user.email "actions@users.noreply.github.com"&lt;/span&gt;
&lt;span class="pl-s"&gt;        git add -A&lt;/span&gt;
&lt;span class="pl-s"&gt;        timestamp=$(date -u)&lt;/span&gt;
&lt;span class="pl-s"&gt;        git commit -m "Latest data: ${timestamp}" || exit 0&lt;/span&gt;
&lt;span class="pl-s"&gt;        git push&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That's not a lot of code!&lt;/p&gt;
&lt;p&gt;It runs on a schedule at 6, 26 and 46 minutes past the hour - I like to offset my cron times like this since I assume that the majority of crons run exactly on the hour, so running not-on-the-hour feels polite.&lt;/p&gt;
&lt;p&gt;The scraper itself works by fetching the JSON using &lt;code&gt;curl&lt;/code&gt;, piping it through &lt;code&gt;jq .&lt;/code&gt; to pretty-print it and saving the result to &lt;code&gt;incidents.json&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The "commit and push if it changed" block uses a pattern that commits and pushes only if the file has changed. I wrote about this pattern in &lt;a href="https://til.simonwillison.net/til/til/github-actions_commit-if-file-changed.md"&gt;this TIL&lt;/a&gt; a few months ago.&lt;/p&gt;
&lt;p&gt;I have a whole bunch of repositories running git scrapers now. I've been labeling them with the &lt;a href="https://github.com/topics/git-scraping"&gt;git-scraping topic&lt;/a&gt; so they show up in one place on GitHub (other people have started using that topic as well).&lt;/p&gt;
&lt;p&gt;I've written about some of these &lt;a href="https://simonwillison.net/tags/gitscraping/"&gt;in the past&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2017/Sep/10/scraping-irma/"&gt;Scraping hurricane Irma&lt;/a&gt; back in September 2017 is when I first came up with the idea to use a Git repository in this way.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2017/Oct/10/fires-in-the-north-bay/"&gt;Changelogs to help understand the fires in the North Bay&lt;/a&gt; from October 2017 describes an early attempt at scraping fire-related information.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2019/Mar/13/tree-history/"&gt;Generating a commit log for San Francisco’s official list of trees&lt;/a&gt; remains my favourite application of this technique. The City of San Francisco maintains a frequently updated CSV file of 190,000 trees in the city, and I have &lt;a href="https://github.com/simonw/sf-tree-history/find/master"&gt;a commit log&lt;/a&gt; of changes to it stretching back over more than a year. This example uses my &lt;a href="https://github.com/simonw/csv-diff"&gt;csv-diff&lt;/a&gt; utility to generate human-readable commit messages.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2019/Oct/10/pge-outages/"&gt;Tracking PG&amp;amp;E outages by scraping to a git repo&lt;/a&gt; documents my attempts to track the impact of PG&amp;amp;E's outages last year by scraping their outage map. I used the GitPython library to turn the values recorded in the commit history into a database that let me run visualizations of changes over time.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2020/Jan/21/github-actions-cloud-run/"&gt;Tracking FARA by deploying a data API using GitHub Actions and Cloud Run&lt;/a&gt; shows how I track new registrations for the US Foreign Agents Registration Act (FARA) in a repository and deploy the latest version of the data using Datasette.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I hope that by giving this technique a name I can encourage more people to add it to their toolbox. It's an extremely effective way of turning all sorts of interesting data sources into a changelog over time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://news.ycombinator.com/item?id=24732943"&gt;Comment thread&lt;/a&gt; on this post over on Hacker News.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="git"/><category term="github"/><category term="projects"/><category term="scraping"/><category term="github-actions"/><category term="git-scraping"/></entry></feed>