<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: Git scraping</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/series/git-scraping.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2022-11-20T07:00:54+00:00</updated><author><name>Simon Willison</name></author><entry><title>Tracking Mastodon user numbers over time with a bucket of tricks</title><link href="https://simonwillison.net/2022/Nov/20/tracking-mastodon/#atom-series" rel="alternate"/><published>2022-11-20T07:00:54+00:00</published><updated>2022-11-20T07:00:54+00:00</updated><id>https://simonwillison.net/2022/Nov/20/tracking-mastodon/#atom-series</id><summary type="html">
    &lt;p&gt;&lt;a href="https://joinmastodon.org/"&gt;Mastodon&lt;/a&gt; is definitely having a moment. User growth is skyrocketing as more and more people migrate over from Twitter.&lt;/p&gt;
&lt;p&gt;I've set up a new &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;git scraper&lt;/a&gt; to track the number of registered user accounts on known Mastodon instances over time.&lt;/p&gt;
&lt;p&gt;It's only been running for a few hours, but it's already collected enough data to &lt;a href="https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time"&gt;render this chart&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/mastodon-users-few-hours.png" alt="The chart starts at around 1am with 4,694,000 users - it climbs to 4,716,000 users by 6am in a relatively straight line" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I'm looking forward to seeing how this trend continues to develop over the next days and weeks.&lt;/p&gt;
&lt;h4&gt;Scraping the data&lt;/h4&gt;
&lt;p&gt;My scraper works by tracking &lt;a href="https://instances.social/"&gt;https://instances.social/&lt;/a&gt; - a website that lists a large number (but not all) of the Mastodon instances that are out there.&lt;/p&gt;
&lt;p&gt;That site publishes an &lt;a href="https://instances.social/instances.json"&gt;instances.json&lt;/a&gt; array which currently contains 1,830 objects representing Mastodon instances. Each of those objects looks something like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
    &lt;span class="pl-ent"&gt;"name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;pleroma.otter.sh&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"title"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Otterland&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"short_description"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"description"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Otters does squeak squeak&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"uptime"&lt;/span&gt;: &lt;span class="pl-c1"&gt;0.944757&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"up"&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"https_score"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"https_rank"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"ipv6"&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"openRegistrations"&lt;/span&gt;: &lt;span class="pl-c1"&gt;false&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"users"&lt;/span&gt;: &lt;span class="pl-c1"&gt;5&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"statuses"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;54870&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"connections"&lt;/span&gt;: &lt;span class="pl-c1"&gt;9821&lt;/span&gt;,
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I have &lt;a href="https://github.com/simonw/scrape-instances-social/blob/main/.github/workflows/scrape.yml"&gt;a GitHub Actions workflow&lt;/a&gt; running approximately every 20 minutes that fetches a copy of that file and commits it back to this repository:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/scrape-instances-social"&gt;https://github.com/simonw/scrape-instances-social&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Since each instance includes a &lt;code&gt;users&lt;/code&gt; count, the commit history of my &lt;code&gt;instances.json&lt;/code&gt; file tells the story of Mastodon's growth over time.&lt;/p&gt;
&lt;h4&gt;Building a database&lt;/h4&gt;
&lt;p&gt;A commit log of a JSON file is interesting, but the next step is to turn that into actionable information.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://simonwillison.net/2021/Dec/7/git-history/"&gt;git-history tool&lt;/a&gt; is designed to do exactly that.&lt;/p&gt;
&lt;p&gt;For the chart up above, the only number I care about is the total number of users listed in each snapshot of the file - the sum of that &lt;code&gt;users&lt;/code&gt; field for each instance.&lt;/p&gt;
&lt;p&gt;Here's how to run &lt;code&gt;git-history&lt;/code&gt; against that file's commit history to generate tables showing how that count has changed over time:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git-history file counts.db instances.json \
  --convert &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;return [&lt;/span&gt;
&lt;span class="pl-s"&gt;    {&lt;/span&gt;
&lt;span class="pl-s"&gt;        'id': 'all',&lt;/span&gt;
&lt;span class="pl-s"&gt;        'users': sum(d['users'] or 0 for d in json.loads(content)),&lt;/span&gt;
&lt;span class="pl-s"&gt;        'statuses': sum(int(d['statuses'] or 0) for d in json.loads(content)),&lt;/span&gt;
&lt;span class="pl-s"&gt;    }&lt;/span&gt;
&lt;span class="pl-s"&gt;  ]&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; --id id&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I'm creating a file called &lt;code&gt;counts.db&lt;/code&gt; that shows the history of the &lt;code&gt;instances.json&lt;/code&gt; file.&lt;/p&gt;
&lt;p&gt;The real trick here though is that &lt;code&gt;--convert&lt;/code&gt; argument. I'm using that to compress each snapshot down to a single row that looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
    &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;all&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"users"&lt;/span&gt;: &lt;span class="pl-c1"&gt;4717781&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"statuses"&lt;/span&gt;: &lt;span class="pl-c1"&gt;374217860&lt;/span&gt;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Normally &lt;code&gt;git-history&lt;/code&gt; expects to work against an array of objects, tracking the history of changes to each one based on their &lt;code&gt;id&lt;/code&gt; property.&lt;/p&gt;
&lt;p&gt;Here I'm tricking it a bit - I only return a single object with the ID of &lt;code&gt;all&lt;/code&gt;. This means that &lt;code&gt;git-history&lt;/code&gt; will only track the history of changes to that single object.&lt;/p&gt;
&lt;p&gt;It works though! The result is a &lt;code&gt;counts.db&lt;/code&gt; file which is currently 52KB and has the following schema (truncated to the most interesting bits):&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;CREATE TABLE [commits] (
   [id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [namespace] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [namespaces]([id]),
   [hash] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [commit_at] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [item_version] (
   [_id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [_item] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [item]([_id]),
   [_version] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [_commit] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [commits]([id]),
   [id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [users] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [statuses] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [_item_full_hash] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Each &lt;code&gt;item_version&lt;/code&gt; row will tell us the number of users and statuses at a particular point in time, based on a join against that &lt;code&gt;commits&lt;/code&gt; table to find the &lt;code&gt;commit_at&lt;/code&gt; date.&lt;/p&gt;
&lt;h4&gt;Publishing the database&lt;/h4&gt;
&lt;p&gt;For this project, I decided to publish the SQLite database to an S3 bucket. I considered pushing the binary SQLite file directly to the GitHub repository but this felt rude, since a binary file that changes every 20 minutes would bloat the repository.&lt;/p&gt;
&lt;p&gt;I wanted to serve the file with open CORS headers so I could load it into Datasette Lite and Observable notebooks.&lt;/p&gt;
&lt;p&gt;I used my &lt;a href="https://s3-credentials.readthedocs.io/"&gt;s3-credentials&lt;/a&gt; tool to create a bucket for this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;~ % s3-credentials create scrape-instances-social --public --website --create-bucket
Created bucket: scrape-instances-social
Attached bucket policy allowing public access
Configured website: IndexDocument=index.html, ErrorDocument=error.html
Created  user: 's3.read-write.scrape-instances-social' with permissions boundary: 'arn:aws:iam::aws:policy/AmazonS3FullAccess'
Attached policy s3.read-write.scrape-instances-social to user s3.read-write.scrape-instances-social
Created access key for user: s3.read-write.scrape-instances-social
{
    "UserName": "s3.read-write.scrape-instances-social",
    "AccessKeyId": "AKIAWXFXAIOZI5NUS6VU",
    "Status": "Active",
    "SecretAccessKey": "...",
    "CreateDate": "2022-11-20 05:52:22+00:00"
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This created a new bucket called &lt;code&gt;scrape-instances-social&lt;/code&gt; configured to work as a website and allow public access.&lt;/p&gt;
&lt;p&gt;It also generated an access key and a secret access key with access to just that bucket. I saved these in GitHub Actions secrets called &lt;code&gt;AWS_ACCESS_KEY_ID&lt;/code&gt; and &lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I enabled a CORS policy on the bucket like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3-credentials set-cors-policy scrape-instances-social
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I added the following to my GitHub Actions workflow to build and upload the database after each run of the scraper:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Build and publish database using git-history&lt;/span&gt;
      &lt;span class="pl-ent"&gt;env&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;AWS_ACCESS_KEY_ID&lt;/span&gt;: &lt;span class="pl-s"&gt;${{ secrets.AWS_ACCESS_KEY_ID }}&lt;/span&gt;
        &lt;span class="pl-ent"&gt;AWS_SECRET_ACCESS_KEY&lt;/span&gt;: &lt;span class="pl-s"&gt;${{ secrets.AWS_SECRET_ACCESS_KEY }}&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;        # First download previous database to save some time&lt;/span&gt;
&lt;span class="pl-s"&gt;        wget https://scrape-instances-social.s3.amazonaws.com/counts.db&lt;/span&gt;
&lt;span class="pl-s"&gt;        # Update with latest commits&lt;/span&gt;
&lt;span class="pl-s"&gt;        ./build-count-history.sh&lt;/span&gt;
&lt;span class="pl-s"&gt;        # Upload to S3&lt;/span&gt;
&lt;span class="pl-s"&gt;        s3-credentials put-object scrape-instances-social counts.db counts.db \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --access-key $AWS_ACCESS_KEY_ID \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --secret-key $AWS_SECRET_ACCESS_KEY&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; knows how to only process commits since the last time the database was built, so downloading the previous copy saves a lot of time.&lt;/p&gt;
&lt;h4&gt;Exploring the data&lt;/h4&gt;
&lt;p&gt;Now that I have a SQLite database that's being served over CORS-enabled HTTPS I can open it in &lt;a href="https://simonwillison.net/2022/May/4/datasette-lite/"&gt;Datasette Lite&lt;/a&gt; - my implementation of Datasette compiled to WebAssembly that runs entirely in a browser.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db"&gt;https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Any time anyone follows this link their browser will fetch the latest copy of the &lt;code&gt;counts.db&lt;/code&gt; file directly from S3.&lt;/p&gt;
&lt;p&gt;The most interesting page in there is the &lt;code&gt;item_version_detail&lt;/code&gt; SQL view, which joins against the commits table to show the date of each change:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db#/counts/item_version_detail"&gt;https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db#/counts/item_version_detail&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;(Datasette Lite lets you link directly to pages within Datasette itself via a &lt;code&gt;#hash&lt;/code&gt;.)&lt;/p&gt;
&lt;h4&gt;Plotting a chart&lt;/h4&gt;
&lt;p&gt;Datasette Lite doesn't have charting yet, so I decided to turn to my favourite visualization tool, an &lt;a href="https://observablehq.com/"&gt;Observable&lt;/a&gt; notebook.&lt;/p&gt;
&lt;p&gt;Observable has the ability to query SQLite databases (that are served via CORS) directly these days!&lt;/p&gt;
&lt;p&gt;Here's my notebook:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time"&gt;https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;There are only four cells needed to create the chart shown above.&lt;/p&gt;
&lt;p&gt;First, we need to open the SQLite database from the remote URL:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-s1"&gt;database&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;SQLiteDatabaseClient&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;open&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;
  &lt;span class="pl-s"&gt;"https://scrape-instances-social.s3.amazonaws.com/counts.db"&lt;/span&gt;
&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Next we need to use an Obervable Database query cell to execute SQL against that database and pull out the data we want to plot - and store it in a &lt;code&gt;query&lt;/code&gt; variable:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;SELECT&lt;/span&gt; _commit_at &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-k"&gt;date&lt;/span&gt;, users, statuses
&lt;span class="pl-k"&gt;FROM&lt;/span&gt; item_version_detail&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We need to make one change to that data - we need to convert the &lt;code&gt;date&lt;/code&gt; column from a string to a JavaScript date object:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-s1"&gt;points&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;query&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;map&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;d&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;
  &lt;span class="pl-c1"&gt;date&lt;/span&gt;: &lt;span class="pl-k"&gt;new&lt;/span&gt; &lt;span class="pl-v"&gt;Date&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;d&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;date&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-c1"&gt;users&lt;/span&gt;: &lt;span class="pl-s1"&gt;d&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;users&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-c1"&gt;statuses&lt;/span&gt;: &lt;span class="pl-s1"&gt;d&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;statuses&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Finally, we can plot the data using the &lt;a href="https://observablehq.com/@observablehq/plot"&gt;Observable Plot&lt;/a&gt; charting library like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-v"&gt;Plot&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;plot&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;
  &lt;span class="pl-c1"&gt;y&lt;/span&gt;: &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-c1"&gt;grid&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
    &lt;span class="pl-c1"&gt;label&lt;/span&gt;: &lt;span class="pl-s"&gt;"Total users over time across all tracked instances"&lt;/span&gt;
  &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-c1"&gt;marks&lt;/span&gt;: &lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-v"&gt;Plot&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;line&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;points&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt; &lt;span class="pl-c1"&gt;x&lt;/span&gt;: &lt;span class="pl-s"&gt;"date"&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;y&lt;/span&gt;: &lt;span class="pl-s"&gt;"users"&lt;/span&gt; &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-c1"&gt;marginLeft&lt;/span&gt;: &lt;span class="pl-c1"&gt;100&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I added 100px of margin to the left of the chart to ensure there was space for the large (4,696,000 and up) labels on the y-axis.&lt;/p&gt;
&lt;h4&gt;A bunch of tricks combined&lt;/h4&gt;
&lt;p&gt;This project combines a whole bunch of tricks I've been pulling together over the past few years:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; is the technique I use to gather the initial data, turning a static listing of instances into a record of changes over time&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://datasette.io/tools/git-history"&gt;git-history&lt;/a&gt; is my tool for turning a scraped Git history into a SQLite database that's easier to work with&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://s3-credentials.readthedocs.io/"&gt;s3-credentials&lt;/a&gt; makes working with S3 buckets - in particular creating credentials that are restricted to just one bucket - much less frustrating&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2022/May/4/datasette-lite/"&gt;Datasette Lite&lt;/a&gt; means that once you have a SQLite database online somewhere you can explore it in your browser - without having to run my full server-side &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; Python application on a machine somewhere&lt;/li&gt;
&lt;li&gt;And finally, combining the above means I can take advantage of &lt;a href="https://observablehq.com/"&gt;Observable notebooks&lt;/a&gt; for ad-hoc visualization of data that's hosted online, in this case as a static SQLite database file served from S3&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/observable"&gt;observable&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-lite"&gt;datasette-lite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mastodon"&gt;mastodon&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cors"&gt;cors&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="github"/><category term="projects"/><category term="datasette"/><category term="observable"/><category term="github-actions"/><category term="git-scraping"/><category term="git-history"/><category term="s3-credentials"/><category term="datasette-lite"/><category term="mastodon"/><category term="cors"/></entry><entry><title>Measuring traffic during the Half Moon Bay Pumpkin Festival</title><link href="https://simonwillison.net/2022/Oct/19/measuring-traffic/#atom-series" rel="alternate"/><published>2022-10-19T15:41:09+00:00</published><updated>2022-10-19T15:41:09+00:00</updated><id>https://simonwillison.net/2022/Oct/19/measuring-traffic/#atom-series</id><summary type="html">
    &lt;p&gt;This weekend was the &lt;a href="https://pumpkinfest.miramarevents.com/" rel="nofollow"&gt;50th annual Half Moon Bay Pumpkin Festival&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We live in El Granada, a tiny town 8 minutes drive from Half Moon Bay. There is a single road (coastal highway one) between the two towns, and the festival is locally notorious for its impact on traffic.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://twitter.com/natbat" rel="nofollow"&gt;Natalie&lt;/a&gt; suggested that we measure the traffic and try and see the impact for ourselves!&lt;/p&gt;
&lt;p&gt;Here's the end result for Saturday. Read on for details on how we created it.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A chart showing the two lines over time" src="https://static.simonwillison.net/static/2022/pumpkin-saturday-smooth.png" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;&lt;a id="user-content-collecting-the-data" class="anchor" aria-hidden="true" href="#collecting-the-data"&gt;&lt;span aria-hidden="true" class="octicon octicon-link"&gt;&lt;/span&gt;&lt;/a&gt;Collecting the data&lt;/h4&gt;
&lt;p&gt;I built a &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/" rel="nofollow"&gt;git scraper&lt;/a&gt; to gather data from the Google Maps &lt;a href="https://developers.google.com/maps/documentation/directions/overview" rel="nofollow"&gt;Directions API&lt;/a&gt;. It turns out if you pass &lt;code&gt;departure_time=now&lt;/code&gt; to that API it returns the current estimated time in traffic as part of the response.&lt;/p&gt;
&lt;p&gt;I picked a location in Half Moon Bay an a location in El Granada and constructed the following URL (pretty-printed):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;https://maps.googleapis.com/maps/api/directions/json?
  origin=GG49%2BCH,%20Half%20Moon%20Bay%20CA
  &amp;amp;destination=FH78%2BQJ,%20Half%20Moon%20Bay,%20CA
  &amp;amp;departure_time=now
  &amp;amp;key=$GOOGLE_MAPS_KEY
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The two locations here are defined using Google Plus codes. Here they are on Google Maps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.google.com/maps/search/FH78%2BQJ+Half+Moon+Bay,+CA,+USA" rel="nofollow"&gt;FH78+QJ Half Moon Bay, CA, USA&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.google.com/maps/search/GG49%2BCH+El+Granada+CA,+USA" rel="nofollow"&gt;GG49+CH El Granada CA, USA&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I constructed the reverse of the URL too, to track traffic in the other direction. Then I rigged up a scheduled GitHub Actions workflow in &lt;a href="https://github.com/simonw/scrape-hmb-traffic"&gt;this repository&lt;/a&gt; to fetch this API data, pretty-print it with &lt;code&gt;jq&lt;/code&gt; and write it to the repsoitory:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Scrape traffic&lt;/span&gt;

&lt;span class="pl-ent"&gt;on&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;push&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;workflow_dispatch&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;schedule&lt;/span&gt;:
  - &lt;span class="pl-ent"&gt;cron&lt;/span&gt;:  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;*/5 * * * *&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-ent"&gt;jobs&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;shot-scraper&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;runs-on&lt;/span&gt;: &lt;span class="pl-s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="pl-ent"&gt;steps&lt;/span&gt;:
    - &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/checkout@v2&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Scrape&lt;/span&gt;
      &lt;span class="pl-ent"&gt;env&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;GOOGLE_MAPS_KEY&lt;/span&gt;: &lt;span class="pl-s"&gt;${{ secrets.GOOGLE_MAPS_KEY }}&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|        &lt;/span&gt;
&lt;span class="pl-s"&gt;        curl "https://maps.googleapis.com/maps/api/directions/json?origin=GG49%2BCH,%20Half%20Moon%20Bay%20CA&amp;amp;destination=FH78%2BQJ,%20Half%20Moon%20Bay,%20California&amp;amp;departure_time=now&amp;amp;key=$GOOGLE_MAPS_KEY" | jq &amp;gt; one.json&lt;/span&gt;
&lt;span class="pl-s"&gt;        sleep 3&lt;/span&gt;
&lt;span class="pl-s"&gt;        curl "https://maps.googleapis.com/maps/api/directions/json?origin=FH78%2BQJ,%20Half%20Moon%20Bay%20CA&amp;amp;destination=GG49%2BCH,%20Half%20Moon%20Bay,%20California&amp;amp;departure_time=now&amp;amp;key=$GOOGLE_MAPS_KEY" | jq &amp;gt; two.json&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Commit and push&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;        git config user.name "Automated"&lt;/span&gt;
&lt;span class="pl-s"&gt;        git config user.email "actions@users.noreply.github.com"&lt;/span&gt;
&lt;span class="pl-s"&gt;        git add -A&lt;/span&gt;
&lt;span class="pl-s"&gt;        timestamp=$(date -u)&lt;/span&gt;
&lt;span class="pl-s"&gt;        git commit -m "${timestamp}" || exit 0&lt;/span&gt;
&lt;span class="pl-s"&gt;        git pull --rebase&lt;/span&gt;
&lt;span class="pl-s"&gt;        git push&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I'm using a GitHub Actions secret called &lt;code&gt;GOOGLE_MAPS_KEY&lt;/code&gt; to store the Google Maps API key.&lt;/p&gt;
&lt;p&gt;This workflow runs every 5 minutes (more-or-less - GitHub Actions doesn't necessarily stick to the schedule). It fetches the two JSON results and writes them to files called &lt;code&gt;one.json&lt;/code&gt; and &lt;code&gt;two.json&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;... and that was the initial setup for the project. This took me about fifteen minutes to put in place, because I've built systems like this so many times before. I launched it at about 10am on Saturday and left it to collect data.&lt;/p&gt;
&lt;h4&gt;&lt;a id="user-content-analyzing-the-data-and-drawing-some-charts" class="anchor" aria-hidden="true" href="#analyzing-the-data-and-drawing-some-charts"&gt;&lt;span aria-hidden="true" class="octicon octicon-link"&gt;&lt;/span&gt;&lt;/a&gt;Analyzing the data and drawing some charts&lt;/h4&gt;
&lt;p&gt;The trick with git scraping is that the data you care about ends up captured in &lt;a href="https://github.com/simonw/scrape-hmb-traffic/commits/main"&gt;the git commit log&lt;/a&gt;. The challenge is how to extract that back out again and turn it into something useful.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://simonwillison.net/2021/Dec/7/git-history/" rel="nofollow"&gt;git-history tool&lt;/a&gt; is designed to solve this. It's a command-line utility which can iterate through every version of a file stored in a git repository, extracting information from that file out into a SQLite database table and creating a new row for every commit.&lt;/p&gt;
&lt;p&gt;Normally I run it against CSV or JSON files containing an array of rows - effectively tabular data already, where I just want to record what has changed in between commits.&lt;/p&gt;
&lt;p&gt;For this project, I was storing the raw JSON output by the Google Maps API. I didn't care about most of the information in there: I really just wanted the &lt;code&gt;duration_in_traffic&lt;/code&gt; value.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; can accept a snippet of Python code that will be run against each stored copy of a file. The snippet should return a list of JSON objects (as Python dictionaries) which the rest of the tool can then use to figure out what has changed.&lt;/p&gt;
&lt;p&gt;To cut a long story short, here's the incantation that worked:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git-history file hmb.db one.json \
--convert '
try:
    duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
    return [{"id": "one", "duration_in_traffic": duration_in_traffic}]
except Exception as ex:
    return []
' \
  --full-versions \
  --id id
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;git-history file&lt;/code&gt; command is used to load the history for a specific file - in this case it's the file &lt;code&gt;one.json&lt;/code&gt;, which will be loaded into a new SQLite database file called &lt;code&gt;hm.db&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--convert&lt;/code&gt; code uses &lt;code&gt;json.loads(content)&lt;/code&gt; to load the JSON for the current file version, then pulls out the &lt;code&gt;["routes"][0]["legs"][0]["duration_in_traffic"]["value"]&lt;/code&gt; nested value from it.&lt;/p&gt;
&lt;p&gt;If that's missing (e.g. in an earlier commit, when I hadn't yet added the &lt;code&gt;departure_time=now&lt;/code&gt; parameter to the URL) an exception will be caught and the function will return an empty list.&lt;/p&gt;
&lt;p&gt;If the &lt;code&gt;duration_in_traffic&lt;/code&gt; value is present, the function returns the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[{"id": "one", "duration_in_traffic": duration_in_traffic}]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; likes lists of dictionaries. It's usually being run against files that contain many different rows, where the &lt;code&gt;id&lt;/code&gt; column can be used to de-dupe rows across commits and spot what has changed.&lt;/p&gt;
&lt;p&gt;In this case, each file only has a single interesting value.&lt;/p&gt;
&lt;p&gt;Two more options are used here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--full-versions&lt;/code&gt; - tells &lt;code&gt;git-history&lt;/code&gt; to store all of the columns, not just columns that have changed since the last run. The default behaviour here is to store a &lt;code&gt;null&lt;/code&gt; if a value has not changed in order to save space, but our data is tiny here so we don't need any clever optimizations.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--id id&lt;/code&gt; specifies the ID column that should be used to de-dupe changes. Again, not really important for this tiny project.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After running the above command, the resulting schema includes these tables:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;CREATE TABLE [commits] (
   [id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [namespace] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [namespaces]([id]),
   [hash] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [commit_at] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [item_version] (
   [_id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [_item] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [item]([_id]),
   [_version] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [_commit] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [commits]([id]),
   [id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [duration_in_traffic] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;
);&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The &lt;code&gt;commits&lt;/code&gt; table includes the date of the commit - &lt;code&gt;commit_at&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;item_version&lt;/code&gt; table has that &lt;code&gt;duration_in_traffic&lt;/code&gt; value.&lt;/p&gt;
&lt;p&gt;So... to get back the duration in traffic at different times of day I can run this SQL query to join those two tables together:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt;
    &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;,
    duration_in_traffic
&lt;span class="pl-k"&gt;from&lt;/span&gt;
    item_version
&lt;span class="pl-k"&gt;join&lt;/span&gt;
    commits &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;item_version&lt;/span&gt;.&lt;span class="pl-c1"&gt;_commit&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;
&lt;span class="pl-k"&gt;order by&lt;/span&gt;
    &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That query returns data that looks like this:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;commit_at&lt;/th&gt;
&lt;th&gt;duration_in_traffic&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2022-10-15T17:09:06+00:00&lt;/td&gt;
&lt;td&gt;1110&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022-10-15T17:17:38+00:00&lt;/td&gt;
&lt;td&gt;1016&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022-10-15T17:30:06+00:00&lt;/td&gt;
&lt;td&gt;1391&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;A couple of problems here. First, the &lt;code&gt;commit_at&lt;/code&gt; column is in UTC, not local time. And &lt;code&gt;duration_in_traffic&lt;/code&gt; is in seconds, which aren't particularly easy to read.&lt;/p&gt;
&lt;p&gt;Here's a SQLite fix for these two issues:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;select
    time(datetime(commits.commit_at, '-7 hours')) as t,
    duration_in_traffic / 60 as mins_in_traffic
from
    item_version
join
    commits on item_version._commit = commits.id
order by
    commits.commit_at
&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;t&lt;/th&gt;
&lt;th&gt;mins_in_traffic&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10:09:06&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:17:38&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:30:06&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;code&gt;datetime(commits.commit_at, '-7 hours')&lt;/code&gt; parses the UTC string as a datetime, and then subsracts 7 hours from it to get the local time in California converted from UTC.&lt;/p&gt;
&lt;p&gt;I wrap that in &lt;code&gt;time()&lt;/code&gt; here because for the chart I want to render I know everything will be on the same day.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;mins_in_traffic&lt;/code&gt; now shows minutes, not seconds.&lt;/p&gt;
&lt;p&gt;We now have enough data to render a chart!&lt;/p&gt;
&lt;p&gt;But... we only have one of the two directions of traffic here. To process the numbers from &lt;code&gt;two.json&lt;/code&gt; as well I ran this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git-history file hmb.db two.json \
--convert '
try:
    duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
    return [{"id": "two", "duration_in_traffic": duration_in_traffic}]
except Exception as ex:
    return []
' \
  --full-versions \
  --id id --namespace item2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is &lt;em&gt;almost&lt;/em&gt; the same as the previous command. It's running against &lt;code&gt;two.json&lt;/code&gt; instead of &lt;code&gt;one.json&lt;/code&gt;, and it's using the &lt;code&gt;--namespace item2&lt;/code&gt; option.&lt;/p&gt;
&lt;p&gt;This causes it to populate a new table called &lt;code&gt;item2_version&lt;/code&gt; instead of &lt;code&gt;item_version&lt;/code&gt;, which is a cheap trick to avoid having to figure out how to load both files into the same table.&lt;/p&gt;
&lt;h2&gt;&lt;a id="user-content-two-lines-on-one-chart" class="anchor" aria-hidden="true" href="#two-lines-on-one-chart"&gt;&lt;span aria-hidden="true" class="octicon octicon-link"&gt;&lt;/span&gt;&lt;/a&gt;Two lines on one chart&lt;/h2&gt;
&lt;p&gt;I rendered an initial single line chart using &lt;a href="https://datasette.io/plugins/datasette-vega" rel="nofollow"&gt;datasette-vega&lt;/a&gt;, but Natalie suggested that putting lines on the same chart for the two directions of traffic would be more interesting.&lt;/p&gt;
&lt;p&gt;Since I now had one table for each direction of traffic (&lt;code&gt;item_version&lt;/code&gt; and &lt;code&gt;item_version2&lt;/code&gt;) I decided to combine those into a single table, suitable for pasting into Google Sheets.&lt;/p&gt;
&lt;p&gt;Here's the SQL I came up with to do that:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;with item1 &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    &lt;span class="pl-k"&gt;time&lt;/span&gt;(datetime(&lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;-7 hours&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;)) &lt;span class="pl-k"&gt;as&lt;/span&gt; t,
    duration_in_traffic &lt;span class="pl-k"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;60&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; mins_in_traffic
  &lt;span class="pl-k"&gt;from&lt;/span&gt;
    item_version
    &lt;span class="pl-k"&gt;join&lt;/span&gt; commits &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;item_version&lt;/span&gt;.&lt;span class="pl-c1"&gt;_commit&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;
  &lt;span class="pl-k"&gt;order by&lt;/span&gt;
    &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;
),
item2 &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    &lt;span class="pl-k"&gt;time&lt;/span&gt;(datetime(&lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;-7 hours&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;)) &lt;span class="pl-k"&gt;as&lt;/span&gt; t,
    duration_in_traffic &lt;span class="pl-k"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;60&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; mins_in_traffic
  &lt;span class="pl-k"&gt;from&lt;/span&gt;
    item2_version
    &lt;span class="pl-k"&gt;join&lt;/span&gt; commits &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;item2_version&lt;/span&gt;.&lt;span class="pl-c1"&gt;_commit&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;
  &lt;span class="pl-k"&gt;order by&lt;/span&gt;
    &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;
)
&lt;span class="pl-k"&gt;select&lt;/span&gt;
  item1.&lt;span class="pl-k"&gt;*&lt;/span&gt;,
  &lt;span class="pl-c1"&gt;item2&lt;/span&gt;.&lt;span class="pl-c1"&gt;mins_in_traffic&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; mins_in_traffic_other_way
&lt;span class="pl-k"&gt;from&lt;/span&gt;
  item1
  &lt;span class="pl-k"&gt;join&lt;/span&gt; item2 &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;item1&lt;/span&gt;.&lt;span class="pl-c1"&gt;t&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;item2&lt;/span&gt;.&lt;span class="pl-c1"&gt;t&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This uses two CTEs (Common Table Expressions - the &lt;code&gt;with X as&lt;/code&gt; pieces) using the pattern I explained earlier - now called &lt;code&gt;item1&lt;/code&gt; and &lt;code&gt;item2&lt;/code&gt;. Having defined these two CTEs, I can join them together on the &lt;code&gt;t&lt;/code&gt; column, which is the time of day.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db?&amp;amp;install=datasette-copyable#/hmb?sql=with+item1+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item_version%0A++++join+commits+on+item_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%2C%0Aitem2+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item2_version%0A++++join+commits+on+item2_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%0Aselect%0A++item1.*%2C%0A++item2.mins_in_traffic+as+mins_in_traffic_other_way%0Afrom%0A++item1%0A++join+item2+on+item1.t+%3D+item2.t" rel="nofollow"&gt;Try running this query&lt;/a&gt; in Datasette Lite.&lt;/p&gt;
&lt;p&gt;Here's the output of that query for Saturday (10am to 8pm):&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;t&lt;/th&gt;
&lt;th&gt;mins_in_traffic&lt;/th&gt;
&lt;th&gt;mins_in_traffic_other_way&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10:09:06&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:17:38&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:30:06&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:47:38&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:57:37&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11:08:20&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11:22:27&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11:38:42&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11:52:35&lt;/td&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:03:23&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:15:16&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:27:51&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:37:48&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:46:41&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:55:03&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13:05:10&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13:17:57&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13:32:55&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13:44:53&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13:55:22&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:05:21&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:17:48&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:31:04&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:41:59&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:51:48&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:00:09&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:11:17&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:25:48&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:39:41&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:51:11&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:59:34&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16:10:50&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16:25:43&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16:53:06&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17:11:34&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17:40:29&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;18:12:07&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;18:58:17&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20:05:13&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;I copied and pasted this table into Google Sheets and messed around with the charting tools there until I had the following chart:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A chart showing the two lines over time" src="https://static.simonwillison.net/static/2022/pumpkin-saturday-smooth.png" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's the same chart for Sunday:&lt;/p&gt;
&lt;p&gt;&lt;img alt="This chart shows the same thing but for Sunday" src="https://static.simonwillison.net/static/2022/pumpkin-sunday-smooth.png" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Our &lt;a href="https://docs.google.com/spreadsheets/d/1JOimtkugZBF_YQxqn0Gn6NiIhNz-OMH2rpOZtmECAY4/edit#gid=0" rel="nofollow"&gt;Google Sheet is here&lt;/a&gt; - the two days have two separate tabs within the sheet.&lt;/p&gt;
&lt;h4&gt;&lt;a id="user-content-building-the-sqlite-database-in-github-actions" class="anchor" aria-hidden="true" href="#building-the-sqlite-database-in-github-actions"&gt;&lt;span aria-hidden="true" class="octicon octicon-link"&gt;&lt;/span&gt;&lt;/a&gt;Building the SQLite database in GitHub Actions&lt;/h4&gt;
&lt;p&gt;I did most of the development work for this project on my laptop, running &lt;code&gt;git-history&lt;/code&gt; and &lt;code&gt;datasette&lt;/code&gt; locally for speed of iteration.&lt;/p&gt;
&lt;p&gt;Once I had everything working, I decided to automate the process of building the SQLite database as well.&lt;/p&gt;
&lt;p&gt;I made the following changes to my GitHub Actions workflow:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;jobs&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;shot-scraper&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;runs-on&lt;/span&gt;: &lt;span class="pl-s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="pl-ent"&gt;steps&lt;/span&gt;:
    - &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/checkout@v3&lt;/span&gt;
      &lt;span class="pl-ent"&gt;with&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;fetch-depth&lt;/span&gt;: &lt;span class="pl-c1"&gt;0&lt;/span&gt; &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Needed by git-history&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Set up Python 3.10&lt;/span&gt;
      &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/setup-python@v4&lt;/span&gt;
      &lt;span class="pl-ent"&gt;with&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;python-version&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;3.10&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
        &lt;span class="pl-ent"&gt;cache&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;pip&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;pip install -r requirements.txt&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Scrape&lt;/span&gt;
      &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Same as before...&lt;/span&gt;
      &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; env:&lt;/span&gt;
      &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; run&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Build SQLite database&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|&lt;/span&gt;
&lt;span class="pl-s"&gt;        rm -f hmb.db # Recreate from scratch each time&lt;/span&gt;
&lt;span class="pl-s"&gt;        git-history file hmb.db one.json \&lt;/span&gt;
&lt;span class="pl-s"&gt;        --convert '&lt;/span&gt;
&lt;span class="pl-s"&gt;        try:&lt;/span&gt;
&lt;span class="pl-s"&gt;            duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]&lt;/span&gt;
&lt;span class="pl-s"&gt;            return [{"id": "one", "duration_in_traffic": duration_in_traffic}]&lt;/span&gt;
&lt;span class="pl-s"&gt;        except Exception as ex:&lt;/span&gt;
&lt;span class="pl-s"&gt;            return []&lt;/span&gt;
&lt;span class="pl-s"&gt;        ' \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --full-versions \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --id id&lt;/span&gt;
&lt;span class="pl-s"&gt;        git-history file hmb.db two.json \&lt;/span&gt;
&lt;span class="pl-s"&gt;        --convert '&lt;/span&gt;
&lt;span class="pl-s"&gt;        try:&lt;/span&gt;
&lt;span class="pl-s"&gt;            duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]&lt;/span&gt;
&lt;span class="pl-s"&gt;            return [{"id": "two", "duration_in_traffic": duration_in_traffic}]&lt;/span&gt;
&lt;span class="pl-s"&gt;        except Exception as ex:&lt;/span&gt;
&lt;span class="pl-s"&gt;            return []&lt;/span&gt;
&lt;span class="pl-s"&gt;        ' \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --full-versions \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --id id --namespace item2&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Commit and push&lt;/span&gt;
      &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Same as before...&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I also added a &lt;code&gt;requirements.txt&lt;/code&gt; file containing just &lt;code&gt;git-history&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Note how the &lt;code&gt;actions/checkout@v3&lt;/code&gt; step now has &lt;code&gt;fetch-depth: 0&lt;/code&gt; - this is necessary because &lt;code&gt;git-history&lt;/code&gt; needs to loop through the entire repository history, but &lt;code&gt;actions/checkout@v3&lt;/code&gt; defaults to only fetching the most recent commit.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;setup-python&lt;/code&gt; step uses &lt;code&gt;cache: "pip"&lt;/code&gt;, which causes it to cache installed dependencies from &lt;code&gt;requirements.txt&lt;/code&gt; between runs.&lt;/p&gt;
&lt;p&gt;Because that big &lt;code&gt;git-history&lt;/code&gt; step creates a &lt;code&gt;hmb.db&lt;/code&gt; SQLite database, the "Commit and push" step now includes that file in the push to the repository. So every time the workflow runs a new binary SQLite database file is committed.&lt;/p&gt;
&lt;p&gt;Normally I wouldn't do this, because Git isn't a great place to keep constantly changing binary files... but in this case the SQLite database is only 100KB and won't continue to be updated beyond the end of the pumpkin festival.&lt;/p&gt;
&lt;p&gt;End result: &lt;a href="https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db"&gt;hmb.db is available&lt;/a&gt; in the GitHub repository.&lt;/p&gt;
&lt;h4&gt;&lt;a id="user-content-querying-it-using-datasette-lite" class="anchor" aria-hidden="true" href="#querying-it-using-datasette-lite"&gt;&lt;span aria-hidden="true" class="octicon octicon-link"&gt;&lt;/span&gt;&lt;/a&gt;Querying it using Datasette Lite&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://simonwillison.net/2022/May/4/datasette-lite/" rel="nofollow"&gt;Datasette Lite&lt;/a&gt; is my repackaged version of my Datasette server-side Python application which runs entirely in the user's browser, using WebAssembly.&lt;/p&gt;
&lt;p&gt;A neat feature of Datasette Lite is that you can pass it the URL to a SQLite database file and it will load that database in your browser and let you run queries against it.&lt;/p&gt;
&lt;p&gt;These database files need to be served with CORS headers. Every file served by GitHub includes these headers!&lt;/p&gt;
&lt;p&gt;Which means the following URL can be used to open up the latest &lt;code&gt;hmb.db&lt;/code&gt; file directly in Datasette in the browser:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db" rel="nofollow"&gt;https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;(This takes advantage of a &lt;a href="https://simonwillison.net/2022/Sep/16/weeknotes/" rel="nofollow"&gt;feature I added&lt;/a&gt; to Datasette Lite where it knows how to convert the URL to the HTML page about a file on GitHub to the URL to the raw file itself.)&lt;/p&gt;
&lt;p&gt;URLs to SQL queries work too. This URL will open Datasette Lite, load the SQLite database AND execute the query I constructed above:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db#/hmb?sql=with+item1+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item_version%0A++++join+commits+on+item_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%2C%0Aitem2+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item2_version%0A++++join+commits+on+item2_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%0Aselect%0A++item1.*%2C%0A++item2.mins_in_traffic+as+mins_in_traffic_other_way%0Afrom%0A++item1%0A++join+item2+on+item1.t+%3D+item2.t" rel="nofollow"&gt;https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db#/hmb?sql=with+item1+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item_version%0A++++join+commits+on+item_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%2C%0Aitem2+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item2_version%0A++++join+commits+on+item2_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%0Aselect%0A++item1.*%2C%0A++item2.mins_in_traffic+as+mins_in_traffic_other_way%0Afrom%0A++item1%0A++join+item2+on+item1.t+%3D+item2.t&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;And finally... Datasette Lite &lt;a href="https://simonwillison.net/2022/Aug/17/datasette-lite-plugins/" rel="nofollow"&gt;has plugin support&lt;/a&gt;. Adding &lt;code&gt;&amp;amp;install=datasette-copyable&lt;/code&gt; to the URL adds the &lt;a href="https://datasette.io/plugins/datasette-copyable" rel="nofollow"&gt;datasette-copyable&lt;/a&gt; plugin, which adds a page for easily copying out the query results as TSV (useful for pasting into a spreadsheet) or even as GitHub-flavored Markdown (which I used to add results to this blog post).&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db&amp;amp;install=datasette-copyable#/hmb.copyable?sql=with+item1+as+%28%0A++select%0A++++time%28datetime%28commits.commit_at%2C+%27-7+hours%27%29%29+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item_version%0A++++join+commits+on+item_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A%29%2C%0Aitem2+as+%28%0A++select%0A++++time%28datetime%28commits.commit_at%2C+%27-7+hours%27%29%29+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item2_version%0A++++join+commits+on+item2_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A%29%0Aselect%0A++item1.%2A%2C%0A++item2.mins_in_traffic+as+mins_in_traffic_other_way%0Afrom%0A++item1%0A++join+item2+on+item1.t+%3D+item2.t&amp;amp;_table_format=github" rel="nofollow"&gt;an example&lt;/a&gt; of that plugin in action.&lt;/p&gt;
&lt;p&gt;This was a fun little project that brought together a whole bunch of things I've been working on over the past few years. Here's some more of my writing on these different techniques and tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/series/git-scraping/" rel="nofollow"&gt;Git scraping&lt;/a&gt; is the key technique I'm using here to collect the data&lt;/li&gt;
&lt;li&gt;I've written a lot about &lt;a href="https://simonwillison.net/tags/githubactions/" rel="nofollow"&gt;GitHub Actions&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;These are my notes about &lt;a href="https://simonwillison.net/tags/githistory/" rel="nofollow"&gt;git-history&lt;/a&gt;, the tool I used to turn a commit history into a SQLite database&lt;/li&gt;
&lt;li&gt;Here's my series of posts about &lt;a href="https://simonwillison.net/series/datasette-lite/" rel="nofollow"&gt;Datasette Lite&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/natalie-downe"&gt;natalie-downe&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-lite"&gt;datasette-lite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/half-moon-bay"&gt;half-moon-bay&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="data-journalism"/><category term="natalie-downe"/><category term="projects"/><category term="sqlite"/><category term="datasette"/><category term="git-scraping"/><category term="git-history"/><category term="datasette-lite"/><category term="half-moon-bay"/></entry><entry><title>Automatically opening issues when tracked file content changes</title><link href="https://simonwillison.net/2022/Apr/28/issue-on-changes/#atom-series" rel="alternate"/><published>2022-04-28T17:18:14+00:00</published><updated>2022-04-28T17:18:14+00:00</updated><id>https://simonwillison.net/2022/Apr/28/issue-on-changes/#atom-series</id><summary type="html">
    &lt;p&gt;I figured out a GitHub Actions pattern to keep track of a file published somewhere on the internet and automatically open a new repository issue any time the contents of that file changes.&lt;/p&gt;
&lt;h4&gt;Extracting GZipMiddleware from Starlette&lt;/h4&gt;
&lt;p&gt;Here's why I needed to solve this problem.&lt;/p&gt;
&lt;p&gt;I want to add gzip support to my &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; open source project. Datasette builds on the Python &lt;a href="https://asgi.readthedocs.io/"&gt;ASGI&lt;/a&gt; standard, and &lt;a href="https://www.starlette.io/"&gt;Starlette&lt;/a&gt; provides an extremely well tested, robust &lt;a href="https://www.starlette.io/middleware/#gzipmiddleware"&gt;GZipMiddleware class&lt;/a&gt; that adds gzip support to any ASGI application. As with everything else in Starlette, it's &lt;em&gt;really&lt;/em&gt; good code.&lt;/p&gt;
&lt;p&gt;The problem is, I don't want to add the whole of Starlette as a dependency. I'm trying to keep Datasette's core as small as possible, so I'm very careful about new dependencies. Starlette itself is actually very light (and only has a tiny number of dependencies of its own) but I still don't want the whole thing just for that one class.&lt;/p&gt;
&lt;p&gt;So I decided to extract the &lt;code&gt;GZipMiddleware&lt;/code&gt; class into a separate Python package, under the same BSD license as Starlette itself.&lt;/p&gt;
&lt;p&gt;The result is my new &lt;a href="https://pypi.org/project/asgi-gzip/"&gt;asgi-gzip&lt;/a&gt; package, now available on PyPI.&lt;/p&gt;
&lt;h4&gt;What if Starlette fixes a bug?&lt;/h4&gt;
&lt;p&gt;The problem with extracting code like this is that Starlette is a very effectively maintained package. What if they make improvements or fix bugs in the &lt;code&gt;GZipMiddleware&lt;/code&gt; class? How can I make sure to apply those same fixes to my extracted copy?&lt;/p&gt;
&lt;p&gt;As I thought about this challenge, I realized I had most of the solution already.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt;&lt;/strong&gt; is the name I've given to the trick of running a periodic scraper that writes to a git repository in order to track changes to data over time.&lt;/p&gt;
&lt;p&gt;It may seem redundant to do this against a file that already &lt;a href="https://github.com/encode/starlette/commits/master/starlette/middleware/gzip.py"&gt;lives in version control&lt;/a&gt; elsewhere - but in addition to tracking changes, Git scraping can offfer a cheap and easy way to add automation that triggers when a change is detected.&lt;/p&gt;
&lt;p&gt;I need an actionable alert any time the Starlette code changes so I can review the change and apply a fix to my own library, if necessary.&lt;/p&gt;
&lt;p&gt;Since I already run all of my projects out of GitHub issues, automatically opening an issue against the &lt;a href="https://github.com/simonw/asgi-gzip"&gt;asgi-gzip repository&lt;/a&gt; would be ideal.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://github.com/simonw/asgi-gzip/blob/0.1/.github/workflows/track.yml"&gt;track.yml workflow&lt;/a&gt; does exactly that: it implements the Git scraping pattern against the &lt;a href="https://github.com/encode/starlette/blob/master/starlette/middleware/gzip.py"&gt;gzip.py module&lt;/a&gt; in Starlette, and files an issue any time it detects changes to that file.&lt;/p&gt;
&lt;p&gt;Starlette haven't made any changes to that file since I started tracking it, so I created &lt;a href="https://github.com/simonw/issue-when-changed"&gt;a test repo&lt;/a&gt; to try this out.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/issue-when-changed/issues/3"&gt;one of the example issues&lt;/a&gt;. I decided to include the visual diff in the issue description and have a link to it from the underlying commit as well.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/issue-when-changed.jpg" alt="Screenshot of an open issue page. The issues is titled &amp;quot;gzip.py was updated&amp;quot; and contains a visual diff showing the change to a file. A commit that references the issue is listed too." style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;How it works&lt;/h4&gt;
&lt;p&gt;The implementation is contained entirely in this &lt;a href="https://github.com/simonw/asgi-gzip/blob/0.1/.github/workflows/track.yml"&gt;track.yml workflow&lt;/a&gt;. I designed this to be contained as a single file to make it easy to copy and paste it to adapt it for other projects.&lt;/p&gt;
&lt;p&gt;It uses &lt;a href="https://github.com/actions/github-script"&gt;actions/github-script&lt;/a&gt;, which makes it easy to do things like file new issues using JavaScript.&lt;/p&gt;
&lt;p&gt;Here's a heavily annotated copy:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Track the Starlette version of this&lt;/span&gt;

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Run on repo pushes, and if a user clicks the "run this action" button,&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; and on a schedule at 5:21am UTC every day&lt;/span&gt;
&lt;span class="pl-ent"&gt;on&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;push&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;workflow_dispatch&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;schedule&lt;/span&gt;:
  - &lt;span class="pl-ent"&gt;cron&lt;/span&gt;:  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;21 5 * * *&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Without this block I got this error when the action ran:&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; HttpError: Resource not accessible by integration&lt;/span&gt;
&lt;span class="pl-ent"&gt;permissions&lt;/span&gt;:
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Allow the action to create issues&lt;/span&gt;
  &lt;span class="pl-ent"&gt;issues&lt;/span&gt;: &lt;span class="pl-s"&gt;write&lt;/span&gt;
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Allow the action to commit back to the repository&lt;/span&gt;
  &lt;span class="pl-ent"&gt;contents&lt;/span&gt;: &lt;span class="pl-s"&gt;write&lt;/span&gt;

&lt;span class="pl-ent"&gt;jobs&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;check&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;runs-on&lt;/span&gt;: &lt;span class="pl-s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="pl-ent"&gt;steps&lt;/span&gt;:
    - &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/checkout@v2&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/github-script@v6&lt;/span&gt;
      &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Using env: here to demonstrate how an action like this can&lt;/span&gt;
      &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; be adjusted to take dynamic inputs&lt;/span&gt;
      &lt;span class="pl-ent"&gt;env&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;URL&lt;/span&gt;: &lt;span class="pl-s"&gt;https://raw.githubusercontent.com/encode/starlette/master/starlette/middleware/gzip.py&lt;/span&gt;
        &lt;span class="pl-ent"&gt;FILE_NAME&lt;/span&gt;: &lt;span class="pl-s"&gt;tracking/gzip.py&lt;/span&gt;
      &lt;span class="pl-ent"&gt;with&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;script&lt;/span&gt;: &lt;span class="pl-s"&gt;|&lt;/span&gt;
&lt;span class="pl-s"&gt;          const { URL, FILE_NAME } = process.env;&lt;/span&gt;
&lt;span class="pl-s"&gt;          // promisify pattern for getting an await version of child_process.exec&lt;/span&gt;
&lt;span class="pl-s"&gt;          const util = require("util");&lt;/span&gt;
&lt;span class="pl-s"&gt;          // Used exec_ here because 'exec' variable name is already used:&lt;/span&gt;
&lt;span class="pl-s"&gt;          const exec_ = util.promisify(require("child_process").exec);&lt;/span&gt;
&lt;span class="pl-s"&gt;          // Use curl to download the file&lt;/span&gt;
&lt;span class="pl-s"&gt;          await exec_(`curl -o ${FILE_NAME} ${URL}`);&lt;/span&gt;
&lt;span class="pl-s"&gt;          // Use 'git diff' to detect if the file has changed since last time&lt;/span&gt;
&lt;span class="pl-s"&gt;          const { stdout } = await exec_(`git diff ${FILE_NAME}`);&lt;/span&gt;
&lt;span class="pl-s"&gt;          if (stdout) {&lt;/span&gt;
&lt;span class="pl-s"&gt;            // There was a diff to that file&lt;/span&gt;
&lt;span class="pl-s"&gt;            const title = `${FILE_NAME} was updated`;&lt;/span&gt;
&lt;span class="pl-s"&gt;            const body =&lt;/span&gt;
&lt;span class="pl-s"&gt;              `${URL} changed:` +&lt;/span&gt;
&lt;span class="pl-s"&gt;              "\n\n```diff\n" +&lt;/span&gt;
&lt;span class="pl-s"&gt;              stdout +&lt;/span&gt;
&lt;span class="pl-s"&gt;              "\n```\n\n" +&lt;/span&gt;
&lt;span class="pl-s"&gt;              "Close this issue once those changes have been integrated here";&lt;/span&gt;
&lt;span class="pl-s"&gt;            const issue = await github.rest.issues.create({&lt;/span&gt;
&lt;span class="pl-s"&gt;              owner: context.repo.owner,&lt;/span&gt;
&lt;span class="pl-s"&gt;              repo: context.repo.repo,&lt;/span&gt;
&lt;span class="pl-s"&gt;              title: title,&lt;/span&gt;
&lt;span class="pl-s"&gt;              body: body,&lt;/span&gt;
&lt;span class="pl-s"&gt;            });&lt;/span&gt;
&lt;span class="pl-s"&gt;            const issueNumber = issue.data.number;&lt;/span&gt;
&lt;span class="pl-s"&gt;            // Now commit and reference that issue number, so the commit shows up&lt;/span&gt;
&lt;span class="pl-s"&gt;            // listed at the bottom of the issue page&lt;/span&gt;
&lt;span class="pl-s"&gt;            const commitMessage = `${FILE_NAME} updated, refs #${issueNumber}`;&lt;/span&gt;
&lt;span class="pl-s"&gt;            // https://til.simonwillison.net/github-actions/commit-if-file-changed&lt;/span&gt;
&lt;span class="pl-s"&gt;            await exec_(`git config user.name "Automated"`);&lt;/span&gt;
&lt;span class="pl-s"&gt;            await exec_(`git config user.email "actions@users.noreply.github.com"`);&lt;/span&gt;
&lt;span class="pl-s"&gt;            await exec_(`git add -A`);&lt;/span&gt;
&lt;span class="pl-s"&gt;            await exec_(`git commit -m "${commitMessage}" || exit 0`);&lt;/span&gt;
&lt;span class="pl-s"&gt;            await exec_(`git pull --rebase`);&lt;/span&gt;
&lt;span class="pl-s"&gt;            await exec_(`git push`);&lt;/span&gt;
&lt;span class="pl-s"&gt;          }&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;In the &lt;a href="https://github.com/simonw/asgi-gzip"&gt;asgi-gzip&lt;/a&gt; repository I keep the fetched &lt;code&gt;gzip.py&lt;/code&gt; file in a &lt;code&gt;tracking/&lt;/code&gt; directory. This directory isn't included in the Python package that gets uploaded to PyPI - it's there only so that my code can track changes to it over time.&lt;/p&gt;
&lt;h4&gt;More interesting applications&lt;/h4&gt;
&lt;p&gt;I built this to solve my "tell me when Starlette update their &lt;code&gt;gzip.py&lt;/code&gt; file" problem, but clearly this pattern has much more interesting uses.&lt;/p&gt;
&lt;p&gt;You could point this at any web page to get a new GitHub issue opened when that page content changes. Subscribe to notifications for that repository and you get a robust , shared mechanism for alerts - plus an issue system where you can post additional comments and close the issue once someone has reviewed the change.&lt;/p&gt;
&lt;p&gt;There's a lot of potential here for solving all kinds of interesting problems. And it doesn't cost anything either: GitHub Actions (somehow) remains completely free for public repositories!&lt;/p&gt;
&lt;h4&gt;Update: October 13th 2022&lt;/h4&gt;
&lt;p&gt;Almost six months after writing about this... it triggered for the first time!&lt;/p&gt;
&lt;p&gt;Here's the issue that the script opened: &lt;a href="https://github.com/simonw/asgi-gzip/issues/4"&gt;#4: tracking/gzip.py was updated&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I applied the improvement (Marcelo Trylesinski and Kai Klingenberg updated Starlette's code to avoid gzipping if the response already had a Content-Encoding header) and released &lt;a href="https://github.com/simonw/asgi-gzip/releases/tag/0.2"&gt;version 0.2&lt;/a&gt; of the package.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gzip"&gt;gzip&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/asgi"&gt;asgi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-issues"&gt;github-issues&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="github"/><category term="gzip"/><category term="projects"/><category term="python"/><category term="datasette"/><category term="asgi"/><category term="github-actions"/><category term="git-scraping"/><category term="github-issues"/></entry><entry><title>Scraping web pages from the command line with shot-scraper</title><link href="https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-scraper/#atom-series" rel="alternate"/><published>2022-03-14T01:29:56+00:00</published><updated>2022-03-14T01:29:56+00:00</updated><id>https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-scraper/#atom-series</id><summary type="html">
    &lt;p&gt;I've added a powerful new capability to my &lt;strong&gt;&lt;a href="https://github.com/simonw/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/strong&gt; command line browser automation tool: you can now use it to load a web page in a headless browser, execute JavaScript to extract information and return that information back to the terminal as JSON.&lt;/p&gt;
&lt;p&gt;Among other things, this means you can construct Unix pipelines that incorporate a full headless web browser as part of their processing.&lt;/p&gt;
&lt;p&gt;It's also a really neat web scraping tool.&lt;/p&gt;
&lt;h4&gt;shot-scraper&lt;/h4&gt;
&lt;p&gt;I &lt;a href="https://simonwillison.net/2022/Mar/10/shot-scraper/"&gt;introduced shot-scraper&lt;/a&gt; last Thursday. It's a Python utility that wraps &lt;a href="https://playwright.dev/"&gt;Playwright&lt;/a&gt;, providing both a command line interface and a YAML-driven configuration flow for automating the process of taking screenshots of web pages.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% pip install shot-scraper
% shot-scraper https://simonwillison.net/ --height 800
Screenshot of 'https://simonwillison.net/' written to 'simonwillison-net.png'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/simonwillison-net.png" alt="Screenshot of my blog homepage" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Since Thursday &lt;code&gt;shot-scraper&lt;/code&gt; has had &lt;a href="https://github.com/simonw/shot-scraper/releases"&gt;a flurry of releases&lt;/a&gt;, adding features like &lt;a href="https://github.com/simonw/shot-scraper/blob/0.9/README.md#saving-a-web-page-to-pdf"&gt;PDF exports&lt;/a&gt;, the ability to dump the Chromium &lt;a href="https://github.com/simonw/shot-scraper/blob/0.9/README.md#dumping-out-an-accessibility-tree"&gt;accessibilty tree&lt;/a&gt; and the ability to take screenshots of &lt;a href="https://github.com/simonw/shot-scraper/blob/0.9/README.md#websites-that-need-authentication"&gt;authenticated web pages&lt;/a&gt;. But the most exciting new feature landed today.&lt;/p&gt;
&lt;h4&gt;Executing JavaScript and returning the result&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/shot-scraper/releases/tag/0.9"&gt;Release 0.9&lt;/a&gt; takes the tool in a new direction. The following command will execute JavaScript on the page and return the resulting value:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% shot-scraper javascript simonwillison.net document.title
"Simon Willison\u2019s Weblog"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or you can return a JSON object:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% shot-scraper javascript https://datasette.io/ "({
  title: document.title,
  tagline: document.querySelector('.tagline').innerText
})"
{
  "title": "Datasette: An open source multi-tool for exploring and publishing data",
  "tagline": "An open source multi-tool for exploring and publishing data"
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or if you want to use functions like &lt;code&gt;setTimeout()&lt;/code&gt; - for example, if you want to insert a delay to allow an animation to finish before running the rest of your code - you can return a promise:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% shot-scraper javascript datasette.io "
new Promise(done =&amp;gt; setInterval(
  () =&amp;gt; {
    done({
      title: document.title,
      tagline: document.querySelector('.tagline').innerText
    });
  }, 1000
));"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Errors that occur in the JavaScript turn into an exit code of 1 returned by the tool - which means you can also use this to execute simple tests in a CI flow. This example will fail a GitHub Actions workflow if the extracted page title is not the expected value:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Test page title&lt;/span&gt;
  &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;    shot-scraper javascript datasette.io "&lt;/span&gt;
&lt;span class="pl-s"&gt;      if (document.title != 'Datasette') {&lt;/span&gt;
&lt;span class="pl-s"&gt;        throw 'Wrong title detected';&lt;/span&gt;
&lt;span class="pl-s"&gt;      }"&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="scrape-a-web-page"&gt;Using this to scrape a web page&lt;/h4&gt;
&lt;p&gt;The most exciting use case for this new feature is web scraping. I'll illustrate that with an example.&lt;/p&gt;
&lt;p&gt;Posts from my blog occasionally show up on &lt;a href="https://news.ycombinator.com/"&gt;Hacker News&lt;/a&gt; - sometimes I spot them, sometimes I don't.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://news.ycombinator.com/from?site=simonwillison.net"&gt;https://news.ycombinator.com/from?site=simonwillison.net&lt;/a&gt; is a Hacker News page showing content from the specified domain. It's really useful, but it sadly isn't included in the official &lt;a href="https://github.com/HackerNews/API"&gt;Hacker News API&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/news-ycombinator-com-from.png" alt="Screenshot of the Hacker News listing for my domain" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;So... let's write a scraper for it.&lt;/p&gt;
&lt;p&gt;I started out running the Firefox developer console against that page, trying to figure out the right JavaScript to extract the data I was interested in. I came up with this:&lt;/p&gt;

&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-v"&gt;Array&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;from&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelectorAll&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.athing'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;title&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.titleline a'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;points&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;parseInt&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;nextSibling&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.score'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;url&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.titleline a'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;href&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;dt&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;nextSibling&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.age'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;title&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;submitter&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;nextSibling&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.hnuser'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;commentsUrl&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;nextSibling&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.age a'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;href&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;id&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;commentsUrl&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;split&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'?id='&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-c"&gt;// Only posts with comments have a comments link&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;commentsLink&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;Array&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;from&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;
    &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;nextSibling&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelectorAll&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'a'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;
  &lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;filter&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;el&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt; &lt;span class="pl-c1"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="pl-s1"&gt;el&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;includes&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'comment'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;0&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;let&lt;/span&gt; &lt;span class="pl-s1"&gt;numComments&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;0&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;commentsLink&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-s1"&gt;numComments&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;parseInt&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;commentsLink&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;split&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;0&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;}&lt;/span&gt;
  &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;id&lt;span class="pl-kos"&gt;,&lt;/span&gt; title&lt;span class="pl-kos"&gt;,&lt;/span&gt; url&lt;span class="pl-kos"&gt;,&lt;/span&gt; dt&lt;span class="pl-kos"&gt;,&lt;/span&gt; points&lt;span class="pl-kos"&gt;,&lt;/span&gt; submitter&lt;span class="pl-kos"&gt;,&lt;/span&gt; commentsUrl&lt;span class="pl-kos"&gt;,&lt;/span&gt; numComments&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The great thing about modern JavaScript is that everything you could need to write a scraper is already there in the default environment.&lt;/p&gt;
&lt;p&gt;I'm using &lt;code&gt;document.querySelectorAll('.itemlist .athing')&lt;/code&gt; to loop through each element that matches that selector.&lt;/p&gt;
&lt;p&gt;I wrap that with &lt;code&gt;Array.from(...)&lt;/code&gt; so I can use the &lt;code&gt;.map()&lt;/code&gt; method. Then for each element I can extract out the details that I need.&lt;/p&gt;
&lt;p&gt;The resulting array contains 30 items that look like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;[
  {
    &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30658310&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"title"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Track changes to CLI tools by recording their help output&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"url"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;https://simonwillison.net/2022/Feb/2/help-scraping/&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"dt"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2022-03-13T05:36:13&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"submitter"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;appwiz&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"commentsUrl"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;https://news.ycombinator.com/item?id=30658310&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"numComments"&lt;/span&gt;: &lt;span class="pl-c1"&gt;19&lt;/span&gt;
  }
]&lt;/pre&gt;&lt;/div&gt;
&lt;h4&gt;Running it with shot-scraper&lt;/h4&gt;
&lt;p&gt;Now that I have a recipe for a scraper, I can run it in the terminal like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;shot-scraper javascript &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://news.ycombinator.com/from?site=simonwillison.net&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;Array.from(document.querySelectorAll('.athing'), el =&amp;gt; {&lt;/span&gt;
&lt;span class="pl-s"&gt;  const title = el.querySelector('.titleline a').innerText;&lt;/span&gt;
&lt;span class="pl-s"&gt;  const points = parseInt(el.nextSibling.querySelector('.score').innerText);&lt;/span&gt;
&lt;span class="pl-s"&gt;  const url = el.querySelector('.titleline a').href;&lt;/span&gt;
&lt;span class="pl-s"&gt;  const dt = el.nextSibling.querySelector('.age').title;&lt;/span&gt;
&lt;span class="pl-s"&gt;  const submitter = el.nextSibling.querySelector('.hnuser').innerText;&lt;/span&gt;
&lt;span class="pl-s"&gt;  const commentsUrl = el.nextSibling.querySelector('.age a').href;&lt;/span&gt;
&lt;span class="pl-s"&gt;  const id = commentsUrl.split('?id=')[1];&lt;/span&gt;
&lt;span class="pl-s"&gt;  // Only posts with comments have a comments link&lt;/span&gt;
&lt;span class="pl-s"&gt;  const commentsLink = Array.from(&lt;/span&gt;
&lt;span class="pl-s"&gt;    el.nextSibling.querySelectorAll('a')&lt;/span&gt;
&lt;span class="pl-s"&gt;  ).filter(el =&amp;gt; el &amp;amp;&amp;amp; el.innerText.includes('comment'))[0];&lt;/span&gt;
&lt;span class="pl-s"&gt;  let numComments = 0;&lt;/span&gt;
&lt;span class="pl-s"&gt;  if (commentsLink) {&lt;/span&gt;
&lt;span class="pl-s"&gt;    numComments = parseInt(commentsLink.innerText.split()[0]);&lt;/span&gt;
&lt;span class="pl-s"&gt;  }&lt;/span&gt;
&lt;span class="pl-s"&gt;  return {id, title, url, dt, points, submitter, commentsUrl, numComments};&lt;/span&gt;
&lt;span class="pl-s"&gt;})&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; simonwillison-net.json&lt;/pre&gt;&lt;/div&gt;  
&lt;p&gt;&lt;code&gt;simonwillison-net.json&lt;/code&gt; is now a JSON file containing the scraped data.&lt;/p&gt;
&lt;h4&gt;Running the scraper in GitHub Actions&lt;/h4&gt;
&lt;p&gt;I want to keep track of changes to this data structure over time. My preferred technique for that is something I call &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; - the core idea is to keep the data in a Git repository and commit an update any time it updates. This provides a cheap and robust history of changes over time.&lt;/p&gt;
&lt;p&gt;Running the scraper in GitHub Actions means I don't need to administrate my own server to keep this running.&lt;/p&gt;
&lt;p&gt;So I built exactly that, in the &lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain"&gt;simonw/scrape-hacker-news-by-domain&lt;/a&gt; repository.&lt;/p&gt;
&lt;p&gt;The GitHub Actions workflow is in &lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain/blob/485841482a39869759e39f4d8dee21b9adc963d7/.github/workflows/scrape.yml"&gt;.github/workflows/scrape.yml&lt;/a&gt;. It runs the above command once an hour, then pushes a commit back to the repository should the file have any changes since last time it ran.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain/commits/main/simonwillison-net.json"&gt;commit history of simonwillison-net.json&lt;/a&gt; will show me any time a new link from my site appears on Hacker News, or a comment is added.&lt;/p&gt;
&lt;p&gt;(Fun GitHub trick: add &lt;code&gt;.atom&lt;/code&gt; to the end of that URL to get &lt;a href="https://github.com/simonw/scrape-hacker-news-by-domain/commits/main/simonwillison-net.json.atom"&gt;an Atom feed of those commits&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;The whole scraper, from idea to finished implementation, took less than fifteen minutes to build and deploy.&lt;/p&gt;
&lt;p&gt;I can see myself using this technique &lt;em&gt;a lot&lt;/em&gt; in the future.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hacker-news"&gt;hacker-news&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="github"/><category term="hacker-news"/><category term="scraping"/><category term="github-actions"/><category term="git-scraping"/><category term="shot-scraper"/></entry><entry><title>shot-scraper: automated screenshots for documentation, built on Playwright</title><link href="https://simonwillison.net/2022/Mar/10/shot-scraper/#atom-series" rel="alternate"/><published>2022-03-10T00:13:30+00:00</published><updated>2022-03-10T00:13:30+00:00</updated><id>https://simonwillison.net/2022/Mar/10/shot-scraper/#atom-series</id><summary type="html">
    &lt;p&gt;&lt;a href="https://github.com/simonw/shot-scraper"&gt;shot-scraper&lt;/a&gt; is a new tool that I’ve built to help automate the process of keeping screenshots up-to-date in my documentation. It also doubles as a scraping tool - hence the name - which I picked as a complement to my &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;git scraping&lt;/a&gt; and &lt;a href="https://simonwillison.net/2022/Feb/2/help-scraping/"&gt;help scraping&lt;/a&gt; techniques.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 13th March 2022:&lt;/strong&gt; The new &lt;code&gt;shot-scraper javascript&lt;/code&gt; command can now be used to &lt;a href="https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-scraper/"&gt;scrape web pages from the command line&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 14th October 2022:&lt;/strong&gt; &lt;a href="https://simonwillison.net/2022/Oct/14/automating-screenshots/"&gt;Automating screenshots for the Datasette documentation using shot-scraper&lt;/a&gt; offers a tutorial introduction to using the tool.&lt;/p&gt;
&lt;h4&gt;The problem&lt;/h4&gt;
&lt;p&gt;I like to include screenshots in documentation. I recently &lt;a href="https://simonwillison.net/2022/Feb/27/datasette-tutorials/"&gt;started writing end-user tutorials&lt;/a&gt; for Datasette, which are particularly image heavy (&lt;a href="https://datasette.io/tutorials/explore"&gt;for example&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;As software changes over time, screenshots get out-of-date. I don't like the idea of stale screenshots, but I also don't want to have to manually recreate them every time I make the tiniest tweak to the visual appearance of my software.&lt;/p&gt;
&lt;h4&gt;Introducing shot-scraper&lt;/h4&gt;
&lt;p&gt;&lt;code&gt;shot-scraper&lt;/code&gt; is a tool for automating this process. You can install it using &lt;code&gt;pip&lt;/code&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pip install shot-scraper
shot-scraper install
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That second &lt;code&gt;shot-scraper install&lt;/code&gt; line will install the browser it needs to do its job - more on that later.&lt;/p&gt;
&lt;p&gt;You can use it in two ways. To take a one-off screenshot, you can run it like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;shot-scraper https://simonwillison.net/ -o simonwillison.png
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or if you want to take a set of screenshots in a repeatable way, you can define them in a YAML file that looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;url&lt;/span&gt;: &lt;span class="pl-s"&gt;https://simonwillison.net/&lt;/span&gt;
  &lt;span class="pl-ent"&gt;output&lt;/span&gt;: &lt;span class="pl-s"&gt;simonwillison.png&lt;/span&gt;
- &lt;span class="pl-ent"&gt;url&lt;/span&gt;: &lt;span class="pl-s"&gt;https://www.example.com/&lt;/span&gt;
  &lt;span class="pl-ent"&gt;width&lt;/span&gt;: &lt;span class="pl-c1"&gt;400&lt;/span&gt;
  &lt;span class="pl-ent"&gt;height&lt;/span&gt;: &lt;span class="pl-c1"&gt;400&lt;/span&gt;
  &lt;span class="pl-ent"&gt;quality&lt;/span&gt;: &lt;span class="pl-c1"&gt;80&lt;/span&gt;
  &lt;span class="pl-ent"&gt;output&lt;/span&gt;: &lt;span class="pl-s"&gt;example.jpg&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And then use &lt;code&gt;shot-scraper multi&lt;/code&gt; to execute every screenshot in one go:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% shot-scraper multi shots.yml 
Screenshot of 'https://simonwillison.net/' written to 'simonwillison.png'
Screenshot of 'https://www.example.com/' written to 'example.jpg'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href="https://shot-scraper.datasette.io/en/stable/screenshots.html"&gt;The documentation&lt;/a&gt; describes all of the available options you can use when taking a screenshot.&lt;/p&gt;
&lt;p&gt;Each option can be provided to the &lt;code&gt;shot-scraper&lt;/code&gt; one-off tool, or can be embedded in the YAML file for use with &lt;code&gt;shot-scraper multi&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;JavaScript and CSS selectors&lt;/h4&gt;
&lt;p&gt;The default behaviour for &lt;code&gt;shot-scraper&lt;/code&gt; is to take a full page screenshot, using a browser width of 1280px.&lt;/p&gt;
&lt;p&gt;For documentation screenshots you probably don't want the whole page though - you likely want to create an image of one specific part of the interface.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--selector&lt;/code&gt; option allows you to specify an area of the page by CSS selector. The resulting image will consist just of that part of the page.&lt;/p&gt;
&lt;p&gt;What if you want to modify the page in addition to selecting a specific area?&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--javascript&lt;/code&gt; option lets you pass in a block of JavaScript code which will be injected into the page and executed after the page has loaded, but before the screenshot is taken.&lt;/p&gt;
&lt;p&gt;The combination of these two options - also available as &lt;code&gt;javascript:&lt;/code&gt; and &lt;code&gt;selector:&lt;/code&gt; keys in the YAML file - should be flexible enough to cover the custom screenshot case for documentation.&lt;/p&gt;
&lt;h4 id="a-complex-example"&gt;A complex example&lt;/h4&gt;
&lt;p&gt;To prove to myself that the tool works, I decided to try replicating this screenshot from &lt;a href="https://datasette.io/tutorials/explore"&gt;my tutorial&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I made the original using &lt;a href="https://cleanshot.com/"&gt;CleanShot X&lt;/a&gt;, manually adding the two pink arrows:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/select-facets-original.jpg" alt="A screenshot of a portion of the table interface in Datasette, with a menu open and two pink arrows pointing to menu items" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This is pretty tricky!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It's not &lt;a href="https://congress-legislators.datasettes.com/legislators/executive_terms?start__startswith=18&amp;amp;type=prez"&gt;this whole page&lt;/a&gt;, just a subset of the page&lt;/li&gt;
&lt;li&gt;The cog menu for one of the columns is open, which means the cog icon needs to be clicked before taking the screenshot&lt;/li&gt;
&lt;li&gt;There are two pink arrows superimposed on the image&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I decided to do use just one arrow for the moment, which should hopefully result in a clearer image.&lt;/p&gt;
&lt;p&gt;I started by &lt;a href="https://github.com/simonw/shot-scraper/issues/9#issuecomment-1063314278"&gt;creating my own pink arrow SVG&lt;/a&gt; using Figma:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/pink-arrow.png" alt="A big pink arrow, with a drop shadow" style="width: 200px; max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I then fiddled around in the Firefox developer console for quite a while, working out the JavaScript needed to trim the page down to the bit I wanted, open the menu and position the arrow.&lt;/p&gt;
&lt;p&gt;With the JavaScript figured out, I pasted it into a YAML file called &lt;code&gt;shot.yml&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;url&lt;/span&gt;: &lt;span class="pl-s"&gt;https://congress-legislators.datasettes.com/legislators/executive_terms?start__startswith=18&amp;amp;type=prez&lt;/span&gt;
  &lt;span class="pl-ent"&gt;javascript&lt;/span&gt;: &lt;span class="pl-s"&gt;|&lt;/span&gt;
&lt;span class="pl-s"&gt;    new Promise(resolve =&amp;gt; {&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Run in a promise so we can sleep 1s at the end&lt;/span&gt;
&lt;span class="pl-s"&gt;      function remove(el) { el.parentNode.removeChild(el);}&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Remove header and footer&lt;/span&gt;
&lt;span class="pl-s"&gt;      remove(document.querySelector('header'));&lt;/span&gt;
&lt;span class="pl-s"&gt;      remove(document.querySelector('footer'));&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Remove most of the children of .content&lt;/span&gt;
&lt;span class="pl-s"&gt;      Array.from(document.querySelectorAll('.content &amp;gt; *:not(.table-wrapper,.suggested-facets)')).map(remove)&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Bit of breathing room for the screenshot&lt;/span&gt;
&lt;span class="pl-s"&gt;      document.body.style.marginTop = '10px';&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Add a bit of padding to .content&lt;/span&gt;
&lt;span class="pl-s"&gt;      var content = document.querySelector('.content');&lt;/span&gt;
&lt;span class="pl-s"&gt;      content.style.width = '820px';&lt;/span&gt;
&lt;span class="pl-s"&gt;      content.style.padding = '10px';&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Open the menu - it's an SVG so we need to use dispatchEvent here&lt;/span&gt;
&lt;span class="pl-s"&gt;      document.querySelector('th.col-executive_id svg').dispatchEvent(new Event('click'));&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Remove all but table header and first 11 rows&lt;/span&gt;
&lt;span class="pl-s"&gt;      Array.from(document.querySelectorAll('tr')).slice(12).map(remove);&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Add a pink SVG arrow&lt;/span&gt;
&lt;span class="pl-s"&gt;      let div = document.createElement('div');&lt;/span&gt;
&lt;span class="pl-s"&gt;      div.innerHTML = `&amp;lt;svg width="104" height="60" fill="none" xmlns="http://www.w3.org/2000/svg"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;        &amp;lt;g filter="url(#a)"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;          &amp;lt;path fill-rule="evenodd" clip-rule="evenodd" d="m76.7 1 2 2 .2-.1.1.4 20 20a3.5 3.5 0 0 1 0 5l-20 20-.1.4-.3-.1-1.9 2a3.5 3.5 0 0 1-5.4-4.4l3.2-14.4H4v-12h70.6L71.3 5.4A3.5 3.5 0 0 1 76.7 1Z" fill="#FF31A0"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;        &amp;lt;/g&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;        &amp;lt;defs&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;          &amp;lt;filter id="a" x="0" y="0" width="104" height="59.5" filterUnits="userSpaceOnUse" color-interpolation-filters="sRGB"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feFlood flood-opacity="0" result="BackgroundImageFix"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feColorMatrix in="SourceAlpha" values="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 127 0" result="hardAlpha"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feOffset dy="4"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feGaussianBlur stdDeviation="2"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feComposite in2="hardAlpha" operator="out"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feColorMatrix values="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.25 0"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feBlend in2="BackgroundImageFix" result="effect1_dropShadow_2_26"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;              &amp;lt;feBlend in="SourceGraphic" in2="effect1_dropShadow_2_26" result="shape"/&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;          &amp;lt;/filter&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;        &amp;lt;/defs&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;      &amp;lt;/svg&amp;gt;`;&lt;/span&gt;
&lt;span class="pl-s"&gt;      let svg = div.firstChild;&lt;/span&gt;
&lt;span class="pl-s"&gt;      content.appendChild(svg);&lt;/span&gt;
&lt;span class="pl-s"&gt;      content.style.position = 'relative';&lt;/span&gt;
&lt;span class="pl-s"&gt;      svg.style.position = 'absolute';&lt;/span&gt;
&lt;span class="pl-s"&gt;      // Give the menu time to finish fading in&lt;/span&gt;
&lt;span class="pl-s"&gt;      setTimeout(() =&amp;gt; {&lt;/span&gt;
&lt;span class="pl-s"&gt;        // Position arrow pointing to the 'facet by this' menu item&lt;/span&gt;
&lt;span class="pl-s"&gt;        var pos = document.querySelector('.dropdown-facet').getBoundingClientRect();&lt;/span&gt;
&lt;span class="pl-s"&gt;        svg.style.left = (pos.left - pos.width) + 'px';&lt;/span&gt;
&lt;span class="pl-s"&gt;        svg.style.top = (pos.top - 20) + 'px';&lt;/span&gt;
&lt;span class="pl-s"&gt;        resolve();&lt;/span&gt;
&lt;span class="pl-s"&gt;      }, 1000);&lt;/span&gt;
&lt;span class="pl-s"&gt;    });&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;  &lt;span class="pl-ent"&gt;output&lt;/span&gt;: &lt;span class="pl-s"&gt;annotated-screenshot.png&lt;/span&gt;
  &lt;span class="pl-ent"&gt;selector&lt;/span&gt;: &lt;span class="pl-s"&gt;.content&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And ran this command to generate the screenshot:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;shot-scraper multi shot.yml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The generated &lt;code&gt;annotated-screenshot.png&lt;/code&gt; image looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/annotated-screenshot.png" alt="A screenshot of the table with the menu open and a single pink arrow pointing to the 'facet by this' menu item" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I'm pretty happy with this! I think it works very well as a proof of concept for the process.&lt;/p&gt;
&lt;h4 id="how-it-works-playwright"&gt;How it works: Playwright&lt;/h4&gt;
&lt;p&gt;I built the &lt;a href="https://github.com/simonw/shot-scraper/tree/44995cd45ca6c56d34c5c3d131217f7b9170f6f7"&gt;first prototype&lt;/a&gt; of &lt;code&gt;shot-scraper&lt;/code&gt; using Puppeteer, because I had &lt;a href="https://simonwillison.net/2020/Sep/3/weeknotes-airtable-screenshots-dogsheep/"&gt;used that before&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Then I noticed that the &lt;a href="https://www.npmjs.com/package/puppeteer-cli"&gt;puppeteer-cli&lt;/a&gt; package I was using hadn't had an update in two years, which reminded me to check out Playwright.&lt;/p&gt;
&lt;p&gt;I've been looking for an excuse to learn &lt;a href="https://playwright.dev/"&gt;Playwright&lt;/a&gt; for a while now, and this project turned out to be ideal.&lt;/p&gt;
&lt;p&gt;Playwright is Microsoft's open source browser automation framework. They promote it as a testing tool, but it has plenty of applications outside of testing - screenshot automation and screen scraping being two of the most obvious.&lt;/p&gt;
&lt;p&gt;Playwright is comprehensive: it downloads its own custom browser builds, and can run tests across multiple different rendering engines.&lt;/p&gt;
&lt;p&gt;The second prototype used the &lt;a href="https://github.com/simonw/shot-scraper/tree/b3318b2f27ca1526d5a9f06de50cf9900dd4d8d0"&gt;Playwright CLI utility&lt;/a&gt; instead, &lt;a href="https://github.com/simonw/shot-scraper/blob/b3318b2f27ca1526d5a9f06de50cf9900dd4d8d0/shot_scraper/cli.py#L39-L50"&gt;executed via npx&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;subprocess&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(
    [
        &lt;span class="pl-s"&gt;"npx"&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"playwright"&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"screenshot"&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"--full-page"&lt;/span&gt;,
        &lt;span class="pl-s1"&gt;url&lt;/span&gt;,
        &lt;span class="pl-s1"&gt;output&lt;/span&gt;,
    ],
    &lt;span class="pl-s1"&gt;capture_output&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;,
)&lt;/pre&gt;
&lt;p&gt;This could take a full page screenshot, but that CLI tool wasn't flexible enough to take screenshots of specific elements. So I needed to switch to the Playwright programmatic API.&lt;/p&gt;
&lt;p&gt;I started out trying to get Python to generate and pass JavaScript to the Node.js library... and then I spotted the official &lt;a href="https://playwright.dev/python/docs/intro"&gt;Playwright for Python&lt;/a&gt; package.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pip install playwright
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It's amazing! It has the exact same functionality as the JavaScript library - the same classes, the same methods. Everything just works, in both languages.&lt;/p&gt;
&lt;p&gt;I was curious how they pulled this off, so I dug inside the &lt;code&gt;playwright&lt;/code&gt; Python package in my &lt;code&gt;site-packages&lt;/code&gt; folder... and found it bundles a full Node.js binary executable and uses it to bridge the two worlds! What a wild hack.&lt;/p&gt;
&lt;p&gt;Thanks to Playwright, the entire implementation of &lt;code&gt;shot-scraper&lt;/code&gt; is currently just &lt;a href="https://github.com/simonw/shot-scraper/blob/0.3/shot_scraper/cli.py"&gt;181 lines of Python code&lt;/a&gt; - it's all glue code tying together a &lt;a href="https://click.palletsprojects.com/"&gt;Click&lt;/a&gt; CLI interface with some code that calls Playwright to do the actual work.&lt;/p&gt;
&lt;p&gt;I couldn't be more impressed with Playwright. I'll definitely be using it for other projects - for one thing, I think I'll finally be able to add automated tests to my &lt;a href="https://datasette.io/desktop"&gt;Datasette Desktop&lt;/a&gt; Electron application.&lt;/p&gt;
&lt;h4&gt;Hooking shot-scraper up to GitHub Actions&lt;/h4&gt;
&lt;p&gt;I built &lt;code&gt;shot-scraper&lt;/code&gt; very much with GitHub Actions in mind.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://github.com/simonw/shot-scraper-demo"&gt;shot-scraper-demo&lt;/a&gt; repository is my first live demo of the tool.&lt;/p&gt;
&lt;p&gt;Once a day, it runs &lt;a href="https://github.com/simonw/shot-scraper-demo/blob/3fdd9d3e79f95d9d396aeefd5bf65e85a7700ef4/.github/workflows/shots.yml"&gt;this shots.yml&lt;/a&gt; file, generates two screenshots and commits them back to the repository.&lt;/p&gt;
&lt;p&gt;One of them is the tutorial screenshot described above.&lt;/p&gt;
&lt;p&gt;The other is a screenshot of the list of "recently spotted owls" from &lt;a href="https://www.owlsnearme.com/?place=127871"&gt;this page&lt;/a&gt; on &lt;a href="https://www.owlsnearme.com/"&gt;owlsnearme.com&lt;/a&gt;. I wanted a page that would change on an occasional basis, to demonstrate GitHub's neat image diffing interface.&lt;/p&gt;
&lt;p&gt;I may need to change that demo though! That page includes "spotted 5 hours ago" text, which means that there's almost always a tiny pixel difference, &lt;a href="https://github.com/simonw/shot-scraper-demo/commit/bc86510f49b6f8d6728c9f1880b999c83361dd5a#diff-897c3444fbbb2033cbba5840da4994d01c3f396e0cdf4b0613d7f410db9887e0"&gt;like this one&lt;/a&gt; (use the "swipe" comparison tool to watch 6 hours ago change to 7 hours ago under the top left photo).&lt;/p&gt;
&lt;p&gt;Storing image files that change frequently in a free repository on GitHub feels rude to me, so please use this tool cautiously there!&lt;/p&gt;
&lt;h4&gt;What's next?&lt;/h4&gt;
&lt;p&gt;I had ambitious plans to add utilities to the tool that would &lt;a href="https://github.com/simonw/shot-scraper/issues/9"&gt;help with annotations&lt;/a&gt;, such as adding pink arrows and drawing circles around different elements on the page.&lt;/p&gt;
&lt;p&gt;I've shelved those plans for the moment: as the demo above shows, the JavaScript hook is good enough. I may revisit this later once common patterns have started to emerge.&lt;/p&gt;
&lt;p&gt;So really, my next step is to start using this tool for my own projects - to generate screenshots for my documentation.&lt;/p&gt;
&lt;p&gt;I'm also very interested to see what kinds of things other people use this for.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/documentation"&gt;documentation&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/puppeteer"&gt;puppeteer&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/playwright"&gt;playwright&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="documentation"/><category term="projects"/><category term="scraping"/><category term="github-actions"/><category term="git-scraping"/><category term="puppeteer"/><category term="playwright"/><category term="shot-scraper"/></entry><entry><title>Help scraping: track changes to CLI tools by recording their --help using Git</title><link href="https://simonwillison.net/2022/Feb/2/help-scraping/#atom-series" rel="alternate"/><published>2022-02-02T23:46:35+00:00</published><updated>2022-02-02T23:46:35+00:00</updated><id>https://simonwillison.net/2022/Feb/2/help-scraping/#atom-series</id><summary type="html">
    &lt;p&gt;I've been experimenting with a new variant of &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; this week which I'm calling &lt;strong&gt;Help scraping&lt;/strong&gt;. The key idea is to track changes made to CLI tools over time by recording the output of their &lt;code&gt;--help&lt;/code&gt; commands in a Git repository.&lt;/p&gt;
&lt;p&gt;My new &lt;a href="https://github.com/simonw/help-scraper"&gt;help-scraper GitHub repository&lt;/a&gt; is my first implementation of this pattern.&lt;/p&gt;
&lt;p&gt;It uses &lt;a href="https://github.com/simonw/help-scraper/blob/cd18c5d7c1ac7c3851823dcabaa21ee920d73720/.github/workflows/scrape.yml"&gt;this GitHub Actions workflow&lt;/a&gt; to record the &lt;code&gt;--help&lt;/code&gt; output for the Amazon Web Services &lt;code&gt;aws&lt;/code&gt; CLI tool, and also for the &lt;code&gt;flyctl&lt;/code&gt; tool maintained by the &lt;a href="https://fly.io/"&gt;Fly.io&lt;/a&gt; hosting platform.&lt;/p&gt;
&lt;p&gt;The workflow runs once a day. It loops through every available AWS command (using &lt;a href="https://github.com/simonw/help-scraper/blob/cd18c5d7c1ac7c3851823dcabaa21ee920d73720/aws_commands.py"&gt;this script&lt;/a&gt;) and records the output of that command's CLI help option to a &lt;code&gt;.txt&lt;/code&gt; file in the repository - then commits the result at the end.&lt;/p&gt;
&lt;p&gt;The result is a version history of changes made to those help files. It's essentially a much more detailed version of a changelog - capturing all sorts of details that might not be reflected in the official release notes for the tool.&lt;/p&gt;
&lt;p&gt;Here's an example. This morning, AWS released version 1.22.47 of their CLI helper tool. They release new versions on an almost daily basis.&lt;/p&gt;
&lt;p&gt;Here are &lt;a href="https://github.com/aws/aws-cli/blob/develop/CHANGELOG.rst#12247"&gt;the official release notes&lt;/a&gt; - 12 bullet points, spanning 12 different AWS services.&lt;/p&gt;
&lt;p&gt;My help scraper caught the details of the release in &lt;a href="https://github.com/simonw/help-scraper/commit/cd18c5d7c1ac7c3851823dcabaa21ee920d73720#diff-c2559859df8912eb13a6017d81019bf5452cead3e6495744e2d0c82202bf33ac"&gt;this commit&lt;/a&gt; - 89 changed files with 3,543 additions and 1,324 deletions. It tells the story of what's changed in a whole lot more detail.&lt;/p&gt;
&lt;p&gt;The AWS CLI tool is &lt;em&gt;enormous&lt;/em&gt;. Running &lt;code&gt;find aws -name '*.txt' | wc -l&lt;/code&gt; in that repository counts help pages for 11,401 individual commands - or 11,390 if you checkout the previous version, showing that there were 11 commands added just in this morning's new release.&lt;/p&gt;
&lt;p&gt;There are plenty of other ways of tracking changes made to AWS. I've previously kept an eye on &lt;a href="https://github.com/boto/botocore/commits/develop"&gt;the botocore GitHub history&lt;/a&gt;, which exposes changes to the underlying JSON - and there are projects like &lt;a href="https://awsapichanges.info/"&gt;awschanges.info&lt;/a&gt; which try to turn those sources of data into something more readable.&lt;/p&gt;
&lt;p&gt;But I think there's something pretty neat about being able to track changes in detail for any CLI tool that offers help output, independent of the official release notes for that tool. Not everyone writes release notes &lt;a href="https://simonwillison.net/2022/Jan/31/release-notes/"&gt;with the detail I like from them&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;I implemented this for &lt;code&gt;flyctl&lt;/code&gt; first, because I wanted to see what changes were being made that might impact my &lt;a href="https://datasette.io/plugins/datasette-publish-fly"&gt;datasette-publish-fly&lt;/a&gt; plugin which shells out to that tool. Then I realized it could be applied to AWS as well.&lt;/p&gt;
&lt;h4&gt;Help scraping my own projects&lt;/h4&gt;
&lt;p&gt;I got the initial idea for this technique from a change I made to my &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; and &lt;a href="https://sqlite-utils.datasette.io"&gt;sqlite-utils&lt;/a&gt; projects a few weeks ago.&lt;/p&gt;
&lt;p&gt;Both tools offer CLI commands with &lt;code&gt;--help&lt;/code&gt; output - but I kept on forgetting to update the help, partly because there was no easy way to see its output online without running the tools themselves.&lt;/p&gt;
&lt;p&gt;So, I added documentation pages that list the output of &lt;code&gt;--help&lt;/code&gt; for each of the CLI commands, generated using the &lt;a href="https://nedbatchelder.com/code/cog"&gt;Cog&lt;/a&gt; file generation tool:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://sqlite-utils.datasette.io/en/stable/cli-reference.html"&gt;sqlite-utils CLI reference&lt;/a&gt; (39 commands!)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.datasette.io/en/stable/cli-reference.html"&gt;datasette CLI reference&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Having added these pages, I realized that the Git commit history of those generated documentation pages could double up as a history of changes I made to the &lt;code&gt;--help&lt;/code&gt; output - here's &lt;a href="https://github.com/simonw/sqlite-utils/commits/main/docs/cli-reference.rst"&gt;that history for sqlite-utils&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It was a short jump from that to the idea of combining it with &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; to generate history for other tools.&lt;/p&gt;
&lt;h4&gt;Bonus trick: GraphQL schema scraping&lt;/h4&gt;
&lt;p&gt;I've started making selective use of the &lt;a href="https://fly.io/"&gt;Fly.io&lt;/a&gt; GraphQL API as part of &lt;a href="https://github.com/simonw/datasette-publish-fly"&gt;my plugin&lt;/a&gt; for publishing Datasette instances to that platform.&lt;/p&gt;
&lt;p&gt;Their GraphQL API is openly available, but it's not extensively documented - presumably because they reserve the right to make breaking changes to it at any time. I collected some notes on it in this TIL: &lt;a href="https://til.simonwillison.net/fly/undocumented-graphql-api"&gt;Using the undocumented Fly GraphQL API&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This gave me an idea: could I track changes made to their GraphQL schema using the same scraping trick?&lt;/p&gt;
&lt;p&gt;It turns out I can! There's an NPM package called &lt;a href="https://www.npmjs.com/package/get-graphql-schema"&gt;get-graphql-schema&lt;/a&gt; which can extract the GraphQL schema from any GraphQL server and write it out to disk:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;npx get-graphql-schema https://api.fly.io/graphql &amp;gt; /tmp/fly.graphql
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I've added that to my &lt;code&gt;help-scraper&lt;/code&gt; repository too - so now I have a &lt;a href="https://github.com/simonw/help-scraper/commits/main/flyctl/fly.graphql"&gt;commit history of changes&lt;/a&gt; of changes they are making there too. Here's &lt;a href="https://github.com/simonw/help-scraper/commit/f11072ff23f0d654395be7c2b1e98e84dbbc26a3#diff-c9cd49cf2aa3b983457e2812ba9313cc254aba74aaba9a36d56c867e32221589"&gt;an example&lt;/a&gt; from this morning.&lt;/p&gt;
&lt;h3&gt;Other weeknotes&lt;/h3&gt;
&lt;p&gt;I've decided to start setting goals on a monthly basis. My goal for February is to finally ship Datasette 1.0! I'm trying to make at least one commit every day that takes me closer to &lt;a href="https://github.com/simonw/datasette/milestone/7"&gt;that milestone&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This week I did &lt;a href="https://github.com/simonw/datasette/issues/1533"&gt;a bunch of work&lt;/a&gt; adding a &lt;code&gt;Link: https://...; rel="alternate"; type="application/datasette+json"&lt;/code&gt; HTTP header to a bunch of different pages in the Datasette interface, to support discovery of the JSON version of a page based on a URL to the human-readable version.&lt;/p&gt;
&lt;p&gt;(I had originally planned &lt;a href="https://github.com/simonw/datasette/issues/1534"&gt;to also support&lt;/a&gt; &lt;code&gt;Accept: application/json&lt;/code&gt; request headers for this, but I've been put off that idea by the discovery that Cloudflare &lt;a href="https://twitter.com/simonw/status/1478470282931163137"&gt;deliberately ignores&lt;/a&gt; the &lt;code&gt;Vary: Accept&lt;/code&gt; header.)&lt;/p&gt;
&lt;p&gt;Unrelated to Datasette: I also started a new Twitter thread, gathering &lt;a href="https://twitter.com/simonw/status/1487673496977113088"&gt;behind the scenes material from the movie the Mitchells vs the Machines&lt;/a&gt;. There's been a flurry of great material shared recently by the creative team, presumably as part of the run-up to awards season - and I've been enjoying trying to tie it all together in a thread.&lt;/p&gt;
&lt;p&gt;The last time I did this &lt;a href="https://twitter.com/simonw/status/1077737871602110466"&gt;was for Into the Spider-Verse&lt;/a&gt; (from the same studio) and that thread ended up running for more than a year!&lt;/p&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/pytest/only-run-integration"&gt;Opt-in integration tests with pytest --integration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/graphql/get-graphql-schema"&gt;get-graphql-schema&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/github-actions/python-3-11"&gt;Testing against Python 3.11 preview using GitHub Actions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/graphql"&gt;graphql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fly"&gt;fly&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="git"/><category term="github"/><category term="projects"/><category term="scraping"/><category term="graphql"/><category term="weeknotes"/><category term="github-actions"/><category term="git-scraping"/><category term="fly"/></entry><entry><title>git-history: a tool for analyzing scraped data collected using Git and SQLite</title><link href="https://simonwillison.net/2021/Dec/7/git-history/#atom-series" rel="alternate"/><published>2021-12-07T22:32:55+00:00</published><updated>2021-12-07T22:32:55+00:00</updated><id>https://simonwillison.net/2021/Dec/7/git-history/#atom-series</id><summary type="html">
    &lt;p&gt;I described &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; last year: a technique for writing scrapers where you periodically snapshot a source of data to a Git repository in order to record changes to that source over time.&lt;/p&gt;
&lt;p&gt;The open challenge was how to analyze that data once it was collected. &lt;a href="https://datasette.io/tools/git-history"&gt;git-history&lt;/a&gt; is my new tool designed to tackle that problem.&lt;/p&gt;
&lt;h4&gt;Git scraping, a refresher&lt;/h4&gt;
&lt;p&gt;A neat thing about scraping to a Git repository is that the scrapers themselves can be really simple. I demonstrated how to run scrapers for free using GitHub Actions in this &lt;a href="https://simonwillison.net/2021/Mar/5/git-scraping/"&gt;five minute lightning talk&lt;/a&gt; back in March.&lt;/p&gt;
&lt;p&gt;Here's a concrete example: California's state fire department, Cal Fire, maintain an incident map at &lt;a href="https://www.fire.ca.gov/incidents/"&gt;fire.ca.gov/incidents&lt;/a&gt; showing the status of current large fires in the state.&lt;/p&gt;
&lt;p&gt;I found the underlying data here:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I built &lt;a href="https://github.com/simonw/ca-fires-history/blob/main/.github/workflows/scrape.yml"&gt;a simple scraper&lt;/a&gt; that grabs a copy of that every 20 minutes and commits it to Git. I've been running that for 14 months now, and it's collected &lt;a href="https://github.com/simonw/ca-fires-history"&gt;1,559 commits&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;The thing that excites me most about Git scraping is that it can create truly unique datasets. It's common for organizations not to keep detailed archives of what changed and where, so by scraping their data into a Git repository you can often end up with a more detailed history than they maintain themselves.&lt;/p&gt;
&lt;p&gt;There's one big challenge though; having collected that data, how can you best analyze it? Reading through thousands of commit differences and eyeballing changes to JSON or CSV files isn't a great way of finding the interesting stories that have been captured.&lt;/p&gt;
&lt;h4&gt;git-history&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://datasette.io/tools/git-history"&gt;git-history&lt;/a&gt; is the new CLI tool I've built to answer that question. It reads through the entire history of a file and generates a SQLite database reflecting changes to that file over time. You can then use &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; to explore the resulting data.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://git-history-demos.datasette.io/ca-fires"&gt;an example database&lt;/a&gt; created by running the tool against my &lt;code&gt;ca-fires-history&lt;/code&gt; repository. I created the SQLite database by running this in the repository directory:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git-history file ca-fires.db incidents.json \
  --namespace incident \
  --id UniqueId \
  --convert &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;json.loads(content)["Incidents"]&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-progress.gif" alt="Animated gif showing the progress bar" style="max-width:100%; border-top: 5px solid black;" /&gt;&lt;/p&gt;
&lt;p&gt;In this example we are processing the history of a single file called &lt;code&gt;incidents.json&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;We use the &lt;code&gt;UniqueId&lt;/code&gt; column to identify which records are changed over time as opposed to newly created.&lt;/p&gt;
&lt;p&gt;Specifying &lt;code&gt;--namespace incident&lt;/code&gt; causes the created database tables to be called &lt;code&gt;incident&lt;/code&gt; and &lt;code&gt;incident_version&lt;/code&gt; rather than the default of &lt;code&gt;item&lt;/code&gt; and &lt;code&gt;item_version&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;And we have a fragment of Python code that knows how to turn each version stored in that commit history into a list of objects compatible with the tool, see &lt;a href="https://github.com/simonw/git-history/blob/0.6/README.md#custom-conversions-using---convert"&gt;--convert in the documentation&lt;/a&gt; for details.&lt;/p&gt;
&lt;p&gt;Let's use the database to answer some questions about fires in California over the past 14 months.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;incident&lt;/code&gt; table contains a copy of the latest record for every incident. We can use that to see &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident"&gt;a map of every fire&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-map.png" alt="A map showing 250 fires in California" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This uses the &lt;a href="https://datasette.io/plugins/datasette-cluster-map"&gt;datasette-cluster-map&lt;/a&gt; plugin, which draws a map of every row with a valid latitude and longitude column.&lt;/p&gt;
&lt;p&gt;Where things get interesting is the &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version"&gt;incident_version&lt;/a&gt; table. This is where changes between different scraped versions of each item are recorded.&lt;/p&gt;
&lt;p&gt;Those 250 fires have 2,060 recorded versions. If we &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item"&gt;facet by _item&lt;/a&gt; we can see which fires had the most versions recorded. Here are the top ten:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=174"&gt;Dixie Fire&lt;/a&gt; 268&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=209"&gt;Caldor Fire&lt;/a&gt; 153&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=197"&gt;Monument Fire&lt;/a&gt; 65&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=1"&gt;August Complex (includes Doe Fire)&lt;/a&gt; 64&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=2"&gt;Creek Fire&lt;/a&gt; 56&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=213"&gt;French Fire&lt;/a&gt; 53&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=32"&gt;Silverado Fire&lt;/a&gt; 52&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=240"&gt;Fawn Fire&lt;/a&gt; 45&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=34"&gt;Blue Ridge Fire&lt;/a&gt; 39&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=190"&gt;McFarland Fire&lt;/a&gt; 34&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This looks about right - the larger the number of versions the longer the fire must have been burning. The Dixie Fire &lt;a href="https://en.wikipedia.org/wiki/Dixie_Fire"&gt;has its own Wikipedia page&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;Clicking through to &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=174"&gt;the Dixie Fire&lt;/a&gt; lands us on a page showing every "version" that we captured, ordered by version number.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; only writes values to this table that have changed since the previous version. This means you can glance at the table grid and get a feel for which pieces of information were updated over time:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-incident-versions.png" alt="The table showing changes to that fire over time" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;ConditionStatement&lt;/code&gt; is a text description that changes frequently, but the other two interesting columns look to be &lt;code&gt;AcresBurned&lt;/code&gt; and &lt;code&gt;PercentContained&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;That &lt;code&gt;_commit&lt;/code&gt; table is a foreign key to &lt;a href="https://git-history-demos.datasette.io/ca-fires/commits"&gt;commits&lt;/a&gt;, which records commits that have been processed by the tool -  mainly so that when you run it a second time it can pick up where it finished last time.&lt;/p&gt;
&lt;p&gt;We can join against &lt;code&gt;commits&lt;/code&gt; to see the date that each version was created. Or we can use the &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version_detail"&gt;incident_version_detail&lt;/a&gt; view which performs that join for us.&lt;/p&gt;
&lt;p&gt;Using that view, we can filter for just rows where &lt;code&gt;_item&lt;/code&gt; is 174 and &lt;code&gt;AcresBurned&lt;/code&gt; is not blank, then use the &lt;a href=""&gt;datasette-vega&lt;/a&gt; plugin to visualize the &lt;code&gt;_commit_at&lt;/code&gt; date column against the &lt;code&gt;AcresBurned&lt;/code&gt; numeric column... and we get a graph of &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version_detail?_item__exact=174&amp;amp;AcresBurned__notblank=1#g.mark=line&amp;amp;g.x_column=_commit_at&amp;amp;g.x_type=temporal&amp;amp;g.y_column=AcresBurned&amp;amp;g.y_type=quantitative"&gt;the growth of the Dixie Fire over time&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-chart.png" alt="The chart plugin showing a line chart" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;To review: we started out with a GitHub Actions scheduled workflow grabbing a copy of a JSON API endpoint every 20 minutes. Thanks to &lt;code&gt;git-history&lt;/code&gt;, Datasette and &lt;code&gt;datasette-vega&lt;/code&gt; we now have a chart showing the growth of the longest-lived California wildfire of the last 14 months over time.&lt;/p&gt;
&lt;h4&gt;A note on schema design&lt;/h4&gt;
&lt;p&gt;One of the hardest problems in designing &lt;code&gt;git-history&lt;/code&gt; was deciding on an appropriate schema for storing version changes over time.&lt;/p&gt;
&lt;p&gt;I ended up with the following (edited for clarity):&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;CREATE TABLE [commits] (
   [id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [hash] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [commit_at] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [item] (
   [_id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [_item_id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [IncidentID] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [Location] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [Type] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [_commit] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;
);
CREATE TABLE [item_version] (
   [_id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [_item] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [item]([_id]),
   [_version] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [_commit] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [commits]([id]),
   [IncidentID] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [Location] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [Type] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [columns] (
   [id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [namespace] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [namespaces]([id]),
   [name] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [item_changed] (
   [item_version] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [item_version]([_id]),
   [column] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [columns]([id]),
   &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt; ([item_version], [column])
);&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As shown earlier, records in the &lt;code&gt;item_version&lt;/code&gt; table represent snapshots over time - but to save on database space and provide a neater interface for browsing versions, they only record columns that had changed since their previous version. Any unchanged columns are stored as &lt;code&gt;null&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;There's one catch with this schema: what do we do if a new version of an item sets one of the columns to &lt;code&gt;null&lt;/code&gt;? How can we tell the difference between that and a column that didn't change?&lt;/p&gt;
&lt;p&gt;I ended up solving that with an &lt;code&gt;item_changed&lt;/code&gt; many-to-many table, which uses pairs of integers (hopefully taking up as little space as possible) to record exactly which columns were modified in which &lt;code&gt;item_version&lt;/code&gt; records.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;item_version_detail&lt;/code&gt; view displays columns from that many-to-many table as JSON - here's &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version_detail?_version__gt=1&amp;amp;_col=_changed_columns&amp;amp;_col=_item&amp;amp;_col=_version"&gt;a filtered example&lt;/a&gt; showing which columns were changed in which versions of which items:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-changed-columns.png" alt="This table shows a JSON list of column names against items and versions" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://git-history-demos.datasette.io/ca-fires?sql=select+columns.name%2C+count%28*%29%0D%0Afrom+incident_changed%0D%0A++join+incident_version+on+incident_changed.item_version+%3D+incident_version._id%0D%0A++join+columns+on+incident_changed.column+%3D+columns.id%0D%0Awhere+incident_version._version+%3E+1%0D%0Agroup+by+columns.name%0D%0Aorder+by+count%28*%29+desc"&gt;a SQL query&lt;/a&gt; that shows, for &lt;code&gt;ca-fires&lt;/code&gt;, which columns were updated most often:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt; &lt;span class="pl-c1"&gt;columns&lt;/span&gt;.&lt;span class="pl-c1"&gt;name&lt;/span&gt;, &lt;span class="pl-c1"&gt;count&lt;/span&gt;(&lt;span class="pl-k"&gt;*&lt;/span&gt;)
&lt;span class="pl-k"&gt;from&lt;/span&gt; incident_changed
  &lt;span class="pl-k"&gt;join&lt;/span&gt; incident_version &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;incident_changed&lt;/span&gt;.&lt;span class="pl-c1"&gt;item_version&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;incident_version&lt;/span&gt;.&lt;span class="pl-c1"&gt;_id&lt;/span&gt;
  &lt;span class="pl-k"&gt;join&lt;/span&gt; columns &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;incident_changed&lt;/span&gt;.&lt;span class="pl-c1"&gt;column&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;columns&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;
&lt;span class="pl-k"&gt;where&lt;/span&gt; &lt;span class="pl-c1"&gt;incident_version&lt;/span&gt;.&lt;span class="pl-c1"&gt;_version&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;
&lt;span class="pl-k"&gt;group by&lt;/span&gt; &lt;span class="pl-c1"&gt;columns&lt;/span&gt;.&lt;span class="pl-c1"&gt;name&lt;/span&gt;
&lt;span class="pl-k"&gt;order by&lt;/span&gt; &lt;span class="pl-c1"&gt;count&lt;/span&gt;(&lt;span class="pl-k"&gt;*&lt;/span&gt;) &lt;span class="pl-k"&gt;desc&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Updated: 1785&lt;/li&gt;
&lt;li&gt;PercentContained: 740&lt;/li&gt;
&lt;li&gt;ConditionStatement: 734&lt;/li&gt;
&lt;li&gt;AcresBurned: 616&lt;/li&gt;
&lt;li&gt;Started: 327&lt;/li&gt;
&lt;li&gt;PersonnelInvolved: 286&lt;/li&gt;
&lt;li&gt;Engines: 274&lt;/li&gt;
&lt;li&gt;CrewsInvolved: 256&lt;/li&gt;
&lt;li&gt;WaterTenders: 225&lt;/li&gt;
&lt;li&gt;Dozers: 211&lt;/li&gt;
&lt;li&gt;AirTankers: 181&lt;/li&gt;
&lt;li&gt;StructuresDestroyed: 125&lt;/li&gt;
&lt;li&gt;Helicopters: 122&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Helicopters are exciting! Let's find all of the fires which had at least one record where the number of helicopters changed (after the first version). We'll use a nested SQL query:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt; &lt;span class="pl-k"&gt;*&lt;/span&gt; &lt;span class="pl-k"&gt;from&lt;/span&gt; incident
&lt;span class="pl-k"&gt;where&lt;/span&gt; _id &lt;span class="pl-k"&gt;in&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt; _item &lt;span class="pl-k"&gt;from&lt;/span&gt; incident_version
  &lt;span class="pl-k"&gt;where&lt;/span&gt; _id &lt;span class="pl-k"&gt;in&lt;/span&gt; (
    &lt;span class="pl-k"&gt;select&lt;/span&gt; item_version &lt;span class="pl-k"&gt;from&lt;/span&gt; incident_changed &lt;span class="pl-k"&gt;where&lt;/span&gt; column &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;15&lt;/span&gt;
  )
  &lt;span class="pl-k"&gt;and&lt;/span&gt; _version &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;
)&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That returned 19 fires that were significant enough to involve helicopters - &lt;a href="https://git-history-demos.datasette.io/ca-fires?sql=select+*+from+incident%0D%0Awhere+_id+in+%28%0D%0A++select+_item+from+incident_version%0D%0A++where+_id+in+%28%0D%0A++++select+item_version+from+incident_changed+where+column+%3D+15%0D%0A++%29%0D%0A++and+_version+%3E+1%0D%0A%29"&gt;here they are on a map&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fire-helicopter-map.png" alt="A map of 19 fires that involved helicopters" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;Advanced usage of --convert&lt;/h4&gt;
&lt;p&gt;Drew Breunig has been running a Git scraper for the past 8 months in &lt;a href="https://github.com/dbreunig/511-events-history"&gt;dbreunig/511-events-history&lt;/a&gt; against &lt;a href="https://511.org/"&gt;511.org&lt;/a&gt;, a site showing traffic incidents in the San Francisco Bay Area. I loaded his data into this example &lt;a href="https://git-history-demos.datasette.io/sf-bay-511"&gt;sf-bay-511 database&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;sf-bay-511&lt;/code&gt; example is useful for digging more into the &lt;code&gt;--convert&lt;/code&gt; option to &lt;code&gt;git-history&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; requires recorded data to be in a specific shape: it needs a JSON list of JSON objects, where each object has a column that can be treated as a unique ID for purposes of tracking changes to that specific record over time.&lt;/p&gt;
&lt;p&gt;The ideal tracked JSON file would look something like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;[
  {
    &lt;span class="pl-ent"&gt;"IncidentID"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;abc123&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"Location"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Corner of 4th and Vermont&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"Type"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;fire&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"IncidentID"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;cde448&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"Location"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;555 West Example Drive&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"Type"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;medical&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
  }
]&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It's common for data that has been scraped to not fit this ideal shape.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;511.org&lt;/code&gt; JSON feed &lt;a href="https://backend-prod.511.org/api-proxy/api/v1/traffic/events/?extended=true"&gt;can be found here&lt;/a&gt; - it's a pretty complicated nested set of objects, and there's a bunch of data in there that's quite noisy without adding much to the overall analysis - things like a &lt;code&gt;updated&lt;/code&gt; timestamp field that changes in every version even if there are no changes, or a deeply nested &lt;code&gt;"extension"&lt;/code&gt; object full of duplicate data.&lt;/p&gt;
&lt;p&gt;I wrote a snippet of Python to transform each of those recorded snapshots into a simpler structure, and then passed that Python code to the &lt;code&gt;--convert&lt;/code&gt; option to the script:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
git-history file sf-bay-511.db 511-events-history/events.json \
  --repo 511-events-history \
  --id id \
  --convert '
data = json.loads(content)
if data.get("error"):
    # {"code": 500, "error": "Error accessing remote data..."}
    return
for event in data["Events"]:
    event["id"] = event["extension"]["event-reference"]["event-identifier"]
    # Remove noisy updated timestamp
    del event["updated"]
    # Drop extension block entirely
    del event["extension"]
    # "schedule" block is noisy but not interesting
    del event["schedule"]
    # Flatten nested subtypes
    event["event_subtypes"] = event["event_subtypes"]["event_subtype"]
    if not isinstance(event["event_subtypes"], list):
        event["event_subtypes"] = [event["event_subtypes"]]
    yield event
'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The single-quoted string passed to &lt;code&gt;--convert&lt;/code&gt; is compiled into a Python function and run against each Git version in turn. My code loops through the nested &lt;code&gt;Events&lt;/code&gt; list, modifying each record and then outputting them as an iterable sequence using &lt;code&gt;yield&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;A few of the records in the history were server 500 errors, so the code block knows how to identify and skip those as well.&lt;/p&gt;
&lt;p&gt;When working with &lt;code&gt;git-history&lt;/code&gt; I find myself spending most of my time iterating on these conversion scripts. Passing strings of Python code to tools like this is a pretty fun pattern - I also used it &lt;a href="https://simonwillison.net/2021/Aug/6/sqlite-utils-convert/"&gt;for sqlite-utils convert&lt;/a&gt; earlier this year.&lt;/p&gt;
&lt;h4&gt;Trying this out yourself&lt;/h4&gt;
&lt;p&gt;If you want to try this out for yourself the &lt;code&gt;git-history&lt;/code&gt; tool has &lt;a href="https://github.com/simonw/git-history/blob/main/README.md"&gt;an extensive README&lt;/a&gt; describing the other options, and the scripts used to create these demos can be found in the &lt;a href="https://github.com/simonw/git-history/tree/main/demos"&gt;demos folder&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/topics/git-scraping"&gt;git-scraping topic&lt;/a&gt; on GitHub now has over 200 repos now built by dozens of different people - that's a lot of interesting scraped data sat there waiting to be explored!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="data-journalism"/><category term="git"/><category term="projects"/><category term="scraping"/><category term="sqlite"/><category term="datasette"/><category term="git-history"/></entry><entry><title>Git scraping, the five minute lightning talk</title><link href="https://simonwillison.net/2021/Mar/5/git-scraping/#atom-series" rel="alternate"/><published>2021-03-05T00:44:15+00:00</published><updated>2021-03-05T00:44:15+00:00</updated><id>https://simonwillison.net/2021/Mar/5/git-scraping/#atom-series</id><summary type="html">
    &lt;p&gt;I prepared a lightning talk about &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; for the &lt;a href="https://www.ire.org/training/conferences/nicar-2021/"&gt;NICAR 2021&lt;/a&gt; data journalism conference. In the talk I explain the idea of running scheduled scrapers in GitHub Actions, show some examples and then live code a new scraper for the CDC's vaccination data using the GitHub web interface. Here's the video.&lt;/p&gt;
&lt;div class="resp-container"&gt;
    &lt;iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/2CjA-03yK8I" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="allowfullscreen"&gt; &lt;/iframe&gt;
&lt;/div&gt;
&lt;h4&gt;Notes from the talk&lt;/h4&gt;
&lt;p&gt;Here's &lt;a href="https://m.pge.com/#outages"&gt;the PG&amp;amp;E outage map&lt;/a&gt; that I scraped. The trick here is to open the browser developer tools network tab, then order resources by size and see if you can find the JSON resource that contains the most interesting data.&lt;/p&gt;
&lt;p&gt;I scraped that outage data into &lt;a href="https://github.com/simonw/pge-outages"&gt;simonw/pge-outages&lt;/a&gt; - here's the &lt;a href="https://github.com/simonw/pge-outages/commits"&gt;commit history&lt;/a&gt; (over 40,000 commits now!)&lt;/p&gt;
&lt;p&gt;The scraper code itself &lt;a href="https://github.com/simonw/disaster-scrapers/blob/3eed6eca820e14e2f89db3910d1aece72717d387/pge.py"&gt;is here&lt;/a&gt;. I wrote about the project in detail in &lt;a href="https://simonwillison.net/2019/Oct/10/pge-outages/"&gt;Tracking PG&amp;amp;E outages by scraping to a git repo&lt;/a&gt; - my database of outages database is at &lt;a href="https://pge-outages.simonwillison.net/pge-outages/outages"&gt;pge-outages.simonwillison.net&lt;/a&gt; and the animation I made of outages over time is attached to &lt;a href="https://twitter.com/simonw/status/1188612004572880896"&gt;this tweet&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote class="twitter-tweet"&gt;&lt;p lang="en" dir="ltr"&gt;Here&amp;#39;s a video animation of PG&amp;amp;E&amp;#39;s outages from October 5th up until just a few minutes ago &lt;a href="https://t.co/50K3BrROZR"&gt;pic.twitter.com/50K3BrROZR&lt;/a&gt;&lt;/p&gt;- Simon Willison (@simonw) &lt;a href="https://twitter.com/simonw/status/1188612004572880896?ref_src=twsrc%5Etfw"&gt;October 28, 2019&lt;/a&gt;&lt;/blockquote&gt;
&lt;p&gt;The much simpler scraper for the &lt;a href="https://www.fire.ca.gov/incidents"&gt;www.fire.ca.gov/incidents&lt;/a&gt; website is at &lt;a href="https://github.com/simonw/ca-fires-history"&gt;simonw/ca-fires-history&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In the video I used that as the template to create a new scraper for CDC vaccination data - their website is &lt;a href="https://covid.cdc.gov/covid-data-tracker/#vaccinations"&gt;https://covid.cdc.gov/covid-data-tracker/#vaccinations&lt;/a&gt; and the API I found using the browser developer tools is &lt;a href="https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data"&gt;https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The new CDC scraper and the data it has scraped lives in &lt;a href="https://github.com/simonw/cdc-vaccination-history"&gt;simonw/cdc-vaccination-history&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You can find more examples of Git scraping in the &lt;a href="https://github.com/topics/git-scraping"&gt;git-scraping GitHub topic&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/my-talks"&gt;my-talks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/annotated-talks"&gt;annotated-talks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicar"&gt;nicar&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="data-journalism"/><category term="scraping"/><category term="my-talks"/><category term="github-actions"/><category term="git-scraping"/><category term="annotated-talks"/><category term="nicar"/></entry><entry><title>Git scraping: track changes over time by scraping to a Git repository</title><link href="https://simonwillison.net/2020/Oct/9/git-scraping/#atom-series" rel="alternate"/><published>2020-10-09T18:27:23+00:00</published><updated>2020-10-09T18:27:23+00:00</updated><id>https://simonwillison.net/2020/Oct/9/git-scraping/#atom-series</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Git scraping&lt;/strong&gt; is the name I've given a scraping technique that I've been experimenting with for a few years now. It's really effective, and more people should use it.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 5th March 2021:&lt;/strong&gt; I presented a version of this post as &lt;a href="https://simonwillison.net/2021/Mar/5/git-scraping/"&gt;a five minute lightning talk at NICAR 2021&lt;/a&gt;, which includes a live coding demo of building a new git scraper.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 5th January 2022:&lt;/strong&gt; I released a tool called &lt;a href="https://simonwillison.net/2021/Dec/7/git-history/"&gt;git-history&lt;/a&gt; that helps analyze data that has been collected using this technique.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The internet is full of interesting data that changes over time. These changes can sometimes be more interesting than the underlying static data - The &lt;a href="https://twitter.com/nyt_diff"&gt;@nyt_diff Twitter account&lt;/a&gt; tracks changes made to New York Times headlines for example, which offers a fascinating insight into that publication's editorial process.&lt;/p&gt;
&lt;p&gt;We already have a great tool for efficiently tracking changes to text over time: &lt;strong&gt;Git&lt;/strong&gt;. And &lt;a href="https://github.com/features/actions"&gt;GitHub Actions&lt;/a&gt; (and other CI systems) make it easy to create a scraper that runs every few minutes, records the current state of a resource and records changes to that resource over time in the commit history.&lt;/p&gt;
&lt;p&gt;Here's a recent example. Fires continue to rage in California, and the &lt;a href="https://www.fire.ca.gov/"&gt;CAL FIRE website&lt;/a&gt; offers an &lt;a href="https://www.fire.ca.gov/incidents/"&gt;incident map&lt;/a&gt; showing the latest fire activity around the state.&lt;/p&gt;
&lt;p&gt;Firing up the Firefox Network pane, filtering to requests triggered by XHR and sorting by size, largest first reveals this endpoint:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents"&gt;https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;That's a 241KB JSON endpoints with full details of the various fires around the state.&lt;/p&gt;
&lt;p&gt;So... I started running a git scraper against it. My scraper lives in the &lt;a href="https://github.com/simonw/ca-fires-history"&gt;simonw/ca-fires-history&lt;/a&gt; repository on GitHub.&lt;/p&gt;
&lt;p&gt;Every 20 minutes it grabs the latest copy of that JSON endpoint, pretty-prints it (for diff readability) using &lt;code&gt;jq&lt;/code&gt; and commits it back to the repo if it has changed.&lt;/p&gt;
&lt;p&gt;This means I now have a &lt;a href="https://github.com/simonw/ca-fires-history/commits/main"&gt;commit log&lt;/a&gt; of changes to that information about fires in California. Here's an &lt;a href="https://github.com/simonw/ca-fires-history/commit/7b0f42d4bf198885ab2b41a22a8da47157572d18"&gt;example commit&lt;/a&gt; showing that last night the Zogg Fires percentage contained increased from 90% to 92%, the number of personnel involved dropped from 968 to 798 and the number of engines responding dropped from 82 to 59.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2020/git-scraping.png" alt="Screenshot of a diff against the Zogg Fires, showing personnel involved dropping from 968 to 798, engines dropping 82 to 59, water tenders dropping 31 to 27 and percent contained increasing from 90 to 92." style="max-width: 100%" /&gt;&lt;/p&gt;
&lt;p&gt;The implementation of the scraper is entirely contained in a single GitHub Actions workflow. It's in a file called &lt;a href="https://github.com/simonw/ca-fires-history/blob/main/.github/workflows/scrape.yml"&gt;.github/workflows/scrape.yml&lt;/a&gt; which looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Scrape latest data&lt;/span&gt;

&lt;span class="pl-ent"&gt;on&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;push&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;workflow_dispatch&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;schedule&lt;/span&gt;:
    - &lt;span class="pl-ent"&gt;cron&lt;/span&gt;:  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;6,26,46 * * * *&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-ent"&gt;jobs&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;scheduled&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;runs-on&lt;/span&gt;: &lt;span class="pl-s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="pl-ent"&gt;steps&lt;/span&gt;:
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Check out this repo&lt;/span&gt;
      &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/checkout@v2&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Fetch latest data&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;        curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents | jq . &amp;gt; incidents.json&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Commit and push if it changed&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;        git config user.name "Automated"&lt;/span&gt;
&lt;span class="pl-s"&gt;        git config user.email "actions@users.noreply.github.com"&lt;/span&gt;
&lt;span class="pl-s"&gt;        git add -A&lt;/span&gt;
&lt;span class="pl-s"&gt;        timestamp=$(date -u)&lt;/span&gt;
&lt;span class="pl-s"&gt;        git commit -m "Latest data: ${timestamp}" || exit 0&lt;/span&gt;
&lt;span class="pl-s"&gt;        git push&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That's not a lot of code!&lt;/p&gt;
&lt;p&gt;It runs on a schedule at 6, 26 and 46 minutes past the hour - I like to offset my cron times like this since I assume that the majority of crons run exactly on the hour, so running not-on-the-hour feels polite.&lt;/p&gt;
&lt;p&gt;The scraper itself works by fetching the JSON using &lt;code&gt;curl&lt;/code&gt;, piping it through &lt;code&gt;jq .&lt;/code&gt; to pretty-print it and saving the result to &lt;code&gt;incidents.json&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The "commit and push if it changed" block uses a pattern that commits and pushes only if the file has changed. I wrote about this pattern in &lt;a href="https://til.simonwillison.net/til/til/github-actions_commit-if-file-changed.md"&gt;this TIL&lt;/a&gt; a few months ago.&lt;/p&gt;
&lt;p&gt;I have a whole bunch of repositories running git scrapers now. I've been labeling them with the &lt;a href="https://github.com/topics/git-scraping"&gt;git-scraping topic&lt;/a&gt; so they show up in one place on GitHub (other people have started using that topic as well).&lt;/p&gt;
&lt;p&gt;I've written about some of these &lt;a href="https://simonwillison.net/tags/gitscraping/"&gt;in the past&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2017/Sep/10/scraping-irma/"&gt;Scraping hurricane Irma&lt;/a&gt; back in September 2017 is when I first came up with the idea to use a Git repository in this way.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2017/Oct/10/fires-in-the-north-bay/"&gt;Changelogs to help understand the fires in the North Bay&lt;/a&gt; from October 2017 describes an early attempt at scraping fire-related information.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2019/Mar/13/tree-history/"&gt;Generating a commit log for San Francisco’s official list of trees&lt;/a&gt; remains my favourite application of this technique. The City of San Francisco maintains a frequently updated CSV file of 190,000 trees in the city, and I have &lt;a href="https://github.com/simonw/sf-tree-history/find/master"&gt;a commit log&lt;/a&gt; of changes to it stretching back over more than a year. This example uses my &lt;a href="https://github.com/simonw/csv-diff"&gt;csv-diff&lt;/a&gt; utility to generate human-readable commit messages.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2019/Oct/10/pge-outages/"&gt;Tracking PG&amp;amp;E outages by scraping to a git repo&lt;/a&gt; documents my attempts to track the impact of PG&amp;amp;E's outages last year by scraping their outage map. I used the GitPython library to turn the values recorded in the commit history into a database that let me run visualizations of changes over time.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2020/Jan/21/github-actions-cloud-run/"&gt;Tracking FARA by deploying a data API using GitHub Actions and Cloud Run&lt;/a&gt; shows how I track new registrations for the US Foreign Agents Registration Act (FARA) in a repository and deploy the latest version of the data using Datasette.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I hope that by giving this technique a name I can encourage more people to add it to their toolbox. It's an extremely effective way of turning all sorts of interesting data sources into a changelog over time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://news.ycombinator.com/item?id=24732943"&gt;Comment thread&lt;/a&gt; on this post over on Hacker News.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="git"/><category term="github"/><category term="projects"/><category term="scraping"/><category term="github-actions"/><category term="git-scraping"/></entry><entry><title>Tracking PG&amp;E outages by scraping to a git repo</title><link href="https://simonwillison.net/2019/Oct/10/pge-outages/#atom-series" rel="alternate"/><published>2019-10-10T23:32:14+00:00</published><updated>2019-10-10T23:32:14+00:00</updated><id>https://simonwillison.net/2019/Oct/10/pge-outages/#atom-series</id><summary type="html">
    &lt;p&gt;PG&amp;amp;E have &lt;a href="https://twitter.com/bedwardstiek/status/1182047040932470784"&gt;cut off power&lt;/a&gt; to several million people in northern California, supposedly as a precaution against wildfires.&lt;/p&gt;

&lt;p&gt;As it happens, I've been scraping and recording PG&amp;amp;E's outage data every 10 minutes for the past 4+ months. This data got really interesting over the past two days!&lt;/p&gt;

&lt;p&gt;The original data lives in &lt;a href="https://github.com/simonw/pge-outages"&gt;a GitHub repo&lt;/a&gt; (more importantly in &lt;a href="https://github.com/simonw/pge-outages/commits/master"&gt;the commit history&lt;/a&gt; of that repo).&lt;/p&gt;

&lt;p&gt;Reading JSON in a Git repo isn't particularly productive, so this afternoon I figured out how to transform that data into a SQLite database and publish it with &lt;a href="https://github.com/simonw/datasette"&gt;Datasette&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The result is &lt;code&gt;https://pge-outages.simonwillison.net/&lt;/code&gt; (no longer available)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update from 27th October 2019&lt;/strong&gt;: I also used the data to create this animation (first shared &lt;a href="https://twitter.com/simonw/status/1188612004572880896"&gt;on Twitter&lt;/a&gt;):&lt;/p&gt;

&lt;video style="max-width: 100%" src="https://static.simonwillison.net/static/2019/outages.mp4" controls="controls"&gt;
  Your browser does not support the video tag.
&lt;/video&gt;

&lt;h3 id="thedatamodeloutagesandsnapshots"&gt;The data model: outages and snapshots&lt;/h3&gt;

&lt;p&gt;The three key tables to understand are &lt;code&gt;outages&lt;/code&gt;, &lt;code&gt;snapshots&lt;/code&gt; and &lt;code&gt;outage_snapshots&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;PG&amp;amp;E assign an outage ID to every outage - where an outage is usually something that affects a few dozen customers. I store these in the &lt;a href="https://pge-outages.simonwillison.net/pge-outages/outages?_sort_desc=outageStartTime"&gt;outages table&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Every 10 minutes I grab a snapshot of their full JSON file, which reports every single outage that is currently ongoing. I store a record of when I grabbed that snapshot in the &lt;a href="https://pge-outages.simonwillison.net/pge-outages/snapshots?_sort_desc=id"&gt;snapshots table&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The most interesting table is &lt;code&gt;outage_snapshots&lt;/code&gt;. Every time I see an outage in the JSON feed, I record a new copy of its data as an &lt;code&gt;outage_snapshot&lt;/code&gt; row. This allows me to reconstruct the full history of any outage, in 10 minute increments.&lt;/p&gt;

&lt;p&gt;Here are &lt;a href="https://pge-outages.simonwillison.net/pge-outages/outage_snapshots?snapshot=1269"&gt;all of the outages&lt;/a&gt; that were represented in &lt;a href="https://pge-outages.simonwillison.net/pge-outages/snapshots/1269"&gt;snapshot 1269&lt;/a&gt; - captured at 4:10pm Pacific Time today.&lt;/p&gt;

&lt;p&gt;I can run &lt;code&gt;select sum(estCustAffected) from outage_snapshots where snapshot = 1269&lt;/code&gt; (&lt;a href="https://pge-outages.simonwillison.net/pge-outages?sql=select+sum%28estCustAffected%29+from+outage_snapshots+where+snapshot+%3D+%3Aid&amp;amp;id=1269"&gt;try it here&lt;/a&gt;) to count up the total PG&amp;amp;E estimate of the number of affected customers - it's 545,706!&lt;/p&gt;

&lt;p&gt;I've installed &lt;a href="https://github.com/simonw/datasette-vega"&gt;datasette-vega&lt;/a&gt; which means I can render graphs. Here's my first attempt at a graph showing &lt;a href="https://pge-outages.simonwillison.net/pge-outages?sql=select+snapshots.id%2C+title+as+snapshotTime%2C+hash%2C+sum%28outage_snapshots.estCustAffected%29+as+totalEstCustAffected%0D%0Afrom+snapshots+join+outage_snapshots+on+snapshots.id+%3D+outage_snapshots.snapshot%0D%0Agroup+by+snapshots.id+order+by+snapshots.id+desc+limit+150#g.mark=line&amp;amp;g.x_column=snapshotTime&amp;amp;g.x_type=ordinal&amp;amp;g.y_column=totalEstCustAffected&amp;amp;g.y_type=quantitative"&gt;the number of estimated customers affected over time&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2019/pge-outages-graph.png" style="text-decoration: none; border: none;"&gt;&lt;img src="https://static.simonwillison.net/static/2019/pge-outages-graph.png" style="max-width: 100%" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(I don't know why there's a dip towards the end of the graph).&lt;/p&gt;

&lt;p&gt;I also defined &lt;a href="https://pge-outages.simonwillison.net/pge-outages/most_recent_snapshot"&gt;a SQL view&lt;/a&gt; which shows all of the outages from the most recently captured snapshot (usually within the past 10 minutes if the PG&amp;amp;E website hasn't gone down) and renders them using &lt;a href="https://github.com/simonw/datasette-cluster-map"&gt;datasette-cluster-map&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2019/pge-map.jpg" style="text-decoration: none; border: none;"&gt;&lt;img src="https://static.simonwillison.net/static/2019/pge-map.jpg" style="max-width: 100%" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3 id="thingstobeawareof"&gt;Things to be aware of&lt;/h3&gt;

&lt;p&gt;There are a huge amount of unanswered questions about this data. I've just been looking at PG&amp;amp;E's JSON and making guesses about what things like &lt;code&gt;estCustAffected&lt;/code&gt; means. Without official documentation we can only guess as to how accurate this data is, or how it should be interpreted.&lt;/p&gt;

&lt;p&gt;Some things to question:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What's the quality of this data? Does it reflect accurately on what's actually going on out there?&lt;/li&gt;

&lt;li&gt;What's the exact meaning of the different columns - &lt;code&gt;estCustAffected&lt;/code&gt;, &lt;code&gt;currentEtor&lt;/code&gt;, &lt;code&gt;autoEtor&lt;/code&gt;, &lt;code&gt;hazardFlag&lt;/code&gt; etc?&lt;/li&gt;

&lt;li&gt;Various columns (&lt;code&gt;lastUpdateTime&lt;/code&gt;, &lt;code&gt;currentEtor&lt;/code&gt;, &lt;code&gt;autoEtor&lt;/code&gt;) appear to be integer &lt;a href="https://en.wikipedia.org/wiki/Unix_time"&gt;unix timestamps&lt;/a&gt;. What timezone were they recorded in? Do they include DST etc?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id="howitworks"&gt;How it works&lt;/h3&gt;

&lt;p&gt;I originally wrote the scraper &lt;a href="https://simonwillison.net/2017/Oct/10/fires-in-the-north-bay/"&gt;back in October 2017&lt;/a&gt; during the North Bay fires, and moved it to run on Circle CI based on my work building &lt;a href="https://simonwillison.net/2019/Mar/13/tree-history/"&gt;a commit history of San Francisco's trees&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It's pretty simple: every 10 minutes &lt;a href="https://circleci.com/gh/simonw/disaster-scrapers"&gt;a Circle CI job&lt;/a&gt; runs which scrapes &lt;a href="https://apim.pge.com/cocoutage/outages/getOutagesRegions?regionType=city&amp;amp;expand=true"&gt;the JSON feed&lt;/a&gt; that powers the PG&amp;amp;E website's &lt;a href="https://www.pge.com/myhome/outages/outage/index.shtml"&gt;outage map&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The JSON is then committed to my &lt;a href="https://github.com/simonw/pge-outages"&gt;pge-outages GitHub repository&lt;/a&gt;, over-writing the existing &lt;a href="https://github.com/simonw/pge-outages/blob/master/pge-outages.json"&gt;pge-outages.json file&lt;/a&gt;. There's some code that attempts to generate a human-readable commit message, but the historic data itself is saved in the commit history of that single file.&lt;/p&gt;

&lt;h3 id="buildingthedatasette"&gt;Building the Datasette&lt;/h3&gt;

&lt;p&gt;The hardest part of this project was figuring out how to turn a GitHub commit history of changes to a JSON file into a SQLite database for use with Datasette.&lt;/p&gt;

&lt;p&gt;After a bunch of prototyping in a Jupyter notebook, I ended up with the schema described above.&lt;/p&gt;

&lt;p&gt;The code that generates the database can be found in &lt;a href="https://github.com/simonw/pge-outages/blob/master/build_database.py"&gt;build_database.py&lt;/a&gt;. I used &lt;a href="https://gitpython.readthedocs.io/en/stable/"&gt;GitPython&lt;/a&gt; to read data from the git repository and my &lt;a href="https://sqlite-utils.readthedocs.io/en/stable/python-api.html"&gt;sqlite-utils library&lt;/a&gt; to create and update the database.&lt;/p&gt;

&lt;h3 id="deployment"&gt;Deployment&lt;/h3&gt;

&lt;p&gt;Since this is a large database that changes every ten minutes, I couldn't use the usual &lt;a href="https://datasette.readthedocs.io/en/stable/publish.html "&gt;datasette publish&lt;/a&gt; trick of packaging it up and re-deploying it to a serverless host (Cloud Run or Heroku or Zeit Now) every time it updates.&lt;/p&gt;

&lt;p&gt;Instead, I'm running it on a VPS instance. I ended up trying out Digital Ocean for this, after &lt;a href="https://twitter.com/simonw/status/1182077259839991808"&gt;an enjoyable Twitter conversation&lt;/a&gt; about good options for stateful (as opposed to stateless) hosting.&lt;/p&gt;

&lt;h3 id="nextsteps"&gt;Next steps&lt;/h3&gt;

&lt;p&gt;I'm putting this out there and sharing it with the California News Nerd community in the hope that people can find interesting stories in there and help firm up my methodology - or take what I've done and spin up much more interesting forks of it.&lt;/p&gt;

&lt;p&gt;If you build something interesting with this please let me know, via email (swillison is my Gmail) or &lt;a href="https://twitter.com/simonw"&gt;on Twitter&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/digitalocean"&gt;digitalocean&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="data-journalism"/><category term="projects"/><category term="scraping"/><category term="sqlite"/><category term="datasette"/><category term="git-scraping"/><category term="digitalocean"/><category term="sqlite-utils"/></entry><entry><title>Generating a commit log for San Francisco's official list of trees</title><link href="https://simonwillison.net/2019/Mar/13/tree-history/#atom-series" rel="alternate"/><published>2019-03-13T14:49:48+00:00</published><updated>2019-03-13T14:49:48+00:00</updated><id>https://simonwillison.net/2019/Mar/13/tree-history/#atom-series</id><summary type="html">
    &lt;p&gt;San Francisco has a &lt;a href="https://datasf.org/"&gt;neat open data portal&lt;/a&gt; (as do an &lt;a href="https://opendatainception.io/"&gt;increasingly large number&lt;/a&gt; of cities these days). For a few years my favourite file on there has been &lt;a href="https://data.sfgov.org/City-Infrastructure/Street-Tree-List/tkzw-k3nq"&gt;Street Tree List&lt;/a&gt;, a list of all 190,000 trees in the city maintained by the Department of Public Works.&lt;/p&gt;
&lt;p&gt;I’ve been using that file for Datasette demos &lt;a href="https://simonwillison.net/2017/Nov/25/new-in-datasette/"&gt;for a while now&lt;/a&gt;, but last week I noticed something intriguing: the file had been recently updated. On closer inspection it turns out it’s updated on a regular basis! I had assumed it was a static snapshot of trees at a certain point in time, but I was wrong: &lt;code&gt;Street_Tree_List.csv&lt;/code&gt; is a living document.&lt;/p&gt;
&lt;p&gt;Back in September 2017 I built a &lt;a href="https://simonwillison.net/2017/Sep/10/scraping-irma/"&gt;scraping project relating to hurricane Irma&lt;/a&gt;. The idea was to take data sources like FEMA’s list of open shelters and track them over time, by scraping them into a git repository and committing after every fetch.&lt;/p&gt;
&lt;p&gt;I’ve been meaning to spend more time with this idea, and building a commit log for San Francisco’s trees looked like an ideal opportunity to do so.&lt;/p&gt;
&lt;h3&gt;&lt;a id="sftreehistory_8"&gt;&lt;/a&gt;sf-tree-history&lt;/h3&gt;
&lt;p&gt;Here’s the result: &lt;a href="https://github.com/simonw/sf-tree-history"&gt;sf-tree-history&lt;/a&gt;, a git repository dedicated to recording the history of changes made to the official list of San Francisco’s trees. The repo contains three things: the latest copy of &lt;code&gt;Street_Tree_List.csv&lt;/code&gt;, a &lt;code&gt;README&lt;/code&gt;, and a &lt;a href="https://github.com/simonw/sf-tree-history/blob/master/.circleci/config.yml"&gt;Circle CI configuration&lt;/a&gt; that grabs a new copy of the file every night and, if it has changed, commits it to git and pushes the result to GitHub.&lt;/p&gt;
&lt;p&gt;The most interesting part of the repo is the &lt;a href="https://github.com/simonw/sf-tree-history/commits/master"&gt;commit history&lt;/a&gt; itself. I’ve only been running the script for just over a week, but I already have some useful illustrative commits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/sf-tree-history/commit/7ab432cdcb8d7914cfea4a5b59803f38cade532b"&gt;7ab432cdcb8d7914cfea4a5b59803f38cade532b&lt;/a&gt; from March 6th records three new trees added to the file: two Monterey Pines and a Blackwood Acacia.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/sf-tree-history/commit/d6b258959af9546909b2eee836f0156ed88cd45d"&gt;d6b258959af9546909b2eee836f0156ed88cd45d&lt;/a&gt; from March 12th shows four changes made to existing records. Of particular interest: TreeID 235981 (a Cherry Plum) had its address updated from 412 Webster St to 410 Webster St and its latitude and longitude tweaked a little bit as well.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/sf-tree-history/commit/ca66d9a5fdd632549301d249c487004a5b68abf2"&gt;ca66d9a5fdd632549301d249c487004a5b68abf2&lt;/a&gt; lists 2151 rows changed, 1280 rows added! I found an old copy of &lt;code&gt;Street_Tree_List.csv&lt;/code&gt; on my laptop from April 2018, so for fun I loaded it into the repository and used &lt;code&gt;git commit amend&lt;/code&gt; to back-date the commit to almost a year ago. I generated a commit message between that file and the version from 9 days ago which came in at around 10,000 lines of text. Git handled that just fine, but GitHub’s web view &lt;a href="https://github.com/simonw/sf-tree-history/commit/ca66d9a5fdd632549301d249c487004a5b68abf2"&gt;sadly truncates it&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;a id="csvdiff_18"&gt;&lt;/a&gt;csv-diff&lt;/h3&gt;
&lt;p&gt;One of the things I learned from my hurricane Irma project was the importance of human-readable commit messages that summarize the detected changes. I initially wrote some code to generate those by hand, but then realized that this could be extracted into a reusable tool.&lt;/p&gt;
&lt;p&gt;The result is &lt;a href="https://github.com/simonw/csv-diff"&gt;csv-diff&lt;/a&gt;, a tiny Python CLI tool which can generate a human (or machine) readable version of the differences between two CSV files.&lt;/p&gt;
&lt;p&gt;Using it looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ csv-diff one.csv two.csv --key=id
1 row added, 1 row removed, 1 row changed

1 row added

  {&amp;quot;id&amp;quot;: &amp;quot;3&amp;quot;, &amp;quot;name&amp;quot;: &amp;quot;Bailey&amp;quot;, &amp;quot;age&amp;quot;: &amp;quot;1&amp;quot;}

1 row removed

  {&amp;quot;id&amp;quot;: &amp;quot;2&amp;quot;, &amp;quot;name&amp;quot;: &amp;quot;Pancakes&amp;quot;, &amp;quot;age&amp;quot;: &amp;quot;2&amp;quot;}

1 row changed

  Row 1
    age: &amp;quot;4&amp;quot; =&amp;gt; &amp;quot;5&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/csv-diff/blob/master/README.md"&gt;csv-diff README&lt;/a&gt; has further details on the tool.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Circle_CI_44"&gt;&lt;/a&gt;Circle CI&lt;/h3&gt;
&lt;p&gt;My favourite thing about the &lt;code&gt;sf-tree-history&lt;/code&gt; project is that it costs me nothing to run - either in hosting costs or (hopefully) in terms of ongoing maintenance.&lt;/p&gt;
&lt;p&gt;The git repository is hosted for free on GitHub. Because it’s a public project, &lt;a href="https://circleci.com/"&gt;Circle CI&lt;/a&gt; will run tasks against it for free.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://github.com/simonw/sf-tree-history/blob/master/.circleci/config.yml"&gt;.circleci/config.yml&lt;/a&gt; does the rest. It uses Circle’s &lt;a href="https://circleci.com/docs/2.0/workflows/#scheduling-a-workflow"&gt;cron syntax&lt;/a&gt; to schedule a task that runs every night. The task then runs this script (embedded in the YAML configuration):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;cp Street_Tree_List.csv Street_Tree_List-old.csv
curl -o Street_Tree_List.csv &amp;quot;https://data.sfgov.org/api/views/tkzw-k3nq/rows.csv?accessType=DOWNLOAD&amp;quot;
git add Street_Tree_List.csv
git config --global user.email &amp;quot;treebot@example.com&amp;quot;
git config --global user.name &amp;quot;Treebot&amp;quot;
sudo pip install csv-diff
csv-diff Street_Tree_List-old.csv Street_Tree_List.csv --key=TreeID &amp;gt; message.txt
git commit -F message.txt &amp;amp;&amp;amp; \
  git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master \
  || true
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This script does all of the work.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;First it backs up the existing &lt;code&gt;Street_Tree_list.csv&lt;/code&gt; as &lt;code&gt;Street_Tree_List-old.csv&lt;/code&gt;, in order to be able to run a comparison later.&lt;/li&gt;
&lt;li&gt;It downloads the latest copy of &lt;code&gt;Street_Tree_List.csv&lt;/code&gt; from the San Francisco data portal&lt;/li&gt;
&lt;li&gt;It adds the file to the git index and sets itself an identity for use in the commit&lt;/li&gt;
&lt;li&gt;It installs my &lt;code&gt;csv-diff&lt;/code&gt; utility &lt;a href="https://pypi.org/project/csv-diff/"&gt;from PyPI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;It uses &lt;code&gt;csv-diff&lt;/code&gt; to create a diff of the two files, and writes that diff to a new file called &lt;code&gt;message.txt&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Finally, it attempts to create a new commit using &lt;code&gt;message.txt&lt;/code&gt; as the commit message, then pushes the result to GitHub&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The last line is the most complex. Circle CI will mark a build as failed if any of the commands in the &lt;code&gt;run&lt;/code&gt; block return a non-0 exit code. &lt;code&gt;git commit&lt;/code&gt; returns a non-0 exit code if you attempt to run it but none of the files have changed.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git commit ... &amp;amp;&amp;amp; git push ... || true&lt;/code&gt; ensures that if &lt;code&gt;git commit&lt;/code&gt; succeeds the &lt;code&gt;git push&lt;/code&gt; command will be run, BUT if it fails the &lt;code&gt;|| true&lt;/code&gt; will still return a 0 exit code for the overall line - so Circle CI will not mark the build as failed.&lt;/p&gt;
&lt;p&gt;There’s one last trick here: I’m using &lt;code&gt;git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master&lt;/code&gt; to push my changes to GitHub. This takes advantage of Circle CI environment variables, which are &lt;a href="https://circleci.com/docs/2.0/env-vars/"&gt;the recommended way&lt;/a&gt; to configure secrets such that they cannot be viewed by anyone browsing &lt;a href="https://circleci.com/gh/simonw/sf-tree-history"&gt;your Circle CI builds&lt;/a&gt;. I created a &lt;a href="https://help.github.com/en/articles/creating-a-personal-access-token-for-the-command-line"&gt;personal GitHub auth token&lt;/a&gt; for this project, which I’m using to allow Circle CI to push commits to GitHub on my behalf.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Next_steps_78"&gt;&lt;/a&gt;Next steps&lt;/h3&gt;
&lt;p&gt;I’m really excited about this pattern of using GitHub in combination with Circle CI to track changes to any file that is being posted on the internet. I’m opening up the code (and my &lt;a href="https://github.com/simonw/csv-diff"&gt;csv-diff utility&lt;/a&gt;) in the hope that other people will use them to set up their own tracking projects. Who knows, maybe there’s a file out there that’s even more exciting than San Francisco’s official list of trees!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/san-francisco"&gt;san-francisco&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="csv"/><category term="data-journalism"/><category term="git"/><category term="projects"/><category term="san-francisco"/><category term="git-scraping"/></entry><entry><title>Changelogs to help understand the fires in the North Bay</title><link href="https://simonwillison.net/2017/Oct/10/fires-in-the-north-bay/#atom-series" rel="alternate"/><published>2017-10-10T06:48:07+00:00</published><updated>2017-10-10T06:48:07+00:00</updated><id>https://simonwillison.net/2017/Oct/10/fires-in-the-north-bay/#atom-series</id><summary type="html">
    &lt;p&gt;The situation in the counties north of San Francisco &lt;a href="http://www.sfgate.com/bayarea/article/Latest-on-North-Bay-fires-A-really-rough-12263721.php"&gt;is horrifying right now&lt;/a&gt;. I’ve repurposed some of &lt;a href="https://simonwillison.net/2017/Sep/10/scraping-irma/"&gt;the tools I built to for the Irma Response project&lt;/a&gt; last month to collect and track some data that might be of use to anyone trying to understand what’s happening up there. I’m sharing these now in the hope that they might prove useful.&lt;/p&gt;
&lt;p&gt;I’m scraping a number of sources relevant to the crisis, and making the data available in &lt;a href="https://github.com/simonw/irma-scraped-data/"&gt;a repository on GitHub&lt;/a&gt;. Because it’s a git repository, changes to those sources are tracked automatically. The value I’m providing here isn’t so much the data itself, it’s the history of the data. If you need to see what has changed and when, my repository’s commit log should have the answers for you. Or maybe you’ll just want to occasionally hit refresh on &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master/santa-rosa-emergency.json"&gt;this history of changes&lt;/a&gt; to &lt;a href="https://srcity.org/610/Emergency-Information"&gt;srcity.org/610/Emergency-Information&lt;/a&gt; to see when they edited the information.&lt;/p&gt;
&lt;p&gt;The sources I’m tracking right now are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The &lt;a href="https://srcity.org/610/Emergency-Information"&gt;Santa Rosa Fire Department’s Emergency Information&lt;/a&gt; page. This is being maintained by hand so it’s not a great source of structured data, but it has key details like the location and availability of shelters and it’s useful to know what was changed and when. &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master/santa-rosa-emergency.json"&gt;History of changes to that page&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://m.pge.com/#outages"&gt;PG&amp;amp;E power outages&lt;/a&gt;. This is probably the highest quality dataset with the &lt;a href="https://github.com/simonw/irma-scraped-data/commit/50ab3d3f3a5f117054e3209c7f0d520e6b483f0e#diff-2432d375ba73b2c87c88f55b12a0a2f0"&gt;neatest commit messages&lt;/a&gt;. The &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master/pge-outages-individual.json"&gt;commit history of these&lt;/a&gt; shows exactly when new outages are reported and how many customers were affected.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://roadconditions.sonoma-county.org/"&gt;Road Conditions in the County of Sonoma&lt;/a&gt;. If you want to understand how far the fire has spread, this is a useful source of data as it shows which roads have been closed due to fire or other reasons. &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master/sonoma-road-conditions.json"&gt;History of changes&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;California Highway Patrol Incidents, extracted from a KML feed on &lt;a href="http://quickmap.dot.ca.gov/"&gt;quickmap.dot.ca.gov&lt;/a&gt;. Since these cover the whole state of California there’s a lot of stuff in here that isn’t directly relevant to the North Bay, but the incidents that mention fire still help tell the story of what’s been happening. &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master/chp-incidents.json"&gt;History of changes&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The code for the scrapers can be &lt;a href="https://github.com/simonw/irma-scrapers/blob/master/north_bay.py"&gt;found in north_bay.py&lt;/a&gt;. Please leave comments, feedback or suggestions on other useful potential sources of data &lt;a href="https://github.com/simonw/simonwillisonblog/issues/4"&gt;in this GitHub issue&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/crisishacking"&gt;crisishacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="scraping"/><category term="crisishacking"/><category term="git-scraping"/></entry><entry><title>Scraping hurricane Irma</title><link href="https://simonwillison.net/2017/Sep/10/scraping-irma/#atom-series" rel="alternate"/><published>2017-09-10T06:21:17+00:00</published><updated>2017-09-10T06:21:17+00:00</updated><id>https://simonwillison.net/2017/Sep/10/scraping-irma/#atom-series</id><summary type="html">
    &lt;p&gt;The &lt;a href="https://www.irmaresponse.org/"&gt;Irma Response project&lt;/a&gt; is a team of volunteers working together to make information available during and after the storm. There is a huge amount of information out there, on many different websites. The &lt;a href="https://irma-api.herokuapp.com/"&gt;Irma API&lt;/a&gt; is an attempt to gather key information in one place, verify it and publish it in a reuseable way. It currently powers the &lt;a href="https://www.irmashelters.org/"&gt;irmashelters.org&lt;/a&gt; website.&lt;/p&gt;
&lt;p&gt;To aid this effort, I built a collection of screen scrapers that pull data from a number of different websites and APIs. That data is then stored in &lt;a href="https://github.com/simonw/irma-scraped-data/"&gt;a Git repository&lt;/a&gt;, providing a clear history of changes made to the various sources that are being tracked.&lt;/p&gt;
&lt;p&gt;Some of the scrapers also publish their findings to Slack in a format designed to make it obvious when key events happen, such as new shelters being added or removed from public listings.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Tracking_changes_over_time_8"&gt;&lt;/a&gt;Tracking changes over time&lt;/h3&gt;
&lt;p&gt;A key goal of this screen scraping mechanism is to allow changes to the underlying data sources to be tracked over time. This is achieved using git, via the GitHub API. Each scraper pulls down data from a source (an API or a website) and reformats that data into a sanitized JSON format. That JSON is then written to the git repository. If the data has changed since the last time the scraper ran, those changes will be captured by git and made available in the commit log.&lt;/p&gt;
&lt;p&gt;Recent changes tracked by the scraper collection can be seen here: &lt;a href="https://github.com/simonw/irma-scraped-data/commits/master"&gt;https://github.com/simonw/irma-scraped-data/commits/master&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;&lt;a id="Generating_useful_commit_messages_14"&gt;&lt;/a&gt;Generating useful commit messages&lt;/h3&gt;
&lt;p&gt;The most complex code for most of the scrapers isn’t in fetching the data: it’s in generating useful, human-readable commit messages that summarize the underlying change. For example, here is &lt;a href="https://github.com/simonw/irma-scraped-data/commit/7919aeff0913ec26d1bea8dc"&gt;a commit message&lt;/a&gt; generated by the scraper that tracks the &lt;a href="http://www.floridadisaster.org/shelters/summary.aspx"&gt;http://www.floridadisaster.org/shelters/summary.aspx&lt;/a&gt; page:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;florida-shelters.json: 2 shelters added

Added shelter: Atwater Elementary School (Sarasota County)
Added shelter: DEBARY ELEMENTARY SCHOOL (Volusia County)
Change detected on http://www.floridadisaster.org/shelters/summary.aspx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The full commit also shows the changes to the underlying JSON, but the human-readable message provides enough information that people who are not JSON-literate programmers can still derive value from the commit.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Publishing_to_Slack_26"&gt;&lt;/a&gt;Publishing to Slack&lt;/h3&gt;
&lt;p&gt;The Irma Response team use Slack to co-ordinate their efforts. You can join their Slack here: &lt;a href="https://irma-response-slack.herokuapp.com/"&gt;https://irma-response-slack.herokuapp.com/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Some of the scrapers publish detected changes in their data source to Slack, as links to the commits generated for each change. The human-readable message is posted directly to the channel.&lt;/p&gt;
&lt;p&gt;&lt;img style="width: 100%" src="http://static.simonwillison.net.s3.amazonaws.com/static/2017/irma-slack.jpg" alt="Bot publishing to Slack" /&gt;&lt;/p&gt;
&lt;p&gt;The source code for all of the scrapers can be found at &lt;a href="https://github.com/simonw/irma-scrapers"&gt;https://github.com/simonw/irma-scrapers&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This Entry started out as &lt;a href="https://github.com/simonw/irma-scrapers/blob/master/README.md"&gt;README file&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/crisishacking"&gt;crisishacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="scraping"/><category term="crisishacking"/><category term="git-scraping"/></entry></feed>