Simon Willison's Weblog: Git scraping

Tracking Mastodon user numbers over time with a bucket of tricks

2022-11-20T07:00:54+00:00

Mastodon is definitely having a moment. User growth is skyrocketing as more and more people migrate over from Twitter.

I've set up a new git scraper to track the number of registered user accounts on known Mastodon instances over time.

It's only been running for a few hours, but it's already collected enough data to render this chart:

I'm looking forward to seeing how this trend continues to develop over the next days and weeks.

Scraping the data

My scraper works by tracking https://instances.social/ - a website that lists a large number (but not all) of the Mastodon instances that are out there.

That site publishes an instances.json array which currently contains 1,830 objects representing Mastodon instances. Each of those objects looks something like this:

{
    "name": "pleroma.otter.sh",
    "title": "Otterland",
    "short_description": null,
    "description": "Otters does squeak squeak",
    "uptime": 0.944757,
    "up": true,
    "https_score": null,
    "https_rank": null,
    "ipv6": true,
    "openRegistrations": false,
    "users": 5,
    "statuses": "54870",
    "connections": 9821,
}

I have a GitHub Actions workflow running approximately every 20 minutes that fetches a copy of that file and commits it back to this repository:

https://github.com/simonw/scrape-instances-social

Since each instance includes a users count, the commit history of my instances.json file tells the story of Mastodon's growth over time.

Building a database

A commit log of a JSON file is interesting, but the next step is to turn that into actionable information.

My git-history tool is designed to do exactly that.

For the chart up above, the only number I care about is the total number of users listed in each snapshot of the file - the sum of that users field for each instance.

Here's how to run git-history against that file's commit history to generate tables showing how that count has changed over time:

git-history file counts.db instances.json \
  --convert "return [
    {
        'id': 'all',
        'users': sum(d['users'] or 0 for d in json.loads(content)),
        'statuses': sum(int(d['statuses'] or 0) for d in json.loads(content)),
    }
  ]" --id id

I'm creating a file called counts.db that shows the history of the instances.json file.

The real trick here though is that --convert argument. I'm using that to compress each snapshot down to a single row that looks like this:

{
    "id": "all",
    "users": 4717781,
    "statuses": 374217860
}

Normally git-history expects to work against an array of objects, tracking the history of changes to each one based on their id property.

Here I'm tricking it a bit - I only return a single object with the ID of all. This means that git-history will only track the history of changes to that single object.

It works though! The result is a counts.db file which is currently 52KB and has the following schema (truncated to the most interesting bits):

CREATE TABLE [commits] (
   [id] INTEGER PRIMARY KEY,
   [namespace] INTEGER REFERENCES [namespaces]([id]),
   [hash] TEXT,
   [commit_at] TEXT
);
CREATE TABLE [item_version] (
   [_id] INTEGER PRIMARY KEY,
   [_item] INTEGER REFERENCES [item]([_id]),
   [_version] INTEGER,
   [_commit] INTEGER REFERENCES [commits]([id]),
   [id] TEXT,
   [users] INTEGER,
   [statuses] INTEGER,
   [_item_full_hash] TEXT
);

Each item_version row will tell us the number of users and statuses at a particular point in time, based on a join against that commits table to find the commit_at date.

Publishing the database

For this project, I decided to publish the SQLite database to an S3 bucket. I considered pushing the binary SQLite file directly to the GitHub repository but this felt rude, since a binary file that changes every 20 minutes would bloat the repository.

I wanted to serve the file with open CORS headers so I could load it into Datasette Lite and Observable notebooks.

I used my s3-credentials tool to create a bucket for this:

~ % s3-credentials create scrape-instances-social --public --website --create-bucket
Created bucket: scrape-instances-social
Attached bucket policy allowing public access
Configured website: IndexDocument=index.html, ErrorDocument=error.html
Created  user: 's3.read-write.scrape-instances-social' with permissions boundary: 'arn:aws:iam::aws:policy/AmazonS3FullAccess'
Attached policy s3.read-write.scrape-instances-social to user s3.read-write.scrape-instances-social
Created access key for user: s3.read-write.scrape-instances-social
{
    "UserName": "s3.read-write.scrape-instances-social",
    "AccessKeyId": "AKIAWXFXAIOZI5NUS6VU",
    "Status": "Active",
    "SecretAccessKey": "...",
    "CreateDate": "2022-11-20 05:52:22+00:00"
}

This created a new bucket called scrape-instances-social configured to work as a website and allow public access.

It also generated an access key and a secret access key with access to just that bucket. I saved these in GitHub Actions secrets called AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

I enabled a CORS policy on the bucket like this:

s3-credentials set-cors-policy scrape-instances-social

Then I added the following to my GitHub Actions workflow to build and upload the database after each run of the scraper:

    - name: Build and publish database using git-history
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      run: |-
        # First download previous database to save some time
        wget https://scrape-instances-social.s3.amazonaws.com/counts.db
        # Update with latest commits
        ./build-count-history.sh
        # Upload to S3
        s3-credentials put-object scrape-instances-social counts.db counts.db \
          --access-key $AWS_ACCESS_KEY_ID \
          --secret-key $AWS_SECRET_ACCESS_KEY

git-history knows how to only process commits since the last time the database was built, so downloading the previous copy saves a lot of time.

Exploring the data

Now that I have a SQLite database that's being served over CORS-enabled HTTPS I can open it in Datasette Lite - my implementation of Datasette compiled to WebAssembly that runs entirely in a browser.

https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db

Any time anyone follows this link their browser will fetch the latest copy of the counts.db file directly from S3.

The most interesting page in there is the item_version_detail SQL view, which joins against the commits table to show the date of each change:

https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db#/counts/item_version_detail

(Datasette Lite lets you link directly to pages within Datasette itself via a #hash.)

Plotting a chart

Datasette Lite doesn't have charting yet, so I decided to turn to my favourite visualization tool, an Observable notebook.

Observable has the ability to query SQLite databases (that are served via CORS) directly these days!

Here's my notebook:

https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time

There are only four cells needed to create the chart shown above.

First, we need to open the SQLite database from the remote URL:

database = SQLiteDatabaseClient.open(
  "https://scrape-instances-social.s3.amazonaws.com/counts.db"
)

Next we need to use an Obervable Database query cell to execute SQL against that database and pull out the data we want to plot - and store it in a query variable:

SELECT _commit_at as date, users, statuses
FROM item_version_detail

We need to make one change to that data - we need to convert the date column from a string to a JavaScript date object:

points = query.map((d) => ({
  date: new Date(d.date),
  users: d.users,
  statuses: d.statuses
}))

Finally, we can plot the data using the Observable Plot charting library like this:

Plot.plot({
  y: {
    grid: true,
    label: "Total users over time across all tracked instances"
  },
  marks: [Plot.line(points, { x: "date", y: "users" })],
  marginLeft: 100
})

I added 100px of margin to the left of the chart to ensure there was space for the large (4,696,000 and up) labels on the y-axis.

A bunch of tricks combined

This project combines a whole bunch of tricks I've been pulling together over the past few years:

Git scraping is the technique I use to gather the initial data, turning a static listing of instances into a record of changes over time
git-history is my tool for turning a scraped Git history into a SQLite database that's easier to work with
s3-credentials makes working with S3 buckets - in particular creating credentials that are restricted to just one bucket - much less frustrating
Datasette Lite means that once you have a SQLite database online somewhere you can explore it in your browser - without having to run my full server-side Datasette Python application on a machine somewhere
And finally, combining the above means I can take advantage of Observable notebooks for ad-hoc visualization of data that's hosted online, in this case as a static SQLite database file served from S3

Measuring traffic during the Half Moon Bay Pumpkin Festival

2022-10-19T15:41:09+00:00

This weekend was the 50th annual Half Moon Bay Pumpkin Festival.

We live in El Granada, a tiny town 8 minutes drive from Half Moon Bay. There is a single road (coastal highway one) between the two towns, and the festival is locally notorious for its impact on traffic.

Natalie suggested that we measure the traffic and try and see the impact for ourselves!

Here's the end result for Saturday. Read on for details on how we created it.

Collecting the data

I built a git scraper to gather data from the Google Maps Directions API. It turns out if you pass departure_time=now to that API it returns the current estimated time in traffic as part of the response.

I picked a location in Half Moon Bay an a location in El Granada and constructed the following URL (pretty-printed):

https://maps.googleapis.com/maps/api/directions/json?
  origin=GG49%2BCH,%20Half%20Moon%20Bay%20CA
  &destination=FH78%2BQJ,%20Half%20Moon%20Bay,%20CA
  &departure_time=now
  &key=$GOOGLE_MAPS_KEY

The two locations here are defined using Google Plus codes. Here they are on Google Maps:

I constructed the reverse of the URL too, to track traffic in the other direction. Then I rigged up a scheduled GitHub Actions workflow in this repository to fetch this API data, pretty-print it with jq and write it to the repsoitory:

name: Scrape traffic

on:
  push:
  workflow_dispatch:
  schedule:
  - cron:  '*/5 * * * *'

jobs:
  shot-scraper:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: Scrape
      env:
        GOOGLE_MAPS_KEY: ${{ secrets.GOOGLE_MAPS_KEY }}
      run: |        
        curl "https://maps.googleapis.com/maps/api/directions/json?origin=GG49%2BCH,%20Half%20Moon%20Bay%20CA&destination=FH78%2BQJ,%20Half%20Moon%20Bay,%20California&departure_time=now&key=$GOOGLE_MAPS_KEY" | jq > one.json
        sleep 3
        curl "https://maps.googleapis.com/maps/api/directions/json?origin=FH78%2BQJ,%20Half%20Moon%20Bay%20CA&destination=GG49%2BCH,%20Half%20Moon%20Bay,%20California&departure_time=now&key=$GOOGLE_MAPS_KEY" | jq > two.json
    - name: Commit and push
      run: |-
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "${timestamp}" || exit 0
        git pull --rebase
        git push

I'm using a GitHub Actions secret called GOOGLE_MAPS_KEY to store the Google Maps API key.

This workflow runs every 5 minutes (more-or-less - GitHub Actions doesn't necessarily stick to the schedule). It fetches the two JSON results and writes them to files called one.json and two.json

... and that was the initial setup for the project. This took me about fifteen minutes to put in place, because I've built systems like this so many times before. I launched it at about 10am on Saturday and left it to collect data.

Analyzing the data and drawing some charts

The trick with git scraping is that the data you care about ends up captured in the git commit log. The challenge is how to extract that back out again and turn it into something useful.

My git-history tool is designed to solve this. It's a command-line utility which can iterate through every version of a file stored in a git repository, extracting information from that file out into a SQLite database table and creating a new row for every commit.

Normally I run it against CSV or JSON files containing an array of rows - effectively tabular data already, where I just want to record what has changed in between commits.

For this project, I was storing the raw JSON output by the Google Maps API. I didn't care about most of the information in there: I really just wanted the duration_in_traffic value.

git-history can accept a snippet of Python code that will be run against each stored copy of a file. The snippet should return a list of JSON objects (as Python dictionaries) which the rest of the tool can then use to figure out what has changed.

To cut a long story short, here's the incantation that worked:

git-history file hmb.db one.json \
--convert '
try:
    duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
    return [{"id": "one", "duration_in_traffic": duration_in_traffic}]
except Exception as ex:
    return []
' \
  --full-versions \
  --id id

The git-history file command is used to load the history for a specific file - in this case it's the file one.json, which will be loaded into a new SQLite database file called hm.db.

The --convert code uses json.loads(content) to load the JSON for the current file version, then pulls out the ["routes"][0]["legs"][0]["duration_in_traffic"]["value"] nested value from it.

If that's missing (e.g. in an earlier commit, when I hadn't yet added the departure_time=now parameter to the URL) an exception will be caught and the function will return an empty list.

If the duration_in_traffic value is present, the function returns the following:

[{"id": "one", "duration_in_traffic": duration_in_traffic}]

git-history likes lists of dictionaries. It's usually being run against files that contain many different rows, where the id column can be used to de-dupe rows across commits and spot what has changed.

In this case, each file only has a single interesting value.

Two more options are used here:

--full-versions - tells git-history to store all of the columns, not just columns that have changed since the last run. The default behaviour here is to store a null if a value has not changed in order to save space, but our data is tiny here so we don't need any clever optimizations.
--id id specifies the ID column that should be used to de-dupe changes. Again, not really important for this tiny project.

After running the above command, the resulting schema includes these tables:

CREATE TABLE [commits] (
   [id] INTEGER PRIMARY KEY,
   [namespace] INTEGER REFERENCES [namespaces]([id]),
   [hash] TEXT,
   [commit_at] TEXT
);
CREATE TABLE [item_version] (
   [_id] INTEGER PRIMARY KEY,
   [_item] INTEGER REFERENCES [item]([_id]),
   [_version] INTEGER,
   [_commit] INTEGER REFERENCES [commits]([id]),
   [id] TEXT,
   [duration_in_traffic] INTEGER
);

The commits table includes the date of the commit - commit_at.

The item_version table has that duration_in_traffic value.

So... to get back the duration in traffic at different times of day I can run this SQL query to join those two tables together:

select
    commits.commit_at,
    duration_in_traffic
from
    item_version
join
    commits on item_version._commit = commits.id
order by
    commits.commit_at

That query returns data that looks like this:

commit_at	duration_in_traffic
2022-10-15T17:09:06+00:00	1110
2022-10-15T17:17:38+00:00	1016
2022-10-15T17:30:06+00:00	1391

A couple of problems here. First, the commit_at column is in UTC, not local time. And duration_in_traffic is in seconds, which aren't particularly easy to read.

Here's a SQLite fix for these two issues:

select
    time(datetime(commits.commit_at, '-7 hours')) as t,
    duration_in_traffic / 60 as mins_in_traffic
from
    item_version
join
    commits on item_version._commit = commits.id
order by
    commits.commit_at

t	mins_in_traffic
10:09:06	18
10:17:38	16
10:30:06	23

datetime(commits.commit_at, '-7 hours') parses the UTC string as a datetime, and then subsracts 7 hours from it to get the local time in California converted from UTC.

I wrap that in time() here because for the chart I want to render I know everything will be on the same day.

mins_in_traffic now shows minutes, not seconds.

We now have enough data to render a chart!

But... we only have one of the two directions of traffic here. To process the numbers from two.json as well I ran this:

git-history file hmb.db two.json \
--convert '
try:
    duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
    return [{"id": "two", "duration_in_traffic": duration_in_traffic}]
except Exception as ex:
    return []
' \
  --full-versions \
  --id id --namespace item2

This is almost the same as the previous command. It's running against two.json instead of one.json, and it's using the --namespace item2 option.

This causes it to populate a new table called item2_version instead of item_version, which is a cheap trick to avoid having to figure out how to load both files into the same table.

Two lines on one chart

I rendered an initial single line chart using datasette-vega, but Natalie suggested that putting lines on the same chart for the two directions of traffic would be more interesting.

Since I now had one table for each direction of traffic (item_version and item_version2) I decided to combine those into a single table, suitable for pasting into Google Sheets.

Here's the SQL I came up with to do that:

with item1 as (
  select
    time(datetime(commits.commit_at, '-7 hours')) as t,
    duration_in_traffic / 60 as mins_in_traffic
  from
    item_version
    join commits on item_version._commit = commits.id
  order by
    commits.commit_at
),
item2 as (
  select
    time(datetime(commits.commit_at, '-7 hours')) as t,
    duration_in_traffic / 60 as mins_in_traffic
  from
    item2_version
    join commits on item2_version._commit = commits.id
  order by
    commits.commit_at
)
select
  item1.*,
  item2.mins_in_traffic as mins_in_traffic_other_way
from
  item1
  join item2 on item1.t = item2.t

This uses two CTEs (Common Table Expressions - the with X as pieces) using the pattern I explained earlier - now called item1 and item2. Having defined these two CTEs, I can join them together on the t column, which is the time of day.

Try running this query in Datasette Lite.

Here's the output of that query for Saturday (10am to 8pm):

t	mins_in_traffic	mins_in_traffic_other_way
10:09:06	18	8
10:17:38	16	8
10:30:06	23	9
10:47:38	23	9
10:57:37	23	9
11:08:20	26	9
11:22:27	26	9
11:38:42	26	9
11:52:35	25	9
12:03:23	24	9
12:15:16	21	9
12:27:51	22	9
12:37:48	22	10
12:46:41	21	10
12:55:03	21	10
13:05:10	21	11
13:17:57	21	11
13:32:55	21	11
13:44:53	19	12
13:55:22	21	14
14:05:21	22	14
14:17:48	23	15
14:31:04	22	15
14:41:59	21	14
14:51:48	18	14
15:00:09	18	15
15:11:17	15	14
15:25:48	14	15
15:39:41	11	14
15:51:11	14	15
15:59:34	15	15
16:10:50	19	16
16:25:43	19	18
16:53:06	19	18
17:11:34	18	16
17:40:29	11	11
18:12:07	10	11
18:58:17	8	9
20:05:13	7	7

I copied and pasted this table into Google Sheets and messed around with the charting tools there until I had the following chart:

Here's the same chart for Sunday:

Our Google Sheet is here - the two days have two separate tabs within the sheet.

Building the SQLite database in GitHub Actions

I did most of the development work for this project on my laptop, running git-history and datasette locally for speed of iteration.

Once I had everything working, I decided to automate the process of building the SQLite database as well.

I made the following changes to my GitHub Actions workflow:

jobs:
  shot-scraper:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
      with:
        fetch-depth: 0 # Needed by git-history
    - name: Set up Python 3.10
      uses: actions/setup-python@v4
      with:
        python-version: "3.10"
        cache: "pip"
    - run: pip install -r requirements.txt
    - name: Scrape
      # Same as before...
      # env:
      # run
    - name: Build SQLite database
      run: |
        rm -f hmb.db # Recreate from scratch each time
        git-history file hmb.db one.json \
        --convert '
        try:
            duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
            return [{"id": "one", "duration_in_traffic": duration_in_traffic}]
        except Exception as ex:
            return []
        ' \
          --full-versions \
          --id id
        git-history file hmb.db two.json \
        --convert '
        try:
            duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
            return [{"id": "two", "duration_in_traffic": duration_in_traffic}]
        except Exception as ex:
            return []
        ' \
          --full-versions \
          --id id --namespace item2
    - name: Commit and push
      # Same as before...

I also added a requirements.txt file containing just git-history.

Note how the actions/checkout@v3 step now has fetch-depth: 0 - this is necessary because git-history needs to loop through the entire repository history, but actions/checkout@v3 defaults to only fetching the most recent commit.

The setup-python step uses cache: "pip", which causes it to cache installed dependencies from requirements.txt between runs.

Because that big git-history step creates a hmb.db SQLite database, the "Commit and push" step now includes that file in the push to the repository. So every time the workflow runs a new binary SQLite database file is committed.

Normally I wouldn't do this, because Git isn't a great place to keep constantly changing binary files... but in this case the SQLite database is only 100KB and won't continue to be updated beyond the end of the pumpkin festival.

End result: hmb.db is available in the GitHub repository.

Querying it using Datasette Lite

Datasette Lite is my repackaged version of my Datasette server-side Python application which runs entirely in the user's browser, using WebAssembly.

A neat feature of Datasette Lite is that you can pass it the URL to a SQLite database file and it will load that database in your browser and let you run queries against it.

These database files need to be served with CORS headers. Every file served by GitHub includes these headers!

Which means the following URL can be used to open up the latest hmb.db file directly in Datasette in the browser:

https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db

(This takes advantage of a feature I added to Datasette Lite where it knows how to convert the URL to the HTML page about a file on GitHub to the URL to the raw file itself.)

URLs to SQL queries work too. This URL will open Datasette Lite, load the SQLite database AND execute the query I constructed above:

https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db#/hmb?sql=with+item1+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item_version%0A++++join+commits+on+item_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%2C%0Aitem2+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item2_version%0A++++join+commits+on+item2_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%0Aselect%0A++item1.*%2C%0A++item2.mins_in_traffic+as+mins_in_traffic_other_way%0Afrom%0A++item1%0A++join+item2+on+item1.t+%3D+item2.t

And finally... Datasette Lite has plugin support. Adding &install=datasette-copyable to the URL adds the datasette-copyable plugin, which adds a page for easily copying out the query results as TSV (useful for pasting into a spreadsheet) or even as GitHub-flavored Markdown (which I used to add results to this blog post).

Here's an example of that plugin in action.

This was a fun little project that brought together a whole bunch of things I've been working on over the past few years. Here's some more of my writing on these different techniques and tools:

Git scraping is the key technique I'm using here to collect the data
I've written a lot about GitHub Actions
These are my notes about git-history, the tool I used to turn a commit history into a SQLite database
Here's my series of posts about Datasette Lite

Automatically opening issues when tracked file content changes

2022-04-28T17:18:14+00:00

I figured out a GitHub Actions pattern to keep track of a file published somewhere on the internet and automatically open a new repository issue any time the contents of that file changes.

Extracting GZipMiddleware from Starlette

Here's why I needed to solve this problem.

I want to add gzip support to my Datasette open source project. Datasette builds on the Python ASGI standard, and Starlette provides an extremely well tested, robust GZipMiddleware class that adds gzip support to any ASGI application. As with everything else in Starlette, it's really good code.

The problem is, I don't want to add the whole of Starlette as a dependency. I'm trying to keep Datasette's core as small as possible, so I'm very careful about new dependencies. Starlette itself is actually very light (and only has a tiny number of dependencies of its own) but I still don't want the whole thing just for that one class.

So I decided to extract the GZipMiddleware class into a separate Python package, under the same BSD license as Starlette itself.

The result is my new asgi-gzip package, now available on PyPI.

What if Starlette fixes a bug?

The problem with extracting code like this is that Starlette is a very effectively maintained package. What if they make improvements or fix bugs in the GZipMiddleware class? How can I make sure to apply those same fixes to my extracted copy?

As I thought about this challenge, I realized I had most of the solution already.

Git scraping is the name I've given to the trick of running a periodic scraper that writes to a git repository in order to track changes to data over time.

It may seem redundant to do this against a file that already lives in version control elsewhere - but in addition to tracking changes, Git scraping can offfer a cheap and easy way to add automation that triggers when a change is detected.

I need an actionable alert any time the Starlette code changes so I can review the change and apply a fix to my own library, if necessary.

Since I already run all of my projects out of GitHub issues, automatically opening an issue against the asgi-gzip repository would be ideal.

My track.yml workflow does exactly that: it implements the Git scraping pattern against the gzip.py module in Starlette, and files an issue any time it detects changes to that file.

Starlette haven't made any changes to that file since I started tracking it, so I created a test repo to try this out.

Here's one of the example issues. I decided to include the visual diff in the issue description and have a link to it from the underlying commit as well.

How it works

The implementation is contained entirely in this track.yml workflow. I designed this to be contained as a single file to make it easy to copy and paste it to adapt it for other projects.

It uses actions/github-script, which makes it easy to do things like file new issues using JavaScript.

Here's a heavily annotated copy:

name: Track the Starlette version of this

# Run on repo pushes, and if a user clicks the "run this action" button,
# and on a schedule at 5:21am UTC every day
on:
  push:
  workflow_dispatch:
  schedule:
  - cron:  '21 5 * * *'

# Without this block I got this error when the action ran:
# HttpError: Resource not accessible by integration
permissions:
  # Allow the action to create issues
  issues: write
  # Allow the action to commit back to the repository
  contents: write

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - uses: actions/github-script@v6
      # Using env: here to demonstrate how an action like this can
      # be adjusted to take dynamic inputs
      env:
        URL: https://raw.githubusercontent.com/encode/starlette/master/starlette/middleware/gzip.py
        FILE_NAME: tracking/gzip.py
      with:
        script: |
          const { URL, FILE_NAME } = process.env;
          // promisify pattern for getting an await version of child_process.exec
          const util = require("util");
          // Used exec_ here because 'exec' variable name is already used:
          const exec_ = util.promisify(require("child_process").exec);
          // Use curl to download the file
          await exec_(`curl -o ${FILE_NAME} ${URL}`);
          // Use 'git diff' to detect if the file has changed since last time
          const { stdout } = await exec_(`git diff ${FILE_NAME}`);
          if (stdout) {
            // There was a diff to that file
            const title = `${FILE_NAME} was updated`;
            const body =
              `${URL} changed:` +
              "\n\n```diff\n" +
              stdout +
              "\n```\n\n" +
              "Close this issue once those changes have been integrated here";
            const issue = await github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: title,
              body: body,
            });
            const issueNumber = issue.data.number;
            // Now commit and reference that issue number, so the commit shows up
            // listed at the bottom of the issue page
            const commitMessage = `${FILE_NAME} updated, refs #${issueNumber}`;
            // https://til.simonwillison.net/github-actions/commit-if-file-changed
            await exec_(`git config user.name "Automated"`);
            await exec_(`git config user.email "actions@users.noreply.github.com"`);
            await exec_(`git add -A`);
            await exec_(`git commit -m "${commitMessage}" || exit 0`);
            await exec_(`git pull --rebase`);
            await exec_(`git push`);
          }

In the asgi-gzip repository I keep the fetched gzip.py file in a tracking/ directory. This directory isn't included in the Python package that gets uploaded to PyPI - it's there only so that my code can track changes to it over time.

More interesting applications

I built this to solve my "tell me when Starlette update their gzip.py file" problem, but clearly this pattern has much more interesting uses.

You could point this at any web page to get a new GitHub issue opened when that page content changes. Subscribe to notifications for that repository and you get a robust , shared mechanism for alerts - plus an issue system where you can post additional comments and close the issue once someone has reviewed the change.

There's a lot of potential here for solving all kinds of interesting problems. And it doesn't cost anything either: GitHub Actions (somehow) remains completely free for public repositories!

Update: October 13th 2022

Almost six months after writing about this... it triggered for the first time!

Here's the issue that the script opened: #4: tracking/gzip.py was updated.

I applied the improvement (Marcelo Trylesinski and Kai Klingenberg updated Starlette's code to avoid gzipping if the response already had a Content-Encoding header) and released version 0.2 of the package.

Scraping web pages from the command line with shot-scraper

2022-03-14T01:29:56+00:00

I've added a powerful new capability to my shot-scraper command line browser automation tool: you can now use it to load a web page in a headless browser, execute JavaScript to extract information and return that information back to the terminal as JSON.

Among other things, this means you can construct Unix pipelines that incorporate a full headless web browser as part of their processing.

It's also a really neat web scraping tool.

shot-scraper

I introduced shot-scraper last Thursday. It's a Python utility that wraps Playwright, providing both a command line interface and a YAML-driven configuration flow for automating the process of taking screenshots of web pages.

% pip install shot-scraper
% shot-scraper https://simonwillison.net/ --height 800
Screenshot of 'https://simonwillison.net/' written to 'simonwillison-net.png'

Since Thursday shot-scraper has had a flurry of releases, adding features like PDF exports, the ability to dump the Chromium accessibilty tree and the ability to take screenshots of authenticated web pages. But the most exciting new feature landed today.

Executing JavaScript and returning the result

Release 0.9 takes the tool in a new direction. The following command will execute JavaScript on the page and return the resulting value:

% shot-scraper javascript simonwillison.net document.title
"Simon Willison\u2019s Weblog"

Or you can return a JSON object:

% shot-scraper javascript https://datasette.io/ "({
  title: document.title,
  tagline: document.querySelector('.tagline').innerText
})"
{
  "title": "Datasette: An open source multi-tool for exploring and publishing data",
  "tagline": "An open source multi-tool for exploring and publishing data"
}

Or if you want to use functions like setTimeout() - for example, if you want to insert a delay to allow an animation to finish before running the rest of your code - you can return a promise:

% shot-scraper javascript datasette.io "
new Promise(done => setInterval(
  () => {
    done({
      title: document.title,
      tagline: document.querySelector('.tagline').innerText
    });
  }, 1000
));"

Errors that occur in the JavaScript turn into an exit code of 1 returned by the tool - which means you can also use this to execute simple tests in a CI flow. This example will fail a GitHub Actions workflow if the extracted page title is not the expected value:

- name: Test page title
  run: |-
    shot-scraper javascript datasette.io "
      if (document.title != 'Datasette') {
        throw 'Wrong title detected';
      }"

Using this to scrape a web page

The most exciting use case for this new feature is web scraping. I'll illustrate that with an example.

Posts from my blog occasionally show up on Hacker News - sometimes I spot them, sometimes I don't.

https://news.ycombinator.com/from?site=simonwillison.net is a Hacker News page showing content from the specified domain. It's really useful, but it sadly isn't included in the official Hacker News API.

So... let's write a scraper for it.

I started out running the Firefox developer console against that page, trying to figure out the right JavaScript to extract the data I was interested in. I came up with this:

Array.from(document.querySelectorAll('.athing'), el => {
  const title = el.querySelector('.titleline a').innerText;
  const points = parseInt(el.nextSibling.querySelector('.score').innerText);
  const url = el.querySelector('.titleline a').href;
  const dt = el.nextSibling.querySelector('.age').title;
  const submitter = el.nextSibling.querySelector('.hnuser').innerText;
  const commentsUrl = el.nextSibling.querySelector('.age a').href;
  const id = commentsUrl.split('?id=')[1];
  // Only posts with comments have a comments link
  const commentsLink = Array.from(
    el.nextSibling.querySelectorAll('a')
  ).filter(el => el && el.innerText.includes('comment'))[0];
  let numComments = 0;
  if (commentsLink) {
    numComments = parseInt(commentsLink.innerText.split()[0]);
  }
  return {id, title, url, dt, points, submitter, commentsUrl, numComments};
})

The great thing about modern JavaScript is that everything you could need to write a scraper is already there in the default environment.

I'm using document.querySelectorAll('.itemlist .athing') to loop through each element that matches that selector.

I wrap that with Array.from(...) so I can use the .map() method. Then for each element I can extract out the details that I need.

The resulting array contains 30 items that look like this:

[
  {
    "id": "30658310",
    "title": "Track changes to CLI tools by recording their help output",
    "url": "https://simonwillison.net/2022/Feb/2/help-scraping/",
    "dt": "2022-03-13T05:36:13",
    "submitter": "appwiz",
    "commentsUrl": "https://news.ycombinator.com/item?id=30658310",
    "numComments": 19
  }
]

Running it with shot-scraper

Now that I have a recipe for a scraper, I can run it in the terminal like this:

shot-scraper javascript 'https://news.ycombinator.com/from?site=simonwillison.net' "
Array.from(document.querySelectorAll('.athing'), el => {
  const title = el.querySelector('.titleline a').innerText;
  const points = parseInt(el.nextSibling.querySelector('.score').innerText);
  const url = el.querySelector('.titleline a').href;
  const dt = el.nextSibling.querySelector('.age').title;
  const submitter = el.nextSibling.querySelector('.hnuser').innerText;
  const commentsUrl = el.nextSibling.querySelector('.age a').href;
  const id = commentsUrl.split('?id=')[1];
  // Only posts with comments have a comments link
  const commentsLink = Array.from(
    el.nextSibling.querySelectorAll('a')
  ).filter(el => el && el.innerText.includes('comment'))[0];
  let numComments = 0;
  if (commentsLink) {
    numComments = parseInt(commentsLink.innerText.split()[0]);
  }
  return {id, title, url, dt, points, submitter, commentsUrl, numComments};
})" > simonwillison-net.json

simonwillison-net.json is now a JSON file containing the scraped data.

Running the scraper in GitHub Actions

I want to keep track of changes to this data structure over time. My preferred technique for that is something I call Git scraping - the core idea is to keep the data in a Git repository and commit an update any time it updates. This provides a cheap and robust history of changes over time.

Running the scraper in GitHub Actions means I don't need to administrate my own server to keep this running.

So I built exactly that, in the simonw/scrape-hacker-news-by-domain repository.

The GitHub Actions workflow is in .github/workflows/scrape.yml. It runs the above command once an hour, then pushes a commit back to the repository should the file have any changes since last time it ran.

The commit history of simonwillison-net.json will show me any time a new link from my site appears on Hacker News, or a comment is added.

(Fun GitHub trick: add .atom to the end of that URL to get an Atom feed of those commits.)

The whole scraper, from idea to finished implementation, took less than fifteen minutes to build and deploy.

I can see myself using this technique a lot in the future.

shot-scraper: automated screenshots for documentation, built on Playwright

2022-03-10T00:13:30+00:00

shot-scraper is a new tool that I’ve built to help automate the process of keeping screenshots up-to-date in my documentation. It also doubles as a scraping tool - hence the name - which I picked as a complement to my git scraping and help scraping techniques.

Update 13th March 2022: The new shot-scraper javascript command can now be used to scrape web pages from the command line.

Update 14th October 2022: Automating screenshots for the Datasette documentation using shot-scraper offers a tutorial introduction to using the tool.

The problem

I like to include screenshots in documentation. I recently started writing end-user tutorials for Datasette, which are particularly image heavy (for example).

As software changes over time, screenshots get out-of-date. I don't like the idea of stale screenshots, but I also don't want to have to manually recreate them every time I make the tiniest tweak to the visual appearance of my software.

Introducing shot-scraper

shot-scraper is a tool for automating this process. You can install it using pip like this:

pip install shot-scraper
shot-scraper install

That second shot-scraper install line will install the browser it needs to do its job - more on that later.

You can use it in two ways. To take a one-off screenshot, you can run it like this:

shot-scraper https://simonwillison.net/ -o simonwillison.png

Or if you want to take a set of screenshots in a repeatable way, you can define them in a YAML file that looks like this:

- url: https://simonwillison.net/
  output: simonwillison.png
- url: https://www.example.com/
  width: 400
  height: 400
  quality: 80
  output: example.jpg

And then use shot-scraper multi to execute every screenshot in one go:

% shot-scraper multi shots.yml 
Screenshot of 'https://simonwillison.net/' written to 'simonwillison.png'
Screenshot of 'https://www.example.com/' written to 'example.jpg'

The documentation describes all of the available options you can use when taking a screenshot.

Each option can be provided to the shot-scraper one-off tool, or can be embedded in the YAML file for use with shot-scraper multi.

JavaScript and CSS selectors

The default behaviour for shot-scraper is to take a full page screenshot, using a browser width of 1280px.

For documentation screenshots you probably don't want the whole page though - you likely want to create an image of one specific part of the interface.

The --selector option allows you to specify an area of the page by CSS selector. The resulting image will consist just of that part of the page.

What if you want to modify the page in addition to selecting a specific area?

The --javascript option lets you pass in a block of JavaScript code which will be injected into the page and executed after the page has loaded, but before the screenshot is taken.

The combination of these two options - also available as javascript: and selector: keys in the YAML file - should be flexible enough to cover the custom screenshot case for documentation.

A complex example

To prove to myself that the tool works, I decided to try replicating this screenshot from my tutorial.

I made the original using CleanShot X, manually adding the two pink arrows:

This is pretty tricky!

It's not this whole page, just a subset of the page
The cog menu for one of the columns is open, which means the cog icon needs to be clicked before taking the screenshot
There are two pink arrows superimposed on the image

I decided to do use just one arrow for the moment, which should hopefully result in a clearer image.

I started by creating my own pink arrow SVG using Figma:

I then fiddled around in the Firefox developer console for quite a while, working out the JavaScript needed to trim the page down to the bit I wanted, open the menu and position the arrow.

With the JavaScript figured out, I pasted it into a YAML file called shot.yml:

- url: https://congress-legislators.datasettes.com/legislators/executive_terms?start__startswith=18&type=prez
  javascript: |
    new Promise(resolve => {
      // Run in a promise so we can sleep 1s at the end
      function remove(el) { el.parentNode.removeChild(el);}
      // Remove header and footer
      remove(document.querySelector('header'));
      remove(document.querySelector('footer'));
      // Remove most of the children of .content
      Array.from(document.querySelectorAll('.content > *:not(.table-wrapper,.suggested-facets)')).map(remove)
      // Bit of breathing room for the screenshot
      document.body.style.marginTop = '10px';
      // Add a bit of padding to .content
      var content = document.querySelector('.content');
      content.style.width = '820px';
      content.style.padding = '10px';
      // Open the menu - it's an SVG so we need to use dispatchEvent here
      document.querySelector('th.col-executive_id svg').dispatchEvent(new Event('click'));
      // Remove all but table header and first 11 rows
      Array.from(document.querySelectorAll('tr')).slice(12).map(remove);
      // Add a pink SVG arrow
      let div = document.createElement('div');
      div.innerHTML = `<svg width="104" height="60" fill="none" xmlns="http://www.w3.org/2000/svg">
        <g filter="url(#a)">
          <path fill-rule="evenodd" clip-rule="evenodd" d="m76.7 1 2 2 .2-.1.1.4 20 20a3.5 3.5 0 0 1 0 5l-20 20-.1.4-.3-.1-1.9 2a3.5 3.5 0 0 1-5.4-4.4l3.2-14.4H4v-12h70.6L71.3 5.4A3.5 3.5 0 0 1 76.7 1Z" fill="#FF31A0"/>
        </g>
        <defs>
          <filter id="a" x="0" y="0" width="104" height="59.5" filterUnits="userSpaceOnUse" color-interpolation-filters="sRGB">
              <feFlood flood-opacity="0" result="BackgroundImageFix"/>
              <feColorMatrix in="SourceAlpha" values="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 127 0" result="hardAlpha"/>
              <feOffset dy="4"/>
              <feGaussianBlur stdDeviation="2"/>
              <feComposite in2="hardAlpha" operator="out"/>
              <feColorMatrix values="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.25 0"/>
              <feBlend in2="BackgroundImageFix" result="effect1_dropShadow_2_26"/>
              <feBlend in="SourceGraphic" in2="effect1_dropShadow_2_26" result="shape"/>
          </filter>
        </defs>
      </svg>`;
      let svg = div.firstChild;
      content.appendChild(svg);
      content.style.position = 'relative';
      svg.style.position = 'absolute';
      // Give the menu time to finish fading in
      setTimeout(() => {
        // Position arrow pointing to the 'facet by this' menu item
        var pos = document.querySelector('.dropdown-facet').getBoundingClientRect();
        svg.style.left = (pos.left - pos.width) + 'px';
        svg.style.top = (pos.top - 20) + 'px';
        resolve();
      }, 1000);
    });
  output: annotated-screenshot.png
  selector: .content

And ran this command to generate the screenshot:

shot-scraper multi shot.yml

The generated annotated-screenshot.png image looks like this:

I'm pretty happy with this! I think it works very well as a proof of concept for the process.

How it works: Playwright

I built the first prototype of shot-scraper using Puppeteer, because I had used that before.

Then I noticed that the puppeteer-cli package I was using hadn't had an update in two years, which reminded me to check out Playwright.

I've been looking for an excuse to learn Playwright for a while now, and this project turned out to be ideal.

Playwright is Microsoft's open source browser automation framework. They promote it as a testing tool, but it has plenty of applications outside of testing - screenshot automation and screen scraping being two of the most obvious.

Playwright is comprehensive: it downloads its own custom browser builds, and can run tests across multiple different rendering engines.

The second prototype used the Playwright CLI utility instead, executed via npx:

subprocess.run(
    [
        "npx",
        "playwright",
        "screenshot",
        "--full-page",
        url,
        output,
    ],
    capture_output=True,
)

This could take a full page screenshot, but that CLI tool wasn't flexible enough to take screenshots of specific elements. So I needed to switch to the Playwright programmatic API.

I started out trying to get Python to generate and pass JavaScript to the Node.js library... and then I spotted the official Playwright for Python package.

pip install playwright

It's amazing! It has the exact same functionality as the JavaScript library - the same classes, the same methods. Everything just works, in both languages.

I was curious how they pulled this off, so I dug inside the playwright Python package in my site-packages folder... and found it bundles a full Node.js binary executable and uses it to bridge the two worlds! What a wild hack.

Thanks to Playwright, the entire implementation of shot-scraper is currently just 181 lines of Python code - it's all glue code tying together a Click CLI interface with some code that calls Playwright to do the actual work.

I couldn't be more impressed with Playwright. I'll definitely be using it for other projects - for one thing, I think I'll finally be able to add automated tests to my Datasette Desktop Electron application.

Hooking shot-scraper up to GitHub Actions

I built shot-scraper very much with GitHub Actions in mind.

My shot-scraper-demo repository is my first live demo of the tool.

Once a day, it runs this shots.yml file, generates two screenshots and commits them back to the repository.

One of them is the tutorial screenshot described above.

The other is a screenshot of the list of "recently spotted owls" from this page on owlsnearme.com. I wanted a page that would change on an occasional basis, to demonstrate GitHub's neat image diffing interface.

I may need to change that demo though! That page includes "spotted 5 hours ago" text, which means that there's almost always a tiny pixel difference, like this one (use the "swipe" comparison tool to watch 6 hours ago change to 7 hours ago under the top left photo).

Storing image files that change frequently in a free repository on GitHub feels rude to me, so please use this tool cautiously there!

What's next?

I had ambitious plans to add utilities to the tool that would help with annotations, such as adding pink arrows and drawing circles around different elements on the page.

I've shelved those plans for the moment: as the demo above shows, the JavaScript hook is good enough. I may revisit this later once common patterns have started to emerge.

So really, my next step is to start using this tool for my own projects - to generate screenshots for my documentation.

I'm also very interested to see what kinds of things other people use this for.

Help scraping: track changes to CLI tools by recording their --help using Git

2022-02-02T23:46:35+00:00

I've been experimenting with a new variant of Git scraping this week which I'm calling Help scraping. The key idea is to track changes made to CLI tools over time by recording the output of their --help commands in a Git repository.

My new help-scraper GitHub repository is my first implementation of this pattern.

It uses this GitHub Actions workflow to record the --help output for the Amazon Web Services aws CLI tool, and also for the flyctl tool maintained by the Fly.io hosting platform.

The workflow runs once a day. It loops through every available AWS command (using this script) and records the output of that command's CLI help option to a .txt file in the repository - then commits the result at the end.

The result is a version history of changes made to those help files. It's essentially a much more detailed version of a changelog - capturing all sorts of details that might not be reflected in the official release notes for the tool.

Here's an example. This morning, AWS released version 1.22.47 of their CLI helper tool. They release new versions on an almost daily basis.

Here are the official release notes - 12 bullet points, spanning 12 different AWS services.

My help scraper caught the details of the release in this commit - 89 changed files with 3,543 additions and 1,324 deletions. It tells the story of what's changed in a whole lot more detail.

The AWS CLI tool is enormous. Running find aws -name '*.txt' | wc -l in that repository counts help pages for 11,401 individual commands - or 11,390 if you checkout the previous version, showing that there were 11 commands added just in this morning's new release.

There are plenty of other ways of tracking changes made to AWS. I've previously kept an eye on the botocore GitHub history, which exposes changes to the underlying JSON - and there are projects like awschanges.info which try to turn those sources of data into something more readable.

But I think there's something pretty neat about being able to track changes in detail for any CLI tool that offers help output, independent of the official release notes for that tool. Not everyone writes release notes with the detail I like from them!

I implemented this for flyctl first, because I wanted to see what changes were being made that might impact my datasette-publish-fly plugin which shells out to that tool. Then I realized it could be applied to AWS as well.

Help scraping my own projects

I got the initial idea for this technique from a change I made to my Datasette and sqlite-utils projects a few weeks ago.

Both tools offer CLI commands with --help output - but I kept on forgetting to update the help, partly because there was no easy way to see its output online without running the tools themselves.

So, I added documentation pages that list the output of --help for each of the CLI commands, generated using the Cog file generation tool:

sqlite-utils CLI reference (39 commands!)
datasette CLI reference

Having added these pages, I realized that the Git commit history of those generated documentation pages could double up as a history of changes I made to the --help output - here's that history for sqlite-utils.

It was a short jump from that to the idea of combining it with Git scraping to generate history for other tools.

Bonus trick: GraphQL schema scraping

I've started making selective use of the Fly.io GraphQL API as part of my plugin for publishing Datasette instances to that platform.

Their GraphQL API is openly available, but it's not extensively documented - presumably because they reserve the right to make breaking changes to it at any time. I collected some notes on it in this TIL: Using the undocumented Fly GraphQL API.

This gave me an idea: could I track changes made to their GraphQL schema using the same scraping trick?

It turns out I can! There's an NPM package called get-graphql-schema which can extract the GraphQL schema from any GraphQL server and write it out to disk:

npx get-graphql-schema https://api.fly.io/graphql > /tmp/fly.graphql

I've added that to my help-scraper repository too - so now I have a commit history of changes of changes they are making there too. Here's an example from this morning.

Other weeknotes

I've decided to start setting goals on a monthly basis. My goal for February is to finally ship Datasette 1.0! I'm trying to make at least one commit every day that takes me closer to that milestone.

This week I did a bunch of work adding a Link: https://...; rel="alternate"; type="application/datasette+json" HTTP header to a bunch of different pages in the Datasette interface, to support discovery of the JSON version of a page based on a URL to the human-readable version.

(I had originally planned to also support Accept: application/json request headers for this, but I've been put off that idea by the discovery that Cloudflare deliberately ignores the Vary: Accept header.)

Unrelated to Datasette: I also started a new Twitter thread, gathering behind the scenes material from the movie the Mitchells vs the Machines. There's been a flurry of great material shared recently by the creative team, presumably as part of the run-up to awards season - and I've been enjoying trying to tie it all together in a thread.

The last time I did this was for Into the Spider-Verse (from the same studio) and that thread ended up running for more than a year!

TIL this week

git-history: a tool for analyzing scraped data collected using Git and SQLite

2021-12-07T22:32:55+00:00

I described Git scraping last year: a technique for writing scrapers where you periodically snapshot a source of data to a Git repository in order to record changes to that source over time.

The open challenge was how to analyze that data once it was collected. git-history is my new tool designed to tackle that problem.

Git scraping, a refresher

A neat thing about scraping to a Git repository is that the scrapers themselves can be really simple. I demonstrated how to run scrapers for free using GitHub Actions in this five minute lightning talk back in March.

Here's a concrete example: California's state fire department, Cal Fire, maintain an incident map at fire.ca.gov/incidents showing the status of current large fires in the state.

I found the underlying data here:

curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents

Then I built a simple scraper that grabs a copy of that every 20 minutes and commits it to Git. I've been running that for 14 months now, and it's collected 1,559 commits!

The thing that excites me most about Git scraping is that it can create truly unique datasets. It's common for organizations not to keep detailed archives of what changed and where, so by scraping their data into a Git repository you can often end up with a more detailed history than they maintain themselves.

There's one big challenge though; having collected that data, how can you best analyze it? Reading through thousands of commit differences and eyeballing changes to JSON or CSV files isn't a great way of finding the interesting stories that have been captured.

git-history

git-history is the new CLI tool I've built to answer that question. It reads through the entire history of a file and generates a SQLite database reflecting changes to that file over time. You can then use Datasette to explore the resulting data.

Here's an example database created by running the tool against my ca-fires-history repository. I created the SQLite database by running this in the repository directory:

git-history file ca-fires.db incidents.json \
  --namespace incident \
  --id UniqueId \
  --convert 'json.loads(content)["Incidents"]'

In this example we are processing the history of a single file called incidents.json.

We use the UniqueId column to identify which records are changed over time as opposed to newly created.

Specifying --namespace incident causes the created database tables to be called incident and incident_version rather than the default of item and item_version.

And we have a fragment of Python code that knows how to turn each version stored in that commit history into a list of objects compatible with the tool, see --convert in the documentation for details.

Let's use the database to answer some questions about fires in California over the past 14 months.

The incident table contains a copy of the latest record for every incident. We can use that to see a map of every fire:

This uses the datasette-cluster-map plugin, which draws a map of every row with a valid latitude and longitude column.

Where things get interesting is the incident_version table. This is where changes between different scraped versions of each item are recorded.

Those 250 fires have 2,060 recorded versions. If we facet by _item we can see which fires had the most versions recorded. Here are the top ten:

This looks about right - the larger the number of versions the longer the fire must have been burning. The Dixie Fire has its own Wikipedia page!

Clicking through to the Dixie Fire lands us on a page showing every "version" that we captured, ordered by version number.

git-history only writes values to this table that have changed since the previous version. This means you can glance at the table grid and get a feel for which pieces of information were updated over time:

The ConditionStatement is a text description that changes frequently, but the other two interesting columns look to be AcresBurned and PercentContained.

That _commit table is a foreign key to commits, which records commits that have been processed by the tool - mainly so that when you run it a second time it can pick up where it finished last time.

We can join against commits to see the date that each version was created. Or we can use the incident_version_detail view which performs that join for us.

Using that view, we can filter for just rows where _item is 174 and AcresBurned is not blank, then use the datasette-vega plugin to visualize the _commit_at date column against the AcresBurned numeric column... and we get a graph of the growth of the Dixie Fire over time!

To review: we started out with a GitHub Actions scheduled workflow grabbing a copy of a JSON API endpoint every 20 minutes. Thanks to git-history, Datasette and datasette-vega we now have a chart showing the growth of the longest-lived California wildfire of the last 14 months over time.

A note on schema design

One of the hardest problems in designing git-history was deciding on an appropriate schema for storing version changes over time.

I ended up with the following (edited for clarity):

CREATE TABLE [commits] (
   [id] INTEGER PRIMARY KEY,
   [hash] TEXT,
   [commit_at] TEXT
);
CREATE TABLE [item] (
   [_id] INTEGER PRIMARY KEY,
   [_item_id] TEXT,
   [IncidentID] TEXT,
   [Location] TEXT,
   [Type] TEXT,
   [_commit] INTEGER
);
CREATE TABLE [item_version] (
   [_id] INTEGER PRIMARY KEY,
   [_item] INTEGER REFERENCES [item]([_id]),
   [_version] INTEGER,
   [_commit] INTEGER REFERENCES [commits]([id]),
   [IncidentID] TEXT,
   [Location] TEXT,
   [Type] TEXT
);
CREATE TABLE [columns] (
   [id] INTEGER PRIMARY KEY,
   [namespace] INTEGER REFERENCES [namespaces]([id]),
   [name] TEXT
);
CREATE TABLE [item_changed] (
   [item_version] INTEGER REFERENCES [item_version]([_id]),
   [column] INTEGER REFERENCES [columns]([id]),
   PRIMARY KEY ([item_version], [column])
);

As shown earlier, records in the item_version table represent snapshots over time - but to save on database space and provide a neater interface for browsing versions, they only record columns that had changed since their previous version. Any unchanged columns are stored as null.

There's one catch with this schema: what do we do if a new version of an item sets one of the columns to null? How can we tell the difference between that and a column that didn't change?

I ended up solving that with an item_changed many-to-many table, which uses pairs of integers (hopefully taking up as little space as possible) to record exactly which columns were modified in which item_version records.

The item_version_detail view displays columns from that many-to-many table as JSON - here's a filtered example showing which columns were changed in which versions of which items:

Here's a SQL query that shows, for ca-fires, which columns were updated most often:

select columns.name, count(*)
from incident_changed
  join incident_version on incident_changed.item_version = incident_version._id
  join columns on incident_changed.column = columns.id
where incident_version._version > 1
group by columns.name
order by count(*) desc

Updated: 1785
PercentContained: 740
ConditionStatement: 734
AcresBurned: 616
Started: 327
PersonnelInvolved: 286
Engines: 274
CrewsInvolved: 256
WaterTenders: 225
Dozers: 211
AirTankers: 181
StructuresDestroyed: 125
Helicopters: 122

Helicopters are exciting! Let's find all of the fires which had at least one record where the number of helicopters changed (after the first version). We'll use a nested SQL query:

select * from incident
where _id in (
  select _item from incident_version
  where _id in (
    select item_version from incident_changed where column = 15
  )
  and _version > 1
)

That returned 19 fires that were significant enough to involve helicopters - here they are on a map:

Advanced usage of --convert

Drew Breunig has been running a Git scraper for the past 8 months in dbreunig/511-events-history against 511.org, a site showing traffic incidents in the San Francisco Bay Area. I loaded his data into this example sf-bay-511 database.

The sf-bay-511 example is useful for digging more into the --convert option to git-history.

git-history requires recorded data to be in a specific shape: it needs a JSON list of JSON objects, where each object has a column that can be treated as a unique ID for purposes of tracking changes to that specific record over time.

The ideal tracked JSON file would look something like this:

[
  {
    "IncidentID": "abc123",
    "Location": "Corner of 4th and Vermont",
    "Type": "fire"
  },
  {
    "IncidentID": "cde448",
    "Location": "555 West Example Drive",
    "Type": "medical"
  }
]

It's common for data that has been scraped to not fit this ideal shape.

The 511.org JSON feed can be found here - it's a pretty complicated nested set of objects, and there's a bunch of data in there that's quite noisy without adding much to the overall analysis - things like a updated timestamp field that changes in every version even if there are no changes, or a deeply nested "extension" object full of duplicate data.

I wrote a snippet of Python to transform each of those recorded snapshots into a simpler structure, and then passed that Python code to the --convert option to the script:

#!/bin/bash
git-history file sf-bay-511.db 511-events-history/events.json \
  --repo 511-events-history \
  --id id \
  --convert '
data = json.loads(content)
if data.get("error"):
    # {"code": 500, "error": "Error accessing remote data..."}
    return
for event in data["Events"]:
    event["id"] = event["extension"]["event-reference"]["event-identifier"]
    # Remove noisy updated timestamp
    del event["updated"]
    # Drop extension block entirely
    del event["extension"]
    # "schedule" block is noisy but not interesting
    del event["schedule"]
    # Flatten nested subtypes
    event["event_subtypes"] = event["event_subtypes"]["event_subtype"]
    if not isinstance(event["event_subtypes"], list):
        event["event_subtypes"] = [event["event_subtypes"]]
    yield event
'

The single-quoted string passed to --convert is compiled into a Python function and run against each Git version in turn. My code loops through the nested Events list, modifying each record and then outputting them as an iterable sequence using yield.

A few of the records in the history were server 500 errors, so the code block knows how to identify and skip those as well.

When working with git-history I find myself spending most of my time iterating on these conversion scripts. Passing strings of Python code to tools like this is a pretty fun pattern - I also used it for sqlite-utils convert earlier this year.

Trying this out yourself

If you want to try this out for yourself the git-history tool has an extensive README describing the other options, and the scripts used to create these demos can be found in the demos folder.

The git-scraping topic on GitHub now has over 200 repos now built by dozens of different people - that's a lot of interesting scraped data sat there waiting to be explored!

Git scraping, the five minute lightning talk

2021-03-05T00:44:15+00:00

I prepared a lightning talk about Git scraping for the NICAR 2021 data journalism conference. In the talk I explain the idea of running scheduled scrapers in GitHub Actions, show some examples and then live code a new scraper for the CDC's vaccination data using the GitHub web interface. Here's the video.

Notes from the talk

Here's the PG&E outage map that I scraped. The trick here is to open the browser developer tools network tab, then order resources by size and see if you can find the JSON resource that contains the most interesting data.

I scraped that outage data into simonw/pge-outages - here's the commit history (over 40,000 commits now!)

The scraper code itself is here. I wrote about the project in detail in Tracking PG&E outages by scraping to a git repo - my database of outages database is at pge-outages.simonwillison.net and the animation I made of outages over time is attached to this tweet.

Here's a video animation of PG&E's outages from October 5th up until just a few minutes ago pic.twitter.com/50K3BrROZR
- Simon Willison (@simonw) October 28, 2019

The much simpler scraper for the www.fire.ca.gov/incidents website is at simonw/ca-fires-history.

In the video I used that as the template to create a new scraper for CDC vaccination data - their website is https://covid.cdc.gov/covid-data-tracker/#vaccinations and the API I found using the browser developer tools is https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data.

The new CDC scraper and the data it has scraped lives in simonw/cdc-vaccination-history.

You can find more examples of Git scraping in the git-scraping GitHub topic.

Git scraping: track changes over time by scraping to a Git repository

2020-10-09T18:27:23+00:00

Git scraping is the name I've given a scraping technique that I've been experimenting with for a few years now. It's really effective, and more people should use it.

Update 5th March 2021: I presented a version of this post as a five minute lightning talk at NICAR 2021, which includes a live coding demo of building a new git scraper.

Update 5th January 2022: I released a tool called git-history that helps analyze data that has been collected using this technique.

The internet is full of interesting data that changes over time. These changes can sometimes be more interesting than the underlying static data - The @nyt_diff Twitter account tracks changes made to New York Times headlines for example, which offers a fascinating insight into that publication's editorial process.

We already have a great tool for efficiently tracking changes to text over time: Git. And GitHub Actions (and other CI systems) make it easy to create a scraper that runs every few minutes, records the current state of a resource and records changes to that resource over time in the commit history.

Here's a recent example. Fires continue to rage in California, and the CAL FIRE website offers an incident map showing the latest fire activity around the state.

Firing up the Firefox Network pane, filtering to requests triggered by XHR and sorting by size, largest first reveals this endpoint:

https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents

That's a 241KB JSON endpoints with full details of the various fires around the state.

So... I started running a git scraper against it. My scraper lives in the simonw/ca-fires-history repository on GitHub.

Every 20 minutes it grabs the latest copy of that JSON endpoint, pretty-prints it (for diff readability) using jq and commits it back to the repo if it has changed.

This means I now have a commit log of changes to that information about fires in California. Here's an example commit showing that last night the Zogg Fires percentage contained increased from 90% to 92%, the number of personnel involved dropped from 968 to 798 and the number of engines responding dropped from 82 to 59.

The implementation of the scraper is entirely contained in a single GitHub Actions workflow. It's in a file called .github/workflows/scrape.yml which looks like this:

name: Scrape latest data

on:
  push:
  workflow_dispatch:
  schedule:
    - cron:  '6,26,46 * * * *'

jobs:
  scheduled:
    runs-on: ubuntu-latest
    steps:
    - name: Check out this repo
      uses: actions/checkout@v2
    - name: Fetch latest data
      run: |-
        curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents | jq . > incidents.json
    - name: Commit and push if it changed
      run: |-
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "Latest data: ${timestamp}" || exit 0
        git push

That's not a lot of code!

It runs on a schedule at 6, 26 and 46 minutes past the hour - I like to offset my cron times like this since I assume that the majority of crons run exactly on the hour, so running not-on-the-hour feels polite.

The scraper itself works by fetching the JSON using curl, piping it through jq . to pretty-print it and saving the result to incidents.json.

The "commit and push if it changed" block uses a pattern that commits and pushes only if the file has changed. I wrote about this pattern in this TIL a few months ago.

I have a whole bunch of repositories running git scrapers now. I've been labeling them with the git-scraping topic so they show up in one place on GitHub (other people have started using that topic as well).

I've written about some of these in the past:

Scraping hurricane Irma back in September 2017 is when I first came up with the idea to use a Git repository in this way.
Changelogs to help understand the fires in the North Bay from October 2017 describes an early attempt at scraping fire-related information.
Generating a commit log for San Francisco’s official list of trees remains my favourite application of this technique. The City of San Francisco maintains a frequently updated CSV file of 190,000 trees in the city, and I have a commit log of changes to it stretching back over more than a year. This example uses my csv-diff utility to generate human-readable commit messages.
Tracking PG&E outages by scraping to a git repo documents my attempts to track the impact of PG&E's outages last year by scraping their outage map. I used the GitPython library to turn the values recorded in the commit history into a database that let me run visualizations of changes over time.
Tracking FARA by deploying a data API using GitHub Actions and Cloud Run shows how I track new registrations for the US Foreign Agents Registration Act (FARA) in a repository and deploy the latest version of the data using Datasette.

I hope that by giving this technique a name I can encourage more people to add it to their toolbox. It's an extremely effective way of turning all sorts of interesting data sources into a changelog over time.

Comment thread on this post over on Hacker News.

Tracking PG&E outages by scraping to a git repo

2019-10-10T23:32:14+00:00

PG&E have cut off power to several million people in northern California, supposedly as a precaution against wildfires.

As it happens, I've been scraping and recording PG&E's outage data every 10 minutes for the past 4+ months. This data got really interesting over the past two days!

The original data lives in a GitHub repo (more importantly in the commit history of that repo).

Reading JSON in a Git repo isn't particularly productive, so this afternoon I figured out how to transform that data into a SQLite database and publish it with Datasette.

The result is pge-outages.simonwillison.net

The data model: outages and snapshots

The three key tables to understand are outages, snapshots and outage_snapshots.

PG&E assign an outage ID to every outage - where an outage is usually something that affects a few dozen customers. I store these in the outages table.

Every 10 minutes I grab a snapshot of their full JSON file, which reports every single outage that is currently ongoing. I store a record of when I grabbed that snapshot in the snapshots table.

The most interesting table is outage_snapshots. Every time I see an outage in the JSON feed, I record a new copy of its data as an outage_snapshot row. This allows me to reconstruct the full history of any outage, in 10 minute increments.

Here are all of the outages that were represented in snapshot 1269 - captured at 4:10pm Pacific Time today.

I can run select sum(estCustAffected) from outage_snapshots where snapshot = 1269 (try it here) to count up the total PG&E estimate of the number of affected customers - it's 545,706!

I've installed datasette-vega which means I can render graphs. Here's my first attempt at a graph showing the number of estimated customers affected over time.

(I don't know why there's a dip towards the end of the graph).

I also defined a SQL view which shows all of the outages from the most recently captured snapshot (usually within the past 10 minutes if the PG&E website hasn't gone down) and renders them using datasette-cluster-map.

Things to be aware of

There are a huge amount of unanswered questions about this data. I've just been looking at PG&E's JSON and making guesses about what things like estCustAffected means. Without official documentation we can only guess as to how accurate this data is, or how it should be interpreted.

Some things to question:

What's the quality of this data? Does it reflect accurately on what's actually going on out there?
What's the exact meaning of the different columns - estCustAffected, currentEtor, autoEtor, hazardFlag etc?
Various columns (lastUpdateTime, currentEtor, autoEtor) appear to be integer unix timestamps. What timezone were they recorded in? Do they include DST etc?

How it works

I originally wrote the scraper back in October 2017 during the North Bay fires, and moved it to run on Circle CI based on my work building a commit history of San Francisco's trees.

It's pretty simple: every 10 minutes a Circle CI job runs which scrapes the JSON feed that powers the PG&E website's outage map.

The JSON is then committed to my pge-outages GitHub repository, over-writing the existing pge-outages.json file. There's some code that attempts to generate a human-readable commit message, but the historic data itself is saved in the commit history of that single file.

Building the Datasette

The hardest part of this project was figuring out how to turn a GitHub commit history of changes to a JSON file into a SQLite database for use with Datasette.

After a bunch of prototyping in a Jupyter notebook, I ended up with the schema described above.

The code that generates the database can be found in build_database.py. I used GitPython to read data from the git repository and my sqlite-utils library to create and update the database.

Deployment

Since this is a large database that changes every ten minutes, I couldn't use the usual datasette publish trick of packaging it up and re-deploying it to a serverless host (Cloud Run or Heroku or Zeit Now) every time it updates.

Instead, I'm running it on a VPS instance. I ended up trying out Digital Ocean for this, after an enjoyable Twitter conversation about good options for stateful (as opposed to stateless) hosting.

Next steps

I'm putting this out there and sharing it with the California News Nerd community in the hope that people can find interesting stories in there and help firm up my methodology - or take what I've done and spin up much more interesting forks of it.

If you build something interesting with this please let me know, via email (swillison is my Gmail) or on Twitter.

Generating a commit log for San Francisco's official list of trees

2019-03-13T14:49:48+00:00

San Francisco has a neat open data portal (as do an increasingly large number of cities these days). For a few years my favourite file on there has been Street Tree List, a list of all 190,000 trees in the city maintained by the Department of Public Works.

I’ve been using that file for Datasette demos for a while now, but last week I noticed something intriguing: the file had been recently updated. On closer inspection it turns out it’s updated on a regular basis! I had assumed it was a static snapshot of trees at a certain point in time, but I was wrong: Street_Tree_List.csv is a living document.

Back in September 2017 I built a scraping project relating to hurricane Irma. The idea was to take data sources like FEMA’s list of open shelters and track them over time, by scraping them into a git repository and committing after every fetch.

I’ve been meaning to spend more time with this idea, and building a commit log for San Francisco’s trees looked like an ideal opportunity to do so.

sf-tree-history

Here’s the result: sf-tree-history, a git repository dedicated to recording the history of changes made to the official list of San Francisco’s trees. The repo contains three things: the latest copy of Street_Tree_List.csv, a README, and a Circle CI configuration that grabs a new copy of the file every night and, if it has changed, commits it to git and pushes the result to GitHub.

The most interesting part of the repo is the commit history itself. I’ve only been running the script for just over a week, but I already have some useful illustrative commits:

7ab432cdcb8d7914cfea4a5b59803f38cade532b from March 6th records three new trees added to the file: two Monterey Pines and a Blackwood Acacia.
d6b258959af9546909b2eee836f0156ed88cd45d from March 12th shows four changes made to existing records. Of particular interest: TreeID 235981 (a Cherry Plum) had its address updated from 412 Webster St to 410 Webster St and its latitude and longitude tweaked a little bit as well.
ca66d9a5fdd632549301d249c487004a5b68abf2 lists 2151 rows changed, 1280 rows added! I found an old copy of Street_Tree_List.csv on my laptop from April 2018, so for fun I loaded it into the repository and used git commit amend to back-date the commit to almost a year ago. I generated a commit message between that file and the version from 9 days ago which came in at around 10,000 lines of text. Git handled that just fine, but GitHub’s web view sadly truncates it.

csv-diff

One of the things I learned from my hurricane Irma project was the importance of human-readable commit messages that summarize the detected changes. I initially wrote some code to generate those by hand, but then realized that this could be extracted into a reusable tool.

The result is csv-diff, a tiny Python CLI tool which can generate a human (or machine) readable version of the differences between two CSV files.

Using it looks like this:

$ csv-diff one.csv two.csv --key=id
1 row added, 1 row removed, 1 row changed

1 row added

  {"id": "3", "name": "Bailey", "age": "1"}

1 row removed

  {"id": "2", "name": "Pancakes", "age": "2"}

1 row changed

  Row 1
    age: "4" => "5"

The csv-diff README has further details on the tool.

Circle CI

My favourite thing about the sf-tree-history project is that it costs me nothing to run - either in hosting costs or (hopefully) in terms of ongoing maintenance.

The git repository is hosted for free on GitHub. Because it’s a public project, Circle CI will run tasks against it for free.

My .circleci/config.yml does the rest. It uses Circle’s cron syntax to schedule a task that runs every night. The task then runs this script (embedded in the YAML configuration):

cp Street_Tree_List.csv Street_Tree_List-old.csv
curl -o Street_Tree_List.csv "https://data.sfgov.org/api/views/tkzw-k3nq/rows.csv?accessType=DOWNLOAD"
git add Street_Tree_List.csv
git config --global user.email "treebot@example.com"
git config --global user.name "Treebot"
sudo pip install csv-diff
csv-diff Street_Tree_List-old.csv Street_Tree_List.csv --key=TreeID > message.txt
git commit -F message.txt && \
  git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master \
  || true

This script does all of the work.

First it backs up the existing Street_Tree_list.csv as Street_Tree_List-old.csv, in order to be able to run a comparison later.
It downloads the latest copy of Street_Tree_List.csv from the San Francisco data portal
It adds the file to the git index and sets itself an identity for use in the commit
It installs my csv-diff utility from PyPI
It uses csv-diff to create a diff of the two files, and writes that diff to a new file called message.txt
Finally, it attempts to create a new commit using message.txt as the commit message, then pushes the result to GitHub

The last line is the most complex. Circle CI will mark a build as failed if any of the commands in the run block return a non-0 exit code. git commit returns a non-0 exit code if you attempt to run it but none of the files have changed.

git commit ... && git push ... || true ensures that if git commit succeeds the git push command will be run, BUT if it fails the || true will still return a 0 exit code for the overall line - so Circle CI will not mark the build as failed.

There’s one last trick here: I’m using git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master to push my changes to GitHub. This takes advantage of Circle CI environment variables, which are the recommended way to configure secrets such that they cannot be viewed by anyone browsing your Circle CI builds. I created a personal GitHub auth token for this project, which I’m using to allow Circle CI to push commits to GitHub on my behalf.

Next steps

I’m really excited about this pattern of using GitHub in combination with Circle CI to track changes to any file that is being posted on the internet. I’m opening up the code (and my csv-diff utility) in the hope that other people will use them to set up their own tracking projects. Who knows, maybe there’s a file out there that’s even more exciting than San Francisco’s official list of trees!

Changelogs to help understand the fires in the North Bay

2017-10-10T06:48:07+00:00

The situation in the counties north of San Francisco is horrifying right now. I’ve repurposed some of the tools I built to for the Irma Response project last month to collect and track some data that might be of use to anyone trying to understand what’s happening up there. I’m sharing these now in the hope that they might prove useful.

I’m scraping a number of sources relevant to the crisis, and making the data available in a repository on GitHub. Because it’s a git repository, changes to those sources are tracked automatically. The value I’m providing here isn’t so much the data itself, it’s the history of the data. If you need to see what has changed and when, my repository’s commit log should have the answers for you. Or maybe you’ll just want to occasionally hit refresh on this history of changes to srcity.org/610/Emergency-Information to see when they edited the information.

The sources I’m tracking right now are:

The Santa Rosa Fire Department’s Emergency Information page. This is being maintained by hand so it’s not a great source of structured data, but it has key details like the location and availability of shelters and it’s useful to know what was changed and when. History of changes to that page.
PG&E power outages. This is probably the highest quality dataset with the neatest commit messages. The commit history of these shows exactly when new outages are reported and how many customers were affected.
Road Conditions in the County of Sonoma. If you want to understand how far the fire has spread, this is a useful source of data as it shows which roads have been closed due to fire or other reasons. History of changes.
California Highway Patrol Incidents, extracted from a KML feed on quickmap.dot.ca.gov. Since these cover the whole state of California there’s a lot of stuff in here that isn’t directly relevant to the North Bay, but the incidents that mention fire still help tell the story of what’s been happening. History of changes.

The code for the scrapers can be found in north_bay.py. Please leave comments, feedback or suggestions on other useful potential sources of data in this GitHub issue.

Scraping hurricane Irma

2017-09-10T06:21:17+00:00

The Irma Response project is a team of volunteers working together to make information available during and after the storm. There is a huge amount of information out there, on many different websites. The Irma API is an attempt to gather key information in one place, verify it and publish it in a reuseable way. It currently powers the irmashelters.org website.

To aid this effort, I built a collection of screen scrapers that pull data from a number of different websites and APIs. That data is then stored in a Git repository, providing a clear history of changes made to the various sources that are being tracked.

Some of the scrapers also publish their findings to Slack in a format designed to make it obvious when key events happen, such as new shelters being added or removed from public listings.

Tracking changes over time

A key goal of this screen scraping mechanism is to allow changes to the underlying data sources to be tracked over time. This is achieved using git, via the GitHub API. Each scraper pulls down data from a source (an API or a website) and reformats that data into a sanitized JSON format. That JSON is then written to the git repository. If the data has changed since the last time the scraper ran, those changes will be captured by git and made available in the commit log.

Recent changes tracked by the scraper collection can be seen here: https://github.com/simonw/irma-scraped-data/commits/master

Generating useful commit messages

The most complex code for most of the scrapers isn’t in fetching the data: it’s in generating useful, human-readable commit messages that summarize the underlying change. For example, here is a commit message generated by the scraper that tracks the http://www.floridadisaster.org/shelters/summary.aspx page:

florida-shelters.json: 2 shelters added

Added shelter: Atwater Elementary School (Sarasota County)
Added shelter: DEBARY ELEMENTARY SCHOOL (Volusia County)
Change detected on http://www.floridadisaster.org/shelters/summary.aspx

The full commit also shows the changes to the underlying JSON, but the human-readable message provides enough information that people who are not JSON-literate programmers can still derive value from the commit.

Publishing to Slack

The Irma Response team use Slack to co-ordinate their efforts. You can join their Slack here: https://irma-response-slack.herokuapp.com/

Some of the scrapers publish detected changes in their data source to Slack, as links to the commits generated for each change. The human-readable message is posted directly to the channel.

The source code for all of the scrapers can be found at https://github.com/simonw/irma-scrapers

This Entry started out as README file.