Simon Willison's Weblog: git-scraping

New improved commit messages for scrape-hacker-news-by-domain

2024-09-06T05:40:01+00:00

New improved commit messages for scrape-hacker-news-by-domain

My simonw/scrape-hacker-news-by-domain repo has a very specific purpose. Once an hour it scrapes the Hacker News /from?site=simonwillison.net page (and the equivalent for datasette.io) using my shot-scraper tool and stashes the parsed links, scores and comment counts in JSON files in that repo.

It does this mainly so I can subscribe to GitHub's Atom feed of the commit log - visit simonw/scrape-hacker-news-by-domain/commits/main and add .atom to the URL to get that.

NetNewsWire will inform me within about an hour if any of my content has made it to Hacker News, and the repo will track the score and comment count for me over time. I wrote more about how this works in Scraping web pages from the command line with shot-scraper back in March 2022.

Prior to the latest improvement, the commit messages themselves were pretty uninformative. The message had the date, and to actually see which Hacker News post it was referring to, I had to click through to the commit and look at the diff.

I built my csv-diff tool a while back to help address this problem: it can produce a slightly more human-readable version of a diff between two CSV or JSON files, ideally suited for including in a commit message attached to a git scraping repo like this one.

I got that working, but there was still room for improvement. I recently learned that any Hacker News thread has an undocumented URL at /latest?id=x which displays the most recently added comments at the top.

I wanted that in my commit messages, so I could quickly click a link to see the most recent comments on a thread.

So... I added one more feature to csv-diff: a new --extra option lets you specify a Python format string to be used to add extra fields to the displayed difference.

My GitHub Actions workflow now runs this command:

csv-diff simonwillison-net.json simonwillison-net-new.json \
  --key id --format json \
  --extra latest 'https://news.ycombinator.com/latest?id={id}' \
  >> /tmp/commit.txt

This generates the diff between the two versions, using the id property in the JSON to tie records together. It adds a latest field linking to that URL.

The commits now look like this:

Tags: hacker-news, json, projects, github-actions, git-scraping, shot-scraper

interactive-feed

2024-07-05T23:39:01+00:00

interactive-feed

Sam Morris maintains this project which gathers interactive, graphic and data visualization stories from various newsrooms around the world and publishes them on Twitter, Mastodon and Bluesky.

It runs automatically using GitHub Actions, and gathers data using a number of different techniques - XML feeds, custom API integrations (for the NYT, Guardian and Washington Post) and in some cases by scraping index pages on news websites using CSS selectors and cheerio.

The data it collects is archived as JSON in the data/ directory of the repository.

Via @palewire

Tags: data-journalism, git-scraping, mastodon, bluesky

Figure out who's leaving the company: dump, diff, repeat

2024-02-09T05:44:31+00:00

Figure out who's leaving the company: dump, diff, repeat

Rachel Kroll describes a neat hack for companies with an internal LDAP server or similar machine-readable employee directory: run a cron somewhere internal that grabs the latest version and diffs it against the previous to figure out who has joined or left the company.

I suggest using Git for this—a form of Git scraping—as then you get a detailed commit log of changes over time effectively for free.

I really enjoyed Rachel’s closing thought: “Incidentally, if someone gets mad about you running this sort of thing, you probably don’t want to work there anyway. On the other hand, if you’re able to build such tools without IT or similar getting ”threatened“ by it, then you might be somewhere that actually enjoys creating interesting and useful stuff. Treasure such places. They don’t tend to last.”

Via Hacker News

Tags: git, git-scraping

Tracking Mastodon user numbers over time with a bucket of tricks

2022-11-20T07:00:54+00:00

Mastodon is definitely having a moment. User growth is skyrocketing as more and more people migrate over from Twitter.

I've set up a new git scraper to track the number of registered user accounts on known Mastodon instances over time.

It's only been running for a few hours, but it's already collected enough data to render this chart:

I'm looking forward to seeing how this trend continues to develop over the next days and weeks.

Scraping the data

My scraper works by tracking https://instances.social/ - a website that lists a large number (but not all) of the Mastodon instances that are out there.

That site publishes an instances.json array which currently contains 1,830 objects representing Mastodon instances. Each of those objects looks something like this:

{
    "name": "pleroma.otter.sh",
    "title": "Otterland",
    "short_description": null,
    "description": "Otters does squeak squeak",
    "uptime": 0.944757,
    "up": true,
    "https_score": null,
    "https_rank": null,
    "ipv6": true,
    "openRegistrations": false,
    "users": 5,
    "statuses": "54870",
    "connections": 9821,
}

I have a GitHub Actions workflow running approximately every 20 minutes that fetches a copy of that file and commits it back to this repository:

https://github.com/simonw/scrape-instances-social

Since each instance includes a users count, the commit history of my instances.json file tells the story of Mastodon's growth over time.

Building a database

A commit log of a JSON file is interesting, but the next step is to turn that into actionable information.

My git-history tool is designed to do exactly that.

For the chart up above, the only number I care about is the total number of users listed in each snapshot of the file - the sum of that users field for each instance.

Here's how to run git-history against that file's commit history to generate tables showing how that count has changed over time:

git-history file counts.db instances.json \
  --convert "return [
    {
        'id': 'all',
        'users': sum(d['users'] or 0 for d in json.loads(content)),
        'statuses': sum(int(d['statuses'] or 0) for d in json.loads(content)),
    }
  ]" --id id

I'm creating a file called counts.db that shows the history of the instances.json file.

The real trick here though is that --convert argument. I'm using that to compress each snapshot down to a single row that looks like this:

{
    "id": "all",
    "users": 4717781,
    "statuses": 374217860
}

Normally git-history expects to work against an array of objects, tracking the history of changes to each one based on their id property.

Here I'm tricking it a bit - I only return a single object with the ID of all. This means that git-history will only track the history of changes to that single object.

It works though! The result is a counts.db file which is currently 52KB and has the following schema (truncated to the most interesting bits):

CREATE TABLE [commits] (
   [id] INTEGER PRIMARY KEY,
   [namespace] INTEGER REFERENCES [namespaces]([id]),
   [hash] TEXT,
   [commit_at] TEXT
);
CREATE TABLE [item_version] (
   [_id] INTEGER PRIMARY KEY,
   [_item] INTEGER REFERENCES [item]([_id]),
   [_version] INTEGER,
   [_commit] INTEGER REFERENCES [commits]([id]),
   [id] TEXT,
   [users] INTEGER,
   [statuses] INTEGER,
   [_item_full_hash] TEXT
);

Each item_version row will tell us the number of users and statuses at a particular point in time, based on a join against that commits table to find the commit_at date.

Publishing the database

For this project, I decided to publish the SQLite database to an S3 bucket. I considered pushing the binary SQLite file directly to the GitHub repository but this felt rude, since a binary file that changes every 20 minutes would bloat the repository.

I wanted to serve the file with open CORS headers so I could load it into Datasette Lite and Observable notebooks.

I used my s3-credentials tool to create a bucket for this:

~ % s3-credentials create scrape-instances-social --public --website --create-bucket
Created bucket: scrape-instances-social
Attached bucket policy allowing public access
Configured website: IndexDocument=index.html, ErrorDocument=error.html
Created  user: 's3.read-write.scrape-instances-social' with permissions boundary: 'arn:aws:iam::aws:policy/AmazonS3FullAccess'
Attached policy s3.read-write.scrape-instances-social to user s3.read-write.scrape-instances-social
Created access key for user: s3.read-write.scrape-instances-social
{
    "UserName": "s3.read-write.scrape-instances-social",
    "AccessKeyId": "AKIAWXFXAIOZI5NUS6VU",
    "Status": "Active",
    "SecretAccessKey": "...",
    "CreateDate": "2022-11-20 05:52:22+00:00"
}

This created a new bucket called scrape-instances-social configured to work as a website and allow public access.

It also generated an access key and a secret access key with access to just that bucket. I saved these in GitHub Actions secrets called AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

I enabled a CORS policy on the bucket like this:

s3-credentials set-cors-policy scrape-instances-social

Then I added the following to my GitHub Actions workflow to build and upload the database after each run of the scraper:

    - name: Build and publish database using git-history
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      run: |-
        # First download previous database to save some time
        wget https://scrape-instances-social.s3.amazonaws.com/counts.db
        # Update with latest commits
        ./build-count-history.sh
        # Upload to S3
        s3-credentials put-object scrape-instances-social counts.db counts.db \
          --access-key $AWS_ACCESS_KEY_ID \
          --secret-key $AWS_SECRET_ACCESS_KEY

git-history knows how to only process commits since the last time the database was built, so downloading the previous copy saves a lot of time.

Exploring the data

Now that I have a SQLite database that's being served over CORS-enabled HTTPS I can open it in Datasette Lite - my implementation of Datasette compiled to WebAssembly that runs entirely in a browser.

https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db

Any time anyone follows this link their browser will fetch the latest copy of the counts.db file directly from S3.

The most interesting page in there is the item_version_detail SQL view, which joins against the commits table to show the date of each change:

https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db#/counts/item_version_detail

(Datasette Lite lets you link directly to pages within Datasette itself via a #hash.)

Plotting a chart

Datasette Lite doesn't have charting yet, so I decided to turn to my favourite visualization tool, an Observable notebook.

Observable has the ability to query SQLite databases (that are served via CORS) directly these days!

Here's my notebook:

https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time

There are only four cells needed to create the chart shown above.

First, we need to open the SQLite database from the remote URL:

database = SQLiteDatabaseClient.open(
  "https://scrape-instances-social.s3.amazonaws.com/counts.db"
)

Next we need to use an Obervable Database query cell to execute SQL against that database and pull out the data we want to plot - and store it in a query variable:

SELECT _commit_at as date, users, statuses
FROM item_version_detail

We need to make one change to that data - we need to convert the date column from a string to a JavaScript date object:

points = query.map((d) => ({
  date: new Date(d.date),
  users: d.users,
  statuses: d.statuses
}))

Finally, we can plot the data using the Observable Plot charting library like this:

Plot.plot({
  y: {
    grid: true,
    label: "Total users over time across all tracked instances"
  },
  marks: [Plot.line(points, { x: "date", y: "users" })],
  marginLeft: 100
})

I added 100px of margin to the left of the chart to ensure there was space for the large (4,696,000 and up) labels on the y-axis.

A bunch of tricks combined

This project combines a whole bunch of tricks I've been pulling together over the past few years:

Git scraping is the technique I use to gather the initial data, turning a static listing of instances into a record of changes over time
git-history is my tool for turning a scraped Git history into a SQLite database that's easier to work with
s3-credentials makes working with S3 buckets - in particular creating credentials that are restricted to just one bucket - much less frustrating
Datasette Lite means that once you have a SQLite database online somewhere you can explore it in your browser - without having to run my full server-side Datasette Python application on a machine somewhere
And finally, combining the above means I can take advantage of Observable notebooks for ad-hoc visualization of data that's hosted online, in this case as a static SQLite database file served from S3

Tags: github, projects, datasette, observable, github-actions, git-scraping, git-history, s3-credentials, datasette-lite, mastodon

Measuring traffic during the Half Moon Bay Pumpkin Festival

2022-10-19T15:41:09+00:00

This weekend was the 50th annual Half Moon Bay Pumpkin Festival.

We live in El Granada, a tiny town 8 minutes drive from Half Moon Bay. There is a single road (coastal highway one) between the two towns, and the festival is locally notorious for its impact on traffic.

Natalie suggested that we measure the traffic and try and see the impact for ourselves!

Here's the end result for Saturday. Read on for details on how we created it.

Collecting the data

I built a git scraper to gather data from the Google Maps Directions API. It turns out if you pass departure_time=now to that API it returns the current estimated time in traffic as part of the response.

I picked a location in Half Moon Bay an a location in El Granada and constructed the following URL (pretty-printed):

https://maps.googleapis.com/maps/api/directions/json?
  origin=GG49%2BCH,%20Half%20Moon%20Bay%20CA
  &destination=FH78%2BQJ,%20Half%20Moon%20Bay,%20CA
  &departure_time=now
  &key=$GOOGLE_MAPS_KEY

The two locations here are defined using Google Plus codes. Here they are on Google Maps:

I constructed the reverse of the URL too, to track traffic in the other direction. Then I rigged up a scheduled GitHub Actions workflow in this repository to fetch this API data, pretty-print it with jq and write it to the repsoitory:

name: Scrape traffic

on:
  push:
  workflow_dispatch:
  schedule:
  - cron:  '*/5 * * * *'

jobs:
  shot-scraper:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: Scrape
      env:
        GOOGLE_MAPS_KEY: ${{ secrets.GOOGLE_MAPS_KEY }}
      run: |        
        curl "https://maps.googleapis.com/maps/api/directions/json?origin=GG49%2BCH,%20Half%20Moon%20Bay%20CA&destination=FH78%2BQJ,%20Half%20Moon%20Bay,%20California&departure_time=now&key=$GOOGLE_MAPS_KEY" | jq > one.json
        sleep 3
        curl "https://maps.googleapis.com/maps/api/directions/json?origin=FH78%2BQJ,%20Half%20Moon%20Bay%20CA&destination=GG49%2BCH,%20Half%20Moon%20Bay,%20California&departure_time=now&key=$GOOGLE_MAPS_KEY" | jq > two.json
    - name: Commit and push
      run: |-
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "${timestamp}" || exit 0
        git pull --rebase
        git push

I'm using a GitHub Actions secret called GOOGLE_MAPS_KEY to store the Google Maps API key.

This workflow runs every 5 minutes (more-or-less - GitHub Actions doesn't necessarily stick to the schedule). It fetches the two JSON results and writes them to files called one.json and two.json

... and that was the initial setup for the project. This took me about fifteen minutes to put in place, because I've built systems like this so many times before. I launched it at about 10am on Saturday and left it to collect data.

Analyzing the data and drawing some charts

The trick with git scraping is that the data you care about ends up captured in the git commit log. The challenge is how to extract that back out again and turn it into something useful.

My git-history tool is designed to solve this. It's a command-line utility which can iterate through every version of a file stored in a git repository, extracting information from that file out into a SQLite database table and creating a new row for every commit.

Normally I run it against CSV or JSON files containing an array of rows - effectively tabular data already, where I just want to record what has changed in between commits.

For this project, I was storing the raw JSON output by the Google Maps API. I didn't care about most of the information in there: I really just wanted the duration_in_traffic value.

git-history can accept a snippet of Python code that will be run against each stored copy of a file. The snippet should return a list of JSON objects (as Python dictionaries) which the rest of the tool can then use to figure out what has changed.

To cut a long story short, here's the incantation that worked:

git-history file hmb.db one.json \
--convert '
try:
    duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
    return [{"id": "one", "duration_in_traffic": duration_in_traffic}]
except Exception as ex:
    return []
' \
  --full-versions \
  --id id

The git-history file command is used to load the history for a specific file - in this case it's the file one.json, which will be loaded into a new SQLite database file called hm.db.

The --convert code uses json.loads(content) to load the JSON for the current file version, then pulls out the ["routes"][0]["legs"][0]["duration_in_traffic"]["value"] nested value from it.

If that's missing (e.g. in an earlier commit, when I hadn't yet added the departure_time=now parameter to the URL) an exception will be caught and the function will return an empty list.

If the duration_in_traffic value is present, the function returns the following:

[{"id": "one", "duration_in_traffic": duration_in_traffic}]

git-history likes lists of dictionaries. It's usually being run against files that contain many different rows, where the id column can be used to de-dupe rows across commits and spot what has changed.

In this case, each file only has a single interesting value.

Two more options are used here:

--full-versions - tells git-history to store all of the columns, not just columns that have changed since the last run. The default behaviour here is to store a null if a value has not changed in order to save space, but our data is tiny here so we don't need any clever optimizations.
--id id specifies the ID column that should be used to de-dupe changes. Again, not really important for this tiny project.

After running the above command, the resulting schema includes these tables:

CREATE TABLE [commits] (
   [id] INTEGER PRIMARY KEY,
   [namespace] INTEGER REFERENCES [namespaces]([id]),
   [hash] TEXT,
   [commit_at] TEXT
);
CREATE TABLE [item_version] (
   [_id] INTEGER PRIMARY KEY,
   [_item] INTEGER REFERENCES [item]([_id]),
   [_version] INTEGER,
   [_commit] INTEGER REFERENCES [commits]([id]),
   [id] TEXT,
   [duration_in_traffic] INTEGER
);

The commits table includes the date of the commit - commit_at.

The item_version table has that duration_in_traffic value.

So... to get back the duration in traffic at different times of day I can run this SQL query to join those two tables together:

select
    commits.commit_at,
    duration_in_traffic
from
    item_version
join
    commits on item_version._commit = commits.id
order by
    commits.commit_at

That query returns data that looks like this:

commit_at	duration_in_traffic
2022-10-15T17:09:06+00:00	1110
2022-10-15T17:17:38+00:00	1016
2022-10-15T17:30:06+00:00	1391

A couple of problems here. First, the commit_at column is in UTC, not local time. And duration_in_traffic is in seconds, which aren't particularly easy to read.

Here's a SQLite fix for these two issues:

select
    time(datetime(commits.commit_at, '-7 hours')) as t,
    duration_in_traffic / 60 as mins_in_traffic
from
    item_version
join
    commits on item_version._commit = commits.id
order by
    commits.commit_at

t	mins_in_traffic
10:09:06	18
10:17:38	16
10:30:06	23

datetime(commits.commit_at, '-7 hours') parses the UTC string as a datetime, and then subsracts 7 hours from it to get the local time in California converted from UTC.

I wrap that in time() here because for the chart I want to render I know everything will be on the same day.

mins_in_traffic now shows minutes, not seconds.

We now have enough data to render a chart!

But... we only have one of the two directions of traffic here. To process the numbers from two.json as well I ran this:

git-history file hmb.db two.json \
--convert '
try:
    duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
    return [{"id": "two", "duration_in_traffic": duration_in_traffic}]
except Exception as ex:
    return []
' \
  --full-versions \
  --id id --namespace item2

This is almost the same as the previous command. It's running against two.json instead of one.json, and it's using the --namespace item2 option.

This causes it to populate a new table called item2_version instead of item_version, which is a cheap trick to avoid having to figure out how to load both files into the same table.

Two lines on one chart

I rendered an initial single line chart using datasette-vega, but Natalie suggested that putting lines on the same chart for the two directions of traffic would be more interesting.

Since I now had one table for each direction of traffic (item_version and item_version2) I decided to combine those into a single table, suitable for pasting into Google Sheets.

Here's the SQL I came up with to do that:

with item1 as (
  select
    time(datetime(commits.commit_at, '-7 hours')) as t,
    duration_in_traffic / 60 as mins_in_traffic
  from
    item_version
    join commits on item_version._commit = commits.id
  order by
    commits.commit_at
),
item2 as (
  select
    time(datetime(commits.commit_at, '-7 hours')) as t,
    duration_in_traffic / 60 as mins_in_traffic
  from
    item2_version
    join commits on item2_version._commit = commits.id
  order by
    commits.commit_at
)
select
  item1.*,
  item2.mins_in_traffic as mins_in_traffic_other_way
from
  item1
  join item2 on item1.t = item2.t

This uses two CTEs (Common Table Expressions - the with X as pieces) using the pattern I explained earlier - now called item1 and item2. Having defined these two CTEs, I can join them together on the t column, which is the time of day.

Try running this query in Datasette Lite.

Here's the output of that query for Saturday (10am to 8pm):

t	mins_in_traffic	mins_in_traffic_other_way
10:09:06	18	8
10:17:38	16	8
10:30:06	23	9
10:47:38	23	9
10:57:37	23	9
11:08:20	26	9
11:22:27	26	9
11:38:42	26	9
11:52:35	25	9
12:03:23	24	9
12:15:16	21	9
12:27:51	22	9
12:37:48	22	10
12:46:41	21	10
12:55:03	21	10
13:05:10	21	11
13:17:57	21	11
13:32:55	21	11
13:44:53	19	12
13:55:22	21	14
14:05:21	22	14
14:17:48	23	15
14:31:04	22	15
14:41:59	21	14
14:51:48	18	14
15:00:09	18	15
15:11:17	15	14
15:25:48	14	15
15:39:41	11	14
15:51:11	14	15
15:59:34	15	15
16:10:50	19	16
16:25:43	19	18
16:53:06	19	18
17:11:34	18	16
17:40:29	11	11
18:12:07	10	11
18:58:17	8	9
20:05:13	7	7

I copied and pasted this table into Google Sheets and messed around with the charting tools there until I had the following chart:

Here's the same chart for Sunday:

Our Google Sheet is here - the two days have two separate tabs within the sheet.

Building the SQLite database in GitHub Actions

I did most of the development work for this project on my laptop, running git-history and datasette locally for speed of iteration.

Once I had everything working, I decided to automate the process of building the SQLite database as well.

I made the following changes to my GitHub Actions workflow:

jobs:
  shot-scraper:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
      with:
        fetch-depth: 0 # Needed by git-history
    - name: Set up Python 3.10
      uses: actions/setup-python@v4
      with:
        python-version: "3.10"
        cache: "pip"
    - run: pip install -r requirements.txt
    - name: Scrape
      # Same as before...
      # env:
      # run
    - name: Build SQLite database
      run: |
        rm -f hmb.db # Recreate from scratch each time
        git-history file hmb.db one.json \
        --convert '
        try:
            duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
            return [{"id": "one", "duration_in_traffic": duration_in_traffic}]
        except Exception as ex:
            return []
        ' \
          --full-versions \
          --id id
        git-history file hmb.db two.json \
        --convert '
        try:
            duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
            return [{"id": "two", "duration_in_traffic": duration_in_traffic}]
        except Exception as ex:
            return []
        ' \
          --full-versions \
          --id id --namespace item2
    - name: Commit and push
      # Same as before...

I also added a requirements.txt file containing just git-history.

Note how the actions/checkout@v3 step now has fetch-depth: 0 - this is necessary because git-history needs to loop through the entire repository history, but actions/checkout@v3 defaults to only fetching the most recent commit.

The setup-python step uses cache: "pip", which causes it to cache installed dependencies from requirements.txt between runs.

Because that big git-history step creates a hmb.db SQLite database, the "Commit and push" step now includes that file in the push to the repository. So every time the workflow runs a new binary SQLite database file is committed.

Normally I wouldn't do this, because Git isn't a great place to keep constantly changing binary files... but in this case the SQLite database is only 100KB and won't continue to be updated beyond the end of the pumpkin festival.

End result: hmb.db is available in the GitHub repository.

Querying it using Datasette Lite

Datasette Lite is my repackaged version of my Datasette server-side Python application which runs entirely in the user's browser, using WebAssembly.

A neat feature of Datasette Lite is that you can pass it the URL to a SQLite database file and it will load that database in your browser and let you run queries against it.

These database files need to be served with CORS headers. Every file served by GitHub includes these headers!

Which means the following URL can be used to open up the latest hmb.db file directly in Datasette in the browser:

https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db

(This takes advantage of a feature I added to Datasette Lite where it knows how to convert the URL to the HTML page about a file on GitHub to the URL to the raw file itself.)

URLs to SQL queries work too. This URL will open Datasette Lite, load the SQLite database AND execute the query I constructed above:

https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db#/hmb?sql=with+item1+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item_version%0A++++join+commits+on+item_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%2C%0Aitem2+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item2_version%0A++++join+commits+on+item2_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%0Aselect%0A++item1.*%2C%0A++item2.mins_in_traffic+as+mins_in_traffic_other_way%0Afrom%0A++item1%0A++join+item2+on+item1.t+%3D+item2.t

And finally... Datasette Lite has plugin support. Adding &install=datasette-copyable to the URL adds the datasette-copyable plugin, which adds a page for easily copying out the query results as TSV (useful for pasting into a spreadsheet) or even as GitHub-flavored Markdown (which I used to add results to this blog post).

Here's an example of that plugin in action.

This was a fun little project that brought together a whole bunch of things I've been working on over the past few years. Here's some more of my writing on these different techniques and tools:

Git scraping is the key technique I'm using here to collect the data
I've written a lot about GitHub Actions
These are my notes about git-history, the tool I used to turn a commit history into a SQLite database
Here's my series of posts about Datasette Lite

Tags: data-journalism, projects, sqlite, datasette, git-scraping, git-history, datasette-lite, half-moon-bay

Half Moon Bay Pumpkin Festival traffic on Saturday 15th October 2022

2022-10-16T03:56:51+00:00

Half Moon Bay Pumpkin Festival traffic on Saturday 15th October 2022

It’s the Half Moon Bay Pumpkin Festival this weekend... and its impact on the traffic between our little town of El Granada and Half Moon Bay—8 minutes drive away—is notorious. So I built a git scraper that archives estimated driving times from the Google Maps Navigation API, and used git-history to turn that scraped data into a SQLite database and visualize it on a chart.

Via @simonw

Tags: projects, git-scraping, git-history, half-moon-bay

Automatically opening issues when tracked file content changes

2022-04-28T17:18:14+00:00

I figured out a GitHub Actions pattern to keep track of a file published somewhere on the internet and automatically open a new repository issue any time the contents of that file changes.

Extracting GZipMiddleware from Starlette

Here's why I needed to solve this problem.

I want to add gzip support to my Datasette open source project. Datasette builds on the Python ASGI standard, and Starlette provides an extremely well tested, robust GZipMiddleware class that adds gzip support to any ASGI application. As with everything else in Starlette, it's really good code.

The problem is, I don't want to add the whole of Starlette as a dependency. I'm trying to keep Datasette's core as small as possible, so I'm very careful about new dependencies. Starlette itself is actually very light (and only has a tiny number of dependencies of its own) but I still don't want the whole thing just for that one class.

So I decided to extract the GZipMiddleware class into a separate Python package, under the same BSD license as Starlette itself.

The result is my new asgi-gzip package, now available on PyPI.

What if Starlette fixes a bug?

The problem with extracting code like this is that Starlette is a very effectively maintained package. What if they make improvements or fix bugs in the GZipMiddleware class? How can I make sure to apply those same fixes to my extracted copy?

As I thought about this challenge, I realized I had most of the solution already.

Git scraping is the name I've given to the trick of running a periodic scraper that writes to a git repository in order to track changes to data over time.

It may seem redundant to do this against a file that already lives in version control elsewhere - but in addition to tracking changes, Git scraping can offfer a cheap and easy way to add automation that triggers when a change is detected.

I need an actionable alert any time the Starlette code changes so I can review the change and apply a fix to my own library, if necessary.

Since I already run all of my projects out of GitHub issues, automatically opening an issue against the asgi-gzip repository would be ideal.

My track.yml workflow does exactly that: it implements the Git scraping pattern against the gzip.py module in Starlette, and files an issue any time it detects changes to that file.

Starlette haven't made any changes to that file since I started tracking it, so I created a test repo to try this out.

Here's one of the example issues. I decided to include the visual diff in the issue description and have a link to it from the underlying commit as well.

How it works

The implementation is contained entirely in this track.yml workflow. I designed this to be contained as a single file to make it easy to copy and paste it to adapt it for other projects.

It uses actions/github-script, which makes it easy to do things like file new issues using JavaScript.

Here's a heavily annotated copy:

name: Track the Starlette version of this

# Run on repo pushes, and if a user clicks the "run this action" button,
# and on a schedule at 5:21am UTC every day
on:
  push:
  workflow_dispatch:
  schedule:
  - cron:  '21 5 * * *'

# Without this block I got this error when the action ran:
# HttpError: Resource not accessible by integration
permissions:
  # Allow the action to create issues
  issues: write
  # Allow the action to commit back to the repository
  contents: write

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - uses: actions/github-script@v6
      # Using env: here to demonstrate how an action like this can
      # be adjusted to take dynamic inputs
      env:
        URL: https://raw.githubusercontent.com/encode/starlette/master/starlette/middleware/gzip.py
        FILE_NAME: tracking/gzip.py
      with:
        script: |
          const { URL, FILE_NAME } = process.env;
          // promisify pattern for getting an await version of child_process.exec
          const util = require("util");
          // Used exec_ here because 'exec' variable name is already used:
          const exec_ = util.promisify(require("child_process").exec);
          // Use curl to download the file
          await exec_(`curl -o ${FILE_NAME} ${URL}`);
          // Use 'git diff' to detect if the file has changed since last time
          const { stdout } = await exec_(`git diff ${FILE_NAME}`);
          if (stdout) {
            // There was a diff to that file
            const title = `${FILE_NAME} was updated`;
            const body =
              `${URL} changed:` +
              "\n\n```diff\n" +
              stdout +
              "\n```\n\n" +
              "Close this issue once those changes have been integrated here";
            const issue = await github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: title,
              body: body,
            });
            const issueNumber = issue.data.number;
            // Now commit and reference that issue number, so the commit shows up
            // listed at the bottom of the issue page
            const commitMessage = `${FILE_NAME} updated, refs #${issueNumber}`;
            // https://til.simonwillison.net/github-actions/commit-if-file-changed
            await exec_(`git config user.name "Automated"`);
            await exec_(`git config user.email "actions@users.noreply.github.com"`);
            await exec_(`git add -A`);
            await exec_(`git commit -m "${commitMessage}" || exit 0`);
            await exec_(`git pull --rebase`);
            await exec_(`git push`);
          }

In the asgi-gzip repository I keep the fetched gzip.py file in a tracking/ directory. This directory isn't included in the Python package that gets uploaded to PyPI - it's there only so that my code can track changes to it over time.

More interesting applications

I built this to solve my "tell me when Starlette update their gzip.py file" problem, but clearly this pattern has much more interesting uses.

You could point this at any web page to get a new GitHub issue opened when that page content changes. Subscribe to notifications for that repository and you get a robust , shared mechanism for alerts - plus an issue system where you can post additional comments and close the issue once someone has reviewed the change.

There's a lot of potential here for solving all kinds of interesting problems. And it doesn't cost anything either: GitHub Actions (somehow) remains completely free for public repositories!

Update: October 13th 2022

Almost six months after writing about this... it triggered for the first time!

Here's the issue that the script opened: #4: tracking/gzip.py was updated.

I applied the improvement (Marcelo Trylesinski and Kai Klingenberg updated Starlette's code to avoid gzipping if the response already had a Content-Encoding header) and released version 0.2 of the package.

Tags: github, gzip, projects, python, datasette, asgi, github-actions, git-scraping

Scraping web pages from the command line with shot-scraper

2022-03-14T01:29:56+00:00

I've added a powerful new capability to my shot-scraper command line browser automation tool: you can now use it to load a web page in a headless browser, execute JavaScript to extract information and return that information back to the terminal as JSON.

Among other things, this means you can construct Unix pipelines that incorporate a full headless web browser as part of their processing.

It's also a really neat web scraping tool.

shot-scraper

I introduced shot-scraper last Thursday. It's a Python utility that wraps Playwright, providing both a command line interface and a YAML-driven configuration flow for automating the process of taking screenshots of web pages.

% pip install shot-scraper
% shot-scraper https://simonwillison.net/ --height 800
Screenshot of 'https://simonwillison.net/' written to 'simonwillison-net.png'

Since Thursday shot-scraper has had a flurry of releases, adding features like PDF exports, the ability to dump the Chromium accessibilty tree and the ability to take screenshots of authenticated web pages. But the most exciting new feature landed today.

Executing JavaScript and returning the result

Release 0.9 takes the tool in a new direction. The following command will execute JavaScript on the page and return the resulting value:

% shot-scraper javascript simonwillison.net document.title
"Simon Willison\u2019s Weblog"

Or you can return a JSON object:

% shot-scraper javascript https://datasette.io/ "({
  title: document.title,
  tagline: document.querySelector('.tagline').innerText
})"
{
  "title": "Datasette: An open source multi-tool for exploring and publishing data",
  "tagline": "An open source multi-tool for exploring and publishing data"
}

Or if you want to use functions like setTimeout() - for example, if you want to insert a delay to allow an animation to finish before running the rest of your code - you can return a promise:

% shot-scraper javascript datasette.io "
new Promise(done => setInterval(
  () => {
    done({
      title: document.title,
      tagline: document.querySelector('.tagline').innerText
    });
  }, 1000
));"

Errors that occur in the JavaScript turn into an exit code of 1 returned by the tool - which means you can also use this to execute simple tests in a CI flow. This example will fail a GitHub Actions workflow if the extracted page title is not the expected value:

- name: Test page title
  run: |-
    shot-scraper javascript datasette.io "
      if (document.title != 'Datasette') {
        throw 'Wrong title detected';
      }"

Using this to scrape a web page

The most exciting use case for this new feature is web scraping. I'll illustrate that with an example.

Posts from my blog occasionally show up on Hacker News - sometimes I spot them, sometimes I don't.

https://news.ycombinator.com/from?site=simonwillison.net is a Hacker News page showing content from the specified domain. It's really useful, but it sadly isn't included in the official Hacker News API.

So... let's write a scraper for it.

I started out running the Firefox developer console against that page, trying to figure out the right JavaScript to extract the data I was interested in. I came up with this:

Array.from(document.querySelectorAll('.athing'), el => {
  const title = el.querySelector('.titleline a').innerText;
  const points = parseInt(el.nextSibling.querySelector('.score').innerText);
  const url = el.querySelector('.titleline a').href;
  const dt = el.nextSibling.querySelector('.age').title;
  const submitter = el.nextSibling.querySelector('.hnuser').innerText;
  const commentsUrl = el.nextSibling.querySelector('.age a').href;
  const id = commentsUrl.split('?id=')[1];
  // Only posts with comments have a comments link
  const commentsLink = Array.from(
    el.nextSibling.querySelectorAll('a')
  ).filter(el => el && el.innerText.includes('comment'))[0];
  let numComments = 0;
  if (commentsLink) {
    numComments = parseInt(commentsLink.innerText.split()[0]);
  }
  return {id, title, url, dt, points, submitter, commentsUrl, numComments};
})

The great thing about modern JavaScript is that everything you could need to write a scraper is already there in the default environment.

I'm using document.querySelectorAll('.itemlist .athing') to loop through each element that matches that selector.

I wrap that with Array.from(...) so I can use the .map() method. Then for each element I can extract out the details that I need.

The resulting array contains 30 items that look like this:

[
  {
    "id": "30658310",
    "title": "Track changes to CLI tools by recording their help output",
    "url": "https://simonwillison.net/2022/Feb/2/help-scraping/",
    "dt": "2022-03-13T05:36:13",
    "submitter": "appwiz",
    "commentsUrl": "https://news.ycombinator.com/item?id=30658310",
    "numComments": 19
  }
]

Running it with shot-scraper

Now that I have a recipe for a scraper, I can run it in the terminal like this:

shot-scraper javascript 'https://news.ycombinator.com/from?site=simonwillison.net' "
Array.from(document.querySelectorAll('.athing'), el => {
  const title = el.querySelector('.titleline a').innerText;
  const points = parseInt(el.nextSibling.querySelector('.score').innerText);
  const url = el.querySelector('.titleline a').href;
  const dt = el.nextSibling.querySelector('.age').title;
  const submitter = el.nextSibling.querySelector('.hnuser').innerText;
  const commentsUrl = el.nextSibling.querySelector('.age a').href;
  const id = commentsUrl.split('?id=')[1];
  // Only posts with comments have a comments link
  const commentsLink = Array.from(
    el.nextSibling.querySelectorAll('a')
  ).filter(el => el && el.innerText.includes('comment'))[0];
  let numComments = 0;
  if (commentsLink) {
    numComments = parseInt(commentsLink.innerText.split()[0]);
  }
  return {id, title, url, dt, points, submitter, commentsUrl, numComments};
})" > simonwillison-net.json

simonwillison-net.json is now a JSON file containing the scraped data.

Running the scraper in GitHub Actions

I want to keep track of changes to this data structure over time. My preferred technique for that is something I call Git scraping - the core idea is to keep the data in a Git repository and commit an update any time it updates. This provides a cheap and robust history of changes over time.

Running the scraper in GitHub Actions means I don't need to administrate my own server to keep this running.

So I built exactly that, in the simonw/scrape-hacker-news-by-domain repository.

The GitHub Actions workflow is in .github/workflows/scrape.yml. It runs the above command once an hour, then pushes a commit back to the repository should the file have any changes since last time it ran.

The commit history of simonwillison-net.json will show me any time a new link from my site appears on Hacker News, or a comment is added.

(Fun GitHub trick: add .atom to the end of that URL to get an Atom feed of those commits.)

The whole scraper, from idea to finished implementation, took less than fifteen minutes to build and deploy.

I can see myself using this technique a lot in the future.

Tags: github, hacker-news, scraping, github-actions, git-scraping, shot-scraper

shot-scraper: automated screenshots for documentation, built on Playwright

2022-03-10T00:13:30+00:00

shot-scraper is a new tool that I’ve built to help automate the process of keeping screenshots up-to-date in my documentation. It also doubles as a scraping tool - hence the name - which I picked as a complement to my git scraping and help scraping techniques.

Update 13th March 2022: The new shot-scraper javascript command can now be used to scrape web pages from the command line.

Update 14th October 2022: Automating screenshots for the Datasette documentation using shot-scraper offers a tutorial introduction to using the tool.

The problem

I like to include screenshots in documentation. I recently started writing end-user tutorials for Datasette, which are particularly image heavy (for example).

As software changes over time, screenshots get out-of-date. I don't like the idea of stale screenshots, but I also don't want to have to manually recreate them every time I make the tiniest tweak to the visual appearance of my software.

Introducing shot-scraper

shot-scraper is a tool for automating this process. You can install it using pip like this:

pip install shot-scraper
shot-scraper install

That second shot-scraper install line will install the browser it needs to do its job - more on that later.

You can use it in two ways. To take a one-off screenshot, you can run it like this:

shot-scraper https://simonwillison.net/ -o simonwillison.png

Or if you want to take a set of screenshots in a repeatable way, you can define them in a YAML file that looks like this:

- url: https://simonwillison.net/
  output: simonwillison.png
- url: https://www.example.com/
  width: 400
  height: 400
  quality: 80
  output: example.jpg

And then use shot-scraper multi to execute every screenshot in one go:

% shot-scraper multi shots.yml 
Screenshot of 'https://simonwillison.net/' written to 'simonwillison.png'
Screenshot of 'https://www.example.com/' written to 'example.jpg'

The documentation describes all of the available options you can use when taking a screenshot.

Each option can be provided to the shot-scraper one-off tool, or can be embedded in the YAML file for use with shot-scraper multi.

JavaScript and CSS selectors

The default behaviour for shot-scraper is to take a full page screenshot, using a browser width of 1280px.

For documentation screenshots you probably don't want the whole page though - you likely want to create an image of one specific part of the interface.

The --selector option allows you to specify an area of the page by CSS selector. The resulting image will consist just of that part of the page.

What if you want to modify the page in addition to selecting a specific area?

The --javascript option lets you pass in a block of JavaScript code which will be injected into the page and executed after the page has loaded, but before the screenshot is taken.

The combination of these two options - also available as javascript: and selector: keys in the YAML file - should be flexible enough to cover the custom screenshot case for documentation.

A complex example

To prove to myself that the tool works, I decided to try replicating this screenshot from my tutorial.

I made the original using CleanShot X, manually adding the two pink arrows:

This is pretty tricky!

It's not this whole page, just a subset of the page
The cog menu for one of the columns is open, which means the cog icon needs to be clicked before taking the screenshot
There are two pink arrows superimposed on the image

I decided to do use just one arrow for the moment, which should hopefully result in a clearer image.

I started by creating my own pink arrow SVG using Figma:

I then fiddled around in the Firefox developer console for quite a while, working out the JavaScript needed to trim the page down to the bit I wanted, open the menu and position the arrow.

With the JavaScript figured out, I pasted it into a YAML file called shot.yml:

- url: https://congress-legislators.datasettes.com/legislators/executive_terms?start__startswith=18&type=prez
  javascript: |
    new Promise(resolve => {
      // Run in a promise so we can sleep 1s at the end
      function remove(el) { el.parentNode.removeChild(el);}
      // Remove header and footer
      remove(document.querySelector('header'));
      remove(document.querySelector('footer'));
      // Remove most of the children of .content
      Array.from(document.querySelectorAll('.content > *:not(.table-wrapper,.suggested-facets)')).map(remove)
      // Bit of breathing room for the screenshot
      document.body.style.marginTop = '10px';
      // Add a bit of padding to .content
      var content = document.querySelector('.content');
      content.style.width = '820px';
      content.style.padding = '10px';
      // Open the menu - it's an SVG so we need to use dispatchEvent here
      document.querySelector('th.col-executive_id svg').dispatchEvent(new Event('click'));
      // Remove all but table header and first 11 rows
      Array.from(document.querySelectorAll('tr')).slice(12).map(remove);
      // Add a pink SVG arrow
      let div = document.createElement('div');
      div.innerHTML = `<svg width="104" height="60" fill="none" xmlns="http://www.w3.org/2000/svg">
        <g filter="url(#a)">
          <path fill-rule="evenodd" clip-rule="evenodd" d="m76.7 1 2 2 .2-.1.1.4 20 20a3.5 3.5 0 0 1 0 5l-20 20-.1.4-.3-.1-1.9 2a3.5 3.5 0 0 1-5.4-4.4l3.2-14.4H4v-12h70.6L71.3 5.4A3.5 3.5 0 0 1 76.7 1Z" fill="#FF31A0"/>
        </g>
        <defs>
          <filter id="a" x="0" y="0" width="104" height="59.5" filterUnits="userSpaceOnUse" color-interpolation-filters="sRGB">
              <feFlood flood-opacity="0" result="BackgroundImageFix"/>
              <feColorMatrix in="SourceAlpha" values="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 127 0" result="hardAlpha"/>
              <feOffset dy="4"/>
              <feGaussianBlur stdDeviation="2"/>
              <feComposite in2="hardAlpha" operator="out"/>
              <feColorMatrix values="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.25 0"/>
              <feBlend in2="BackgroundImageFix" result="effect1_dropShadow_2_26"/>
              <feBlend in="SourceGraphic" in2="effect1_dropShadow_2_26" result="shape"/>
          </filter>
        </defs>
      </svg>`;
      let svg = div.firstChild;
      content.appendChild(svg);
      content.style.position = 'relative';
      svg.style.position = 'absolute';
      // Give the menu time to finish fading in
      setTimeout(() => {
        // Position arrow pointing to the 'facet by this' menu item
        var pos = document.querySelector('.dropdown-facet').getBoundingClientRect();
        svg.style.left = (pos.left - pos.width) + 'px';
        svg.style.top = (pos.top - 20) + 'px';
        resolve();
      }, 1000);
    });
  output: annotated-screenshot.png
  selector: .content

And ran this command to generate the screenshot:

shot-scraper multi shot.yml

The generated annotated-screenshot.png image looks like this:

I'm pretty happy with this! I think it works very well as a proof of concept for the process.

How it works: Playwright

I built the first prototype of shot-scraper using Puppeteer, because I had used that before.

Then I noticed that the puppeteer-cli package I was using hadn't had an update in two years, which reminded me to check out Playwright.

I've been looking for an excuse to learn Playwright for a while now, and this project turned out to be ideal.

Playwright is Microsoft's open source browser automation framework. They promote it as a testing tool, but it has plenty of applications outside of testing - screenshot automation and screen scraping being two of the most obvious.

Playwright is comprehensive: it downloads its own custom browser builds, and can run tests across multiple different rendering engines.

The second prototype used the Playwright CLI utility instead, executed via npx:

subprocess.run(
    [
        "npx",
        "playwright",
        "screenshot",
        "--full-page",
        url,
        output,
    ],
    capture_output=True,
)

This could take a full page screenshot, but that CLI tool wasn't flexible enough to take screenshots of specific elements. So I needed to switch to the Playwright programmatic API.

I started out trying to get Python to generate and pass JavaScript to the Node.js library... and then I spotted the official Playwright for Python package.

pip install playwright

It's amazing! It has the exact same functionality as the JavaScript library - the same classes, the same methods. Everything just works, in both languages.

I was curious how they pulled this off, so I dug inside the playwright Python package in my site-packages folder... and found it bundles a full Node.js binary executable and uses it to bridge the two worlds! What a wild hack.

Thanks to Playwright, the entire implementation of shot-scraper is currently just 181 lines of Python code - it's all glue code tying together a Click CLI interface with some code that calls Playwright to do the actual work.

I couldn't be more impressed with Playwright. I'll definitely be using it for other projects - for one thing, I think I'll finally be able to add automated tests to my Datasette Desktop Electron application.

Hooking shot-scraper up to GitHub Actions

I built shot-scraper very much with GitHub Actions in mind.

My shot-scraper-demo repository is my first live demo of the tool.

Once a day, it runs this shots.yml file, generates two screenshots and commits them back to the repository.

One of them is the tutorial screenshot described above.

The other is a screenshot of the list of "recently spotted owls" from this page on owlsnearme.com. I wanted a page that would change on an occasional basis, to demonstrate GitHub's neat image diffing interface.

I may need to change that demo though! That page includes "spotted 5 hours ago" text, which means that there's almost always a tiny pixel difference, like this one (use the "swipe" comparison tool to watch 6 hours ago change to 7 hours ago under the top left photo).

Storing image files that change frequently in a free repository on GitHub feels rude to me, so please use this tool cautiously there!

What's next?

I had ambitious plans to add utilities to the tool that would help with annotations, such as adding pink arrows and drawing circles around different elements on the page.

I've shelved those plans for the moment: as the demo above shows, the JavaScript hook is good enough. I may revisit this later once common patterns have started to emerge.

So really, my next step is to start using this tool for my own projects - to generate screenshots for my documentation.

I'm also very interested to see what kinds of things other people use this for.

Tags: documentation, projects, scraping, github-actions, git-scraping, puppeteer, playwright, shot-scraper

Help scraping: track changes to CLI tools by recording their --help using Git

2022-02-02T23:46:35+00:00

I've been experimenting with a new variant of Git scraping this week which I'm calling Help scraping. The key idea is to track changes made to CLI tools over time by recording the output of their --help commands in a Git repository.

My new help-scraper GitHub repository is my first implementation of this pattern.

It uses this GitHub Actions workflow to record the --help output for the Amazon Web Services aws CLI tool, and also for the flyctl tool maintained by the Fly.io hosting platform.

The workflow runs once a day. It loops through every available AWS command (using this script) and records the output of that command's CLI help option to a .txt file in the repository - then commits the result at the end.

The result is a version history of changes made to those help files. It's essentially a much more detailed version of a changelog - capturing all sorts of details that might not be reflected in the official release notes for the tool.

Here's an example. This morning, AWS released version 1.22.47 of their CLI helper tool. They release new versions on an almost daily basis.

Here are the official release notes - 12 bullet points, spanning 12 different AWS services.

My help scraper caught the details of the release in this commit - 89 changed files with 3,543 additions and 1,324 deletions. It tells the story of what's changed in a whole lot more detail.

The AWS CLI tool is enormous. Running find aws -name '*.txt' | wc -l in that repository counts help pages for 11,401 individual commands - or 11,390 if you checkout the previous version, showing that there were 11 commands added just in this morning's new release.

There are plenty of other ways of tracking changes made to AWS. I've previously kept an eye on the botocore GitHub history, which exposes changes to the underlying JSON - and there are projects like awschanges.info which try to turn those sources of data into something more readable.

But I think there's something pretty neat about being able to track changes in detail for any CLI tool that offers help output, independent of the official release notes for that tool. Not everyone writes release notes with the detail I like from them!

I implemented this for flyctl first, because I wanted to see what changes were being made that might impact my datasette-publish-fly plugin which shells out to that tool. Then I realized it could be applied to AWS as well.

Help scraping my own projects

I got the initial idea for this technique from a change I made to my Datasette and sqlite-utils projects a few weeks ago.

Both tools offer CLI commands with --help output - but I kept on forgetting to update the help, partly because there was no easy way to see its output online without running the tools themselves.

So, I added documentation pages that list the output of --help for each of the CLI commands, generated using the Cog file generation tool:

sqlite-utils CLI reference (39 commands!)
datasette CLI reference

Having added these pages, I realized that the Git commit history of those generated documentation pages could double up as a history of changes I made to the --help output - here's that history for sqlite-utils.

It was a short jump from that to the idea of combining it with Git scraping to generate history for other tools.

Bonus trick: GraphQL schema scraping

I've started making selective use of the Fly.io GraphQL API as part of my plugin for publishing Datasette instances to that platform.

Their GraphQL API is openly available, but it's not extensively documented - presumably because they reserve the right to make breaking changes to it at any time. I collected some notes on it in this TIL: Using the undocumented Fly GraphQL API.

This gave me an idea: could I track changes made to their GraphQL schema using the same scraping trick?

It turns out I can! There's an NPM package called get-graphql-schema which can extract the GraphQL schema from any GraphQL server and write it out to disk:

npx get-graphql-schema https://api.fly.io/graphql > /tmp/fly.graphql

I've added that to my help-scraper repository too - so now I have a commit history of changes of changes they are making there too. Here's an example from this morning.

Other weeknotes

I've decided to start setting goals on a monthly basis. My goal for February is to finally ship Datasette 1.0! I'm trying to make at least one commit every day that takes me closer to that milestone.

This week I did a bunch of work adding a Link: https://...; rel="alternate"; type="application/datasette+json" HTTP header to a bunch of different pages in the Datasette interface, to support discovery of the JSON version of a page based on a URL to the human-readable version.

(I had originally planned to also support Accept: application/json request headers for this, but I've been put off that idea by the discovery that Cloudflare deliberately ignores the Vary: Accept header.)

Unrelated to Datasette: I also started a new Twitter thread, gathering behind the scenes material from the movie the Mitchells vs the Machines. There's been a flurry of great material shared recently by the creative team, presumably as part of the run-up to awards season - and I've been enjoying trying to tie it all together in a thread.

The last time I did this was for Into the Spider-Verse (from the same studio) and that thread ended up running for more than a year!

TIL this week

Tags: git, github, projects, scraping, graphql, weeknotes, github-actions, git-scraping, fly

Weeknotes: Shaving some beautiful yaks

2021-12-01T03:43:18+00:00

I've been mostly shaving yaks this week - two in particular: the Datasette table refactor and the next release of git-history. I also built and released my first Web Component!

A Web Component for embedding Datasette tables

A longer term goal that I have for Datasette is to figure out a good way of using it to build dashboards, tying together summaries and visualizations of the latest data from a bunch of different sources.

I'm excited about the potential of Web Components to help solve this problem.

My datasette-notebook project is a very early experiment in this direction: it's a Datasette notebook that provides a Markdown wiki (persisted to SQLite) to which I plan to add the ability to embed tables and visualizations in wiki pages - forming a hybrid of a wiki, dashboarding system and Notion/Airtable-style database.

It does almost none of those things right now, which is why I've not really talked about it here.

Web Components offer a standards-based mechanism for creating custom HTML tags. Imagine being able to embed a Datasette table on a page by adding the following to your HTML:

<datasette-table
    url="https://global-power-plants.datasettes.com/global-power-plants/global-power-plants.json"
></datasette-table>

That's exactly what datasette-table lets you do! Here's a demo of it in action.

This is version 0.1.0 - it works, but I've not even started to flesh it out.

I did learn a bunch of things building it though: it's my first Web Component, my first time using Lit, my first time using Vite and the first JavaScript library I've ever packaged and published to npm.

Here's a detailed TIL on Publishing a Web Component to npm encapsulating everything I've learned from this project so far.

This is also my first piece of yak shaving this week: I built this partly to make progress on datasette-notebook, but also because my big Datasette refactor involves finalizing the design of the JSON API for version 1.0. I realized that I don't actually have a project that makes full use of that API, which has been hindering my attempts to redesign it. Having one or more Web Components that consume the API will be a fantastic way for me to eat my own dog food.

Link: rel="alternate" for Datasette tables

Here's an interesting problem that came up while I was working on the datasette-table component.

As designed right now, you need to figure out the JSON URL for a table and pass that to the component.

This is usually a case of adding .json to the path, while preserving any query string parameters - but there's a nasty edge-case: if your SQLite table itself ends with the string .json (which could happen! Especially since Datasette promises to work with any existing SQLite database) the URL becomes this instead:

/mydb/table.json?_format=json

Telling users of my component that they need to first construct the JSON URL for their page isn't the best experience: I'd much rather let people paste in the URL to the HTML version and derive the JSON from that.

This is made more complex by the fact that, thanks to --cors, the Web Component can be embedded on any page. And for datasette-notebook I'd like to provide a feature where any URLs to Datasette instances - no matter where they are hosted - are turned into embedded tables automatically.

To do this, I need an efficient way to tell that an arbitrary URL corresponds to a Datasette table.

My latest idea here is to use a combination of HTTP HEAD requests and a Link: rel="alternate" header - something like this:

~ % curl -I 'https://latest.datasette.io/fixtures/compound_three_primary_keys'
HTTP/1.1 200 OK
date: Sat, 27 Nov 2021 20:09:36 GMT
server: uvicorn
Link: https://latest.datasette.io/fixtures/compound_three_primary_keys.json; rel="alternate"; type="application/datasette+json"

This would allow a (hopefully fast) fetch() call from JavaScript to confirm that a URL is a Datasette table, and get back the JSON that should be fetched by the component in order to render it on the page.

I have a prototype of this in Datasette issue #1533. I think it's a promising approach!

It's also now part of the ever-growing table refactor. Adding custom headers to page responses is currently far harder than it should be.

sqlite-utils STRICT tables

SQLite 3.37.0 came out at the weekend with a long-awaited feature: STRICT tables, which enforce column types such that you get an error if you try to insert a string into an integer column.

(This has been a long-standing complaint about SQLite by people who love strong typing, and D. Richard Hipp finally shipped the change for them with some salty release notes saying it's "for developers who prefer that kind of thing.")

I started researching how to add support for this to my sqlite-utils Python library. You can follow my thinking in sqlite-utils issue #344 - I'm planning to add a strict=True option to methods that create tables, but for the moment I've shipped new introspection properties for seeing if a table uses strict mode or not.

git-history update

My other big yak this week has been work on git-history. I'm determined to get it into a stable state such that I can write it up, produce a tutorial and maybe produce a video demonstration as well - but I keep on finding things I want to change about how it works.

The big challenge is how to most effectively represent the history of a bunch of different items over time in a relational database schema.

I started with a item table that presents just the most recent version of each item, and an item_version table with a row for every subsequent version.

That table got pretty big, with vast amounts of duplicated data in it.

So I've been working on an optimization where columns are only included in an item_version row if they have changed since the previous version.

The problem there is what to do about null - does null mean "this column didn't change" or does it mean "this column was set from some other value back to null"?

After a few different attempts I've decided to solve this with a many-to-many table, so for any row in the item_version table you can see which columns were explicitly changed by that version.

This is all working pretty nicely now, but still needs documentation, and tests, and then a solid write-up and tutorial and demos and a video... hopefully tomorrow!

One of my design decisions for this tool has been to use an underscore prefix for "reserved columns", such that non-reserved columns can be safely used by the arbitrary data that is being tracked by the tool.

Having columns with names like _id and _item has highlighted several bugs with Datasette's handling of these column names, since Datasette itself tries to use things like ?_search= for special query string parameters. I released Datasette 0.59.4 with some relevant fixes.

A beautiful yak

As a consumate yak shaver this beautiful yak that showed up on Reddit a few weeks ago has me absolutely delighted. I've not been able to determine the photography credit.

Releases this week

s3-credentials: 0.7 - (7 releases total) - 2021-11-30
A tool for creating credentials for accessing S3 buckets
datasette: 0.59.4 - (102 releases total) - 2021-11-30
An open source multi-tool for exploring and publishing data
datasette-table: 0.1.0 - 2021-11-28
A Web Component for embedding a Datasette table on a page

TIL this week

Tags: projects, npm, datasette, web-components, weeknotes, git-scraping

Weeknotes: Apache proxies in Docker containers, refactoring Datasette

2021-11-22T05:43:44+00:00

Updates to six major projects this week, plus finally some concrete progress towards Datasette 1.0.

Fixing Datasette's proxy bugs

Now that Datasette has had its fourth birthday I've decided to really push towards hitting the 1.0 milestone. The key property of that release will be a stable JSON API, stable plugin hooks and a stable, documented context for custom templates. There's quite a lot of mostly unexciting work needed to get there.

As I work through the issues in that milestone I'm encountering some that I filed more than two years ago!

Two of those made it into the Datasette 0.59.3 bug fix release earlier this week.

The majority of the work in that release though related to Datasette's base_url feature, designed to help people who run Datasette behind a proxy.

base_url lets you run Datasette like this:

datasette --setting base_url=/prefix/ fixtures.db

When you do this, Datasette will change its URLs to start with that prefix - so the hompage will live at /prefix/, the database index page at /prefix/fixtures/, tables at /prefix/fixtures/facetable etc.

The reason you would want this is if you are running a larger website, and you intend to proxy traffic to /prefix/ to a separate Datasette instance.

The Datasette documentation includes suggested nginx and Apache configurations for doing exactly that.

This feature has been a magnet for bugs over the years! People keep finding new parts of the Datasette interface that fail to link to the correct pages when run in this mode.

The principle cause of these bugs is that I don't use Datasette in this way myself, so I wasn't testing it nearly as thoroughly as it needed.

So the first step in finally solving these issues once and for all was to get my own instance of Datasette up and running behind an Apache proxy.

Since I like to deploy live demos to Cloud Run, I decided to try and run Apache and Datasette in the same container. This took a lot of figuring out. You can follow my progress on this in these two issue threads:

The short version: I got it working! My Docker implementation now lives in the demos/apache-proxy directory and the live demo itself is deployed to datasette-apache-proxy-demo.fly.dev/prefix/.

(I ended up deploying it to Fly after running into a bug when deployed to Cloud Run that I couldn't replicate on my own laptop.)

My final implementation uses a Debian base container with Supervisord to manage the two processes.

With a working live environment, I was finally able to track down the root cause of the bugs. My notes on #1519: base_url is omitted in JSON and CSV views document how I found and solved them, and updated the associated test to hopefully avoid them ever coming back in the future.

The big Datasette table refactor

The single most complicated part of the Datasette codebase is the code behind the table view - the page that lets you browse, facet, search, filter and paginate through the contents of a table (this page here).

It's got very thorough tests, but the actual implementation is mostly a 600 line class method.

It was already difficult to work with, but the changes I want to make for Datasette 1.0 have proven too much for it. I need to refactor.

Apart from making that view easier to change and maintain, a major goal I have is for it to support a much more flexible JSON syntax. I want the JSON version to default to just returning minimal information about the table, then allow ?_extra=x parameters to opt into additional information - like facets, suggested facets, full counts, SQL schema information and so on.

This means I want to break up that 600 line method into a bunch of separate methods, each of which can be opted-in-to by the calling code.

The HTML interface should then build on top of the JSON, requesting the extras that it knows it will need and passing the resulting data through to the template. This helps solve the challenge of having a stable template context that I can document in advance of Datasette 1.0

I've been putting this off for over a year now, because it's a lot of work. But no longer! This week I finally started to get stuck in.

I don't know if I'll stick with it, but my initial attempt at this is a little unconventional. Inspired by how pytest fixtures work I'm experimenting with a form of dependency injection, in a new (very alpha) library I've released called asyncinject.

The key idea behind asyncinject is to provide a way for class methods to indicate their dependencies as named parameters, in the same way as pytest fixtures do.

When you call a method, the code can spot which dependencies have not yet been resolved and execute them before executing the method.

Crucially, since they are all async def methods they can be executed in parallel. I'm cautiously excited about this - Datasette has a bunch of opportunities for parallel queries - fetching a single page of table rows, calculating a count(*) for the entire table, executing requested facets and calculating suggested facets are all queries that could potentially run in parallel rather than in serial.

What about the GIL, you might ask? Datasette's database queries are handled by the sqlite3 module, and that module releases the GIL once it gets into SQLite C code. So theoretically I should be able to use more than one core for this all.

The asyncinject README has more details, including code examples. This may turn out to be a terrible idea! But it's really fun to explore, and I'll be able to tell for sure if this is a useful, maintainable and performant approach once I have Datasette's table view running on top of it.

git-history and sqlite-utils

I made some big improvements to my git-history tool, which automates the process of turning a JSON (or other) file that has been version-tracked in a GitHub repository (see Git scraping) into a SQLite database that can be used to explore changes to it over time.

The biggest was a major change to the database schema. Previously, the tool used full Git SHA hashes as foreign keys in the largest table.

The problem here is that a SHA hash string is 40 characters long, and if they are being used as a foreign key that's a LOT of extra weight added to the largest table.

sqlite-utils has a table.lookup() method which is designed to make creating "lookup" tables - where a string is stored in a unique column but an integer ID can be used for things like foreign keys - as easy as possible.

That method was previously quite limited, but in sqlite-utils 3.18 and 3.19 - both released this week - I expanded it to cover the more advanced needs of my git-history tool.

The great thing about building stuff on top of your own libraries is that you can discover new features that you need along the way - and then ship them promptly without them blocking your progress!

Some other highlights

s3-credentials 0.6 adds a --dry-run option that you can use to show what the tool would do without making any actual changes to your AWS account. I found myself wanting this while continuing to work on the ability to specify a folder prefix within S3 that the bucket credentials should be limited to.
datasette-publish-vercel 0.12 applies some pull requests from Romain Clement that I had left unreviewed for far too long, and adds the ability to customize the vercel.json file used for the deployment - useful for things like setting up additional custom redirects.
datasette-graphql 2.0 updates that plugin to Graphene 3.0, a major update to that library. I had to break backwards compatiblity in very minor ways, hence the 2.0 version number.

csvs-to-sqlite 1.3 is the first relase of that tool in just over a year. William Rowell contributed a new feature that allows you to populate "fixed" database columns on your imported records, see PR #81 for details.

TIL this week

Releases this week

datasette-publish-vercel: 0.12 - (18 releases total) - 2021-11-22
Datasette plugin for publishing data using Vercel
git-history: 0.4 - (6 releases total) - 2021-11-21
Tools for analyzing Git history using SQLite
sqlite-utils: 3.19 - (90 releases total) - 2021-11-21
Python CLI utility and library for manipulating SQLite databases
datasette: 0.59.3 - (101 releases total) - 2021-11-20
An open source multi-tool for exploring and publishing data
datasette-redirect-to-https: 0.1 - 2021-11-20
Datasette plugin that redirects all non-https requests to https
s3-credentials: 0.6 - (6 releases total) - 2021-11-18
A tool for creating credentials for accessing S3 buckets
csvs-to-sqlite: 1.3 - (13 releases total) - 2021-11-18
Convert CSV files into a SQLite database
datasette-graphql: 2.0 - (32 releases total) - 2021-11-17
Datasette plugin providing an automatic GraphQL API for your SQLite databases
asyncinject: 0.2a0 - (2 releases total) - 2021-11-17
Run async workflows using pytest-fixtures-style dependency injection

Tags: apache, proxies, refactoring, supervisord, docker, datasette, weeknotes, git-scraping, sqlite-utils

Weeknotes: git-history, created for a Git scraping workshop

2021-11-15T04:10:50+00:00

My main project this week was a 90 minute workshop I delivered about Git scraping at Coda.Br 2021, a Brazilian data journalism conference, on Friday. This inspired the creation of a brand new tool, git-history, plus smaller improvements to a range of other projects.

git-history

I still need to do a detailed write-up of this one (update: git-history: a tool for analyzing scraped data collected using Git and SQLite), but on Thursday I released a brand new tool called git-history, which I describe as "tools for analyzing Git history using SQLite".

This tool is the missing link in the Git scraping pattern I described here last October.

Git scraping is the technique of regularly scraping an online source of information and writing the results to a file in a Git repository... which automatically gives you a full revision history of changes made to that data source over time.

The missing piece has always been what to do next: how do you turn a commit history of changes to a JSON or CSV file into a data source that can be used to answer questions about how that file changed over time?

I've written one-off Python scripts for this a few times (here's my CDC vaccinations one, for example), but giving an interactive workshop about the technique finally inspired me to build a tool to help.

The tool has a comprehensive README, but the short version is that you can take a JSON (or CSV) file in a repository that has been tracking changes to some items over time and run the following to load all of the different versions into a SQLite database file for analysis with Datasette:

git-convert file incidents.db incidents.json --id IncidentID

This assumes that incidents.json contains a JSON array of incidents (reported fires for example) and that each incident has a IncidentID identifier key. It will then loop through the Git history of that file right from the start, creating an item_versions table that tracks every change made to each of those items - using IncidentID to decide if a row represents a new incident or an update to a previous one.

I have a few more improvements I want to make before I start more widely promoting this, but it's already really useful. I've had a lot of fun running it against example repos from the git-scraping GitHub topic (now at 202 repos and counting).

Workshop: Raspando dados com o GitHub Actions e analisando com Datasette

The workshop I gave at the conference was live-translated into Portuguese, which is really exciting! I'm looking forward to watching the video when it comes out and seeing how well that worked.

The title translates to "Scraping data with GitHub Actions and analyzing with Datasette", and it was the first time I've given a workshop that combines Git scraping and Datasette - hence the development of the new git-history tool to help tie the two together.

I think it went really well. I put together four detailed exercises for the attendees, and then worked through each one live with the goal of attendees working through them at the same time - a method I learned from the Carpentries training course I took last year.

Four exercises turns out to be exactly right for 90 minutes, with reasonable time for an introduction and some extra material and questions at the end.

The worst part of running a workshop is inevitably the part where you try and get everyone setup with a functional development environment on their own machines (see XKCD 1987). This time round I skipped that entirely by encouraging my students to use GitPod, which provides free browser-based cloud development environments running Linux, with a browser-embedded VS Code editor and terminal running on top.

(It's similar to GitHub Codespaces, but Codespaces is not yet available to free customers outside of the beta.)

I demonstrated all of the exercises using GitPod myself during the workshop, and ensured that they could be entirely completed through that environment, with no laptop software needed at all.

This worked so well. Not having to worry about development environments makes workshops massively more productive. I will absolutely be doing this again in the future.

The workshop exercises are available in this Google Doc, and I hope to extract some of them out into official tutorials for various tools later on.

Datasette 0.58.2

Yesterday was Datasette's fourth birthday - the four year anniversary of the initial release announcement! I celebrated by releasing a minor bug-fix, Datasette 0.58.2, the release notes for which are quoted below:

Column names with a leading underscore now work correctly when used as a facet. (#1506)
Applying ?_nocol= to a column no longer removes that column from the filtering interface. (#1503)
Official Datasette Docker container now uses Debian Bullseye as the base image. (#1497)

That first change was inspired by ongoing work on git-history, where I decided to use a _id underscoper prefix pattern for columns that were reserved for use by that tool in order to avoid clashing with column names in the provided source data.

sqlite-utils 3.18

Today I released sqlite-utils 3.18 - initially also to provide a feature I wanted for git-history (a way to populate additional columns when creating a row using table.lookup()) but I also closed some bug reports and landed some small pull requests that had come in since 3.17.

s3-credentials 0.5

Earlier in the week I released version 0.5 of s3-credentials - my CLI tool for creating read-only, read-write or write-only AWS credentials for a specific S3 bucket.

The biggest new feature is the ability to create temporary credentials, that expire after a given time limit.

This is achived using STS.assume_role(), where STS is Security Token Service. I've been wanting to learn this API for quite a while now.

Assume role comes with some limitations: tokens must live between 15 minutes and 12 hours, and you need to first create a role that you can assume. In creating those credentials you can define an additional policy document, which is how I scope down the token I'm creating to only allow a specific level of access to a specific S3 bucket.

I've learned a huge amount about AWS, IAM and S3 through developming this project. I think I'm finally overcoming my multi-year phobia of anything involving IAM!

Releases this week

sqlite-utils: 3.18 - (88 releases total) - 2021-11-15
Python CLI utility and library for manipulating SQLite databases
datasette: 0.59.2 - (100 releases total) - 2021-11-14
An open source multi-tool for exploring and publishing data
datasette-hello-world: 0.1.1 - (2 releases total) - 2021-11-14
The hello world of Datasette plugins
git-history: 0.3.1 - (5 releases total) - 2021-11-12
Tools for analyzing Git history using SQLite
s3-credentials: 0.5 - (5 releases total) - 2021-11-11
A tool for creating credentials for accessing S3 buckets

TIL this week

Tags: aws, projects, s3, talks, teaching, datasette, weeknotes, git-scraping, sqlite-utils, git-history, s3-credentials

Weeknotes: CDC vaccination history fixes, developing in GitHub Codespaces

2021-09-28T01:53:49+00:00

I spent the last week mostly surrounded by boxes: we're completing our move to the new place and life is mostly unpacking now. I did find some time to fix some issues with my CDC vaccination history Datasette instance though.

Fixing my CDC vaccination history site

I started tracking changes made to the CDC's COVID Data Tracker website back in Feburary. I created a git scraper repository for it as part of my five minute lightning talk on git scraping (notes and video) at this year's NICAR data journalism conference.

Since then it's been quietly ticking along, recording the latest data in a git repository that now has 335 commits.

In March I added a script to build the collected historic data into a SQLite database and publish it to Vercel using GitHub. That started breaking a few weeks ago, and it turnoud out that was because the database file had grown in size to the point where it was too large to deploy to Vercel (~100MB).

I got a bug report about this, so I took some time to move the deployment over to Google Cloud Run which doesn't have a documented size limit (though in my experience starts to creak once you go above about 2GB.)

I also started publishing the raw collected data directly as a CSV file, partly as an excuse to learn how to publish to Google Cloud Storage.

datasette-template-request

I released an extremely simple plugin this week called datasette-template-request - all it does is expose Datasette's request object in the context passed to custom templates, for people who want to update their custom page based on incoming request parameters.

More notable is how I built the plugin: this is the first plugin I've developed, tested and released entirely in my browser using the new GitHub Codespaces online development environment.

I created the new repo using my Datasette plugin template repository, opened it up in Codespaces, implemented the plugin and tests, tried it out using the port forwarding feature and then published it to PyPI using the publish.yml workflow.

Not having to even open a text editor on my laptop (let alone get a new Python development environment up and running) felt really good. I should turn this into a tutorial.

Releases this week

datasette-template-request: 0.1 - 2021-09-23
Expose the Datasette request object to custom templates
datasette-notebook: 0.1a1 - (2 releases total) - 2021-09-22
A markdown wiki and dashboarding system for Datasette
datasette-render-markdown: 2.0 - (8 releases total) - 2021-09-22
Datasette plugin for rendering Markdown
sqlite-utils: 3.17.1 - (87 releases total) - 2021-09-22
Python CLI utility and library for manipulating SQLite databases
twitter-to-sqlite: 0.22 - (28 releases total) - 2021-09-21
Save data from Twitter to a SQLite database

TIL this week

Tags: github, projects, weeknotes, covid19, git-scraping, github-codespaces

Flat Data

2021-05-19T01:05:54+00:00

Flat Data

New project from the GitHub OCTO (the Office of the CTO, love that backronym) somewhat inspired by my work on Git scraping: I’m really excited to see GitHub embracing git for CSV/JSON data in this way. Flat incorporates a reusable Action for scraping and storing data (using Deno), a VS Code extension for setting up those workflows and a very nicely designed Flat Viewer web app for browsing CSV and JSON data hosted on GitHub.

Tags: github, deno, git-scraping

Weeknotes: SpatiaLite 5, Datasette on Azure, more CDC vaccination history

2021-03-28T05:19:57+00:00

This week I got SpatiaLite 5 working in the Datasette Docker image, improved the CDC vaccination history git scraper, figured out Datasette on Azure and we closed on a new home!

SpatiaLite 5 for Datasette

SpatiaLite 5 came out earlier this year with a bunch of exciting improvements, most notably an implementation of KNN (K-nearest neighbours) - a way to efficiently answer the question "what are the 10 closest rows to this latitude/longitude point".

I love building X near me websites so I expect I'll be using this a lot in the future.

I spent a bunch of time this week figuring out how best to install it into a Docker container for use with Datasette. I finally cracked it in issue 1249 and the Dockerfile in the Datasette repository now builds with the SpatiaLite 5.0 extension, using a pattern I figured out for installing Debian unstable packages into a Debian stable base container.

When Datasette 0.56 is released the official Datasette Docker image will bundle SpatiaLite 5.0.

CDC vaccination history in Datasette

I'm tracking the CDC's per-state vaccination numbers in my cdc-vaccination-history repository, as described in my Git scraping lightning talk.

Scraping data into a git repository to track changes to it over time is easy. What's harder is extracting that data back out of the commit history in order to analyze and visualize it later.

To demonstrate how this can work I added a build_database.py script to that repository which iterates through the git history and uses it to build a SQLite database containing daily state reports. I also added steps to the GitHub Actions workflow to publish that SQLite database using Datasette and Vercel.

I installed the datasette-vega visualization plugin there too. Here's a chart showing the number of doses administered over time in California.

This morning I started capturing the CDC's per-county data too, but I've not yet written code to load that into Datasette. [UPDATE: that table is now available: cdc/daily_reports_counties]

Datasette on Azure

I'm keen to make Datasette easy to deploy in as many places as possible. I already have mechanisms for publishing to Heroku, Cloud Run, Vercel and Fly.io - today I worked out the recipe needed for Azure Functions.

I haven't bundled it into a datasette-publish-azure plugin yet but that's the next step. In the meantime the azure-functions-datasette repo has a working example with instructions on how to deploy it.

Thanks go to Anthony Shaw for building out the ASGI wrapper needed to run ASGI applications like Datasette on Azure Functions.

iam-to-sqlite

I spend way too much time whinging about IAM on Twitter. I'm certain that properly learning IAM will unlock the entire world of AWS, but I have so far been unable to overcome my discomfort with it long enough to actually figure it out.

After yet another unproductive whinge this week I guilted myself into putting in some effort, and it's already started to pay off: I figured out how to dump out all existing IAM data (users, groups, roles and policies) as JSON using the aws iam get-account-authorization-details command, and got so excited about it that I built iam-to-sqlite as a wrapper around that command that writes the results into SQLite so I can browse them using Datasette!

I'm increasingly realizing that the key to me understanding how pretty much any service works is to pull their JSON into a SQLite database so I can explore it as relational tables.

A useful trick for writing weeknotes

When writing weeknotes like these, it's really useful to be able to see all of the commits from the past week across many different projects.

Today I realized you can use GitHub search for this. Run a search for author:simonw created:>2021-03-20 and filter to commits, ordered by "Recently committed".

Here's that search for me.

Django pull request accepted!

I had a pull request accepted to Django this week! It was a documentation fix for the RawSQL query expression - I found a pattern for using it as part of an .filter(id__in=RawSQL(...)) query that wasn't covered by the documentation.

And we found a new home

One other project this week: Natalie and I closed on a new home! We're moving to El Granada, a tiny town just north of Half Moon Bay, on the coast 40 minutes south of San Francisco. We'll be ten minutes from the ocean, with plenty of pinnipeds and pelicans. Exciting!

TIL this week

Releases this week

datasette-publish-vercel: 0.9.3 - (15 releases total) - 2021-03-26
Datasette plugin for publishing data using Vercel
sqlite-transform: 0.5 - (6 releases total) - 2021-03-24
Tool for running transformations on columns in a SQLite database
django-sql-dashboard: 0.5a0 - (12 releases total) - 2021-03-24
Django app for building dashboards using raw SQL queries
iam-to-sqlite: 0.1 - 2021-03-24
Load Amazon IAM data into a SQLite database
tableau-to-sqlite: 0.2.1 - (4 releases total) - 2021-03-22
Fetch data from Tableau into a SQLite database
c64: 0.1a0 - 2021-03-21
Experimental package of ASGI utilities extracted from Datasette

Tags: aws, azure, datasette, weeknotes, git-scraping

Weeknotes: Datasette and Git scraping at NICAR, VaccinateCA

2021-03-07T07:29:00+00:00

This week I virtually attended the NICAR data journalism conference and made a ton of progress on the Django backend for VaccinateCA (see last week).

NICAR 2021

NICAR stands for the National Institute for Computer Assisted Reporting - an acronym that reflects the age of the organization, which started teaching journalists data-driven reporting back in 1989, long before the term "data journalism" became commonplace.

This was my third NICAR and it's now firly established itself at the top of the list of my favourite conferences. Every year it attracts over 1,000 of the highest quality data nerds - from data journalism veterans who've been breaking stories for decades to journalists who are just getting started with data and want to start learning Python or polish up their skills with Excel.

I presented an hour long workshop on Datasette, which I'm planning to turn into the first official Datasette tutorial. I also got to pre-record a five minute lightning talk about Git scraping.

I published the video and notes for that yesterday. It really seemed to strike a nerve at the conference: I showed how you can set up a scheduled scraper using GitHub Actions with just a few lines of YAML configuration, and do so entirely through the GitHub web interface without even opening a text editor.

Pretty much every data journalist wants to run scrapers, and understands the friction involved in maintaining your own dedicated server and crontabs and storage and backups for running them. Being able to do this for free on GitHub's infrastructure drops that friction down to almost nothing.

The lightning talk lead to a last-minute GitHub Actions and Git scraping office hours session being added to the schedule, and I was delighted to have Ryan Murphy from the LA Times join that session to demonstrate the incredible things the LA Times have been doing with scrapers and GitHub Actions. You can see some of their scrapers in the datadesk/california-coronavirus-scrapers repo.

VaccinateCA

The race continues to build out a Django backend for the VaccinateCA project, to collect data on vaccine availability from people making calls on that organization's behalf.

The new backend is getting perilously close to launch. I'm leaning heavily on the Django admin for this, refreshing my knowledge of how to customize it with things like admin actions and custom filters.

It's been quite a while since I've done anything sophisticated with the Django admin and it has evolved a LOT. In the past I've advised people to drop the admin for custom view functions the moment they want to do anything out-of-the-ordinary - I don't think that advice holds any more. It's got really good over the years!

A very smart thing the team at VaccinateCA did a month ago is to start logging the full incoming POST bodies for every API request handled by their existing Netlify functions (which then write to Airtable).

This has given me an invaluable tool for testing out the new replacement API: I wrote a script which replays those API logs against my new implementation - allowing me to test that every one of several thousand previously recorded API requests will run without errors against my new code.

Since this is so valuable, I've written code that will log API requests to the new stack directly to the database. Normally I'd shy away from a database table for logging data like this, but the expected traffic is the low thousands of API requests a day - and a few thousand extra database rows per day is a tiny price to pay for having such a high level of visibility into how the API is being used.

(I'm also logging the API requests to PostgreSQL using Django's JSONField, which means I can analyze them in depth later on using PostgreSQL's JSON functionality!)

YouTube subtitles

I decided to add proper subtitles to my lightning talk video, and was delighted to learn that the YouTube subtitle editor pre-populates with an automatically generated transcript, which you can then edit in place to fix up spelling, grammar and remove the various "um" and "so" filler words.

This makes creating high quality captions extremely productive. I've also added them to the 17 minute Introduction to Datasette and sqlite-utils video that's embedded on the datasette.io homepage - editing the transcript for that only took about half an hour.

TIL this week

Tags: data-journalism, youtube, datasette, weeknotes, git-scraping, vaccinate-ca, nicar

Git scraping, the five minute lightning talk

2021-03-05T00:44:15+00:00

I prepared a lightning talk about Git scraping for the NICAR 2021 data journalism conference. In the talk I explain the idea of running scheduled scrapers in GitHub Actions, show some examples and then live code a new scraper for the CDC's vaccination data using the GitHub web interface. Here's the video.

Notes from the talk

Here's the PG&E outage map that I scraped. The trick here is to open the browser developer tools network tab, then order resources by size and see if you can find the JSON resource that contains the most interesting data.

I scraped that outage data into simonw/pge-outages - here's the commit history (over 40,000 commits now!)

The scraper code itself is here. I wrote about the project in detail in Tracking PG&E outages by scraping to a git repo - my database of outages database is at pge-outages.simonwillison.net and the animation I made of outages over time is attached to this tweet.

Here's a video animation of PG&E's outages from October 5th up until just a few minutes ago pic.twitter.com/50K3BrROZR
- Simon Willison (@simonw) October 28, 2019

The much simpler scraper for the www.fire.ca.gov/incidents website is at simonw/ca-fires-history.

In the video I used that as the template to create a new scraper for CDC vaccination data - their website is https://covid.cdc.gov/covid-data-tracker/#vaccinations and the API I found using the browser developer tools is https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data.

The new CDC scraper and the data it has scraped lives in simonw/cdc-vaccination-history.

You can find more examples of Git scraping in the git-scraping GitHub topic.

Tags: data-journalism, scraping, talks, github-actions, git-scraping, annotated-talks, nicar

Weeknotes: sqlite-utils 3.0 alpha, Git scraping in the zeitgeist

2020-11-07T02:17:55+00:00

Natalie and I decided to escape San Francisco for election week, and have been holed up in Fort Bragg on the Northern California coast. I've mostly been on vacation, but I did find time to make some significant changes to sqlite-utils. Plus notes on an exciting Git scraping project.

Better search in the sqlite-utils 3.0 alpha

I practice semantic versioning with sqlite-utils, which means it only gets a major version bump if I break backwards compatibility in some way.

My goal is to avoid breaking backwards compatibility as much as possible, and I was proud to have made it all the way to version 2.23 representing 23 new feature releases since the 2.0 release without breaking any documented features!

Sadly this run has come to an end: I realized that the table.search() method was poorly designed, and I also needed to grab back the -c command-line option (a shortcut for --csv output) to be used for another purpose.

The chances that either of these changes will break anyone are pretty small, but semantic versioning dictates a major version bump so here we are.

I shipped a 3.0 alpha today, which should hopefully become a stable release very shortly (milestone here).

The big new feature is sqlite-utils search - a command-line tool for executing searches against a full-text search enabled table:

$ sqlite-utils search 24ways-fts4.db articles maps -c title
[{"rowid": 163, "title": "Get To Grips with Slippy Maps", "rank": -10.028754920576421},
 {"rowid": 220, "title": "Finding Your Way with Static Maps", "rank": -9.952534352591737},
 {"rowid": 27, "title": "Putting Design on the Map", "rank": -5.667327088267961},
 {"rowid": 168, "title": "Unobtrusively Mapping Microformats with jQuery", "rank": -4.662224207228984},

Here's full documentation for the new command.

Notably, this command works against both FTS4 and FTS5 tables in SQLite - despite FTS4 not shipping with a built-in ranking function. I'm using my sqlite-fts4 package for this, which I described back in January 2019 in Exploring search relevance algorithms with SQLite.

Git scraping to predict the election

It's not quite over yet but the end is in sight, and one of the best tools to track the late arriving vote counts is this Election 2020 results site built by Alex Gaynor and a growing cohort of contributors.

The site is a beautiful example of Git scraping in action, and I'm thrilled that it links to my article in the README!

Take a look at the repo to see how it works. Short version: this GitHub Action workflow grabs the latest snapshot of this undocumented New York Times JSON API once every five minutes and commits it to the repository. It then runs this Python script which iterates through the Git history and generates an HTML summary showing the different batches of new votes that were reported and their impact on the overall race.

The resulting report is published to GitHub pages - resulting in a site that can handle a great deal of traffic and is updated entirely by code running in scheduled actions.

This is a perfect use-case for Git scraping: it takes a JSON endpoint that represents the current state of the world and turns it into a sequence of historic snapshots, then uses those snapshots to build a unique and useful new source of information to help people understand what's going on.

Releases this week

sqlite-utils 3.0a0 - 2020-11-07
sqlite-fts4 1.0.1 - 2020-11-06
sqlite-fts4 1.0 - 2020-11-06
csvs-to-sqlite 1.2 - 2020-11-03
datasette 0.51.1 - 2020-11-01

Tags: alex-gaynor, elections, weeknotes, git-scraping, sqlite-utils

nyt-2020-election-scraper

2020-11-06T14:24:36+00:00

nyt-2020-election-scraper

Brilliant application of git scraping by Alex Gaynor and a growing team of contributors. Takes a JSON snapshot of the NYT’s latest election poll figures every five minutes, then runs a Python script to iterate through the history and build an HTML page showing the trends, including what percentage of the remaining votes each candidate needs to win each state. This is the perfect case study in why it can be useful to take a “snapshot if the world right now” data source and turn it into a git revision history over time.

Tags: alex-gaynor, data-journalism, elections, git, new-york-times, git-scraping

Datasette Weekly: Datasette 0.50, git scraping, extracting columns

2020-10-10T21:00:30+00:00

Datasette Weekly: Datasette 0.50, git scraping, extracting columns

The first edition of the new Datasette Weekly newsletter—covering Datasette 0.50, Git scraping, extracting columns with sqlite-utils and featuring datasette-graphql as the first “plugin of the week”

Via @simonw

Tags: email, projects, sqlite, graphql, datasette, git-scraping, sqlite-utils

Git scraping: track changes over time by scraping to a Git repository

2020-10-09T18:27:23+00:00

Git scraping is the name I've given a scraping technique that I've been experimenting with for a few years now. It's really effective, and more people should use it.

Update 5th March 2021: I presented a version of this post as a five minute lightning talk at NICAR 2021, which includes a live coding demo of building a new git scraper.

Update 5th January 2022: I released a tool called git-history that helps analyze data that has been collected using this technique.

The internet is full of interesting data that changes over time. These changes can sometimes be more interesting than the underlying static data - The @nyt_diff Twitter account tracks changes made to New York Times headlines for example, which offers a fascinating insight into that publication's editorial process.

We already have a great tool for efficiently tracking changes to text over time: Git. And GitHub Actions (and other CI systems) make it easy to create a scraper that runs every few minutes, records the current state of a resource and records changes to that resource over time in the commit history.

Here's a recent example. Fires continue to rage in California, and the CAL FIRE website offers an incident map showing the latest fire activity around the state.

Firing up the Firefox Network pane, filtering to requests triggered by XHR and sorting by size, largest first reveals this endpoint:

https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents

That's a 241KB JSON endpoints with full details of the various fires around the state.

So... I started running a git scraper against it. My scraper lives in the simonw/ca-fires-history repository on GitHub.

Every 20 minutes it grabs the latest copy of that JSON endpoint, pretty-prints it (for diff readability) using jq and commits it back to the repo if it has changed.

This means I now have a commit log of changes to that information about fires in California. Here's an example commit showing that last night the Zogg Fires percentage contained increased from 90% to 92%, the number of personnel involved dropped from 968 to 798 and the number of engines responding dropped from 82 to 59.

The implementation of the scraper is entirely contained in a single GitHub Actions workflow. It's in a file called .github/workflows/scrape.yml which looks like this:

name: Scrape latest data

on:
  push:
  workflow_dispatch:
  schedule:
    - cron:  '6,26,46 * * * *'

jobs:
  scheduled:
    runs-on: ubuntu-latest
    steps:
    - name: Check out this repo
      uses: actions/checkout@v2
    - name: Fetch latest data
      run: |-
        curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents | jq . > incidents.json
    - name: Commit and push if it changed
      run: |-
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "Latest data: ${timestamp}" || exit 0
        git push

That's not a lot of code!

It runs on a schedule at 6, 26 and 46 minutes past the hour - I like to offset my cron times like this since I assume that the majority of crons run exactly on the hour, so running not-on-the-hour feels polite.

The scraper itself works by fetching the JSON using curl, piping it through jq . to pretty-print it and saving the result to incidents.json.

The "commit and push if it changed" block uses a pattern that commits and pushes only if the file has changed. I wrote about this pattern in this TIL a few months ago.

I have a whole bunch of repositories running git scrapers now. I've been labeling them with the git-scraping topic so they show up in one place on GitHub (other people have started using that topic as well).

I've written about some of these in the past:

Scraping hurricane Irma back in September 2017 is when I first came up with the idea to use a Git repository in this way.
Changelogs to help understand the fires in the North Bay from October 2017 describes an early attempt at scraping fire-related information.
Generating a commit log for San Francisco’s official list of trees remains my favourite application of this technique. The City of San Francisco maintains a frequently updated CSV file of 190,000 trees in the city, and I have a commit log of changes to it stretching back over more than a year. This example uses my csv-diff utility to generate human-readable commit messages.
Tracking PG&E outages by scraping to a git repo documents my attempts to track the impact of PG&E's outages last year by scraping their outage map. I used the GitPython library to turn the values recorded in the commit history into a database that let me run visualizations of changes over time.
Tracking FARA by deploying a data API using GitHub Actions and Cloud Run shows how I track new registrations for the US Foreign Agents Registration Act (FARA) in a repository and deploy the latest version of the data using Datasette.

I hope that by giving this technique a name I can encourage more people to add it to their toolbox. It's an extremely effective way of turning all sorts of interesting data sources into a changelog over time.

Comment thread on this post over on Hacker News.

Tags: git, github, projects, scraping, github-actions, git-scraping

Weeknotes: datasette-auth-passwords, a Datasette logo and a whole lot more

2020-07-17T03:41:13+00:00

All sorts of project updates this week.

datasette-auth-passwords

Datasette 0.44 added authentication support as a core concept, but left the actual implementation details up to the plugins.

I released datasette-auth-passwords on Monday. It's an implementation of the most obvious form of authentication (as opposed to GitHub SSO or bearer tokens or existing domain cookies): usernames and passwords, typed into a form.

Implementing passwords responsibly is actually pretty tricky, due to the need to effectively hash them. After some research I ended up mostly copying how Django does it (never a bad approach): I'm using 260,000 salted pbkdf2_hmac iterations, taking advantage of the Python standard library. I wrote this up in a TIL.

The plugin currently only supports hard-coded password hashes that are fed to Datasette via an environment variable - enough to set up a password-protected Datasette instance with a couple of users, but not really good for anything more complex than that. I have an open issue for implementing database-backed password accounts, although again the big challenge is figuring out how to responsible store those password hashes.

I've set up a live demo of the password plugin at datasette-auth-passwords-demo.datasette.io - you can sign into it to reveal a private database that's only available to authenticated users.

Datasette website and logo

I'm finally making good progress on a website for Datasette. As part of that I've been learning to use Figma, which I used to create a Datasette logo.

Figma is really neat: it's an entirely web-based vector image editor, aimed at supporting the kind of design work that goes into websites and apps. It has full collaborative editing for teams but it's free for single users. Most importantly it has extremely competent SVG exports.

I've added the logo to the latest version of the Datasette docs, and I have an open pull request to sphinx_rtd_theme to add support for setting a custom link target on the logo so I can link back to the rest of the official site, when it goes live.

TIL search snippet highlighting

My TIL site has a search engine, but it didn't do snippet highlighting. I reused the pattern I described in Fast Autocomplete Search for Your Website - implemented server-side rather than client-side this time - to add that functionality. The implementation is here - here's a demo of it in action.

SRCCON schedule

I'm attending (virtually) the SRCCON 2020 journalism conference this week, and Datasette is part of the Projects, Products, & Research track.

As a demo, I set up a Datasette powered copy of the conference schedule at srccon-2020.datasette.io - it's running the datasette-ics plugin which means it can provide a URL that can be subscribed to in Google or Apple Calendar.

The site runs out of the simonw/srccon-2020-datasette repository, which uses a GitHub Action to download the schedule JSON, modify it a little (mainly to turn the start and end dates into ISO datestamps), save it to a SQLite database with sqlite-utils and publish it to Vercel.

Covid 19 population data

My Covid-19 tracker publishes updated numbers of cases and deaths from the New York Times, the LA Times and Johns Hopkins university on an hourly basis.

One thing that was missing was county population data. US counties are identified in the data by their FIPS codes, which offers a mechanism for joining against population estimates pulled from the US Census.

Thanks to Aaron King I've now incorporated that data into the site, as a new us_census_county_populations_2019 table.

I used that data to define a SQL view - latest_ny_times_counties_with_populations - which shows the latest New York Times county data with new derived cases_per_million and deaths_per_million columns.

Tweaks to this blog

For many years this blog's main content has sat on the left of the page - which looks increasingly strange as screens get wider and wider. As of this commit the main layout is centered, which I think looks much nicer.

I also ran a data migration to fix some old internal links.

Miscellaneous

I gave a (virtual) talk at Django London on Monday about Datasette. I've taken to sharing a Google Doc for this kind of talk, which I prepare before the talk with notes and then update afterwards to reflect additional material from the Q&A. Here's the document from Monday's talk.

San Francisco Public Works maintain a page of tree removal notifications showing trees that are scheduled for removal. I like those trees. They don't provide an archive of notifications from that page, so I've set up a git scraping GitHub repository that scrapes the page daily and maintains a history of its contents in the commit log.

I updated datasette-publish-fly for compatibility with Datasette 0.44 and Python 3.6.

I made a few tweaks to my GitHub profile README, which is now Apache 2 licensed so people know they can adapt it for their own purposes.

I released github-to-sqlite 2.3 with a new option for fetching information for just specific repositories.

The Develomentor podcast published an interview with me about my career, and how it's been mostly defined by side-projects.

TIL this week

Tags: design, passwords, projects, datasette, weeknotes, covid19, git-scraping

Weeknotes: Archiving coronavirus.data.gov.uk, custom pages and directory configuration in Datasette, photos-to-sqlite

2020-04-29T19:41:11+00:00

I mainly made progress on three projects this week: Datasette, photos-to-sqlite and a cleaner way of archiving data to a git repository.

Archiving coronavirus.data.gov.uk

The UK goverment have a new portal website sharing detailed Coronavirus data for regions around the country, at coronavirus.data.gov.uk.

As with everything else built in 2020, it's a big single-page JavaScript app. Matthew Somerville investigated what it would take to build a much lighter (and faster loading) site displaying the same information by moving much of the rendering to the server.

One of the best things about the SPA craze is that it strongly encourages structured data to be published as JSON files. Matthew's article inspired me to take a look, and sure enough the government figures are available in an extremely comprehensive (and 3.3MB in size) JSON file, available from https://c19downloads.azureedge.net/downloads/data/data_latest.json.

Any time I see a file like this my first questions are how often does it change - and what kind of changes are being made to it?

I've written about scraping to a git repository (see my new gitscraping tag) a bunch in the past:

Scraping hurricane Irma - September 2017
Changelogs to help understand the fires in the North Bay - October 2017
Generating a commit log for San Francisco’s official list of trees - March 2019
Tracking PG&E outages by scraping to a git repo - October 2019
Deploying a data API using GitHub Actions and Cloud Run - January 2020

Now that I've figured out a really clean way to Commit a file if it changed in a GitHub Action knocking out new versions of this pattern is really quick.

simonw/coronavirus-data-gov-archive is my new repo that does exactly that: it periodically fetches the latest versions of the JSON data files powering that site and commits them if they have changed. The aim is to build a commit history of changes made to the underlying data.

The first implementation was extremely simple - here's the entire action:

name: Fetch latest data

on:
push:
repository_dispatch:
schedule:
    - cron:  '25 * * * *'

jobs:
scheduled:
    runs-on: ubuntu-latest
    steps:
    - name: Check out this repo
    uses: actions/checkout@v2
    - name: Fetch latest data
    run: |-
        curl https://c19downloads.azureedge.net/downloads/data/data_latest.json | jq . > data_latest.json
        curl https://c19pub.azureedge.net/utlas.geojson | gunzip | jq . > utlas.geojson
        curl https://c19pub.azureedge.net/countries.geojson | gunzip | jq . > countries.geojson
        curl https://c19pub.azureedge.net/regions.geojson | gunzip | jq . > regions.geojson
    - name: Commit and push if it changed
    run: |-
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "Latest data: ${timestamp}" || exit 0
        git push

It uses a combination of curl and jq (both available in the default worker environment) to pull down the data and pretty-print it (better for readable diffs), then commits the result.

Matthew Somerville pointed out that inefficient polling sets a bad precedent. Here I'm hitting azureedge.net, the Azure CDN, so that didn't particularly worry me - but since I want this pattern to be used widely it's good to provide a best-practice example.

Figuring out the best way to make conditional get requests in a GitHub Action lead me down something of a rabbit hole. I wanted to use curl's new ETag support but I ran into a curl bug, so I ended up rolling a simple Python CLI tool called conditional-get to solve my problem. In the time it took me to release that tool (just a few hours) a new curl release came out with a fix for that bug!

Here's the workflow using my conditional-get tool. See the issue thread for all of the other potential solutions, including a really neat Action shell-script solution by Alf Eaton.

To my absolute delight, the project has already been forked once by Daniel Langer to capture Canadian Covid-19 cases!

New Datasette features

I pushed two new features to Datasette master, ready for release in 0.41.

Configuration directory mode

This is an idea I had while building datasette-publish-now. Datasette instances can be run with custom metadata, custom plugins and custom templates. I'm increasingly finding myself working on projects that run using something like this:

$ datasette data1.db data2.db data3.db \
    --metadata=metadata.json
    --template-dir=templates \
    --plugins-dir=plugins

Directory configuration mode introduces the idea that Datasette can configure itself based on a directory layout. The above example can instead by handled by creating the following layout:

my-project/data1.db
my-project/data2.db
my-project/data3.db
my-project/metadatata.json
my-project/templates/index.html
my-project/plugins/custom_plugin.py

Then run Datasette directly targetting that directory:

$ datasette my-project/

See issue #731 for more details. Directory configuration mode is documented here.

Define custom pages using templates/pages

In niche-museums.com, powered by Datasette I described how I built the www.niche-museums.com website as a heavily customized Datasette instance.

That site has /about and /map pages which are served by custom templates - but I had to do some gnarly hacks with empty about.db and map.db files to get them to work.

Issue #648 introduces a new mechanism for creating this kind of page: create a templates/pages/map.html template file and custom 404 handling code will ensure that any hits to /map serve the rendered contents of that template.

This could work really well with the datasette-template-sql plugin, which allows templates to execute abritrary SQL queries (ala PHP or ColdFusion).

Here's the new documentation on custom pages, including details of how to use the new custom_status(), custom_header() and custom_redirect() template functions to go beyond just returning HTML.

photos-to-sqlite

My Dogsheep personal analytics project brings my tweets, GitHub activity, Swarm checkins and more together in one place. But the big missing feature is my photos.

As-of yesterday, I have 39,000 photos from Apple Photos uploaded to an S3 bucket using my new photos-to-sqlite tool. I can run the following SQL query and get back ten random photos!

select
  json_object(
    'img_src',
    'https://photos.simonwillison.net/i/' || 
    sha256 || '.' || ext || '?w=400'
  ),
  filepath,
  ext
from
  photos
where
  ext in ('jpeg', 'jpg', 'heic')
order by
  random()
limit
  10

photos.simonwillison.net is running a modified version of my heic-to-jpeg image converting and resizing proxy, which I'll release at some point soon.

There's still plenty of work to do - I still need to import EXIF data (including locations) into SQLite, and I plan to use osxphotos to export additional metadata from my Apple Photos library. But this week it went from a pure research project to something I can actually start using, which is exciting.

TIL this week

Generated using this query.

Tags: git, http, matthew-somerville, photos, projects, datasette, weeknotes, covid19, git-scraping

Tracking FARA by deploying a data API using GitHub Actions and Cloud Run

2020-01-21T07:51:11+00:00

I'm using the combination of GitHub Actions and Google Cloud Run to retrieve data from the U.S. Department of Justice FARA website and deploy it as a queryable API using Datasette.

FARA background

The Foreign Agents Registration Act (FARA) law that requires "certain agents of foreign principals who are engaged in political activities or other activities specified under the statute to make periodic public disclosure of their relationship with the foreign principal, as well as activities, receipts and disbursements in support of those activities".

The law was introduced in 1938 in response to the large number of German propaganda agents that were operating in the U.S. prior to the war.

Basically, if you are in the United States as a lobbyist for a foreign government you need to register under FARA. It was used in 23 criminal cases during World War II, but hasn't had much use since it was ammended in 1966. Although... if you consult the list of recent cases you'll see some very interesting recent activity involving Russia and Ukraine.

It's also for spies! Quoting the FARA FAQ:

Finally, 50 U.S.C. § 851, requires registration of persons who have knowledge of or have received instruction or assignment in espionage, counterespionage or sabotage service or tactics of a foreign country or political party.

I imagine most spies operate in violation of this particular law and don't take steps to register themselves.

It's all still pretty fascinating though, in part because it gets updated. A lot. Almost every business day in fact.

Tracking FARA history

I know this because seven months ago I set up a scraper for it. Every twelve hours I have code which downloads the four bulk CSVs published by the Justice department and saves them to a git repository. It's the same trick I've been using to track San Francisco's database of trees and PG&E's outage map.

I've been running the scraper using Circle CI, but this weekend I decided to switch it over to GitHub Actions to get a better idea for how they work.

Deploying it as an API

I also wanted to upgrade my script to also deploy a fresh Datasette instance of the data using Google Cloud Run. I wrote a script to do this on a manual basis last year, but I never combined it with the daily scraper. Combining the two means I can offer a Datasette-powered API directly against the latest data.

https://fara.datasettes.com is that API - it now updates twice a day, assuming there are some changes to the underlying data.

Putting it all together

The final GitHub action workflow can be seen here. I'm going to present an annotated version here.

on:
  repository_dispatch:
  schedule:
    - cron:  '0 0,12 * * *'

This sets when the workflow should be triggered. I'm running it twice a day - at midnight and noon UTC (the 0,12 cron syntax).

The repository_dispatch key means I can also trigger it manually by running the following curl command - useful for testing:

curl -XPOST https://api.github.com/repos/simonw/fara-history/dispatches \
    -H 'Authorization: token MY_PERSONAL_TOKEN_HERE' \
    -d '{"event_type": "trigger_action"}' \
    -H 'Accept: application/vnd.github.everest-preview+json'

Next comes the job itself, which I called scheduled and set to run on the latest Ubuntu:

jobs:
  scheduled:
    runs-on: ubuntu-latest
    steps:

Next comes the steps. Each step is run in turn, in an isolated process (presumably a container) but with access to the current working directory.

- uses: actions/checkout@v2
  name: Check out repo
- name: Set up Python
  uses: actions/setup-python@v1
  with:
    python-version: 3.8

The first two steps checkout the fara-history repository and install Python 3.8.

- uses: actions/cache@v1
  name: Configure pip caching
  with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
    restore-keys: |
      ${{ runner.os }}-pip-

This step should set up a cache so that pip doesn't have to download fresh dependencies on every run. Unfortunately it doesn't seem to actually work - it only works for push and pull_request events, but my workflow is triggered by schedule and repository_dispatch. There's an open issue about this.

- name: Install Python dependencies
  run: |
    python -m pip install --upgrade pip
    pip install -r requirements.txt

This step installs my dependencies from requirements.txt.

- name: Fetch, update and commit FARA data
  run: . update_and_commit_all.sh
- name: Build fara.db database
  run: python build_database.py

Now we're getting to the fun stuff. My update_and_commit_all.sh script downloads the four zip files from the FARA.gov site, unzips them, sorts them, diffs them against the previously stored files and commits the new copy to GitHub if they have changed. See my explanation of csv-diff for more on this - though sadly only one of the files has a reliable row identifier so I can't generate great commit messages for most of them.

My build_database.py script uses sqlite-utils to convert the CSV files into a SQLite database.

Now that we've got a SQLite database, we can deploy it to Google Cloud Run using Datasette.

But should we run a deploy at all? If the database hasn't changed, there's no point in deploying it. How can we tell if the database file has changed from the last one that was published?

Datasette has a mechanism for deriving a content hash of a database, part of a performance optimization which is no longer turned on by default and may be removed in the future.

You can generate JSON that includes hash using the datasette inspect command. The jq tool can then be used to extract out just the hash:

$ datasette inspect fara.db | jq '.fara.hash' -r
fbc9cbaca6de1e232fc14494faa06cc8d4cb9f379d0d568e4711e9a218800906

The -r option to jq causes it to return just the raw string, without quote marks.

Datasette's /-/databases.json introspection URL reveals the hashes of the currently deployed database. Here's how to pull the currently deployed hash:

$ curl -s https://fara.datasettes.com/-/databases.json | jq '.[0].hash' -r
a6c0ab26589bde0d225c5a45044e0adbfa3840b95fbb263d01fd8fb0d2460ed5

If those two hashes differ then we should deploy the new database.

GitHub Actions have a slightly bizarre mechanism for defining "output variables" for steps, which can then be used to conditionally run further steps.

Here's the step that sets those variables, followed by the step that conditionally installs the Google Cloud CLI tools using their official action:

- name: Set variables to decide if we should deploy
  id: decide_variables
  run: |-
    echo "##[set-output name=latest;]$(datasette inspect fara.db | jq '.fara.hash' -r)"
    echo "##[set-output name=deployed;]$(curl -s https://fara.datasettes.com/-/databases.json | jq '.[0].hash' -r)"
- name: Set up Cloud Run
  if: steps.decide_variables.outputs.latest != steps.decide_variables.outputs.deployed
  uses: GoogleCloudPlatform/github-actions/setup-gcloud@master
  with:
    version: '275.0.0'
    service_account_email: ${{ secrets.GCP_SA_EMAIL }}
    service_account_key: ${{ secrets.GCP_SA_KEY }}

Having installed the Google Cloud tools, I can deploy my database using Datasette:

- name: Deploy to Cloud Run
  if: steps.decide_variables.outputs.latest != steps.decide_variables.outputs.deployed
  run: |-
    gcloud components install beta
    gcloud config set run/region us-central1
    gcloud config set project datasette-222320
    datasette publish cloudrun fara.db --service fara-history -m metadata.json

This was by far the hardest part to figure out.

First, I needed to create a Google Cloud service account with an accompanying service key.

I tried and failed to do this using the CLI, so I switched to their web console following these and then these instructions.

Having downloaded the key JSON file, I converted it to base64 and pasted it into a GitHub Actions secret (hidden away in the repository settings area) called GCP_SA_KEY.

cat ~/Downloads/datasette-222320-2ad02afe6d82.json \
    | base64 | pbcopy

The service account needed permissions in order to run a build through Cloud Build and then deploy the result through Cloud Run. I spent a bunch of time trying out different combinations and eventually gave up and gave the account "Editor" permissions across my entire project. This is bad. I am hoping someone can help me understand what the correct narrow set of permissions are, and how to apply them.

It also took me a while to figure out that I needed to run these three commands before I could deploy to my project. The first one installs the Cloud Run tools, the second set up some required configuration:

gcloud components install beta
gcloud config set run/region us-central1
gcloud config set project datasette-222320

But... having done all of the above, the following command run from an action successfully deploys the site!

datasette publish cloudrun fara.db \
    --service fara-history -m metadata.json

DNS

Google Cloud Run deployments come with extremely ugly default URLs. For this project, that URL is https://fara-history-j7hipcg4aq-uc.a.run.app/.

I wanted something nicer. I own datasettes.com and manage the DNS via Cloudflare, which means I can point subdomains at Cloud Run instances.

This is a two-step process

I set fara.datasettes.com as a DNS-only (no proxying) CNAME for ghs.googlehosted.com.
In the Google Cloud Console I used Cloud Run -> Manage Custom Domains (a button in the header) -> Add Mapping to specify that fara.datasettes.com should map to my fara-history service (the --service argument from datasette publish earlier).

I had previously verified my domain ownership - I forget quite how I did it. Domains purchased through Google Domains get to skip this step.

Next steps

This was a lot of fiddling around. I'm hoping that by writing this up in detail I'll be able to get this working much faster next time.

I think this model - GitHub Actions that pull data, build a database and deploy to Cloud Run using datasette publish - is incredibly promising. The end result should be an API that costs cents-to-dollars a month to operate thanks to Cloud Run's scale-to-zero architecture. And hopefully by publishing this all on GitHub it will be as easy as possible for other people to duplicate it for their own projects.

Tags: continuous-deployment, continuous-integration, data-journalism, github, projects, datasette, cloudrun, github-actions, git-scraping

Tracking PG&E outages by scraping to a git repo

2019-10-10T23:32:14+00:00

PG&E have cut off power to several million people in northern California, supposedly as a precaution against wildfires.

As it happens, I've been scraping and recording PG&E's outage data every 10 minutes for the past 4+ months. This data got really interesting over the past two days!

The original data lives in a GitHub repo (more importantly in the commit history of that repo).

Reading JSON in a Git repo isn't particularly productive, so this afternoon I figured out how to transform that data into a SQLite database and publish it with Datasette.

The result is https://pge-outages.simonwillison.net/ (no longer available)

Update from 27th October 2019: I also used the data to create this animation (first shared on Twitter):

Your browser does not support the video tag.

The data model: outages and snapshots

The three key tables to understand are outages, snapshots and outage_snapshots.

PG&E assign an outage ID to every outage - where an outage is usually something that affects a few dozen customers. I store these in the outages table.

Every 10 minutes I grab a snapshot of their full JSON file, which reports every single outage that is currently ongoing. I store a record of when I grabbed that snapshot in the snapshots table.

The most interesting table is outage_snapshots. Every time I see an outage in the JSON feed, I record a new copy of its data as an outage_snapshot row. This allows me to reconstruct the full history of any outage, in 10 minute increments.

Here are all of the outages that were represented in snapshot 1269 - captured at 4:10pm Pacific Time today.

I can run select sum(estCustAffected) from outage_snapshots where snapshot = 1269 (try it here) to count up the total PG&E estimate of the number of affected customers - it's 545,706!

I've installed datasette-vega which means I can render graphs. Here's my first attempt at a graph showing the number of estimated customers affected over time.

(I don't know why there's a dip towards the end of the graph).

I also defined a SQL view which shows all of the outages from the most recently captured snapshot (usually within the past 10 minutes if the PG&E website hasn't gone down) and renders them using datasette-cluster-map.

Things to be aware of

There are a huge amount of unanswered questions about this data. I've just been looking at PG&E's JSON and making guesses about what things like estCustAffected means. Without official documentation we can only guess as to how accurate this data is, or how it should be interpreted.

Some things to question:

What's the quality of this data? Does it reflect accurately on what's actually going on out there?
What's the exact meaning of the different columns - estCustAffected, currentEtor, autoEtor, hazardFlag etc?
Various columns (lastUpdateTime, currentEtor, autoEtor) appear to be integer unix timestamps. What timezone were they recorded in? Do they include DST etc?

How it works

I originally wrote the scraper back in October 2017 during the North Bay fires, and moved it to run on Circle CI based on my work building a commit history of San Francisco's trees.

It's pretty simple: every 10 minutes a Circle CI job runs which scrapes the JSON feed that powers the PG&E website's outage map.

The JSON is then committed to my pge-outages GitHub repository, over-writing the existing pge-outages.json file. There's some code that attempts to generate a human-readable commit message, but the historic data itself is saved in the commit history of that single file.

Building the Datasette

The hardest part of this project was figuring out how to turn a GitHub commit history of changes to a JSON file into a SQLite database for use with Datasette.

After a bunch of prototyping in a Jupyter notebook, I ended up with the schema described above.

The code that generates the database can be found in build_database.py. I used GitPython to read data from the git repository and my sqlite-utils library to create and update the database.

Deployment

Since this is a large database that changes every ten minutes, I couldn't use the usual datasette publish trick of packaging it up and re-deploying it to a serverless host (Cloud Run or Heroku or Zeit Now) every time it updates.

Instead, I'm running it on a VPS instance. I ended up trying out Digital Ocean for this, after an enjoyable Twitter conversation about good options for stateful (as opposed to stateless) hosting.

Next steps

I'm putting this out there and sharing it with the California News Nerd community in the hope that people can find interesting stories in there and help firm up my methodology - or take what I've done and spin up much more interesting forks of it.

If you build something interesting with this please let me know, via email (swillison is my Gmail) or on Twitter.

Tags: data-journalism, projects, scraping, sqlite, datasette, git-scraping, digitalocean, sqlite-utils

Generating a commit log for San Francisco's official list of trees

2019-03-13T14:49:48+00:00

San Francisco has a neat open data portal (as do an increasingly large number of cities these days). For a few years my favourite file on there has been Street Tree List, a list of all 190,000 trees in the city maintained by the Department of Public Works.

I’ve been using that file for Datasette demos for a while now, but last week I noticed something intriguing: the file had been recently updated. On closer inspection it turns out it’s updated on a regular basis! I had assumed it was a static snapshot of trees at a certain point in time, but I was wrong: Street_Tree_List.csv is a living document.

Back in September 2017 I built a scraping project relating to hurricane Irma. The idea was to take data sources like FEMA’s list of open shelters and track them over time, by scraping them into a git repository and committing after every fetch.

I’ve been meaning to spend more time with this idea, and building a commit log for San Francisco’s trees looked like an ideal opportunity to do so.

sf-tree-history

Here’s the result: sf-tree-history, a git repository dedicated to recording the history of changes made to the official list of San Francisco’s trees. The repo contains three things: the latest copy of Street_Tree_List.csv, a README, and a Circle CI configuration that grabs a new copy of the file every night and, if it has changed, commits it to git and pushes the result to GitHub.

The most interesting part of the repo is the commit history itself. I’ve only been running the script for just over a week, but I already have some useful illustrative commits:

7ab432cdcb8d7914cfea4a5b59803f38cade532b from March 6th records three new trees added to the file: two Monterey Pines and a Blackwood Acacia.
d6b258959af9546909b2eee836f0156ed88cd45d from March 12th shows four changes made to existing records. Of particular interest: TreeID 235981 (a Cherry Plum) had its address updated from 412 Webster St to 410 Webster St and its latitude and longitude tweaked a little bit as well.
ca66d9a5fdd632549301d249c487004a5b68abf2 lists 2151 rows changed, 1280 rows added! I found an old copy of Street_Tree_List.csv on my laptop from April 2018, so for fun I loaded it into the repository and used git commit amend to back-date the commit to almost a year ago. I generated a commit message between that file and the version from 9 days ago which came in at around 10,000 lines of text. Git handled that just fine, but GitHub’s web view sadly truncates it.

csv-diff

One of the things I learned from my hurricane Irma project was the importance of human-readable commit messages that summarize the detected changes. I initially wrote some code to generate those by hand, but then realized that this could be extracted into a reusable tool.

The result is csv-diff, a tiny Python CLI tool which can generate a human (or machine) readable version of the differences between two CSV files.

Using it looks like this:

$ csv-diff one.csv two.csv --key=id
1 row added, 1 row removed, 1 row changed

1 row added

  {"id": "3", "name": "Bailey", "age": "1"}

1 row removed

  {"id": "2", "name": "Pancakes", "age": "2"}

1 row changed

  Row 1
    age: "4" => "5"

The csv-diff README has further details on the tool.

Circle CI

My favourite thing about the sf-tree-history project is that it costs me nothing to run - either in hosting costs or (hopefully) in terms of ongoing maintenance.

The git repository is hosted for free on GitHub. Because it’s a public project, Circle CI will run tasks against it for free.

My .circleci/config.yml does the rest. It uses Circle’s cron syntax to schedule a task that runs every night. The task then runs this script (embedded in the YAML configuration):

cp Street_Tree_List.csv Street_Tree_List-old.csv
curl -o Street_Tree_List.csv "https://data.sfgov.org/api/views/tkzw-k3nq/rows.csv?accessType=DOWNLOAD"
git add Street_Tree_List.csv
git config --global user.email "treebot@example.com"
git config --global user.name "Treebot"
sudo pip install csv-diff
csv-diff Street_Tree_List-old.csv Street_Tree_List.csv --key=TreeID > message.txt
git commit -F message.txt && \
  git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master \
  || true

This script does all of the work.

First it backs up the existing Street_Tree_list.csv as Street_Tree_List-old.csv, in order to be able to run a comparison later.
It downloads the latest copy of Street_Tree_List.csv from the San Francisco data portal
It adds the file to the git index and sets itself an identity for use in the commit
It installs my csv-diff utility from PyPI
It uses csv-diff to create a diff of the two files, and writes that diff to a new file called message.txt
Finally, it attempts to create a new commit using message.txt as the commit message, then pushes the result to GitHub

The last line is the most complex. Circle CI will mark a build as failed if any of the commands in the run block return a non-0 exit code. git commit returns a non-0 exit code if you attempt to run it but none of the files have changed.

git commit ... && git push ... || true ensures that if git commit succeeds the git push command will be run, BUT if it fails the || true will still return a 0 exit code for the overall line - so Circle CI will not mark the build as failed.

There’s one last trick here: I’m using git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master to push my changes to GitHub. This takes advantage of Circle CI environment variables, which are the recommended way to configure secrets such that they cannot be viewed by anyone browsing your Circle CI builds. I created a personal GitHub auth token for this project, which I’m using to allow Circle CI to push commits to GitHub on my behalf.

Next steps

I’m really excited about this pattern of using GitHub in combination with Circle CI to track changes to any file that is being posted on the internet. I’m opening up the code (and my csv-diff utility) in the hope that other people will use them to set up their own tracking projects. Who knows, maybe there’s a file out there that’s even more exciting than San Francisco’s official list of trees!

Tags: csv, data-journalism, git, projects, san-francisco, git-scraping

Changelogs to help understand the fires in the North Bay

2017-10-10T06:48:07+00:00

The situation in the counties north of San Francisco is horrifying right now. I’ve repurposed some of the tools I built to for the Irma Response project last month to collect and track some data that might be of use to anyone trying to understand what’s happening up there. I’m sharing these now in the hope that they might prove useful.

I’m scraping a number of sources relevant to the crisis, and making the data available in a repository on GitHub. Because it’s a git repository, changes to those sources are tracked automatically. The value I’m providing here isn’t so much the data itself, it’s the history of the data. If you need to see what has changed and when, my repository’s commit log should have the answers for you. Or maybe you’ll just want to occasionally hit refresh on this history of changes to srcity.org/610/Emergency-Information to see when they edited the information.

The sources I’m tracking right now are:

The Santa Rosa Fire Department’s Emergency Information page. This is being maintained by hand so it’s not a great source of structured data, but it has key details like the location and availability of shelters and it’s useful to know what was changed and when. History of changes to that page.
PG&E power outages. This is probably the highest quality dataset with the neatest commit messages. The commit history of these shows exactly when new outages are reported and how many customers were affected.
Road Conditions in the County of Sonoma. If you want to understand how far the fire has spread, this is a useful source of data as it shows which roads have been closed due to fire or other reasons. History of changes.
California Highway Patrol Incidents, extracted from a KML feed on quickmap.dot.ca.gov. Since these cover the whole state of California there’s a lot of stuff in here that isn’t directly relevant to the North Bay, but the incidents that mention fire still help tell the story of what’s been happening. History of changes.

The code for the scrapers can be found in north_bay.py. Please leave comments, feedback or suggestions on other useful potential sources of data in this GitHub issue.

Tags: screenscraping, crisishacking, git-scraping

Scraping hurricane Irma

2017-09-10T06:21:17+00:00

The Irma Response project is a team of volunteers working together to make information available during and after the storm. There is a huge amount of information out there, on many different websites. The Irma API is an attempt to gather key information in one place, verify it and publish it in a reuseable way. It currently powers the irmashelters.org website.

To aid this effort, I built a collection of screen scrapers that pull data from a number of different websites and APIs. That data is then stored in a Git repository, providing a clear history of changes made to the various sources that are being tracked.

Some of the scrapers also publish their findings to Slack in a format designed to make it obvious when key events happen, such as new shelters being added or removed from public listings.

Tracking changes over time

A key goal of this screen scraping mechanism is to allow changes to the underlying data sources to be tracked over time. This is achieved using git, via the GitHub API. Each scraper pulls down data from a source (an API or a website) and reformats that data into a sanitized JSON format. That JSON is then written to the git repository. If the data has changed since the last time the scraper ran, those changes will be captured by git and made available in the commit log.

Recent changes tracked by the scraper collection can be seen here: https://github.com/simonw/irma-scraped-data/commits/master

Generating useful commit messages

The most complex code for most of the scrapers isn’t in fetching the data: it’s in generating useful, human-readable commit messages that summarize the underlying change. For example, here is a commit message generated by the scraper that tracks the http://www.floridadisaster.org/shelters/summary.aspx page:

florida-shelters.json: 2 shelters added

Added shelter: Atwater Elementary School (Sarasota County)
Added shelter: DEBARY ELEMENTARY SCHOOL (Volusia County)
Change detected on http://www.floridadisaster.org/shelters/summary.aspx

The full commit also shows the changes to the underlying JSON, but the human-readable message provides enough information that people who are not JSON-literate programmers can still derive value from the commit.

Publishing to Slack

The Irma Response team use Slack to co-ordinate their efforts. You can join their Slack here: https://irma-response-slack.herokuapp.com/

Some of the scrapers publish detected changes in their data source to Slack, as links to the commits generated for each change. The human-readable message is posted directly to the channel.

The source code for all of the scrapers can be found at https://github.com/simonw/irma-scrapers

This Entry started out as README file.

Tags: screenscraping, crisishacking, git-scraping