Recent entries
Weeknotes: DjangoCon, SQLite in Django, datasette-gunicorn 18 hours ago
I spent most of this week at DjangoCon in San Diego—my first outside-of-the-Bay-Area conference since the before-times.
It was a most excellent event. I spent a lot of time in the corridor track—actually the sitting-outside-in-the-sunshine track, catching up with people I haven’t seen in several years.
I gave a talk titled "Massively increase your productivity on personal projects with comprehensive documentation and automated tests", with the alternative title "Coping strategies for the serial project hoarder". I’ll do a full write-up of this once the video is made available in a few weeks time, but in the meantime the talk materials can be found here:
- Supporting notes and links
- Slides on Speaker Deck
- Video for paying DjangoCon attendees (public video coming soon)
I also gave a lightning talk about AI and magic, which was effectively the five minute oral version of my recent blog post Is the AI spell-casting metaphor harmful or helpful?
Benchmarking SQLite in Django
I also hung around for the first day of the DjangoCon sprints.
For over a decade, the Django documentation has warned against using SQLite in production—recommending PostgreSQL or MySQL instead.
I asked Django Fellow Carlton Gibson what it would take to update that advice for 2022. He suggested that what we really needed was a solid idea for how well modern SQLite performs with Django, against a variety of different settings.
So I spent some time running benchmarks, using my new django_sqlite_benchmark repository.
You can follow the full details of my experiments in these issues:
- #2: Locust test to exercise /counter/xxx endpoint which runs benchmarks against a simple Django view that increments a counter stored in a SQLite table
- #3: Load test for larger writes runs a benchmark using a script that inserts larger JSON objects into a database table. I also tried this against PostgreSQL, getting very similar numbers to SQLite.
- #4: Benchmark endpoint that doesn’t interact with database benchmarks a simple “hello world” view that doesn’t use SQLite at all—as a baseline for comparison
I used Locust for all of these tests, and wrote up a TIL about using it as well.
Here’s the TLDR version of the results: SQLite in its default “journal” mode starts returning “database locked” errors pretty quickly as the write load increases. But... if you switch to “wal” mode (here’s how) those errors straight up vanish!
I was expecting WAL mode to improve things, but I thought I’d still be able to hit errors even with it enabled. No—it turns out that, at least for the amount of traffic I could generate on may laptop, WAL mode proved easily capable of handling the load.
Even without WAL mode, bumping the SQLite “timeout” option up to 20s solved most of the errors.
Even more interestingly: I tried using Gunicorn (and Uvicorn) to run multiple Django workers at once. I was certain this would lead to problems, as SQLite isn’t designed to handle writes from multiple processes at once... or so I thought. It turned out SQLite’s use of file locking meant everything worked far better than I expected—and upping the number of worker processes from 1 to 4 resulted in approximately a 4x increase in throughput.
I shouldn’t be surprised by this, if only because every time I’ve tried to push SQLite in a new direction it’s impressed me with how much more capable it is than I expected.
But still, these results are very exciting. This problem still needs more thorough testing and more eyes than just mine, but I think this indicates that SQLite should absolutely be considered a viable option for running Django in production in 2022.
datasette-gunicorn
Datasette has always run as a single process. It uses Uvicorn to serve requests, but it hard-codes Uvicorn to a single worker (here).
Based on my experiments with SQLite and Django—in particular how running multiple worker processes gave me an increase in how much traffic I could handle—I decided to try the same thing with Datasette itself.
Gunicorn remains one of the most well regarded options for deploying Python web applications. It acts as a process monitor, balancing requests between different workers and restarting anything that fails with an error.
I decided to experiment with this through the medium of a Datasette plugin. So I built datasette-gunicorn, a plugin that adds an extra command to Datasette that lets you start it like this:
datasette gunicorn my.db --workers 4
It takes most of the same arguments as Datasette’s regular datasette serve command, plus that new -w/--workers option for setting the number of workers.
Initial benchmarks were very positive: 21 requests a second with a single worker, increasing to 75 requests/second with four! Not bad for an initial experiment. I also tested it serving a static page through Datasette and got up to over 500 requests a second with a warning that Locust needed to be moved to a separate machine for a full load test.
In writing the plugin I had to figure out how to build a new command that mostly copied parameters from the existing datasette serve Click command—I wrote a TIL about how I ended up doing that.
shot-scraper 1.0
Also this week: I released shot-scraper 1.0.
Despite the exciting version number this actually only has two small new features. Here’s the full changelog:
- New
shot-scraper html URLcommand (documented here) for outputting the final HTML of a page, after JavaScript has been executed. #96shot-scraper javascripthas a new-r/--rawoption for outputting the result of the JavaScript expression as a raw string rather than JSON encoded (shot-scraper javascript documentation). #95- Tutorial: Automating screenshots for the Datasette documentation using shot-scraper.
I bumped it to 1.0 because shot-scraper is mature enough now that I’m ready to commit to not breaking existing features (at least without shipping a 2.0, which I hope to avoid for as long as possible).
I’m always trying to get more brave when it comes to stamping a 1.0 release on my main projects.
(I really, really need to get Datasette 1.0 shipped soon.)
Releases this week
-
datasette-gunicorn: 0.1—2022-10-22
Plugin for running Datasette using Gunicorn -
shot-scraper: 1.0—(23 releases total)—2022-10-15
A command-line utility for taking automated screenshots of websites -
asgi-gzip: 0.2—(2 releases total)—2022-10-13
gzip middleware for ASGI applications, extracted from Starlette
TIL this week
Measuring traffic during the Half Moon Bay Pumpkin Festival four days ago
This weekend was the 50th annual Half Moon Bay Pumpkin Festival.
We live in El Granada, a tiny town 8 minutes drive from Half Moon Bay. There is a single road (coastal highway one) between the two towns, and the festival is locally notorious for its impact on traffic.
Natalie suggested that we measure the traffic and try and see the impact for ourselves!
Here’s the end result for Saturday. Read on for details on how we created it.

Collecting the data
I built a git scraper to gather data from the Google Maps Directions API. It turns out if you pass departure_time=now to that API it returns the current estimated time in traffic as part of the response.
I picked a location in Half Moon Bay an a location in El Granada and constructed the following URL (pretty-printed):
https://maps.googleapis.com/maps/api/directions/json?
origin=GG49%2BCH,%20Half%20Moon%20Bay%20CA
&destination=FH78%2BQJ,%20Half%20Moon%20Bay,%20CA
&departure_time=now
&key=$GOOGLE_MAPS_KEY
The two locations here are defined using Google Plus codes. Here they are on Google Maps:
I constructed the reverse of the URL too, to track traffic in the other direction. Then I rigged up a scheduled GitHub Actions workflow in this repository to fetch this API data, pretty-print it with jq and write it to the repsoitory:
name: Scrape traffic
on:
push:
workflow_dispatch:
schedule:
- cron: '*/5 * * * *'
jobs:
shot-scraper:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Scrape
env:
GOOGLE_MAPS_KEY: ${{ secrets.GOOGLE_MAPS_KEY }}
run: |
curl "https://maps.googleapis.com/maps/api/directions/json?origin=GG49%2BCH,%20Half%20Moon%20Bay%20CA&destination=FH78%2BQJ,%20Half%20Moon%20Bay,%20California&departure_time=now&key=$GOOGLE_MAPS_KEY" | jq > one.json
sleep 3
curl "https://maps.googleapis.com/maps/api/directions/json?origin=FH78%2BQJ,%20Half%20Moon%20Bay%20CA&destination=GG49%2BCH,%20Half%20Moon%20Bay,%20California&departure_time=now&key=$GOOGLE_MAPS_KEY" | jq > two.json
- name: Commit and push
run: |-
git config user.name "Automated"
git config user.email "actions@users.noreply.github.com"
git add -A
timestamp=$(date -u)
git commit -m "${timestamp}" || exit 0
git pull --rebase
git pushI’m using a GitHub Actions secret called GOOGLE_MAPS_KEY to store the Google Maps API key.
This workflow runs every 5 minutes (more-or-less—GitHub Actions doesn’t necessarily stick to the schedule). It fetches the two JSON results and writes them to files called one.json and two.json
... and that was the initial setup for the project. This took me about fifteen minutes to put in place, because I’ve built systems like this so many times before. I launched it at about 10am on Saturday and left it to collect data.
Analyzing the data and drawing some charts
The trick with git scraping is that the data you care about ends up captured in the git commit log. The challenge is how to extract that back out again and turn it into something useful.
My git-history tool is designed to solve this. It’s a command-line utility which can iterate through every version of a file stored in a git repository, extracting information from that file out into a SQLite database table and creating a new row for every commit.
Normally I run it against CSV or JSON files containing an array of rows—effectively tabular data already, where I just want to record what has changed in between commits.
For this project, I was storing the raw JSON output by the Google Maps API. I didn’t care about most of the information in there: I really just wanted the duration_in_traffic value.
git-history can accept a snippet of Python code that will be run against each stored copy of a file. The snippet should return a list of JSON objects (as Python dictionaries) which the rest of the tool can then use to figure out what has changed.
To cut a long story short, here’s the incantation that worked:
git-history file hmb.db one.json \
--convert '
try:
duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
return [{"id": "one", "duration_in_traffic": duration_in_traffic}]
except Exception as ex:
return []
' \
--full-versions \
--id id
The git-history file command is used to load the history for a specific file—in this case it’s the file one.json, which will be loaded into a new SQLite database file called hm.db.
The --convert code uses json.loads(content) to load the JSON for the current file version, then pulls out the ["routes"][0]["legs"][0]["duration_in_traffic"]["value"] nested value from it.
If that’s missing (e.g. in an earlier commit, when I hadn’t yet added the departure_time=now parameter to the URL) an exception will be caught and the function will return an empty list.
If the duration_in_traffic value is present, the function returns the following:
[{"id": "one", "duration_in_traffic": duration_in_traffic}]
git-history likes lists of dictionaries. It’s usually being run against files that contain many different rows, where the id column can be used to de-dupe rows across commits and spot what has changed.
In this case, each file only has a single interesting value.
Two more options are used here:
-
--full-versions—tellsgit-historyto store all of the columns, not just columns that have changed since the last run. The default behaviour here is to store anullif a value has not changed in order to save space, but our data is tiny here so we don’t need any clever optimizations. -
--id idspecifies the ID column that should be used to de-dupe changes. Again, not really important for this tiny project.
After running the above command, the resulting schema includes these tables:
CREATE TABLE [commits] (
[id] INTEGER PRIMARY KEY,
[namespace] INTEGER REFERENCES [namespaces]([id]),
[hash] TEXT,
[commit_at] TEXT
);
CREATE TABLE [item_version] (
[_id] INTEGER PRIMARY KEY,
[_item] INTEGER REFERENCES [item]([_id]),
[_version] INTEGER,
[_commit] INTEGER REFERENCES [commits]([id]),
[id] TEXT,
[duration_in_traffic] INTEGER
);The commits table includes the date of the commit—commit_at.
The item_version table has that duration_in_traffic value.
So... to get back the duration in traffic at different times of day I can run this SQL query to join those two tables together:
select
commits.commit_at,
duration_in_traffic
from
item_version
join
commits on item_version._commit = commits.id
order by
commits.commit_atThat query returns data that looks like this:
| commit_at | duration_in_traffic |
|---|---|
| 2022-10-15T17:09:06+00:00 | 1110 |
| 2022-10-15T17:17:38+00:00 | 1016 |
| 2022-10-15T17:30:06+00:00 | 1391 |
A couple of problems here. First, the commit_at column is in UTC, not local time. And duration_in_traffic is in seconds, which aren’t particularly easy to read.
Here’s a SQLite fix for these two issues:
select
time(datetime(commits.commit_at, '-7 hours')) as t,
duration_in_traffic / 60 as mins_in_traffic
from
item_version
join
commits on item_version._commit = commits.id
order by
commits.commit_at
| t | mins_in_traffic |
|---|---|
| 10:09:06 | 18 |
| 10:17:38 | 16 |
| 10:30:06 | 23 |
datetime(commits.commit_at, '-7 hours') parses the UTC string as a datetime, and then subsracts 7 hours from it to get the local time in California converted from UTC.
I wrap that in time() here because for the chart I want to render I know everything will be on the same day.
mins_in_traffic now shows minutes, not seconds.
We now have enough data to render a chart!
But... we only have one of the two directions of traffic here. To process the numbers from two.json as well I ran this:
git-history file hmb.db two.json \
--convert '
try:
duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
return [{"id": "two", "duration_in_traffic": duration_in_traffic}]
except Exception as ex:
return []
' \
--full-versions \
--id id --namespace item2
This is almost the same as the previous command. It’s running against two.json instead of one.json, and it’s using the --namespace item2 option.
This causes it to populate a new table called item2_version instead of item_version, which is a cheap trick to avoid having to figure out how to load both files into the same table.
Two lines on one chart
I rendered an initial single line chart using datasette-vega, but Natalie suggested that putting lines on the same chart for the two directions of traffic would be more interesting.
Since I now had one table for each direction of traffic (item_version and item_version2) I decided to combine those into a single table, suitable for pasting into Google Sheets.
Here’s the SQL I came up with to do that:
with item1 as (
select
time(datetime(commits.commit_at, '-7 hours')) as t,
duration_in_traffic / 60 as mins_in_traffic
from
item_version
join commits on item_version._commit = commits.id
order by
commits.commit_at
),
item2 as (
select
time(datetime(commits.commit_at, '-7 hours')) as t,
duration_in_traffic / 60 as mins_in_traffic
from
item2_version
join commits on item2_version._commit = commits.id
order by
commits.commit_at
)
select
item1.*,
item2.mins_in_traffic as mins_in_traffic_other_way
from
item1
join item2 on item1.t = item2.tThis uses two CTEs (Common Table Expressions—the with X as pieces) using the pattern I explained earlier—now called item1 and item2. Having defined these two CTEs, I can join them together on the t column, which is the time of day.
Try running this query in Datasette Lite.
Here’s the output of that query for Saturday (10am to 8pm):
| t | mins_in_traffic | mins_in_traffic_other_way |
|---|---|---|
| 10:09:06 | 18 | 8 |
| 10:17:38 | 16 | 8 |
| 10:30:06 | 23 | 9 |
| 10:47:38 | 23 | 9 |
| 10:57:37 | 23 | 9 |
| 11:08:20 | 26 | 9 |
| 11:22:27 | 26 | 9 |
| 11:38:42 | 26 | 9 |
| 11:52:35 | 25 | 9 |
| 12:03:23 | 24 | 9 |
| 12:15:16 | 21 | 9 |
| 12:27:51 | 22 | 9 |
| 12:37:48 | 22 | 10 |
| 12:46:41 | 21 | 10 |
| 12:55:03 | 21 | 10 |
| 13:05:10 | 21 | 11 |
| 13:17:57 | 21 | 11 |
| 13:32:55 | 21 | 11 |
| 13:44:53 | 19 | 12 |
| 13:55:22 | 21 | 14 |
| 14:05:21 | 22 | 14 |
| 14:17:48 | 23 | 15 |
| 14:31:04 | 22 | 15 |
| 14:41:59 | 21 | 14 |
| 14:51:48 | 18 | 14 |
| 15:00:09 | 18 | 15 |
| 15:11:17 | 15 | 14 |
| 15:25:48 | 14 | 15 |
| 15:39:41 | 11 | 14 |
| 15:51:11 | 14 | 15 |
| 15:59:34 | 15 | 15 |
| 16:10:50 | 19 | 16 |
| 16:25:43 | 19 | 18 |
| 16:53:06 | 19 | 18 |
| 17:11:34 | 18 | 16 |
| 17:40:29 | 11 | 11 |
| 18:12:07 | 10 | 11 |
| 18:58:17 | 8 | 9 |
| 20:05:13 | 7 | 7 |
I copied and pasted this table into Google Sheets and messed around with the charting tools there until I had the following chart:

Here’s the same chart for Sunday:

Our Google Sheet is here—the two days have two separate tabs within the sheet.
Building the SQLite database in GitHub Actions
I did most of the development work for this project on my laptop, running git-history and datasette locally for speed of iteration.
Once I had everything working, I decided to automate the process of building the SQLite database as well.
I made the following changes to my GitHub Actions workflow:
jobs:
shot-scraper:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0 # Needed by git-history
- name: Set up Python 3.10
uses: actions/setup-python@v4
with:
python-version: "3.10"
cache: "pip"
- run: pip install -r requirements.txt
- name: Scrape
# Same as before...
# env:
# run
- name: Build SQLite database
run: |
rm -f hmb.db # Recreate from scratch each time
git-history file hmb.db one.json \
--convert '
try:
duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
return [{"id": "one", "duration_in_traffic": duration_in_traffic}]
except Exception as ex:
return []
' \
--full-versions \
--id id
git-history file hmb.db two.json \
--convert '
try:
duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
return [{"id": "two", "duration_in_traffic": duration_in_traffic}]
except Exception as ex:
return []
' \
--full-versions \
--id id --namespace item2
- name: Commit and push
# Same as before...I also added a requirements.txt file containing just git-history.
Note how the actions/checkout@v3 step now has fetch-depth: 0—this is necessary because git-history needs to loop through the entire repository history, but actions/checkout@v3 defaults to only fetching the most recent commit.
The setup-python step uses cache: "pip", which causes it to cache installed dependencies from requirements.txt between runs.
Because that big git-history step creates a hmb.db SQLite database, the “Commit and push” step now includes that file in the push to the repository. So every time the workflow runs a new binary SQLite database file is committed.
Normally I wouldn’t do this, because Git isn’t a great place to keep constantly changing binary files... but in this case the SQLite database is only 100KB and won’t continue to be updated beyond the end of the pumpkin festival.
End result: hmb.db is available in the GitHub repository.
Querying it using Datasette Lite
Datasette Lite is my repackaged version of my Datasette server-side Python application which runs entirely in the user’s browser, using WebAssembly.
A neat feature of Datasette Lite is that you can pass it the URL to a SQLite database file and it will load that database in your browser and let you run queries against it.
These database files need to be served with CORS headers. Every file served by GitHub includes these headers!
Which means the following URL can be used to open up the latest hmb.db file directly in Datasette in the browser:
https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db
(This takes advantage of a feature I added to Datasette Lite where it knows how to convert the URL to the HTML page about a file on GitHub to the URL to the raw file itself.)
URLs to SQL queries work too. This URL will open Datasette Lite, load the SQLite database AND execute the query I constructed above:
And finally... Datasette Lite has plugin support. Adding &install=datasette-copyable to the URL adds the datasette-copyable plugin, which adds a page for easily copying out the query results as TSV (useful for pasting into a spreadsheet) or even as GitHub-flavored Markdown (which I used to add results to this blog post).
Here’s an example of that plugin in action.
This was a fun little project that brought together a whole bunch of things I’ve been working on over the past few years. Here’s some more of my writing on these different techniques and tools:
- Git scraping is the key technique I’m using here to collect the data
- I’ve written a lot about GitHub Actions
- These are my notes about git-history, the tool I used to turn a commit history into a SQLite database
- Here’s my series of posts about Datasette Lite
Automating screenshots for the Datasette documentation using shot-scraper nine days ago
I released shot-scraper back in March as a tool for keeping screenshots in documentation up-to-date.
It’s very easy for feature screenshots in documentation for a web application to drift out-of-date with the latest design of the software itself.
shot-scraper is a command-line tool that aims to solve this.
You can use it to take one-off screenshots like this:
shot-scraper https://latest.datasette.io/ --height 800
Or you can define multiple screenshots in a single YAML file—let’s call this shots.yml:
- url: https://latest.datasette.io/
height: 800
output: index.png
- url: https://latest.datasette.io/fixtures
height: 800
output: database.pngAnd run them all at once like this:
shot-scraper multi shots.yml
This morning I used shot-scraper to replace all of the existing screenshots in the Datasette documentation with up-to-date, automated equivalents.
I decided to use this as an opportunity to create a more detailed tutorial for how to use shot-scraper for this kind of screenshot automation project.
Four screenshots to replace
Datasette’s documentation included four screenshots that I wanted to replace with automated equivalents.
full_text_search.png illustrates the full-text search feature:

advanced_export.png displays Datasette’s “advanced export” dialog:

binary_data.png displays just a small fragment of a table with binary download links:

facets.png demonstrates faceting against a table:

I’ll walk through each screenshot in turn.
full_text_search.png
I decided to use a different example for the new screenshot, because I don’t currently have a live instance for that table running against the most recent Datasette release.
I went with https://register-of-members-interests.datasettes.com/regmem/items?_search=hamper&_sort_desc=date—a search against the UK register of members interests for “hamper” (see Exploring the UK Register of Members Interests with SQL and Datasette).
The existing image in the documentation was 960 pixels wide, so I stuck with that and tried a few iterations until I found a height that I liked.
I installed shot-scraper and ran the following, in my /tmp directory:
shot-scraper 'https://register-of-members-interests.datasettes.com/regmem/items?_search=hamper&_sort_desc=date' \
-h 585 \
-w 960
This produced a register-of-members-interests-datasettes-com-regmem-items.png file which looked good when I opened it in Preview.
I turned that into the following YAML in my shots.yml file:
- url: https://register-of-members-interests.datasettes.com/regmem/items?_search=hamper&_sort_desc=date
height: 585
width: 960
output: regmem-search.pngRunning shot-scraper multi shots.yml against that file produced this regmem-search.png image:

advanced_export.png
This next image isn’t a full page screenshot—it’s just a small fragment of the page.
shot-scraper can take partial screenshots based on one or more CSS selectors. Given a CSS selector the tool draws a box around just that element and uses that to take the screenshot—adding optional padding.
Here’s the recipe for the advanced export box—I used the same register-of-members-interests.datasettes.com example for it as this had enough rows to trigger all of the advanced options to be displayed:
shot-scraper 'https://register-of-members-interests.datasettes.com/regmem/items?_search=hamper' \
-s '#export' \
-p 10
The -p 10 here specifies 10px of padding, needed to capture the drop shadow on the box.
Here’s the equivalent YAML:
- url: https://register-of-members-interests.datasettes.com/regmem/items?_search=hamper
selector: "#export"
output: advanced-export.png
padding: 10And the result:

binary_data.png
This screenshot required a different trick.
I wanted to take a screenshot of the table on this page.
The full table looks like this, with three rows:

I only wanted the first two of these to be shown in the screenshot though.
shot-scraper has the ability to execute JavaScript on the page before the screenshot is taken. This can be used to remove elements first.
Here’s the JavaScript I came up with to remove all but the first two rows (actually the first three, because the table header counts as a row too):
Array.from(
document.querySelectorAll('tr:nth-child(n+3)'),
el => el.parentNode.removeChild(el)
);I did it this way so that if I add any more rows to that test table in the future the code will still remove everything but the first two.
The CSS selector tr:nth-child(n+3) selects all rows that are not the first three (one header plus two content rows).
Here’s how to run that from the command-line, and then take a 10 pixel padded screenshot of just the table on the page after it has been modified by the JavaScript:
shot-scraper 'https://latest.datasette.io/fixtures/binary_data' \
-j 'Array.from(document.querySelectorAll("tr:nth-child(n+3)"), el => el.parentNode.removeChild(el));' \
-s table -p 10
The YAML I added to shots.yml:
- url: https://latest.datasette.io/fixtures/binary_data
selector: table
javascript: |-
Array.from(
document.querySelectorAll('tr:nth-child(n+3)'),
el => el.parentNode.removeChild(el)
);
padding: 10
output: binary-data.pngAnd the resulting image:

facets.png
I left the most complex screenshot to last.
For the faceting screenshot, I wanted to include the “suggested facet” links at the top of the page, a set of active facets and then the first three rows of the following table.
But... the table has quite a lot of columns. For a neater screenshot I only wanted to include a subset of columns in the final shot.
Here’s the screenshot I ended up taking:

And the YAML recipe:
- url: https://congress-legislators.datasettes.com/legislators/legislator_terms?_facet=type&_facet=party&_facet=state&_facet_size=10
selectors_all:
- .suggested-facets a
- tr:not(tr:nth-child(n+4)) td:not(:nth-child(n+11))
padding: 10
output: faceting-details.pngThe key trick I’m using here is that selectors_all list.
The usual shot-scraper selector option finds the first element on the page matching the specified CSS selector and takes a screenshot of that.
--selector-all—or the YAML equivalent selectors_all—instead finds EVERY element that matches any of the specified selectors and draws a bounding box containing all of them.
I wanted that bounding box to surround a subset of the table cells on the page. I used this CSS selector to indicate that subset:
tr:not(tr:nth-child(n+4)) td:not(:nth-child(n+11))
Here’s what GPT-3 says if you ask it to explain the selector:
Explain this CSS selector:
tr:not(tr:nth-child(n+4)) td:not(:nth-child(n+11))
This selector is selecting all table cells in rows that are not the fourth row or greater, and are not in columns that are the 11th column or greater.
(See also this TIL.)
Automating everything using GitHub Actions
Here’s the full shots.yml YAML needed to generate all four of these screenshots:
- url: https://register-of-members-interests.datasettes.com/regmem/items?_search=hamper&_sort_desc=date
height: 585
width: 960
output: regmem-search.png
- url: https://register-of-members-interests.datasettes.com/regmem/items?_search=hamper
selector: "#export"
output: advanced-export.png
padding: 10
- url: https://congress-legislators.datasettes.com/legislators/legislator_terms?_facet=type&_facet=party&_facet=state&_facet_size=10
selectors_all:
- .suggested-facets a
- tr:not(tr:nth-child(n+4)) td:not(:nth-child(n+11))
padding: 10
output: faceting-details.png
- url: https://latest.datasette.io/fixtures/binary_data
selector: table
javascript: |-
Array.from(
document.querySelectorAll('tr:nth-child(n+3)'),
el => el.parentNode.removeChild(el)
);
padding: 10
output: binary-data.pngRunning shot-scraper shots shots.yml against this file takes all four screenshots.
But I want this to be fully automated! So I turned to GitHub Actions.
A while ago I created a template repository for setting up GitHub Actions to take screenshots using shot-scraper and write them back to the same repo. I wrote about that in Instantly create a GitHub repository to take screenshots of a web page.
I had previously used that recipe to create my datasette-screenshots repository—with its own shots.yml file.
So I added the new YAML to that existing file, committed the change, waited a minute and the result was all four images stored in that repository!
My datasette-screenshots workflow actually has two key changes from my default template. First, it takes every screenshot twice—once as a retina image and once as a regular image:
- name: Take retina shots
run: |
shot-scraper multi shots.yml --retina
- name: Take non-retina shots
run: |
mkdir -p non-retina
cd non-retina
shot-scraper multi ../shots.yml
cd ..This provides me with both a high quality image and a smaller, faster-loading image for each screenshot.
Secondly, it runs oxipng to optimize the PNGs before committing them to the repo:
- name: Optimize PNGs
run: |-
oxipng -o 4 -i 0 --strip safe *.png
oxipng -o 4 -i 0 --strip safe non-retina/*.pngThe shot-scraper documentation describes this pattern in more detail.
With all of that in place, simply committing a change to the shots.yml file is enough to generate and store the new screenshots.
Linking to the images
One last problem to solve: I want to include these images in my documentation, which means I need a way to link to them.
I decided to use GitHub to host these directly, via the raw.githubusercontent.com domain—which is fronted by the Fastly CDN.
I care about up-to-date images, but I also want different versions of the Datasette documentation to reflect the corresponding design in their screenshots—so I needed a way to snapshot those screenshots to a known version.
Repository tags are one way to do this.
I tagged the datasette-screenshots repository with 0.62, since that’s the version of Datasette that the screenshots were taken for.
This gave me the following URLs for the images:
- https://raw.githubusercontent.com/simonw/datasette-screenshots/0.62/advanced-export.png (retina)
- https://raw.githubusercontent.com/simonw/datasette-screenshots/0.62/non-retina/regmem-search.png
- https://raw.githubusercontent.com/simonw/datasette-screenshots/0.62/binary-data.png (retina)
- https://raw.githubusercontent.com/simonw/datasette-screenshots/0.62/non-retina/faceting-details.png
To save on page loading time I decided to use the non-retina URLs for the two larger images.
Here’s the commit that updated the Datasette documentation to link to these new images (and deleted the old images from the repo).
You can see the new images in the documentation on these pages:
Weeknotes: Publishing data using Datasette Cloud 11 days ago
My initial preview releases of Datasette Cloud (the SaaS version of my Datasette open source project) have focused on private data collaboration.
Users can create private spaces for their data, and then invite in team members to collaborate in that space.
Each space runs in its own Firecracker container (hosted on Fly), completely isolated from other instances.
The more time I’ve spent with the preview, the more I’ve felt that something crucial is missing.
Datasette can do a lot of things, but in my opinion the thing it is better at than anything else at is publishing data.
A question I like to ask people in my office hours sessions is “what job have you hired Datasette to solve?”. One of the most common answers is “to publish data online”.
So Datasette Cloud really needs to be able to help people publish their data to a wider audience! Leaving that feature out means leaving a lot of the value of the open source product off the table.
I’d been contemplating a number of more elaborate strategies for this: I’m looking forward to being able to use LiteFS to host globally distributed read-replicas for example.
But the simplest thing that could possibly work would just be to allow people to toggle individual tables from private to public.
I decided to build that first.
I try to build as much of the functionality of Datasette Cloud as possible as open source plugins. So I built a new plugin: datasette-public.
The plugin hooks into Datasette’s authentication and permissions system, which is already used for the existing private spaces feature.
It adds a new SQLite table called _public_tables—and a new permission rule which grants access to any user if the table they are trying to access is listed there.
Then it adds a little bit of UI which users can use to add or remove a table from that list of public tables.
And that’s pretty much it! It’s a simple implementation but it works very nicely as a first draft of the new feature.
Here’s an example of a table I published using the feature:
https://simon.datasette.cloud/data/global-power-plants
Working on Datasette Cloud has been fantastic for ironing out details in Datasette itself. Here are a couple of new Datasette issues that emerged from this work:
- #1829: Table/database that is private due to inherited permissions does not show padlock
- #1831: If user can see table but NOT database/instance nav links should not display
My next area of focus is going to be around the Datasette and Datasette Cloud APIs.
Datasette Cloud gets a lot more interesting once it’s possible to use authenticated API calls to write data to it—rather than just supporting uploaded CSVs.
Working on this will drive some long-needed work around writable APIs in Datasette itself—something that until now has been entirely the realm of plugins such as datasette-insert.
Releases this week
-
datasette-public: 0.2—(2 releases total)—2022-10-07
Make specific Datasette tables visible to the public -
datasette-sentry: 0.3—(6 releases total)—2022-10-06
Datasette plugin for configuring Sentry -
datasette-search-all: 1.1—(8 releases total)—2022-10-05
Datasette plugin for searching all searchable tables at once
TIL this week
Is the AI spell-casting metaphor harmful or helpful? 18 days ago
For a few weeks now I’ve been promoting spell-casting as a metaphor for prompt design against generative AI systems such as GPT-3 and Stable Diffusion.
Here’s an example, in this snippet from my recent Changelog podcast episode.
Relevant section towards the end (transcription assisted by Whisper):
When you’re working with these, you’re not a programmer anymore. You’re a wizard, right? I always wanted to be a wizard. We get to be wizards now. And we’re learning these spells. We don’t know why they work. Why does Neuromancer work? Who knows? Nobody knows. But you add it to your spell book and then you combine it with other spells. And if you’re unlucky and combine them in the wrong way, you might get demons coming out at you.
I had an interesting debate on Twitter this morning about whether or not this metaphor is harmful or helpful. There are some very interesting points to discuss!
The short version: I’m now convinced that the value of this metaphor changes based on the audience.
The key challenge here is to avoid implying that these systems are “magical” in that they are incomprehensible and mysterious. As such, I believe the metaphor is only appropriate when you’re talking to people who are working with these systems from a firm technical perspective.
Expanding the spell-casting metaphor
When I compare prompts to spells and I’m talking to another software engineer, here’s the message I am trying to convey:
Writing prompts is not like writing regular code. There is no API reference or programming language specification that will let you predict exactly what will happen.
Instead, you have to experiment: try different fragments of prompts and see what works. As you get a feel for these fragments you can then start exploring what happens when you combine them together.
Over time you will start to develop an intuition for what works. You’ll build your own collection of fragments and patterns, and exchange those with other people.
The weird thing about this process is that no-one can truly understand exactly how each fragment works—not even the creators of the models. We’ve learned that “Trending on artstation” produces better images with Stable Diffusion—but we can only ever develop a vague intuition for why.
It honestly feels more like fictional spell-casting than programming. Each fragment is a new spell that you have learned and can add to your spell book.
It’s confusing, and surprising, and a great deal of fun.
For me, this captures my experience working with prompts pretty accurately. My hope is that this is a useful way to tempt other programmers into exploring this fascinating new area.
The other thing I like about this metaphor is that, to my mind, it touches on some of the risks of generative AI as well.
Fiction is full of tales of magic gone wrong: of wizards who lost control of forces that they did not fully understand.
When I think about prompt injection attacks I imagine good wizards and evil wizards casting spells and counter-spells at each other! Software vulnerabilities in plain English totally fit my mental model of casting spells.
But in debating this on Twitter I realized that whether this metaphor makes sense to you relies pretty heavily on which specific magic system comes to mind for you.
I was raised on Terry Pratchett’s Discworld, which has a fantastically rich and deeply satirical magic system. Incorrect incantations frequently produce demons! Discworld wizards are mostly academics who spend more time thinking about lunch than practicing magic. The most interesting practitioners are the witches, for who the most useful magic is more like applied psychology (“headalogy” in the books.)
If your mental model of “magic” is unexplained supernatural phenomenon and fairies granting wishes then my analogy doesn’t really fit.
Magic as a harmful metaphor for AI
The argument for this metaphor causing harm is tied to the larger challenge of helping members of the public understand what is happening in this field.
Look behind the curtain: Don’t be dazzled by claims of ‘artificial intelligence’ by Emily M. Bender is a useful summary of some of these challenges.
In Technology Is Magic, Just Ask The Washington Post from 2015 Jon Evans makes the case that treating technology as “magic” runs a risk of people demanding solutions to societal problems that cannot be delivered.
Understanding exactly what these systems are capable of and how they work is a hard enough for people with twenty years of software engineering experience, let alone everyone else.
The last thing people need is to be told that these systems are “magic”—something that is permanently beyond their understanding and control.
These systems are not magic. They’re mathematics. It turns out that if you throw enough matrix multiplication and example data (literally terabytes of it) at a problem, you can get a system that can appear to do impossible things.
But implying that they are magic—or even that they are “intelligent”—does not give people a useful mental model. GPT-3 is not a wizard, and it’s not intelligent: it’s a stochastic parrot, capable of nothing more than predicting which word should come next to form a sentence that best matches the corpus it has been trained on.
This matters to me a great deal. In conversations I have had around AI ethics the only universal answer I’ve found is that it is ethical to help people understand what these systems can do and how they work.
So I plan to be more intentional with my metaphors. I’ll continue to enthuse about spell-casting with fellow nerds who aren’t at risk of assuming these systems are incomprehensible magic, but I’ll keep searching for better ways to help explain these systems to everyone else.
Software engineering practices 22 days ago
Gergely Orosz started a Twitter conversation asking about recommended “software engineering practices” for development teams.
(I really like his rejection of the term “best practices” here: I always feel it’s prescriptive and misguiding to announce something as “best”.)
I decided to flesh some of my replies out into a longer post.
- Documentation in the same repo as the code
- Mechanisms for creating test data
- Rock solid database migrations
- Templates for new projects and components
- Automated code formatting
- Tested, automated process for new development environments
- Automated preview environments
Documentation in the same repo as the code
The most important characteristic of internal documentation is trust: do people trust that documentation both exists and is up-to-date?
If they don’t, they won’t read it or contribute to it.
The best trick I know of for improving the trustworthiness of documentation is to put it in the same repository as the code it documents, for a few reasons:
- You can enforce documentation updates as part of your code review process. If a PR changes code in a way that requires documentation updates, the reviewer can ask for those updates to be included.
- You get versioned documentation. If you’re using an older version of a library you can consult the documentation for that version. If you’re using the current main branch you can see documentation for that, without confusion over what corresponds to the most recent “stable” release.
- You can integrate your documentation with your automated tests! I wrote about this in Documentation unit tests, which describes a pattern for introspecting code and then ensuring that the documentation at least has a section header that matches specific concepts, such as plugin hooks or configuration options.
Mechanisms for creating test data
When you work on large products, your customers will inevitably find surprising ways to stress or break your system. They might create an event with over a hundred different types of ticket for example, or an issue thread with a thousand comments.
These can expose performance issues that don’t affect the majority of your users, but can still lead to service outages or other problems.
Your engineers need a way to replicate these situations in their own development environments.
One way to handle this is to provide tooling to import production data into local environments. This has privacy and security implications—what if a developer laptop gets stolen that happens to have a copy of your largest customer’s data?
A better approach is to have a robust system in place for generating test data, that covers a variety of different scenarios.
You might have a button somewhere that creates an issue thread with a thousand fake comments, with a note referencing the bug that this helps emulate.
Any time a new edge case shows up, you can add a new recipe to that system. That way engineers can replicate problems locally without needing copies of production data.
Rock solid database migrations
The hardest part of large-scale software maintenance is inevitably the bit where you need to change your database schema.
(I’m confident that one of the biggest reasons NoSQL databases became popular over the last decade was the pain people had associated with relational databases due to schema changes. Of course, NoSQL database schema modifications are still necessary, and often they’re even more painful!)
So you need to invest in a really good, version-controlled mechanism for managing schema changes. And a way to run them in production without downtime.
If you do not have this your engineers will respond by being fearful of schema changes. Which means they’ll come up with increasingly complex hacks to avoid them, which piles on technical debt.
This is a deep topic. I mostly use Django for large database-backed applications, and Django has the best migration system I’ve ever personally experienced. If I’m working without Django I try to replicate its approach as closely as possible:
- The database knows which migrations have already been applied. This means when you run the “migrate” command it can run just the ones that are still needed—important for managing multiple databases, e.g. production, staging, test and development environments.
- A single command that applies pending migrations, and updates the database rows that record which migrations have been run.
- Optional: rollbacks. Django migrations can be rolled back, which is great for iterating in a development environment but using that in production is actually quite rare: I’ll often ship a new migration that reverses the change instead rather than using a rollback, partly to keep the record of the mistake in version control.
Even harder is the challenge of making schema changes without any downtime. I’m always interested in reading about new approaches for this—GitHub’s gh-ost is a neat solution for MySQL.
An interesting consideration here is that it’s rarely possible to have application code and database schema changes go out at the exact same instance in time. As a result, to avoid downtime you need to design every schema change with this in mind. The process needs to be:
- Design a new schema change that can be applied without changing the application code that uses it.
- Ship that change to production, upgrading your database while keeping the old code working.
- Now ship new application code that uses the new schema.
- Ship a new schema change that cleans up any remaining work—dropping columns that are no longer used, for example.
This process is a pain. It’s difficult to get right. The only way to get good at it is to practice it a lot over time.
My rule is this: schema changes should be boring and common, as opposed to being exciting and rare.
Templates for new projects and components
If you’re working with microservices, your team will inevitably need to build new ones.
If you’re working in a monorepo, you’ll still have elements of your codebase with similar structures—components and feature implementations of some sort.
Be sure to have really good templates in place for creating these “the right way”—with the right directory structure, a README and a test suite with a single, dumb passing test.
I like to use the Python cookiecutter tool for this. I’ve also used GitHub template repositories, and I even have a neat trick for combining the two.
These templates need to be maintained and kept up-to-date. The best way to do that is to make sure they are being used—every time a new project is created is a chance to revise the template and make sure it still reflects the recommended way to do things.
Automated code formatting
This one’s easy. Pick a code formatting tool for your language—like Black for Python or Prettier for JavaScript (I’m so jealous of how Go has gofmt built in)—and run its “check” mode in your CI flow.
Don’t argue with its defaults, just commit to them.
This saves an incredible amount of time in two places:
- As an individual, you get back all of that mental energy you used to spend thinking about the best way to format your code and can spend it on something more interesting.
- As a team, your code reviews can entirely skip the pedantic arguments about code formatting. Huge productivity win!
Tested, automated process for new development environments
The most painful part of any software project is inevitably setting up the initial development environment.
The moment your team grows beyond a couple of people, you should invest in making this work better.
At the very least, you need a documented process for creating a new environment—and it has to be known-to-work, so any time someone is onboarded using it they should be encouraged to fix any problems in the documentation or accompanying scripts as they encounter them.
Much better is an automated process: a single script that gets everything up and running. Tools like Docker have made this a LOT easier over the past decade.
I’m increasingly convinced that the best-in-class solution here is cloud-based development environments. The ability to click a button on a web page and have a fresh, working development environment running a few seconds later is a game-changer for large development teams.
Gitpod and Codespaces are two of the most promising tools I’ve tried in this space.
I’ve seen developers lose hours a week to issues with their development environment. Eliminating that across a large team is the equivalent of hiring several new full-time engineers!
Automated preview environments
Reviewing a pull request is a lot easier if you can actually try out the changes.
The best way to do this is with automated preview environments, directly linked to from the PR itself.
These are getting increasingly easy to offer. Vercel, Netlify, Render and Heroku all have features that can do this. Building a custom system on top of something like Google Cloud Run or Fly Machines is also possible with a bit of work.
This is another one of those things which requires some up-front investment but will pay itself off many times over through increased productivity and quality of reviews.