Simon Willison’s Weblog

Weeknotes: Datasette Lite, s3-credentials, shot-scraper, datasette-edit-templates and more

Despite distractions from AI I managed to make progress on a bunch of different projects this week, including new releases of s3-credentials and shot-scraper, a new datasette-edit-templates plugin and a small but neat improvement to Datasette Lite.

Better GitHub support for Datasette Lite

Datasette Lite is Datasette running in WebAssembly. Originally intended as a cool tech demo it’s quickly becoming a key component of the wider Datasette ecosystem—just this week I saw that mySociety are using it to help people explore their WhatDoTheyKnow Authorities Dataset.

One of the neat things about Datasette Lite is that you can feed it URLs to CSV files, SQLite database files and even SQL initialization scripts and it will fetch them into your browser and serve them up inside Datasette. I wrote more about this capability in Joining CSV files in your browser using Datasette Lite.

There’s just one catch: because those URLs are fetched by JavaScript running in your browser, they need to be served from a host that sets the Access-Control-Allow-Origin: * header (see MDN). This is not an easy thing to explain to people!

The good news here is that GitHub makes every public file (and every Gist) hosted on GitHub available as static hosting with that magic header.

The bad news is that you have to know how to construct that URL! GitHub’s “raw” links redirect to that URL, but JavaScript fetch() calls can’t follow redirects if they don’t have that header—and GitHub’s redirects do not.

So you need to know that if you want to load the SQLite database file from this page on GitHub:

https://github.com/lerocha/chinook-database/blob/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite

You first need to rewrite that URL to the following, which is served with the correct CORS header:

https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite

Asking human’s to do that by hand isn’t reasonable. So I added some code!

const githubUrl = /^https:\/\/github.com\/(.*)\/(.*)\/blob\/(.*)(\?raw=true)?$/;

function fixUrl(url) {
  const matches = githubUrl.exec(url);
  if (matches) {
    return `https://raw.githubusercontent.com/${matches[1]}/${matches[2]}/${matches[3]}`;
  }
  return url;
}

Fun aside: GitHub Copilot auto-completed that return statement for me, correctly guessing the URL string I needed based on the regular expression I had defined several lines earlier.

Now any time you feed Datasette Lite a URL, if it’s a GitHub page it will automatically rewrite it to the CORS-enabled equivalent on the raw.githubusercontent.com domain.

Some examples:

datasette-edit-templates

I started working on this plugin a couple of years ago but didn’t get it working. This week I finally closed the initial issue and shipped a first alpha release.

It’s pretty fun. On first launch it creates a _templates_ table in your database. Then it allows the root user (run datasette data.db --root and click the link to sign in as root) to edit Datasette’s default set of Jinja templates, writing their changes to that new table.

Datasette uses those templates straight away. It turns the whole of Datasette into an interface for editing itself.

Here’s an animated demo showing the plugin in action:

Animated screenshot. The Datasette app menu now has a Edit templates item, which goes to a page listing all of the templates. If you edit the _footer.html template to add an exclamation mark on the next page the Datasette footer shows that change.

The implementation is currently a bit gnarly, but I’ve filed an issue in Datasette core to help clear some of it up.

s3-credentials get-objects and put-objects

I built s3-credentials to solve my number one frustration with AWS S3: the surprising level of complexity involved in issuing IAM credentials that could only access a specific S3 bucket. I introduced it in s3-credentials: a tool for creating credentials for S3 buckets.

Once you’ve created credentials, you need to be able to do stuff with them. I find the default AWS CLI tools relatively unintuitive, so s3-credentials has continued to grow other commands as and when I feel the need for them.

The latest version, 0.14, adds two more: get-objects and put-objects.

These let you do things like this:

s3-credentials get-objects my-bucket -p "*.txt" -p "static/*.css"

This downloads every key in my-bucket with a name that matches either of those patterns.

s3-credentials put-objects my-bucket one.txt ../other-directory

This uploads one.txt and the whole other-directory folder with all of its contents.

As with most of my projects, the GitHub issues threads for each of these include a blow-by-blow account of how I finalized their design—#68 for put-objects and #78 for get-objects.

shot-scraper --log-requests

shot-scraper is my tool for automating screenshots, built on top of Playwright.

Its latest feature was inspired by Datasette Lite.

I have an ongoing ambition to get Datasette Lite to work entirely offline, using Service Workers.

The first step is to get it to work without loading external resources—it currently hits PyPI and a separate CDN multiple times to download wheels every time you load the application.

To do that, I need a reliable list of all of the assets that it’s fetching.

Wouldn’t it be handy If I could run a command and get a list of those resources?

The following command now does exactly that:

shot-scraper https://lite.datasette.io/ \
  --wait-for 'document.querySelector("h2")' \
  --log-requests requests.log

Here’ the --wait-for is needed to ensure shot-scraper doesn’t terminate until the application has fully loaded—detected by waiting for a <h2> element to be added to the page.

The --log-requests bit is a new feature in shot-scraper 0.15: it logs out a newline-delimited JSON file with details of all of the resources fetched during the run. That file starts like this:

{"method": "GET", "url": "https://lite.datasette.io/", "size": 10516, "timing": {...}}
{"method": "GET", "url": "https://plausible.io/js/script.manual.js", "size": 1005, "timing": {...}}
{"method": "GET", "url": "https://latest.datasette.io/-/static/app.css?cead5a", "size": 16230, "timing": {...}}
{"method": "GET", "url": "https://lite.datasette.io/webworker.js", "size": 4875, "timing": {...}}
{"method": "GET", "url": "https://cdn.jsdelivr.net/pyodide/v0.20.0/full/pyodide.js", "size": null, "timing": {...}}

This is already pretty useful... but wouldn’t it be more useful if I could explore that data in Datasette?

That’s what this recipe does:

shot-scraper https://lite.datasette.io/ \
  --wait-for 'document.querySelector("h2")' \
  --log-requests - | \
  sqlite-utils insert /tmp/datasette-lite.db log - --flatten --nl

It’s piping the newline-delimited JSON to sqlite-utils insert which then inserts it, using the --flatten option to turn that nested timing object into a flat set of columns.

I decided to share it by turning it into a SQL dump and publishing that to this Gist. I did that using the sqlite-utils memory command to convert it to a SQL dump like so:

shot-scraper https://lite.datasette.io/ \
  --wait-for 'document.querySelector("h2")' \
  --log-requests - | \
  sqlite-utils memory stdin:nl --flatten --dump > dump.sql

stdin:nl means “read from standard input and treat that as newline-delimited JSON”. Then I run a select * command and use --dump to output that to dump.sql, which I pasted into a new Gist.

So now I can open the result in Datasette Lite!

Datasette on Sandstorm

Sandstorm is “an open source platform for self-hosting web apps”. You can think of it as an easy to use UI over a Docker-like container platform—once you’ve installed it on a server you can use it to manage and install applications that have been bundled for it.

Jacob Weisz has been doing exactly that for Datasette. The result is Datasette in the Sandstorm App Market.

The listing for Datasette on the Sandstorm App Market, with a prominent DEMO button

You can see how it works in the ocdtrekkie/datasette-sandstorm repo. I helped out by building a small datasette-sandstorm-support plugin to show how permissions and authentication can work against Sandstorm’s custom HTTP headers.

Releases this week

TIL this week

This is Weeknotes: Datasette Lite, s3-credentials, shot-scraper, datasette-edit-templates and more by Simon Willison, posted on 16th September 2022.

Next: I don't know how to solve prompt injection

Previous: Prompt injection attacks against GPT-3