Simon Willison’s Weblog

Subscribe

Weeknotes: sqlite-transform 1.1, Datasette 0.58.1, datasette-graphql 1.5

21st July 2021

Work on Project Pelican inspires new features and improvements across a number of different projects.

Project Pelican is an upcoming collaborative data release I’m working on with an embargo—hence the code name. It involves around 700MB of CSV data converted to SQLite, which has lead to some small but notable improvements in several of my tools.

Datasette 0.58 and 0.58.1

I published annotated release notes for Datasette 0.58 last week. In addition to several plugin hooks improvement the big improvement was a 10x performance improvement to Datasette facets, which was discovered during work on Project Pelican. Conversations about this on the SQLite Forum lead to a new optimization by D. Richard Hipp landing in SQLite trunk!

I release Datasette 0.58.1 shortly afterwards with a fix for a rare intermittent race condition that I introduced back in January.

sqlite-transform 1.0 and 1.1

The data I am working with for Project Pelican has a confusing characteristic: some of the columns contain integers almost consistently, but have a few rows that contain text: a - character, or a number with commas like 1,232 or a string such as >40).

I decided to address this by adding more columns—the original column with a _int suffix of type integer containing just the number, if one is available. So if the column was called total there would be a total_int integer column reflecting its value.

My sqlite-transform tool could almost do this: it has a lambda command that lets you execute Python code against every value in a column to convert it in some way.

In sqlite-transform 1.0 I added the ability to write that converted value to a different output column—so now I can do this:

sqlite-transform lambda project-pelican.db \
  sometable total \
    --code '
if (
    value.isdigit() or
    value.replace(",", "").isdigit()
    or value.startswith("<"):
    return int(value.replace(",", "").replace("<", ""))
' --output total_int --output-type integer

This command-line script compiles the Python code in --code once and then executes it against every value in the total column. The return value is written to the newly created total_int column.

Under the hood this works by defining a custom SQLite function called transform() using the compiled Python code, then executing the SQL equivalent of the following:

update sometable set total_int = transform(total)

The Python function tracks how many times it has been called and uses that to update a progress bar using the tqdm Python library.

sqlite-transform 1.1 added a --silent option to disable that progress bar, because when I ran it in GitHub Actions the progress bar output turned into too many separate lines of text.

conditional-get 0.3

The Project Pelican Datasette instance is built by GitHub Actions, which starts by downloading a 700MB CSV file.

I decided to use the GitHub Actions cache in conjunction with a conditional HTTP GET request to avoid downloading the CSV file if it hadn’t changed since the last time the action ran.

My conditional-get tool can help here: it downloads files but stores their ETag in a JSON file, then on subsequent requests it uses that ETag to perform a conditional GET and avoids downloading the content of the target URL if it has not changed since last time.

One catch though: the CSV files for Project Pelican live in Google Cloud Storage and are protected by a signed URL. The JSON file used the full URL as the key to lookup an ETag, so if the URL has changed (due to a fresh signature) the tool would not be able to find the correct ETag.

conditional-get 0.3 fixes that with a new --key option that can be used to specify a custom key within the JSON file to store the ETag. So now my GitHub Action script partially looks like this:

name: Build and deploy database

on:
  push:
    branches:
      - main
  workflow_dispatch:

jobs:
  build_and_deploy:
    runs-on: ubuntu-latest
    steps:
    - name: Check out repo
      uses: actions/checkout@v2
    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: 3.9
    - uses: actions/cache@v2
      name: Configure pip caching
      with:
        path: ~/.cache/pip
        key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
        restore-keys: |
          ${{ runner.os }}-pip-
    - uses: actions/cache@v2
      name: Configure HTTP download caching
      with:
        path: ~/data
        key: ${{ runner.os }}-http-download-cache
    - name: Install Python dependencies
      run: |
        pip install -r requirements.txt
    - name: Download CSVs
      env:
        JWT: ${{ secrets.JWT }}
      run: |
        DOWNLOAD_DIR=~/data ./download.sh

Where requirements.txt contains conditional-get and download.sh looks like this:

#!/bin/bash
set -euf -o pipefail

download_dir="${DOWNLOAD_DIR:-.}"
mkdir -p $download_dir
filename=states.csv
signed_url=$(curl -s 'https://api.example.com/graphql' \
-H "authorization: JWT $JWT" \
-H 'content-type: application/json' \
--data-raw $'{"operationName":"CreateFileDownloadURI","variables":{"input":{"fileName":"'$filename$'","projectId":"123"}},"query":"mutation CreateFileDownloadURI($input: FileURIInput\u0021) {\\n  createFileDownloadUri(input: $input) {\\n    ok {\\n      name\\n      uri\\n      __typename\\n    }\\n    err\\n    __typename\\n  }\\n}\\n"}' \
--compressed | jq -r .data.createFileDownloadUri.ok.uri)
conditional-get --etags $download_dir/etags.json \
  --key $filename -v $signed_url -o $download_dir/$filename

I’m making a GraphQL call here from curl to retrieve the signed URL, then passing that signed URL to the following to perform the conditional GET request:

conditional-get --etags $download_dir/etags.json \
  --key $filename -v $signed_url -o $download_dir/$filename

datasette-graphql 1.5

Unrelated to Project Pelican (despite that project making calls to a separate GraphQL API): datasette-graphql 1.5 finally closed an issue that had been open for a while: switching internal API calls over to using datasette.client.

Datasette exposes two key documented APIs: there’s the HTTP/JSON API that can be used to query and introspect Datasette over the network, and there’s the internal Python API which allows plugins to interact with the rest of Datasette.

Last year I realized that there are often times when a plugin might wish to access that same JSON API that is exposed by Datasette to the public.

Datasette uses ASGI internally, and the HTTPX Python library has a mechanism for executing requests directly against a Python ASGI interface without using the network... so I combined the two and introduced the datasette.client interface for making calls from within Datasette to its own HTTP API.

My dogsheep-beta search plugin was the first to take advantage of this: it uses that mechanism to make internal API calls to calculate facet counts without re-implementing faceting.

datasette-graphql exposes a GraphQL API over Datasette tables, including the ability to apply filters (rows where age > 25 etc). It does this by translating GraphQL field requests into calls to the Datasette table JSON API.

This used to work with some inelegant hijacking of Datasette’s undocumented internal view classes, which made me nervous because it meant that the GraphQL plugin would break any time I modified Datasette core.

In 1.5 I switched it over to using the datasette.client mechanism instead, which should be a much more stable target since changes to Datasette’s JSON API are made in a much more controlled fashion.

In writing the tests for this I ran into the race condition mentioned above—so releasing this plugin ended up being the trigger for pushing out the Datasette 0.58.1 release with that fix.

Releases this week

TIL this week