Weeknotes: sqlite-transform 1.1, Datasette 0.58.1, datasette-graphql 1.5
21st July 2021
Work on Project Pelican inspires new features and improvements across a number of different projects.
Project Pelican is an upcoming collaborative data release I’m working on with an embargo—hence the code name. It involves around 700MB of CSV data converted to SQLite, which has lead to some small but notable improvements in several of my tools.
Datasette 0.58 and 0.58.1
I published annotated release notes for Datasette 0.58 last week. In addition to several plugin hooks improvement the big improvement was a 10x performance improvement to Datasette facets, which was discovered during work on Project Pelican. Conversations about this on the SQLite Forum lead to a new optimization by D. Richard Hipp landing in SQLite trunk!
I release Datasette 0.58.1 shortly afterwards with a fix for a rare intermittent race condition that I introduced back in January.
sqlite-transform 1.0 and 1.1
The data I am working with for Project Pelican has a confusing characteristic: some of the columns contain integers almost consistently, but have a few rows that contain text: a -
character, or a number with commas like 1,232
or a string such as >40
).
I decided to address this by adding more columns—the original column with a _int
suffix of type integer
containing just the number, if one is available. So if the column was called total
there would be a total_int
integer column reflecting its value.
My sqlite-transform tool could almost do this: it has a lambda
command that lets you execute Python code against every value in a column to convert it in some way.
In sqlite-transform 1.0 I added the ability to write that converted value to a different output column—so now I can do this:
sqlite-transform lambda project-pelican.db \
sometable total \
--code '
if (
value.isdigit() or
value.replace(",", "").isdigit()
or value.startswith("<"):
return int(value.replace(",", "").replace("<", ""))
' --output total_int --output-type integer
This command-line script compiles the Python code in --code
once and then executes it against every value in the total
column. The return value is written to the newly created total_int
column.
Under the hood this works by defining a custom SQLite function called transform()
using the compiled Python code, then executing the SQL equivalent of the following:
update sometable set total_int = transform(total)
The Python function tracks how many times it has been called and uses that to update a progress bar using the tqdm Python library.
sqlite-transform 1.1 added a --silent
option to disable that progress bar, because when I ran it in GitHub Actions the progress bar output turned into too many separate lines of text.
conditional-get 0.3
The Project Pelican Datasette instance is built by GitHub Actions, which starts by downloading a 700MB CSV file.
I decided to use the GitHub Actions cache in conjunction with a conditional HTTP GET request to avoid downloading the CSV file if it hadn’t changed since the last time the action ran.
My conditional-get tool can help here: it downloads files but stores their ETag in a JSON file, then on subsequent requests it uses that ETag to perform a conditional GET and avoids downloading the content of the target URL if it has not changed since last time.
One catch though: the CSV files for Project Pelican live in Google Cloud Storage and are protected by a signed URL. The JSON file used the full URL as the key to lookup an ETag, so if the URL has changed (due to a fresh signature) the tool would not be able to find the correct ETag.
conditional-get 0.3 fixes that with a new --key
option that can be used to specify a custom key within the JSON file to store the ETag. So now my GitHub Action script partially looks like this:
name: Build and deploy database
on:
push:
branches:
- main
workflow_dispatch:
jobs:
build_and_deploy:
runs-on: ubuntu-latest
steps:
- name: Check out repo
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.9
- uses: actions/cache@v2
name: Configure pip caching
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
restore-keys: |
${{ runner.os }}-pip-
- uses: actions/cache@v2
name: Configure HTTP download caching
with:
path: ~/data
key: ${{ runner.os }}-http-download-cache
- name: Install Python dependencies
run: |
pip install -r requirements.txt
- name: Download CSVs
env:
JWT: ${{ secrets.JWT }}
run: |
DOWNLOAD_DIR=~/data ./download.sh
Where requirements.txt
contains conditional-get
and download.sh
looks like this:
#!/bin/bash
set -euf -o pipefail
download_dir="${DOWNLOAD_DIR:-.}"
mkdir -p $download_dir
filename=states.csv
signed_url=$(curl -s 'https://api.example.com/graphql' \
-H "authorization: JWT $JWT" \
-H 'content-type: application/json' \
--data-raw $'{"operationName":"CreateFileDownloadURI","variables":{"input":{"fileName":"'$filename$'","projectId":"123"}},"query":"mutation CreateFileDownloadURI($input: FileURIInput\u0021) {\\n createFileDownloadUri(input: $input) {\\n ok {\\n name\\n uri\\n __typename\\n }\\n err\\n __typename\\n }\\n}\\n"}' \
--compressed | jq -r .data.createFileDownloadUri.ok.uri)
conditional-get --etags $download_dir/etags.json \
--key $filename -v $signed_url -o $download_dir/$filename
I’m making a GraphQL call here from curl to retrieve the signed URL, then passing that signed URL to the following to perform the conditional GET request:
conditional-get --etags $download_dir/etags.json \
--key $filename -v $signed_url -o $download_dir/$filename
datasette-graphql 1.5
Unrelated to Project Pelican (despite that project making calls to a separate GraphQL API): datasette-graphql 1.5 finally closed an issue that had been open for a while: switching internal API calls over to using datasette.client
.
Datasette exposes two key documented APIs: there’s the HTTP/JSON API that can be used to query and introspect Datasette over the network, and there’s the internal Python API which allows plugins to interact with the rest of Datasette.
Last year I realized that there are often times when a plugin might wish to access that same JSON API that is exposed by Datasette to the public.
Datasette uses ASGI internally, and the HTTPX Python library has a mechanism for executing requests directly against a Python ASGI interface without using the network... so I combined the two and introduced the datasette.client interface for making calls from within Datasette to its own HTTP API.
My dogsheep-beta search plugin was the first to take advantage of this: it uses that mechanism to make internal API calls to calculate facet counts without re-implementing faceting.
datasette-graphql
exposes a GraphQL API over Datasette tables, including the ability to apply filters (rows where age > 25 etc). It does this by translating GraphQL field requests into calls to the Datasette table JSON API.
This used to work with some inelegant hijacking of Datasette’s undocumented internal view classes, which made me nervous because it meant that the GraphQL plugin would break any time I modified Datasette core.
In 1.5 I switched it over to using the datasette.client
mechanism instead, which should be a much more stable target since changes to Datasette’s JSON API are made in a much more controlled fashion.
In writing the tests for this I ran into the race condition mentioned above—so releasing this plugin ended up being the trigger for pushing out the Datasette 0.58.1 release with that fix.
Releases this week
-
sqlite-transform: 1.1—(8 releases total)—2021-07-21
Tool for running transformations on columns in a SQLite database -
datasette-export-notebook: 1.0—(5 releases total)—2021-07-17
Datasette plugin providing instructions for exporting data to Jupyter or Observable -
datasette-graphql: 1.5—(31 releases total)—2021-07-16
Datasette plugin providing an automatic GraphQL API for your SQLite databases -
datasette: 0.58.1—(94 releases total)—2021-07-16
An open source multi-tool for exploring and publishing data -
conditional-get: 0.3—(4 releases total)—2021-07-14
CLI tool for fetching data using HTTP conditional get
TIL this week
More recent articles
- A selfish personal argument for releasing code as Open Source - 24th January 2025
- Anthropic's new Citations API - 24th January 2025
- Six short video demos of LLM and Datasette projects - 22nd January 2025