Simon Willison’s Weblog

On search, management, github, markdown, careers, ...

 

Recent entries

Refactoring databases with sqlite-utils extract three hours ago

Yesterday I described the new sqlite-utils transform mechanism for applying SQLite table transformations that go beyond those supported by ALTER TABLE. The other new feature in sqlite-utils 2.20 builds on that capability to allow you to refactor a database table by extracting columns into separate tables. I’ve called it sqlite-utils extract.

The problem

Much of the data I work with in Datasette starts off as a CSV file published by an organization or government. Since CSV files aren’t relational databases, they are often denormalized. It’s particularly common to see something like this:

Organization Group Code Organization Group Department Code Department Union Code Union
1 Public Protection POL Police 911 POA
4 Community Health DPH Public Health 250 SEIU, Local 1021, Misc
1 Public Protection FIR Fire Department 798 Firefighters,Local 798, Unit 1
1 Public Protection POL Police 911 POA

This is an extract from the San Francisco Employee Compensation dataset from DataSF.

The sqlite-utils extract command-line tool, and the table.extract() Python method that underlies it, can be used to extract these duplicated column pairs out into separate tables with foreign key relationships from the main table.

How to refactor that data

Here’s how to use sqlite-utils to clean up and refactor that compensation data.

First, grab the data. It’s a 150M CSV file containing over 600,000 rows:

curl -o salaries.csv 'https://data.sfgov.org/api/views/88g8-5mnd/rows.csv?accessType=DOWNLOAD'

Use sqlite-utils insert to load that into a SQLite database:

sqlite-utils insert salaries.db salaries salaries.csv --csv

Fire up Datasette to check that the data looks right:

datasette salaries.db

There’s a catch here: the schema for the generated table (shown at the bottom of http://localhost:8001/salaries/salaries) reveals that because we imported from CSV every column is a text column. Since some of this data is numeric we should convert to numbers, so we can do things like sort the table by the highest salary.

We can do that using sqlite-transform:

sqlite-utils transform salaries.db salaries \
  --type 'Employee Identifier' integer \
  --type Salaries float \
  --type Overtime float \
  --type 'Other Salaries' float \
  --type 'Total Salary' float \
  --type 'Retirement' float \
  --type 'Health and Dental' float \
  --type 'Other Benefits' float \
  --type 'Total Benefits' float \
  --type 'Total Compensation' float

Having run that command, here’s the new database schema:

$ sqlite3 salaries.db '.schema salaries'
CREATE TABLE IF NOT EXISTS "salaries" (
   [rowid] INTEGER PRIMARY KEY,
   [Year Type] TEXT,
   [Year] TEXT,
   [Organization Group Code] TEXT,
   [Organization Group] TEXT,
   [Department Code] TEXT,
   [Department] TEXT,
   [Union Code] TEXT,
   [Union] TEXT,
   [Job Family Code] TEXT,
   [Job Family] TEXT,
   [Job Code] TEXT,
   [Job] TEXT,
   [Employee Identifier] INTEGER,
   [Salaries] FLOAT,
   [Overtime] FLOAT,
   [Other Salaries] FLOAT,
   [Total Salary] FLOAT,
   [Retirement] FLOAT,
   [Health and Dental] FLOAT,
   [Other Benefits] FLOAT,
   [Total Benefits] FLOAT,
   [Total Compensation] FLOAT
);

Now we can start extracting those columns. We do this using several rounds of the sqlite-utils extract command, one for each duplicated pairs.

For Organization Group Code and Organization Group:

sqlite-utils extract salaries.db salaries \
   'Organization Group Code' 'Organization Group' \
  --table 'organization_groups' \
  --fk-column 'organization_group_id' \
  --rename 'Organization Group Code' code \
  --rename 'Organization Group' name

This takes about 12 minutes on my laptop, and displays a progress bar as it runs. (Open issue for making that faster).

Here’s the refactord database schema:

$ sqlite3 salaries.db .schema
CREATE TABLE [organization_groups] (
   [id] INTEGER PRIMARY KEY,
   [code] TEXT,
   [name] TEXT
);
CREATE TABLE IF NOT EXISTS "salaries" (
   [rowid] INTEGER PRIMARY KEY,
   [Year Type] TEXT,
   [Year] TEXT,
   [organization_group_id] INTEGER,
   [Department Code] TEXT,
   [Department] TEXT,
   [Union Code] TEXT,
   [Union] TEXT,
   [Job Family Code] TEXT,
   [Job Family] TEXT,
   [Job Code] TEXT,
   [Job] TEXT,
   [Employee Identifier] INTEGER,
   [Salaries] FLOAT,
   [Overtime] FLOAT,
   [Other Salaries] FLOAT,
   [Total Salary] FLOAT,
   [Retirement] FLOAT,
   [Health and Dental] FLOAT,
   [Other Benefits] FLOAT,
   [Total Benefits] FLOAT,
   [Total Compensation] FLOAT,
   FOREIGN KEY(organization_group_id) REFERENCES organization_groups(id)
);
CREATE UNIQUE INDEX [idx_organization_groups_code_name]
    ON [organization_groups] ([code], [name]);

Now fire up Datasette to confirm it had the desired effect:

datasette salaries.db

Here’s what that looks like:

Screenshot of the first few columns of the table, showing links displayed in the new organization_group_id column

Note that the new organization_group_id column still shows the name of the organization group, because Datasette automatically de-references foreign key relationships when it displays a table and uses any column called name (or title or value) as the label for a link to the record.

Let’s extract the other columns. This will take a while:

sqlite-utils extract salaries.db salaries \
   'Department Code' 'Department' \
  --table 'departments' \
  --fk-column 'department_id' \
  --rename 'Department Code' code \
  --rename 'Department' name

sqlite-utils extract salaries.db salaries \
   'Union Code' 'Union' \
  --table 'unions' \
  --fk-column 'union_id' \
  --rename 'Union Code' code \
  --rename 'Union' name

sqlite-utils extract salaries.db salaries \
   'Job Family Code' 'Job Family' \
  --table 'job_families' \
  --fk-column 'job_family_id' \
  --rename 'Job Family Code' code \
  --rename 'Job Family' name

sqlite-utils extract salaries.db salaries \
   'Job Code' 'Job' \
  --table 'jobs' \
  --fk-column 'job_id' \
  --rename 'Job Code' code \
  --rename 'Job' name

Our finished schema looks like this:

$ sqlite3 salaries.db .schema
CREATE TABLE [organization_groups] (
   [id] INTEGER PRIMARY KEY,
   [code] TEXT,
   [name] TEXT
);
CREATE TABLE [departments] (
   [id] INTEGER PRIMARY KEY,
   [code] TEXT,
   [name] TEXT
);
CREATE TABLE [unions] (
   [id] INTEGER PRIMARY KEY,
   [code] TEXT,
   [name] TEXT
);
CREATE TABLE [job_families] (
   [id] INTEGER PRIMARY KEY,
   [code] TEXT,
   [name] TEXT
);
CREATE TABLE [jobs] (
   [id] INTEGER PRIMARY KEY,
   [code] TEXT,
   [name] TEXT
);
CREATE TABLE IF NOT EXISTS "salaries" (
   [rowid] INTEGER PRIMARY KEY,
   [Year Type] TEXT,
   [Year] TEXT,
   [organization_group_id] INTEGER REFERENCES [organization_groups]([id]),
   [department_id] INTEGER REFERENCES [departments]([id]),
   [union_id] INTEGER REFERENCES [unions]([id]),
   [job_family_id] INTEGER REFERENCES [job_families]([id]),
   [job_id] INTEGER,
   [Employee Identifier] INTEGER,
   [Salaries] FLOAT,
   [Overtime] FLOAT,
   [Other Salaries] FLOAT,
   [Total Salary] FLOAT,
   [Retirement] FLOAT,
   [Health and Dental] FLOAT,
   [Other Benefits] FLOAT,
   [Total Benefits] FLOAT,
   [Total Compensation] FLOAT,
   FOREIGN KEY(job_id) REFERENCES jobs(id)
);
CREATE UNIQUE INDEX [idx_organization_groups_code_name]
    ON [organization_groups] ([code], [name]);
CREATE UNIQUE INDEX [idx_departments_code_name]
    ON [departments] ([code], [name]);
CREATE UNIQUE INDEX [idx_unions_code_name]
    ON [unions] ([code], [name]);
CREATE UNIQUE INDEX [idx_job_families_code_name]
    ON [job_families] ([code], [name]);
CREATE UNIQUE INDEX [idx_jobs_code_name]
    ON [jobs] ([code], [name]);

We’ve also shrunk our database file quite a bit. Before the transformations salaries.db was 159MB. It’s now just 70MB—that’s less than half the size!

I used datasette publish cloudrun to publish a copy of my final database here:

https://sf-employee-compensation.datasettes.com/salaries/salaries

Here’s the command I used to publish it:

datasette publish cloudrun salaries.db \
  --service sf-employee-compensation \
  --title "San Francisco Employee Compensation (as-of 21 Sep 2020)" \
  --source "DataSF" \
  --source_url "https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8-5mnd" \
  --about "About this project" \
  --about_url "https://simonwillison.net/2020/Sep/23/sqlite-utils-extract/" \
  --install datasette-block-robots \
  --install datasette-vega \
  --install datasette-copyable \
  --install datasette-graphql

Bonus: explore salaries with GraphQL

You may have noticed my datasette publish line above finished with the line --install datasette-graphql. This installs the datasette-graphql plugin as part of the deployment to Cloud Run. Which means we can query the salary data using GraphQL as an alternative to SQL!

Here’s a GraphQL query that shows the ten highest paid employees, including their various expanded foreign key references:

{
  salaries(sort_desc: Total_Compensation, first: 10) {
    nodes {
      Year_Type
      Year
      union_id {
        id
        name
      }
      job_id {
        id
        name
      }
      job_family_id {
        id
        name
      }
      department_id {
        id
        name
      }
      organization_group_id {
        id
        name
      }
      Salaries
      Overtime
      Other_Salaries
      Total_Salary
      Retirement
      Health_and_Dental
      Other_Benefits
      Total_Benefits
      Total_Compensation
      rowid
      Employee_Identifier
    }
  }
}

You can try that query out here in the GraphiQL API explorer.

Executing advanced ALTER TABLE operations in SQLite 18 hours ago

SQLite’s ALTER TABLE has some significant limitations: it can’t drop columns, it can’t alter NOT NULL status, it can’t change column types. Since I spend a lot of time with SQLite these days I’ve written some code to fix this—both from Python and as a command-line utility.

To SQLite’s credit, not only are these limitations well explained in the documentation but the explanation is accompanied by a detailed description of the recommended workaround. The short version looks something like this:

  • Start a transaction
  • Create a new temporary table with the exact shape you would like
  • Copy all of your old data across using INSERT INTO temp_table SELECT FROM old_table
  • Drop the old table
  • Rename the temp table to the old table
  • Commit the transaction

My sqlite-utils tool and Python library aims to make working with SQLite as convenient as possible. So I set out to build a utility method for performing this kind of large scale table transformation. I’ve called it table.transform(...).

Here are some simple examples of what it can do, lifted from the documentation:

# Convert the 'age' column to an integer, and 'weight' to a float
table.transform(types={"age": int, "weight": float})

# Rename the 'age' column to 'initial_age':
table.transform(rename={"age": "initial_age"})

# Drop the 'age' column:
table.transform(drop={"age"})

# Make `user_id` the new primary key
table.transform(pk="user_id")

# Make the 'age' and 'weight' columns NOT NULL
table.transform(not_null={"age", "weight"})

# Make age allow NULL and switch weight to being NOT NULL:
table.transform(not_null={"age": False, "weight": True})

# Set default age to 1:
table.transform(defaults={"age": 1})

# Now remove the default from that column:
table.transform(defaults={"age": None})

Each time the table.transform(...) method runs it will create a brand new table, copy the data across and then drop the old table. You can combine multiple operations together in a single call, avoiding copying the table multiple times.

The table.transform_sql(...) method returns the SQL that would be executed instead of executing it directly, useful if you want to handle even more complex requirements.

The “sqlite-utils transform” command-line tool

Almost every feature in sqlite-utils is available in both the Python library and as a command-line utility, and .transform() is no exception. The sqlite-utils transform command can be used to apply complex table transformations directly from the command-line.

Here’s an example, starting with the fixtures.db database that powers Datasette’s unit tests:

$ wget https://latest.datasette.io/fixtures.db
$ sqlite3 fixtures.db '.schema facetable'
CREATE TABLE facetable (
    pk integer primary key,
    created text,
    planet_int integer,
    on_earth integer,
    state text,
    city_id integer,
    neighborhood text,
    tags text,
    complex_array text,
    distinct_some_null,
    FOREIGN KEY ("city_id") REFERENCES [facet_cities](id)
);
$ sqlite-utils transform fixtures.db facetable \
  --type on_earth text \
  --drop complex_array \
  --drop state \
  --rename tags the_tags
$ sqlite3 fixtures.db '.schema facetable'       
CREATE TABLE IF NOT EXISTS "facetable" (
   [pk] INTEGER PRIMARY KEY,
   [created] TEXT,
   [planet_int] INTEGER,
   [on_earth] TEXT,
   [city_id] INTEGER REFERENCES [facet_cities]([id]),
   [neighborhood] TEXT,
   [the_tags] TEXT,
   [distinct_some_null] TEXT
);

You can use the --sql option to see the SQL that would be executed without actually running it:

$ wget https://latest.datasette.io/fixtures.db
$ sqlite-utils transform fixtures.db facetable \
  --type on_earth text \
  --drop complex_array \
  --drop state \
  --rename tags the_tags \
  --sql
CREATE TABLE [facetable_new_442f07e26eef] (
   [pk] INTEGER PRIMARY KEY,
   [created] TEXT,
   [planet_int] INTEGER,
   [on_earth] TEXT,
   [city_id] INTEGER REFERENCES [facet_cities]([id]),
   [neighborhood] TEXT,
   [the_tags] TEXT,
   [distinct_some_null] TEXT
);
INSERT INTO [facetable_new_442f07e26eef] ([pk], [created], [planet_int], [on_earth], [city_id], [neighborhood], [the_tags], [distinct_some_null])
   SELECT [pk], [created], [planet_int], [on_earth], [city_id], [neighborhood], [tags], [distinct_some_null] FROM [facetable];
DROP TABLE [facetable];
ALTER TABLE [facetable_new_442f07e26eef] RENAME TO [facetable];

Plenty more tricks

sqlite-utils has plenty more tricks up its sleeve. I suggest spending some time browsing the Python library reference and the sqlite-utils CLI documentation, or taking a look through through the release notes.

Weeknotes: datasette-seaborn, fivethirtyeight-polls four days ago

This week I released Datasette 0.49 and tinkered with datasette-seaborn, dogsheep-beta and polling data from FiveThirtyEight.

datasette-seaborn

Datasette currently has two key plugins for data visualization: datasette-vega for line, bar and scatter charts (powered by Vega-Lite) and datsette-cluster-map for creating clustered marker maps.

I’m always looking for other interesting visualization opportunities. Seaborn 0.11 came out last week and is a very exciting piece of software.

Seaborn focuses on statistical visualization tools—histograms, boxplots and the like—and represents 8 years of development lead by Michael Waskom. It’s built on top of Matplotlib and exhibits meticulously good taste.

So I’ve started building datasette-seaborn, a plugin which provides an HTTP interface for generating Seaborn charts using data stored in Datasette.

It’s still very alpha but early results are promising. I’ve chosen to implement it as a custom output renderer—so adding .seaborn to any Datasette table (plus some querystring parameters) will output a rendered PNG chart of that data, just like how .json or csv give you that data in different export formats.

Here’s an example chart generated by the plugin using the fabulous palmerpenguins dataset (intended as a new alternative to Iris).

I generated this image from the following URL:

https://datasette-seaborn-demo.datasette.io/penguins/penguins.seaborn?_size=1000&_seaborn=kdeplot&_seaborn_x=flipper_length_mm&_seaborn_hue=species&_seaborn_multiple=stack

This interface should be considered unstable and likely to change, but it illustrates the key idea behind the plugin: use ?_seaborn_x parameters to feed in options for the chart that you want to render.

The two biggest issues with the plugin right now are that it renders images on the main thread, potentially blocking the event loop, and it passes querystring arguments directly to seaborn without first validating them which is almost certainly a security problem.

So very much an alpha, but it’s a promising start!

dogsheep-beta

Dogsheep Beta provides faceted search across my personal data from a whole variety of different sources.

I demo’d it at PyCon AU a few weeks ago, and promised that a full write-up would follow. I still need to honour that promise! I’m figuring out how to provide a good interactive demo at the moment that doesn’t expose my personal data.

I added sort by date in addition to sort by relevance in version 0.7 this week.

fivethirtyeight-polls

FiveThirtEight have long published the data behind their stories in their fivethirtyeight/data GitHub repository, and I’ve been using that data to power a Datasette demo ever since I first released the project.

They run an index of their data projects at data.fivethirtyeight.com, and this week I noticed that they list US election polling data there that wasn’t being picked up by my fivethirtyeight.datasettes.com site.

It turns out this is listed in the GitHub repository as a README.md file but without the actual CSV data. Instead, the README links to external CSV files with URLs like https://projects.fivethirtyeight.com/polls-page/president_primary_polls.csv

It looks to me like they’re running their own custom web application which provides the CSV data as an export format, rather than keeping that CSV data directly in their GitHub repository.

This makes sense—I imagine they run a lot of custom code to help them manage their data—but does mean that my Datasette publishing scripts weren’t catching their polling data.

Since they release their data as Creative Commons Attribution 4.0 International I decided to start archiving it on GitHub, where it would be easier to automate.

I set up simonw/fivethirtyeight-polls to do just that. It’s a classic implementation of the git-scraping pattern: it runs a workflow script four times a day which fetches their latest CSV files and commits them to a repo. This means I now have a commit history of changes they have made to their polling data!

I updated my FiveThirtyEight Datasette script to publish that data as a new polls.db database, which is now being served at fivethirtyeight.datasettes.com/polls.

And since that Datasette instance runs the datasette-graphql plugin, you can now use GraphQL to explore FiveThirtyEight’s most recent polling data at https://fivethirtyeight.datasettes.com/graphql/polls—here’s an example query.

github-to-sqlite get

github-to-sqlite lets you fetch data from the GitHub API and write it into a SQLite database.

This week I added a sub-command for hitting the API directly and getting back data on the console, inspired by the fetch subcommand in twitter-to-sqlite. This is useful for trying out new APIs, since it both takes into account your GitHub authentication credentials (from an environment variable or an auth.json file) and can handle Link header pagination.

This example fetches all of my repos, paginating across multiple API requests:

github-to-sqlite get /users/simonw/repos --paginate

You can use the --nl option to get back the results as newline-delimited JSON. This makes it easy to pipe them directly into sqlite-utils like this:

github-to-sqlite get /users/simonw/repos --paginate --nl \
  | sqlite-utils insert simonw.db repos - --nl

See Inserting JSON data in the sqlite-utils documentation for an explanation of what this is doing.

TIL this week

Releases this week

Datasette 0.49: The annotated release notes seven days ago

Datasette 0.49 is out. Some notes on what’s new.

API for writable canned queries

Writable canned queries now expose a JSON API, see JSON API for writable canned queries. (#880)

I wrote about writable canned queries when they were introduced in Datasette 0.44 back in June. They provide a mechanism for defining a canned SQL query which can make writes (inserts, updates or deletes) to the underlying SQLite database. They can be protected by Datasette authentication or you can leave them open—for example if you want unauthenticated users to be able to post comments or leave feedback messages.

The missing feature was API support. Datasette 0.49 adds that, so now you can define a canned query that writes to the database and then call it as a JSON API—either without authentication or protected by a mechanism such as that provided by the datasette-auth-tokens plugin.

This feature works with magic parameters, so you can define an API that accepts API traffic and automatically logs things like the incoming IP address. Here’s a canned query defined in a metadata.yml file that logs user agent and IP addresses:

databases:
  logs:
    queries:
      log:
        sql: |-
          INSERT INTO logs (
            user_agent, datetime
          ) VALUES (
            :_header_user_agent, :_now_datetime_utc
          )
        write: true

Create a SQLite database file called logs.db with the correct table like this:

$ sqlite-utils create-table logs.db logs id integer user_agent text datetime text --pk=id

Confirm the created schema with:

$ sqlite3 logs.db .schema
CREATE TABLE [logs] (
   [id] INTEGER PRIMARY KEY,
   [user_agent] TEXT,
   [datetime] TEXT
);

Now start Datasette like so:

$ datasette logs.db -m metadata.yml

And visit http://127.0.0.1:8001/logs/log. You can click the “Run SQL” button there to insert a new log row, which you can then view at http://127.0.0.1:8001/logs/logs.

Next, the API. You can request a JSON response by adding ?_json=1 to the URL, so try this with curl:

$ curl -XPOST 'http://127.0.0.1:8001/logs/log?_json=1'
{"ok": true, "message": "Query executed, 1 row affected", "redirect": null}

You can also set the Accept: application/json header on your request, like so:

$ curl -XPOST 'http://127.0.0.1:8001/logs/log' -H 'Accept: application/json'
{"ok": true, "message": "Query executed, 1 row affected", "redirect": null}                                                             ~ % 

Or by passing _json=1 as part of a POST submission. Let’s try that using requests:

$ ipython
In [1]: import requests
In [2]: requests.post("http://127.0.0.1:8001/logs/log", data={"_json": "1"}).json()
Out[2]: {'ok': True, 'message': 'Query executed, 1 row affected', 'redirect': None}

Path parameters for custom page templates

New mechanism for defining page templates with custom path parameters—a template file called pages/about/{slug}.html will be used to render any requests to /about/something. See Path parameters for pages. (#944)

I added custom page support in Datasette 0.41 back in May, based on the needs of my Niche Museums site. I wanted an easy way to create things like an /about page that returned content from a custom template.

Custom page templates work as a fallback for Datasette 404s. If /about fails to resolve and Datasette was provided a --template-dir on startup, Datasette will check to see if a template exists called templates/pages/about.html.

Datasette 0.49 adds support for path parameters, partially inspired by cookiecutter (which showed me that it’s OK to create files with curly braces in their name). You can now create templates with {slug} style wildcards as part of their filenames, and Datasette will route matching requests to that template.

I shipped a new release of Niche Museums today that takes advantage of that feature. I wanted neater URLs for museum pages—to shift from https://www.niche-museums.com/browse/museums/101 to https://www.niche-museums.com/101.

Here’s how it works. I have a template file called templates/pages/{id}.html. That template takes advantage of the datasette-template-sql plugin, which adds a sql() function that can be called within the template. It starts like this:

<!DOCTYPE html>
<html>
{% set rows = sql("select * from museums where id = :id", {"id": id}) %}
{% if not rows %}
    {{ raise_404("Museum not found") }}
{% endif %}
<head>
    <meta charset="utf-8">
    {% set museum = rows[0] %}
    ... rest of the template here ...

Datasette made the variable id available to the context having captured it from the {id}.html template matching the incoming URL.

I then use the sql() function to execute a query, passing in id as a query parameter.

If there are no matches, I call the brand new raise_404() template function which cancels the rendering of this template and falls back to Datasette’s default 404 handling.

Otherwise, I set the museum variable to rows[0] and continue rendering the page.

I’ve basically reinvented PHP and ColdFusion in my custom 404 handler. Hooray!

A few other notable changes

register_output_renderer() render functions can now return a Response. (#953)

The register_output_renderer() plugin hook was designed before Datasette had a documented Response class. It asked your plugin to return a custom {"content_type": "...", "body": "...", "status_code": 200, "headers": {}} dictionary.

You can now return a Response instead, and I plan to remove the dictionary version before Datasette 1.0. I released new versions of datasette-ics, datasette-atom, datasette-copyable and the new datasette-yaml that use the new return format.

New --upgrade option for datasette install. (#945)

I added the datasette install datasette-cluster-map command as a thin wrapper around pip install.

This means you can install new plugins without first figuring out which virtual environment your Datasette is running out—particularly useful if you install Datasette using Homebrew.

I then realized that this could be used to upgrade Datasette itself—but only if you could run pip install -U. So now datasette install -U datasette will upgrade Datasette in place.

New datasette --pdb option. (#962)

This is useful if you are working on Datasette itself, or a Datasette plugin. Pass the --pdb option and Datasette will start an interactive Python debugger session any time it hits an exception.

datasette --get exit code now reflects the internal HTTP status code. (#947)

I’m excited about the pattern of using datasette --get for running simple soundness checks, e.g. as part of a CI suite. Now that the exit code reflects the status code for the page you can write test scripts that look like this:

# Fail if homepage returns 404 or 500
datasette . --get /

New raise_404() template function for returning 404 errors. (#964)

Demonstrated in the Niche Museums example above.

And the rest:

  • datasette publish heroku now deploys using Python 3.8.5
  • Upgraded CodeMirror to 5.57.0. (#948)
  • Upgraded code style to Black 20.8b1. (#958)
  • Fixed bug where selected facets were not correctly persisted in hidden form fields on the table page. (#963)
  • Renamed the default error template from 500.html to error.html.
  • Custom error pages are now documented, see Custom error pages. (#965)

In writing up these annotated release notes I spotted a bug with writable canned queries, which I have now fixed in Datasette 0.49.1.

Weeknotes: datasette-dump, sqlite-backup, talks 12 days ago

I spent some time this week digging into Python’s sqlite3 internals. I also gave two talks and recorded a third, due to air at PyGotham in October.

sqlite-dump and datasette-backup

I’m running an increasing number of Datasette instances with mutable database files—databases that are updated through a variety of different mechanisms. So I need to start thinking about backups.

Prior to this most of my database files had been relatively disposable: they’re built from other sources of data (often by scheduled GitHub Actions) so backups weren’t necessary since I could always rebuild them from their point of truth.

Creating a straight copy of a SQLite database file isn’t enough for robust backups, because the file may be accepting writes while you are creating the copy.

SQLite has various mechanisms for backups. There’s an online backup API and more recent SQLite versions support a VACUUM INTO command which also optimizes the backed up database.

I figured it would be useful to expose this functionality by a Datasette plugin—one that could allow automated backups to be directly fetched from Datasette over HTTPS. So I started work on datasette-backup.

For the first backup mode, I decided to take advantage of the connection.iterdump() method that’s built into Python’s sqlite3 module. This method is an iterator that outputs plain text SQL that can recreate a database. Crucially it’s a streaming-compatible mechanism—unlike VACUUM INTO which would require me to create a temporary file the same as the database I was backing up.

I started experimenting with it, and ran into a big problem. I make extensive use of SQLite full-text search, but the .sql dumps generated by .iterdump() break with constraint errors if they include any FTS tables.

After a bit of digging I came across a 13 year old comment about this in the cPython source code itself!

The implementation for .iterdump() turns out to be entirely in Python, and way less complicated than I had expected. So I decided to see if I could get FTS table exports working.

In a classic case of yak shaving, I decided to create a Python library called sqlite-dump to solve this problem. And since my existing cookiecutter templates only cover Datasette Plugins or Click apps I first needed to create a new python-lib template in order to create the library I needed for my plugin.

I got it working! Install the datasette-backup plugin on any Datasette instance to get a /-/backup/name-of-database.sql URL that will produce a streaming SQL dump of any attached database.

A weird bug with SQLite FTS and triggers

While working on datasette-backup I noticed a weird issue with some of my SQLite full-text search enabled databases: they kept getting bigger. Way bigger than I would expect them to.

I eventually noticed that the licenses_fts table in my github-to-sqlite demo database had 7 rows in it, but the accompanying licenses_fts_docsize table had 9,141. I would expect it to only have 7 as well.

I was stumped as to what was going on, so I turned to the official SQLite forum. I only recently discovered how useful this is as a resource. Dan Kennedy, one of the three core SQLite maintainers, replied within an hour and gave me some useful hints. The root cause turned out to be the way SQLite triggers work: by default, SQLite runs in recursive_triggers=off mode (for backwards compatibility with older databases). This means that an INSERT OR REPLACE update to a table that is backed by full-text search may not correctly trigger the updates needed on the FTS table itself.

Since there doesn’t appear to be any disadvantage to running with recursive_triggers=on I’ve now set that as the default for sqlite-utils, as-of version 2.17.

I then added a sqlite-utils rebuild-fts data.db command in version 2.18 which can rebuild the FTS tables in a database and fix the _fts_docsize problem.

Talks

I presented Build your own data warehouse for personal analytics with SQLite and Datasette at PyCon AU last week. The video is here and includes my first public demo of Dogsheep Beta, my new combined search engine for personal analytics data imported using my Dogsheep family of tools. I took questions in this Google Doc, and filled out more detailed answers after the talk.

I gave a talk at PyRVA a couple of days called Rapid data analysis with SQLite and Datasette. Here’s the video and Google Doc for that one.

I also pre-recorded my talk for PyGotham: Datasette—an ecosystem of tools for working with Small Data. The conference is in the first week of October and I’ll be hanging out there during the talk answering questions and chatting about the project, safe from the stress of also having to present it live!

TIL this week

Releases this week

Weeknotes: airtable-export, generating screenshots in GitHub Actions, Dogsheep! 19 days ago

This week I figured out how to populate Datasette from Airtable, wrote code to generate social media preview card page screenshots using Puppeteer, and made a big breakthrough with my Dogsheep project.

airtable-export

I wrote about Rocky Beaches in my weeknotes two weeks ago. It’s a new website built by Natalie Downe that showcases great places to go rockpooling (tidepooling in American English), mixing in tide data from NOAA and species sighting data from iNaturalist.

Rocky Beaches is powered by Datasette, using a GitHub Actions workflow that builds the site’s underlying SQLite database using API calls and YAML data stored in the GitHub repository.

Natalie wanted to use Airtable to maintain the structured data for the site, rather than hand-editing a YAML file. So I built airtable-export, a command-line script for sucking down all of the data from an Airtable instance and writing it to disk as YAML or JSON.

You run it like this:

airtable-export out/ mybaseid table1 table2 --key=key

This will create a folder called out/ with a .yml file for each of the tables.

Sadly the Airtable API doesn’t yet provide a mechanism to list all of the tables in a database (a long-running feature request) so you have to list the tables yourself.

We’re now running that command as part of the Rocky Beaches build script, and committing the latest version of the YAML file back to the GitHub repo (thus gaining a full change history for that data).

Social media cards for my TILs

I really like social media cards—og:image HTML meta attributes for Facebook and twitter:image for Twitter. I wanted them for articles on my TIL website since I often share those via Twitter.

One catch: my TILs aren’t very image heavy. So I decided to generate screenshots of the pages and use those as the 2x1 social media card images.

The best way I know of programatically generating screenshots is to use Puppeteer, a Node.js library for automating a headless instance of the Chrome browser that is maintained by the Chrome DevTools team.

My first attempt was to run Puppeteer in an AWS Lambda function on Vercel. I remembered seeing an example of how to do this in the Vercel documentation a few years ago. The example isn’t there any more, but I found the original pull request that introduced it.

Since the example was MIT licensed I created my own fork at simonw/puppeteer-screenshot and updated it to work with the latest Chrome.

It’s pretty resource intensive, so I also added a secret ?key= mechanism so only my own automation code could call my instance running on Vercel.

I needed to store the generated screenshots somewhere. They’re pretty small—on the order of 60KB each—so I decided to store them in my SQLite database itself and use my datasette-media plugin (see Fun with binary data and SQLite) to serve them up.

This worked! Until it didn’t... I ran into a showstopper bug when I realized that the screenshot process relies on the page being live on the site... but when a new article is added it’s not live when the build process works, so the generated screenshot is of the 404 page.

So I reworked it to generate the screenshots inside the GitHub Action as part of the build script, using puppeteer-cli.

My generate_screenshots.py script handles this, by first shelling out to datasette --get to render the HTML for the page, then running puppeteer to generate the screenshot. Relevant code:

def png_for_path(path):
    # Path is e.g. /til/til/python_debug-click-with-pdb.md
    page_html = str(TMP_PATH / "generate-screenshots-page.html")
    # Use datasette to generate HTML
    proc = subprocess.run(["datasette", ".", "--get", path], capture_output=True)
    open(page_html, "wb").write(proc.stdout)
    # Now use puppeteer screenshot to generate a PNG
    proc2 = subprocess.run(
        [
            "puppeteer",
            "screenshot",
            page_html,
            "--viewport",
            "800x400",
            "--full-page=false",
        ],
        capture_output=True,
    )
    png_bytes = proc2.stdout
    return png_bytes

This worked great! Except for one thing... the site is hosted on Vercel, and Vercel has a 5MB response size limit.

Every time my GitHub build script runs it downloads the previous SQLite database file, so it can avoid regenerating screenshots and HTML for pages that haven’t changed.

The addition of the binary screenshots drove the size of the SQLite database over 5MB, so the part of my script that retrieved the previous database no longer worked.

I needed a reliable way to store that 5MB (and probably eventually 10-50MB) database file in between runs of my action.

The best place to put this would be an S3 bucket, but I ind the process of setting up IAM permissions for access to a new bucket so infuriating that I couldn’t bring myself to do it.

So... I created a new dedicated GitHub repository, simonw/til-db, and updated my action to store the binary file in that repo—using a force push so the repo doesn’t need to maintain unnecessary version history of the binary asset.

This is an abomination of a hack, and it made me cackle a lot. I tweeted about it and got the suggestion to try Git LFS instead, which would definitely be a more appropriate way to solve this problem.

Rendering Markdown

I write my blog entries in Markdown and transform them into HTML before I post them on my blog. Some day I’ll teach my blog to render Markdown itself, but so far I’ve got by through copying and pasting into Markdown tools.

My favourite Markdown flavour is GitHub’s, which adds a bunch of useful capabilities—most notably the ability to apply syntax highlighting. GitHub expose an API that applies their Markdown formatter and returns the resulting HTML.

I built myself a quick and scrappy tool in JavaScript that sends Markdown through their API and then applies a few DOM manipulations to clean up what comes back. It was a nice opportunity to write some modern vanilla JavaScript using fetch():

async function render(markdown) {
    return (await fetch('https://api.github.com/markdown', {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json'
        },
        body: JSON.stringify({'mode': 'markdown', 'text': markdown})
    })).text();
}

const button = document.getElementsByTagName('button')[0];
const output = document.getElementById('output');
const preview = document.getElementById('preview');

button.addEventListener('click', async function() {
    const rendered = await render(input.value);
    output.value = rendered;
    preview.innerHTML = rendered;
});

Dogsheep Beta

My most exciting project this week was getting out the first working version of Dogsheep Beta—the search engine that ties together results from my Dogsheep family of tools for personal analytics.

I’m giving a talk about this tonight at PyCon Australia: Build your own data warehouse for personal analytics with SQLite and Datasette. I’ll be writing up detailed notes in the next few days, so watch this space.

TIL this week

Releases this week

Elsewhere

20th September 2020

19th September 2020

  • DuckDB (via) This is a really interesting, relatively new database. It’s kind of a weird hybrid between SQLite and PostgreSQL: it uses the PostgreSQL parser but models itself after SQLite in that databases are a single file and the code is designed for use as an embedded library, distributed in a single amalgamation C++ file (SQLite uses a C amalgamation). It features a “columnar-vectorized query execution engine” inspired by MonetDB (also by the DuckDB authors) and is hence designed to run analytical queries really quickly. You can install it using “pip install duckdb”—the resulting module feels similar to Python’s sqlite3, and follows roughly the same DBAPI pattern. #

17th September 2020

  • Array programming with NumPy - the NumPy paper (via) The NumPy paper is out, published in Nature. I found this enlightening: for an academic paper it’s very understandable, and it filled in quite a few gaps in my mental model of what NumPy is and which problems it addresses, as well as its relationship to the many other tools in the scientific Python stack. #

15th September 2020

  • A manager on Strategic Response mused to myself that most of the world outside the West was effectively the Wild West with myself as the part-time dictator – he meant the statement as a compliment, but it illustrated the immense pressures upon me.

    Sophie Zhang #

  • “I Have Blood on My Hands”: A Whistleblower Says Facebook Ignored Global Political Manipulation (via) Sophie Zhang worked as the data scientist for the Facebook Site Integrity fake engagement team. She gave up her severance package in order to speak out internally about what she saw there, and someone leaked her memo to BuzzFeed News. It’s a hell of a story: she saw bots and coordinated manual accounts used to influence politics in countries all around the world, and found herself constantly making moderation decisions that had lasting political impact. “With no oversight whatsoever, I was left in a situation where I was trusted with immense influence in my spare time". This sounds like a nightmare—imagine taking on responsibility for protecting democracy in so many different places. #

11th September 2020

10th September 2020

  • 15 rules for blogging, and my current streak (via) Matt Webb is on a 24 week streak of blogging multiple posts a week and shares his rules on how he’s doing this. These are really good rules. A rule of thumb that has helped me a lot is to fight back against the temptation to make a post as good as I can before I publish it— because that way lies a giant drafts folder and no actual published content. “Perfect is the enemy of shipped”. #

9th September 2020

  • AVIF has landed. AVIF support landed in Chrome 85 a few weeks ago. It’s a new lossy royalty-free image format derived from AV1 video and it’s really impressive—it can achieve similar results to JPEG using a quarter of the file size! Jake digs into AVIF in detail, providing lots of illustrative examples created using the Squoosh online compressor, which now supports AVIF encoding. Jake used the same WebAssembly encoder from Squoosh to decode AVIF images in a web worker so that the demos in his article would work even for browsers that don’t yet support AVIF natively. #

3rd September 2020

  • Render Markdown tool (via) I wrote a quick JavaScript tool for rendering Markdown via the GitHub Markdown API—which includes all of their clever extensions like tables and syntax highlighting—and then stripping out some extraneous HTML to give me back the format I like using for my blog posts. #

1st September 2020

  • Simply put, if you’re in a position of power at work, you’re unlikely to see workplace harassment in front of you. That’s because harassment and bullying are attempts to exert power over people with less of it. People who behave improperly don’t tend to do so with people they perceive as having power already.

    Sarah Milstein #

29th August 2020

  • airtable-export. I wrote a command-line utility for exporting data from Airtable and dumping it to disk as YAML, JSON or newline delimited JSON files. This means you can backup an Airtable database from a GitHub Action and get a commit history of changes made to your data. #