Simon Willison’s Weblog

On performance 70 gitscraping 19 annotatedreleasenotes 11 datasettedesktop 5 github 93 ...

 

Recent entries

Weeknotes: s3-credentials prefix and Datasette 0.60 five days ago

A new release of s3-credentials with support for restricting access to keys that start with a prefix, Datasette 0.60 and a write-up of my process for shipping a feature.

s3-credentials --prefix

s3-credentials is my tool for creating limited scope AWS credentials that can only read and write from a specific S3 bucket. I introduced it in this blog entry in November, and I’ve continued to iterate on it since then.

I released s3-credentials 0.9 today with a feature I’ve been planning since I first built the tool: the ability to specify a --prefix and get credentials that are only allowed to operate on keys within a specific folder within the S3 bucket.

This is particularly useful if you are building multi-tenant SaaS applications on top of AWS. You might decide to create a bucket per customer... but S3 limits you to 100 buckets for your by default, with a maximum of 1,000 buckets if you request an increase.

So a bucket per customer won’t scale above 1,000 customers.

The sts.assume_role() API lets you retrieve temporary credentials for S3 that can have limits attached to them—including a limit to access keys within a specific bucket and under a specific prefix. That means you can create limited duration credentials that can only read and write from a specific prefix within a bucket.

Which solves the problem! Each of your customers can have a dedicated prefix within the bucket, and your application can issue restricted tokens that greatly reduce the risk of one customer accidentally seeing files that belong to another.

Here’s how to use it:

s3-credentials create name-of-bucket --prefix user1410/

This will return a JSON set of credentials—an access key and secret key—that can only be used to read and write keys in that bucket that start with user1410/.

Add --read-only to make those credentials read-only, and --write-only for credentials that can be used to write but not read records.

If you add --duration 15m the returned credentials will only be valid for 15 minutes, using sts.assume_role(). The README includes a detailed description of the changes that will be made to your AWS account by the tool.

You can also add --dry-run to see a text summary of changes without applying them to your account. Here’s an example:

% s3-credentials create name-of-bucket --prefix user1410/ --read-only --dry-run --duration 15m
Would create bucket: 'name-of-bucket'
Would ensure role: 's3-credentials.AmazonS3FullAccess'
Would assume role using following policy for 900 seconds:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::name-of-bucket"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::name-of-bucket"
      ],
      "Condition": {
        "StringLike": {
          "s3:prefix": [
            "user1410/*"
          ]
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:GetObjectAcl",
        "s3:GetObjectLegalHold",
        "s3:GetObjectRetention",
        "s3:GetObjectTagging"
      ],
      "Resource": [
        "arn:aws:s3:::name-of-bucket/user1410/*"
      ]
    }
  ]
}

As with all things AWS, the magic is in the details of the JSON policy document. The README includes details of exactly what those policies look like. Getting them right was by far the hardest part of building this tool!

s3-credentials integration tests

When writing automated tests, I generally avoid calling any external APIs or making any outbound network traffic. I want the tests to run in an isolated environment, with no risk that some other system that’s having a bad day could cause random test failures.

Since the hardest part of building this tool is having confidence that it does the right thing, I decided to also include a suite of integration tests that actively exercise Amazon S3.

By default, running pytest will skip these:

% pytest
================ test session starts ================
platform darwin -- Python 3.10.0, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /Users/simon/Dropbox/Development/s3-credentials
plugins: recording-0.12.0, mock-3.6.1
collected 61 items                                  

tests/test_dry_run.py ....                    [  6%]
tests/test_integration.py ssssssss            [ 19%]
tests/test_s3_credentials.py ................ [ 45%]
.................................             [100%]

=========== 53 passed, 8 skipped in 1.21s ===========

Running pytest --integration runs the test suite with those tests enabled. It expects the computer they are running on to have AWS credentials with the ability to create buckets and users—I’m too nervous to add these secrets to GitHub Actions, so I currently only run the integration suite on my own laptop.

These were invaluable for getting confident that the new --prefix option behaved as expected, especially when combined with --read-only and --write-only. Here’s the test_prefix_read_only() test which exercises the --prefix --read-only combination.

s3-credentials list-bucket

One more new feature: the s3-credentials list-bucket name-of-bucket command lists all of the keys in a specific bucket.

By default it returns a JSON array, but you can add --nl to get back newline delimited JSON or --csv or --tsv to get back CSV or TSV.

So... a fun thing you can do with the command is pipe the output into sqlite-utils insert to create a SQLite database file of your bucket contents... and then use Datasette to browse it!

% s3-credentials list-bucket static.niche-museums.com --nl \
  | sqlite-utils insert s3.db keys - --nl
% datasette s3.db -o

This will create a s3.db SQLite database with a keys table containing your bucket contents, then open Datasette to let you interact with the table.

A screenshot of the keys table running in Datasette

Datasette 0.60

I shipped several months of work on Datasette a few days ago as Datasette 0.60. I published annotated release notes for that release which describe the background of those changes in detail.

I also released new versions of datasette-pretty-traces and datasette-leaflet-freedraw to take advantage of new features added to Datasette.

How I build a feature

My other big project this week was a blog post: How I build a feature, which goes into detail about the process I use for adding new features to my various projects. I’ve had some great feedback about this, so I’m tempted to write more about general software engineering process stuff here in the future.

Releases this week

TIL this week

Datasette 0.60: The annotated release notes nine days ago

I released Datasette 0.60 today. It’s a big release, incorporating 61 commits and 18 issues. Here are the annotated release notes.

filters_from_request plugin hook

The inspiration for this hook was my ongoing quest to simplify and refactor Datasette’s TableView, the most complex page in the project which provides an interface for filtering and paginating through a table of data.

The main job of that page is to convert a query string—with things like ?country_long=China and &capacity_mw__gt=200 in it—into a SQL query.

So I extracted part of that logic out into a new plugin hook. I’ve already started using it in datasette-leaflet-freedraw to help support filtering a table by drawing on a map, demo here.

I also used the new hook to refactor Datasette itself. The filters.py module now registers where_filters(), search_filters() and through_filters() implementations against that hook, to support various core pieces of Datasette functionality.

Tracing, write API improvements and performance

I built a new plugin called datasette-pretty-traces to help with my refactoring. It takes Datasette’s existing ?_trace=1 feature, which dumps out a big blob of JSON at the bottom of the page, and turns it into something that’s a bit easier to understand.

The plugin quickly started highlighting all sorts of interesting potential improvements!

After I added tracing to write queries it became apparent that Datasette’s schema introspection code—which runs once when the server starts, and then re-runs any time it notices a change to a database schema—was painfully inefficient.

It writes information about the schema into an in-memory database, which I hope to use in the future to power features like search of all attached tables.

I ended up adding two new documented internal methods for speeding up those writes: db.execute_write_script() and db.execute_write_many(). These are now available for plugins to use as well.

  • The db.execute_write() internal method now defaults to blocking until the write operation has completed. Previously it defaulted to queuing the write and then continuing to run code while the write was in the queue. (#1579)

Spending time with code that wrote to the database highlighted a design flaw in Datasette’s original write method. I realized that every line of code I had written that used it looked like this:

db.execute_write("insert into ...", block=True)

The block=True parameter means “block until the write has completed”. Without it, the write goes into a queue and code continues executing whether or not the write has been made.

This was clearly the wrong default. I used GitHub code search to check if changing it would be disruptive—it would not—and made the change. I’m glad I caught this before Datasette 1.0!

I noticed that writes to a database with SpatiaLite were failing with an error, because the SpatiaLite module was not being correctly loaded. This fixes that.

Faceting

A bunch of different fixes for Datasette’s Faceting made it into this release:

  • The number of unique values in a facet is now always displayed. Previously it was only displayed if the user specified ?_facet_size=max. (#1556)
  • Facets of type date or array can now be configured in metadata.json, see Facets in metadata.json. Thanks, David Larlet. (#1552)
  • New ?_nosuggest=1 parameter for table views, which disables facet suggestion. (#1557)
  • Fixed bug where ?_facet_array=tags&_facet=tags would only display one of the two selected facets. (#625)

Other, smaller changes

  • The Datasette() constructor no longer requires the files= argument, and is now documented at Datasette class. (#1563)

A tiny usability improvement, mainly for tests. It means you can write a test that looks like this:

import pytest
from datasette.app import Datasette

@pytest.mark.asyncio
async def test_datasette_homepage():
    ds = Datasette()
    response = await ds.client.get("/")
    assert "<title>Datasette" in response.text

Previously the files= argument was required, so you would have to use Datasette(files=[]).

  • The query string variables exposed by request.args will now include blank strings for arguments such as foo in ?foo=&bar=1 rather than ignoring those parameters entirely. (#1551)

This came out of the refactor—this commit tells the story.

  • Upgraded Pluggy dependency to 1.0. (#1575)

I needed this because Pluggy 1.0 allows multiple implementations of the same hook to be defined within the same file, like this:

@hookimpl(specname="filters_from_request")
def where_filters(request, database, datasette):
    # ...

@hookimpl(specname="filters_from_request")
def search_filters(request, database, table, datasette):
    # ...

I really like Plausible as an analytics product. It does a great job of respecting user privacy while still producing useful numbers. It’s cookie-free, which means it doesn’t trigger a need for GDPR banners in Europe. I’m increasing using it on all of my projects.

  • New CLI reference page showing the output of --help for each of the datasette sub-commands. This lead to several small improvements to the help copy. (#1594)

I first built this for sqlite-utils and liked it so much I brought it to Datasette as well. It’s generated by cog, using this inline script in the reStructuredText.

And the rest

  • Label columns detected for foreign keys are now case-insensitive, so Name or TITLE will be detected in the same way as name or title. (#1544)
  • explain query plan is now allowed with varying amounts of whitespace in the query. (#1588)
  • Fixed bug where writable canned queries could not be used with custom templates. (#1547)
  • Improved fix for a bug where columns with a underscore prefix could result in unnecessary hidden form fields. (#1527)

How I build a feature 10 days ago

I’m maintaining a lot of different projects at the moment. I thought it would be useful to describe the process I use for adding a new feature to one of them, using the new sqlite-utils create-database command as an example.

I like each feature to be represented by what I consider to be the perfect commit—one that bundles together the implementation, the tests, the documentation and a link to an external issue thread.

The sqlite-utils create-database command is very simple: it creates a new, empty SQLite database file. You use it like this:

% sqlite-utils create-database empty.db

Everything starts with an issue

Every piece of work I do has an associated issue. This acts as ongoing work-in-progress notes and lets me record decisions, reference any research, drop in code snippets and sometimes even add screenshots and video—stuff that is really helpful but doesn’t necessarily fit in code comments or commit messages.

Even if it’s a tiny improvement that’s only a few lines of code, I’ll still open an issue for it—sometimes just a few minutes before closing it again as complete.

Any commits that I create that relate to an issue reference the issue number in their commit message. GitHub does a great job of automatically linking these together, bidirectionally so I can navigate from the commit to the issue or from the issue to the commit.

Having an issue also gives me something I can link to from my release notes.

In the case of the create-database command, I opened this issue in November when I had the idea for the feature.

I didn’t do the work until over a month later—but because I had designed the feature in the issue comments I could get started on the implementation really quickly.

Development environment

Being able to quickly spin up a development environment for a project is crucial. All of my projects have a section in the README or the documentation describing how to do this—here’s that section for sqlite-utils.

On my own laptop each project gets a directory, and I use pipenv shell in that directory to activate a directory-specific virtual environment, then pip install -e '.[test]' to install the dependencies and test dependencies.

Automated tests

All of my features are accompanied by automated tests. This gives me the confidence to boldly make changes to the software in the future without fear of breaking any existing features.

This means that writing tests needs to be as quick and easy as possible—the less friction here the better.

The best way to make writing tests easy is to have a great testing framework in place from the very beginning of the project. My cookiecutter templates (python-lib, datasette-plugin and click-app) all configure pytest and add a tests/ folder with a single passing test, to give me something to start adding tests to.

I can’t say enough good things about pytest. Before I adopted it, writing tests was a chore. Now it’s an activity I genuinely look forward to!

I’m not a religious adherent to writing the tests first—see How to cheat at unit tests with pytest and Black for more thoughts on that—but I’ll write the test first if it’s pragmatic to do so.

In the case of create-database, writing the test first felt like the right thing to do. Here’s the test I started with:

def test_create_database(tmpdir):
    db_path = tmpdir / "test.db"
    assert not db_path.exists()
    result = CliRunner().invoke(
        cli.cli, ["create-database", str(db_path)]
    )
    assert result.exit_code == 0
    assert db_path.exists()

This test uses the tmpdir pytest fixture to provide a temporary directory that will be automatically cleaned up by pytest after the test run finishes.

It checks that the test.db file doesn’t exist yet, then uses the Click framework’s CliRunner utility to execute the create-database command. Then it checks that the command didn’t throw an error and that the file has been created.

The I run the test, and watch it fail—because I haven’t built the feature yet!

% pytest -k test_create_database

============ test session starts ============
platform darwin -- Python 3.8.2, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /Users/simon/Dropbox/Development/sqlite-utils
plugins: cov-2.12.1, hypothesis-6.14.5
collected 808 items / 807 deselected / 1 selected                           

tests/test_cli.py F                                                   [100%]

================= FAILURES ==================
___________ test_create_database ____________

tmpdir = local('/private/var/folders/wr/hn3206rs1yzgq3r49bz8nvnh0000gn/T/pytest-of-simon/pytest-659/test_create_database0')

    def test_create_database(tmpdir):
        db_path = tmpdir / "test.db"
        assert not db_path.exists()
        result = CliRunner().invoke(
            cli.cli, ["create-database", str(db_path)]
        )
>       assert result.exit_code == 0
E       assert 1 == 0
E        +  where 1 = <Result SystemExit(1)>.exit_code

tests/test_cli.py:2097: AssertionError
========== short test summary info ==========
FAILED tests/test_cli.py::test_create_database - assert 1 == 0
===== 1 failed, 807 deselected in 0.99s ====

The -k option lets me run any test that match the search string, rather than running the full test suite. I use this all the time.

Other pytest features I often use:

  • pytest -x: runs the entire test suite but quits at the first test that fails
  • pytest --lf: re-runs any tests that failed during the last test run
  • pytest --pdb -x: open the Python debugger at the first failed test (omit the -x to open it at every failed test). This is the main way I interact with the Python debugger. I often use this to help write the tests, since I can add assert False and get a shell inside the test to interact with various objects and figure out how to best run assertions against them.

Implementing the feature

Test in place, it’s time to implement the command. I added this code to my existing cli.py module:

@cli.command(name="create-database")
@click.argument(
    "path",
    type=click.Path(file_okay=True, dir_okay=False, allow_dash=False),
    required=True,
)
def create_database(path):
    "Create a new empty database file."
    db = sqlite_utils.Database(path)
    db.vacuum()

(I happen to know that the quickest way to create an empty SQLite database file is to run VACUUM against it.)

The test now passes!

I iterated on this implementation a little bit more, to add the --enable-wal option I had designed in the issue comments—and updated the test to match. You can see the final implementation in this commit: 1d64cd2e5b402ff957f9be2d9bb490d313c73989.

If I add a new test and it passes the first time, I’m always suspicious of it. I’ll deliberately break the test (change a 1 to a 2 for example) and run it again to make sure it fails, then change it back again.

Code formatting with Black

Black has increased my productivity as a Python developer by a material amount. I used to spend a whole bunch of brain cycles agonizing over how to indent my code, where to break up long function calls and suchlike. Thanks to Black I never think about this at all—I instinctively run black . in the root of my project and accept whatever style decisions it applies for me.

Linting

I have a few linters set up to run on every commit. I can run these locally too—how to do that is documented here—but I’m often a bit lazy and leave them to run in CI.

In this case one of my linters failed! I accidentally called the new command function create_table() when it should have been called create_database(). The code worked fine due to how the cli.command(name=...) decorator works but mypy complained about the redefined function name. I fixed that in a separate commit.

Documentation

My policy these days is that if a feature isn’t documented it doesn’t exist. Updating existing documentation isn’t much work at all if the documentation already exists, and over time these incremental improvements add up to something really comprehensive.

For smaller projects I use a single README.md which gets displayed on both GitHub and PyPI (and the Datasette website too, for example on datasette.io/tools/git-history).

My larger projects, such as Datasette and sqlite-utils, use Read the Docs and reStructuredText with Sphinx instead.

I like reStructuredText mainly because it has really good support for internal reference links—something that is missing from Markdown, though it can be enabled using MyST.

sqlite-utils uses Sphinx. I have the sphinx-autobuild extension configured, which means I can run a live reloading server with the documentation like so:

cd docs
make livehtml

Any time I’m working on the documentation I have that server running, so I can hit “save” in VS Code and see a preview in my browser a few seconds later.

For Markdown documentation I use the VS Code preview pane directly.

The moment the documentation is live online, I like to add a link to it in a comment on the issue thread.

Committing the change

I run git diff a LOT while hacking on code, to make sure I haven’t accidentally changed something unrelated. This also helps spot things like rogue print() debug statements I may have added.

Before my final commit, I sometimes even run git diff | grep print to check for those.

My goal with the commit is to bundle the test, documentation and implementation. If those are the only files I’ve changed I do this:

git commit -a -m "sqlite-utils create-database command, closes #348"

If this completes the work on the issue I use "closes #N“, which causes GitHub to close the issue for me. If it’s not yet ready to close I use ”refs #N" instead.

Sometimes there will be unrelated changes in my working directory. If so, I use git add <files> and then commit just with git commit -m message.

Branches and pull requests

create-database is a good example of a feature that can be implemented in a single commit, with no need to work in a branch.

For larger features, I’ll work in a feature branch:

git checkout -b my-feature

I’ll make a commit (often just labelled “WIP prototype, refs #N”) and then push that to GitHub and open a pull request for it:

git push -u origin my-feature 

I ensure the new pull request links back to the issue in its description, then switch my ongoing commentary to comments on the pull request itself.

I’ll sometimes add a task checklist to the opening comment on the pull request, since tasks there get reflected in the GitHub UI anywhere that links to the PR. Then I’ll check those off as I complete them.

An example of a PR I used like this is #361: --lines and --text and --convert and --import.

I don’t like merge commits—I much prefer to keep my main branch history as linear as possible. I usually merge my PRs through the GitHub web interface using the squash feature, which results in a single, clean commit to main with the combined tests, documentation and implementation. Occasionally I will see value in keeping the individual commits, in which case I will rebase merge them.

Another goal here is to keep the main branch releasable at all times. Incomplete work should stay in a branch. This makes turning around and releasing quick bug fixes a lot less stressful!

Release notes, and a release

A feature isn’t truly finished until it’s been released to PyPI.

All of my projects are configured the same way: they use GitHub releases to trigger a GitHub Actions workflow which publishes the new release to PyPI. The sqlite-utils workflow for that is here in publish.yml.

My cookiecutter templates for new projects set up this workflow for me. I just need to create a PyPI token for the project and assign it as a repository secret. See the python-lib cookiecutter README for details.

To push out a new release, I need to increment the version number in setup.py and write the release notes.

I use semantic versioning—a new feature is a minor version bump, a breaking change is a major version bump (I try very hard to avoid these) and a bug fix or documentation-only update is a patch increment.

Since create-database was a new feature, it went out in release 3.21.

My projects that use Sphinx for documentation have changelog.rst files in their repositories. I add the release notes there, linking to the relevant issues and cross-referencing the new documentation. Then I ship a commit that bundles the release notes with the bumped version number, with a commit message that looks like this:

git commit -m "Release 3.21

Refs #348, #364, #366, #368, #371, #372, #374, #375, #376, #379"

Here’s the commit for release 3.21.

Referencing the issue numbers in the release automatically adds a note to their issue threads indicating the release that they went out in.

I generate that list of issue numbers by pasting the release notes into an Observable notebook I built for the purpose: Extract issue numbers from pasted text. Observable is really great for building this kind of tiny interactive utility.

For projects that just have a README I write the release notes in Markdown and paste them directly into the GitHub “new release” form.

I like to duplicate the release notes to GiHub releases for my Sphinx changelog projects too. This is mainly so the datasette.io website will display the release notes on its homepage, which is populated at build time using the GitHub GraphQL API.

To convert my reStructuredText to Markdown I copy and paste the rendered HTML into this brilliant Paste to Markdown tool by Euan Goddard.

A live demo

When possible, I like to have a live demo that I can link to.

This is easiest for features in Datasette core. Datesette’s main branch gets deployed automatically to latest.datasette.io so I can often link to a demo there.

For Datasette plugins, I’ll deploy a fresh instance with the plugin (e.g. this one for datasette-graphql) or (more commonly) add it to my big latest-with-plugins.datasette.io instance—which tries to demonstrate what happens to Datasette if you install dozens of plugins at once (so far it works OK).

Here’s a demo of the datasette-copyable plugin running there: https://latest-with-plugins.datasette.io/github/commits.copyable

Tell the world about it

The last step is to tell the world (beyond the people who meticulously read the release notes) about the new feature.

Depending on the size of the feature, I might do this with a tweet like this one—usually with a screenshot and a link to the documentation. I often extend this into a short Twitter thread, which gives me a chance to link to related concepts and demos or add more screenshots.

For larger or more interesting feature I’ll blog about them. I may save this for my weekly weeknotes, but sometimes for particularly exciting features I’ll write up a dedicated blog entry. Some examples include:

I may even assemble a full set of annotated release notes on my blog, where I quote each item from the release in turn and provide some fleshed out examples plus background information on why I built it.

If it’s a new Datasette (or Datasette-adjacent) feature, I’ll try to remember to write about it in the next edition of the Datasette Newsletter.

Finally, if I learned a new trick while building a feature I might extract that into a TIL. If I do that I’ll link to the new TIL from the issue thread.

More examples of this pattern

Here are a bunch of examples of commits that implement this pattern, combining the tests, implementation and documentation into a single unit:

What’s new in sqlite-utils 11 days ago

sqlite-utils is my combined CLI tool and Python library for manipulating SQLite databases. Consider this the annotated release notes for sqlite-utils 3.20 and 3.21, both released in the past week.

sqlite-utils insert --convert with --lines and --text

The sqlite-utils insert command inserts rows into a SQLite database from a JSON, CSV or TSV file, creating a table with the necessary columns if one does not exist already.

It gained three new options in v3.20:

  • sqlite-utils insert ... --lines to insert the lines from a file into a table with a single line column, see Inserting unstructured data with --lines and --text.
  • sqlite-utils insert ... --text to insert the contents of the file into a table with a single text column and a single row.
  • sqlite-utils insert ... --convert allows a Python function to be provided that will be used to convert each row that is being inserted into the database. See Applying conversions while inserting data, including details on special behavior when combined with --lines and --text. (#356)

These features all evolved from an idea I had while re-reading my blog entry from last year, Apply conversion functions to data in SQLite columns with the sqlite-utils CLI tool. That blog entry introduced the sqlite-utils convert comand, which can run a custom Python function against a column in a table to convert that data in some way.

Given a log file log.txt that looks something like this:

2021-08-05T17:58:28.880469+00:00 app[web.1]: measure#nginx.service=4.212 request="GET /search/?type=blogmark&page=2&tag=highavailability HTTP/1.1" status_code=404 request_id=25eb296e-e970-4072-b75a-606e11e1db5b remote_addr="10.1.92.174" forwarded_for="114.119.136.88, 172.70.142.28" forwarded_proto="http" via="1.1 vegur" body_bytes_sent=179 referer="-" user_agent="Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)" request_time="4.212" upstream_response_time="4.212" upstream_connect_time="0.000" upstream_header_time="4.212";

I provided this example code to insert lines from a log file into a table with a single line column:

cat log.txt | \
    jq --raw-input '{line: .}' --compact-output | \
    sqlite-utils insert logs.db log - --nl

Since sqlite-utils insert requires JSON, this example first used jq to convert the lines into {"line": "..."} JSON objects.

My first idea was to improve this with the new --lines option, which lets you replace the above with this:

sqlite-utils insert logs.db log log.txt --lines

Using --lines will create a table with a single lines column and import every line from the file as a row in that table.

In the article, I then demonstrated how --convert could be used to convert those imported lines into structured rows using a regular expression:

sqlite-utils convert logs.db log line --import re --multi "$(cat <<EOD
    r = re.compile(r'([^\s=]+)=(?:"(.*?)"|(\S+))')
    pairs = {}
    for key, value1, value2 in r.findall(value):
        pairs[key] = value1 or value2
    return pairs
EOD
)"

The new --convert option to sqlite-utils means you can now achieve the same thing using:

sqlite-utils insert logs.db log log.txt --lines \
  --import re --convert "$(cat <<EOD
    r = re.compile(r'([^\s=]+)=(?:"(.*?)"|(\S+))')
    for key, value1, value2 in r.findall(line):
        pairs[key] = value1 or value2
    return pairs
EOD
)"

Since the --lines option allows you to consume mostly unstructured files split by newlines, I decided to also add an option to consume an entire unstructured file as a single record. I originally called that --all but found the code got messy because it conflicted with Python’s all() built-in, so I renamed it to --text.

Used on its own, --text creates a table with a single column called text:

% sqlite-utils insert logs.db fulllog log.txt --text
% sqlite-utils schema logs.db
CREATE TABLE [fulllog] (
   [text] TEXT
);

But with --convert you can pass a snippet of Python code which can take that text value and convert it into a list of dictionaries, which will then be used to populate the table.

Here’s a fun example. The following one-liner uses the classic feedparser library to parse the Atom feed for my blog and load it into a database table:

curl 'https://simonwillison.net/atom/everything/' | \
  sqlite-utils insert feed.db entries --text --convert '
    feed = feedparser.parse(text)
    return feed.entries' - --import feedparser

The resulting database looks like this:

% sqlite-utils tables feed.db --counts -t
table      count
-------  -------
feed          30

% sqlite-utils schema feed.db
CREATE TABLE [feed] (
   [title] TEXT,
   [title_detail] TEXT,
   [links] TEXT,
   [link] TEXT,
   [published] TEXT,
   [published_parsed] TEXT,
   [updated] TEXT,
   [updated_parsed] TEXT,
   [id] TEXT,
   [guidislink] INTEGER,
   [summary] TEXT,
   [summary_detail] TEXT,
   [tags] TEXT
);

Not bad for a one-liner!

This example uses the --import option to import that feedparser library. This means you’ll need to have that library installed in the same virtual environment as sqlite-utils.

If you run into problems here (maybe due to having installed sqlite-utils via Homebrew) one way to do this is to use the following:

python3 -m pip install feedparser sqlite-utils

Then use python3 -m sqlite_utils in place of sqlite-utils—this will ensure you are running the command from the same virtual environment where you installed the library.

--convert for regular rows

The above examples combine --convert with the --lines and --text options to parse unstructured text into database tables.

But --convert works with the existing sqlite-utils insert options as well.

To review, those are the following:

  • sqlite-utils insert by default expects a JSON file that’s a list of objects, [{"id": 1, "text": "Like"}, {"id": 2, "text": "This"}].
  • sqlite-utils insert --nl accepts newline-delimited JSON, {"id": 1, "text": "Like"}\n{"id": 2, "text": "This"}.
  • sqlite-utils insert --csv and --tsv accepts CSV/TSV—with --delimiter and --encoding and --quotechar and --no-headers options for customizing that import, and a --sniff option for automatically detecting those settings.

You can now use --convert to define a Python function that accepts a row dictionary representing each row from the import and modifies that dictionary or returns a fresh one with changes.

Here’s a simple example that produces just the capitalized name, the latitude and the longitude from the WRI’s global power plants CSV file:

curl https://raw.githubusercontent.com/wri/global-power-plant-database/master/output_database/global_power_plant_database.csv | \
  sqlite-utils insert plants.db plants - --csv --convert '
  return {
      "name": row["name"].upper(),
      "latitude": float(row["latitude"]),
      "longitude": float(row["longitude"]),
  }'

The resulting database looks like this:

% sqlite-utils schema plants.db
CREATE TABLE [plants] (
   [name] TEXT,
   [latitude] FLOAT,
   [longitude] FLOAT
);

~ % sqlite-utils rows plants.db plants | head -n 3
[{"name": "KAJAKI HYDROELECTRIC POWER PLANT AFGHANISTAN", "latitude": 32.322, "longitude": 65.119},
 {"name": "KANDAHAR DOG", "latitude": 31.67, "longitude": 65.795},
 {"name": "KANDAHAR JOL", "latitude": 31.623, "longitude": 65.792},

sqlite-utils bulk

  • New sqlite-utils bulk command which can import records in the same way as sqlite-utils insert (from JSON, CSV or TSV) and use them to bulk execute a parametrized SQL query. (#375)

With the addition of --lines, --text, --convert and --import the sqlite-utils insert command is now a powerful tool for turning anything into a list of Python dictionaries, which can then in turn be inserted into a SQLite database table.

Which gave me an idea... what if you could use the same mechanisms to execute SQL statements in bulk instead?

Python’s SQLite library supports named parameters in SQL queries, which look like this:

insert into plants (id, name) values (:id, :name)

Those :id and :name parameters can be populated from a Python dictionary. And the .executemany() method can efficiently apply the same SQL query to a big list (or iterator or generator) of dictionaries in one go:

cursor = db.cursor()
cursor.executemany(
    "insert into plants (id, name) values (:id, :name)",
    [{"id": 1, "name": "One"}, {"id": 2, "name": "Two"}]
)

So I implemented the sqlite-utils bulk command, which takes the same import options as sqlite-utils but instead of creating and populating the specified table requires a SQL argument with a query that will be executed using the imported rows as arguments.

% sqlite-utils bulk demo.db \
  'insert into plants (id, name) values (:id, :name)' \
  plants.csv --csv

This feels like a powerful new feature, which was very simple to implement because the hard work of importing the data had already been done by the insert command.

Running ANALYZE

This idea came from Forest Gregg, who initially suggested running ANALYZE automatically as part of the sqlite-utils create-index command.

I have to confess: in all of my years of using SQLite, I’d never actually explored the ANALYZE command.

When run, it builds a new table called sqlite_stats1 containing statistics about each of the indexes on the table—indicating how “selective” each index is—effectively how many rows on average you are likely to filter down to if you use the index.

The SQLite query planner can then use this to decide which index to consult. For example, given the following query:

select * from ny_times_us_counties
where state = 'Missouri' and county = 'Greene'

(Try that here.)

If there are indexes on both columns, should the query planner use the state column or the county column?

In this case the state column will filter down to 75,209 rows, while the county column filters to 9,186—so county is clearly the better query plan.

Impressively, SQLite seems to make this kind of decision perfectly well without the sqlite_stat1 table being populated: explain query plan select * from ny_times_us_counties where “county” = ’Greene’ and “state” = ’Missouri’ returns the following:

SEARCH TABLE ny_times_us_counties USING INDEX idx_ny_times_us_counties_county (county=?)

I’ve not actually found a good example of a query where the sqlite_stat1 table makes a difference yet, but I’m confident such queries exist!

Using SQL, you can run ANALYZE against an entire database by executing ANALYZE;, or against all of the indexes for a specific table with ANALYZE tablename;, or against a specific index by name using ANALYZE indexname;.

There’s one catch with ANALYZE: since running it populates a static sqlite_stat1 table, the data in that table can get out of date. If you insert another million rows into a table for example your analyzye statistics might no longer reflect ground truth to the point that the query planner starts to make bad decisions.

For sqlite-utils I decided to make ANALYZE an explicit operation. In the Python library you can now run the following:

db.analyze() # Analyze every index in the database
db.analyze("indexname") # Analyze a specific index
db.analyze("tablename") # Analyze every index for that table
# Or the same thing using a table object:
db["tablename"].analyze()

I also added an optional analyze=True parameter to several methods, which you can use to trigger an ANALZYE once that operation completes:

db["tablename"].create_index(["column"], analyze=True)
db["tablename"].insert_rows(rows, analyze=True)
db["tablename"].delete_where(analyze=True)

The sqlite-utils CLI command has equivalent functionality:

# Analyze every index in a database:
% sqlite-utils analyze database.db
# Analyze a specific index:
% sqlite-utils analyze database.db indexname
# Analyze all indexes for a table:
% sqlite-utils analyze database.db tablename

And an --analyze option for various commands:

% sqlite-utils create-index ... --analyze
% sqlite-utils insert ... --analyze
% sqlite-utils upsert ... --analyze

Other smaller changes

  • New sqlite-utils create-database command for creating new empty database files. (#348)

Most sqlite-utils commands such as insert or create-table create the database file for you if it doesn’t already exist, but I decided it would be neat to have an explicit create-database command for deliberately creating an empty database.

Update 13th January 2022: I wrote a detailed description of my process building this command in How I build a feature.

  • The CLI tool can now also be run using python -m sqlite_utils. (#368)

I initially added this to help write a unit test that exercised the tool through a subprocess (see TIL Testing a Click app with streaming input) but it’s a neat pattern in general. datasette gained this through a contribution from Abdussamet Koçak a few years ago.

  • Using --fmt now implies --table, so you don’t need to pass both options. (#374)

A nice tiny usability enhancement. You can now run sqlite-utils rows my.db mytable --fmt rst to get back a reStructuredText table—previously you also needed to add --table.

I sometimes re-read the documentation for older features to remind me what they do, and occasionally an idea for a feature jumps out from that. Implementing these was a very small change.

  • The --nl import option now ignores blank lines in the input. (#376)
  • Fixed bug where streaming input to the insert command with --batch-size 1 would appear to only commit after several rows had been ingested, due to unnecessary input buffering. (#364)

That --nl improvement came from tinkering around trying to fix the bug.

The bug itself was interesting: I initially thought that my entire mechanism for comitting on every --batch-size chunk was broken, but it turned out I was unnecessarily buffering data from standard input in order to support the --sniff option for detecting the shape of incoming CSV data.

  • db.supports_strict property showing if the database connection supports SQLite strict tables.
  • table.strict property (see .strict) indicating if the table uses strict mode. (#344)

See previous weeknotes: this is the first part of my ongoing support for the new STRICT tables in SQLite.

I’m currently blocked on implementing more due to the need to get a robust mechanism up and running for executing sqlite-utils tests in CI against specific SQLite versions, see issue #346.

Releases this week

TILs this week

Weeknotes: Taking a break in Moss Landing 18 days ago

Took some time off. Saw some whales and sea otters. Added a new spot to Niche Museums.

Natalie took me to Moss Landing for a few days for my birthday. I now think Moss Landing may be one of California’s best kept secrets, for a whole bunch of reasons.

Most importantly, Moss Landing has Elkhorn Sloug, California’s second largest estuary and home to 7% of the world’s population of sea otters. And you can kayak there!

We rented a kayak from Kayak Connection and headed out for three hours on the water.

The rules are to stay eight boat lengths (100 feet) away from the otters, or to stop paddling and wait for them to leave if they pop up near your boat. And they pop up a lot!

We saw at least twenty sea otters. The largest can weigh 90lbs (that’s two Cleos) and they were quite happy to ignore us and get on with otter stuff: floating on their backs, diving into the water and playing with each other.

We also saw harbor seals, egrets, herons, avocets and both brown and white pelicans.

A harbour seal in the estuary, making a funny U shape with its body

A beautiful great egret

Moss Landing also sits at the edge of Monterey Bay, which contains the Monterey Submarine Canyon, one of the largest such canyons in the world. Which means cold water and warm water mixing in interesting ways. Which means lots of nutritious tiny sea creatures. Which means whales!

We went whale watching with Blue Ocean Whale Watching, who came recommended by several naturalist friends. They were brilliant—they had an obvious passion for the ocean, shared great information and answered all of our increasingly eccentric questions. Did you know a Blue Whale can use a thousand calories of energy just opening its mouth?

A gray whale in the water

We saw gray whales—expected at this time of year due to their migration from the arctic down south to their breeding lagoons in Baja, and humpback whales—not a usual occurrence at this time of year but evidently the younger whales don’t necessarily stick to the official calendar.

The tail of a humpback whale as it dives down for food

Moss Landing also has a large number of noisy sea lions. This one was asleep on the dock when our ship returned.

A sleeping sea lion

A sea otter floating on its back in the water eating a crab

Then yesterday morning we went for a walk around this pensinsula and saw sea otters fishing for crabs just yards away from shore! Plus a juvenile elephant seal who had hauled itself onto the beach.

An exceptionally round juvenile elephant seal, asleep on the beach

We also dropped in to the Shakespeare Society of America—my first niche museum visit of 2022. I wrote about that for Niche Museums.

TIL this week

Weeknotes: datasette-tiddlywiki, filters_from_request one month ago

I made some good progress on the big refactor this week, including extracting some core logic out into a new Datasette plugin hook. I also got distracted by TiddlyWiki and released a new Datasette plugin that lets you run TiddlyWiki inside Datasette.

datasette-tiddlywiki

TiddlyWiki is a fascinating and unique project. Jeremy Ruston has been working on it for 17 years now and I’ve still not seen another piece of software that works even remotely like it.

It’s a full-featured wiki that’s implemented entirely as a single 2.3MB page of HTML and JavaScript, with a plugin system that allows it to be extended in all sorts of interesting ways.

The most unique feature of TiddlyWiki is how it persists data. You can create a brand new wiki by opening tiddlywiki.com/empty.html in your browser, making some edits... and then clicking the circle-tick “Save changes” button to download a copy of the page with your changes baked into it! Then you can open that up on your own computer and keep on using it.

There’s actually a lot more to TiddlyWiki persistence than that: The GettingStarted guide lists dozens of options that vary depending on operating system and browser—it’s worth browsing through them just to marvel at how much innovation has happened around the project just in the persistence space.

One of the options is to run a little server that implements the WebServer API and persists data sent via PUT requests. SQLite is an obvious candidate for a backend, and Datasette makes it pretty easy to provide APIs on top of SQLite... so I decided to experiment with building a Datasette plugin that offers a full persistant TiddlyWiki experience.

datasette-tiddlywiki is the result.

You can try it out by running datasette install datasette-tiddlywiki and then datasette tiddlywiki.db --create to start the server (with a tiddlywiki.db SQLite database that will be created if it does not already exist.)

Then navigate to http://localhost:8001/-/tiddlywiki to start interacting with your new TiddlyWiki. Any changes you make there will be persisted to the tiddlywiki database.

Animated demo showing creating a new tiddler

I had a running research issue that I updated as I was figuring out how to build it—all sorts of fun TiddlyWiki links and TILs are embedded in that thread. The issue started out in my private “notes” GitHub repository but I transferred it to the datasette-tiddlywiki repository after I had created and published the first version of the plugin.

filters_from_request() plugin hook

My big breakthrough in the ongoing Datasette Table View refactor project was a realization that I could simplify the table logic by extracting some of it out into a new plugin hook.

The new hook is called filters_from_request. It acknowledges that the primary goal of the table page is to convert query string parameters—like ?_search=tony or ?id__gte=6 or ?_where=id+in+(1,+2+,3) into SQL where clauses.

(Here’s a full list of supported table arguments.)

So that’s what filters_from_request() does—given a request object it can return SQL clauses that should be added to the WHERE.

Datasette now uses those internally to implement ?_where= and ?_search= and ?_through=, see datasette/filters.py.

I always try to accompany a new plugin hook with a plugin that actually uses it—in this case I’ve been updating datasette-leaflet-freedraw to use that hook to add a “draw a shape on a map to filter this table” interface to any table that it detects has a SpatiaLite geometry column. There’s a demo of that here:

https://calands.datasettes.com/calands/CPAD_2020a_SuperUnits?_freedraw=%7B%22type%22%3A%22MultiPolygon%22%2C%22coordinates%22%3A%5B%5B%5B%5B-121.92627%2C37.73597%5D%2C%5B-121.83838%2C37.68382%5D%2C%5B-121.64063%2C37.45742%5D%2C%5B-121.57471%2C37.19533%5D%2C%5B-121.81641%2C36.80928%5D%2C%5B-122.146%2C36.63316%5D%2C%5B-122.56348%2C36.65079%5D%2C%5B-122.89307%2C36.79169%5D%2C%5B-123.06885%2C36.96745%5D%2C%5B-123.09082%2C37.33522%5D%2C%5B-123.0249%2C37.562%5D%2C%5B-122.91504%2C37.77071%5D%2C%5B-122.71729%2C37.92687%5D%2C%5B-122.58545%2C37.96152%5D%2C%5B-122.10205%2C37.96152%5D%2C%5B-121.92627%2C37.73597%5D%5D%5D%5D%7D

Animated demo of drawing a shape on a map and then submitting the form to see items within that map region

Note the new custom ?_freedraw={...} parameter which accepts a GeoJSON polygon and uses it to filter the table—that’s implemented using the new hook.

This isn’t in a full Datasette release yet, but it’s available in the Datasette 0.60a1 alpha (added in 0.60a0) if you want to try it out.

Optimizing populate_table_schemas()

I introduced the datasette-pretty-traces plugin last week—it makes it much easier to see the queries that are running on any given Datasette page.

This week I realized it wasn’t tracking write queries, so I added support for that—and discovered that on first page load after starting up Datasette spends a lot of time populating its own internal database containing schema information (see Weeknotes: Datasette internals from last year.)

Example trace showing a cavalcade of write SQL

I opened a tracking ticket and made a bunch of changes to optimize this. The new code in datasette/utils/internal_db.py uses two new documented internal methods:

db.execute_write_script() and db.execute_write_many()

These are the new methods that were created as part of the optimization work. They are documented here:

They are Datasette’s async wrappers around the Python sqlite3 module’s executemany() and executescript() methods.

I also made a breaking change to Datasette’s existing execute_write() and execute_write_fn() methods: their block= argument now defaults to True, where it previously defaulted to False.

Prior to this change, db.execute_write(sql) would put the passed SQL in a queue to be executed once the write connection became available... and then return control to the calling code, whether or not that SQL had actually run- a fire-and-forget mechanism for executing SQL.

The block=True option would change it to blocking until the query had finished executing.

Looking at my own code, I realized I had never once used the fire-and-forget mechanism: I always used block=True to ensure the SQL had finished writing before I moved on.

So clearly block=True was a better default. I made that change in issue 1579.

This is technically a breaking change... but I used the new GitHub code search to see if anyone was using it in a way that would break and could only find one example of it in code not written by me, in datasette-webhook-write—and since they use block=True there anyway this update won’t break their code.

If I’d released Datasette 1.0 I would still consider this a breaking change and bump the major version number, but thankfully I’m still in the 0.x range where I can be a bit less formal about these kinds of thing!

Releases this week

TIL this week

Elsewhere

21st January 2022

  • Roblox Return to Service 10/28-10/31 2021 (via) A particularly good example of a public postmortem on an outage. Roblox was down for 72 hours last year, as a result of an extremely complex set of circumstances which took a lot of effort to uncover. It’s interesting to think through what kind of monitoring you would need to have in place to help identify the root cause of this kind of issue. #21st January 2022, 4:41 pm

20th January 2022

  • How to Add a Favicon to Your Django Site (via) Adam Johnson did the research on the best way to handle favicons—Safari still doesn’t handle SVG icons so the best solution today is a PNG served from the /favicon.ico path. This article inspired me to finally add a proper favicon to Datasette. #20th January 2022, 7:03 am

18th January 2022

17th January 2022

  • SQLime: SQLite Playground (via) Anton Zhiyanov built this useful mobile-friendly online playground for trying things out it SQLite. It uses the sql.js library which compiles SQLite to WebAssembly, so it runs everything in the browser—but it also supports saving your work to Gists via the GitHub API. The JavaScript source code is fun to read: the site doesn’t use npm or Webpack or similar, opting instead to implement everything library-free using modern JavaScript modules and Web Components. #17th January 2022, 7:08 pm

16th January 2022

  • Abusing AWS Lambda to make an Aussie Search Engine (via) Ben Boyter built a search engine that only indexes .au Australian websites, with the novel approach of directly compiling the search index into 250 different ~40MB large lambda functions written in Go, then running searches across 12 million pages by farming them out to all of the lambdas and combining the results. His write-up includes all sorts of details about how he built this, including how he ran the indexer and how he solved the surprisingly hard problem of returning good-enough text snippets for the results. #16th January 2022, 8:52 pm

15th January 2022

13th January 2022

9th January 2022

  • Before May 2021, the master key in MetaMask was called the “Seed Phrase”. Through user research and insights from our customer support team, we have concluded that this name does not properly convey the critical importance that this master key has for user security. This is why we will be changing our naming of this master key to “Secret Recovery Phrase”. Through May and June of 2021, we will be phasing out the use of “seed phrase” in our application and support articles, and eventually exclusively calling it a “Secret Recovery Phrase.” No action is required, this is only a name change. We will be rolling this out on both the extension and the mobile app for all users.

    MetaMask Support # 9th January 2022, 5:44 am

8th January 2022

  • Hashids (via) Confusingly named because it’s not really a hash—this library (available in 40+ languages) offers a way to convert integer IDs to and from short strings of text based on a salt which, if kept secret, should help prevent people from deriving the IDs and using them to measure growth of your service. It works using a base62 alphabet that is shuffled using the salt. #8th January 2022, 7:31 pm

6th January 2022

  • Crypto creates a massively multiplayer online game where the game is “currency speculation”, and it’s very realistic because it really is money, at least if enough people get involved. [...] NFTs add another layer to the game. Instead of just currency speculation, you’re now simulating art speculation too! The fact that you don’t actually own the art and the fact that the art is randomly generated cartoon images of monkeys is entirely beside the point: the point is the speculation, and winning the game by making money. This is, again, a lot of fun to some people, and in addition to the piles of money they also in some very limited sense own a picture of a cartoon monkey that some people recognize as being very expensive, so they can brag without having to actually post screenshots of their bank balance, which nobody believed anyway.

    Laurie Voss # 6th January 2022, 7:35 am