Simon Willison’s Weblog

On gis 34 datasette 277 rust 24 go 15 plugins 41 ...

 

Recent entries

Automatically opening issues when tracked file content changes two hours ago

I figured out a GitHub Actions pattern to keep track of a file published somewhere on the internet and automatically open a new repository issue any time the contents of that file changes.

Extracting GZipMiddleware from Starlette

Here’s why I needed to solve this problem.

I want to add gzip support to my Datasette open source project. Datasette builds on the Python ASGI standard, and Starlette provides an extremely well rested, robust GZipMiddleware class that adds gzip support to any ASGI application. As with everything else in Starlette, it’s really good code.

The problem is, I don’t want to add the whole of Starlette as a dependency. I’m trying to keep Datasette’s core as small as possible, so I’m very careful about new dependencies. Starlette itself is actually very light (and only has a tiny number of dependencies of its own) but I still don’t want the whole thing just for that one class.

So I decided to extract the GZipMiddleware class into a separate Python package, under the same BSD license as Starlette itself.

The result is my new asgi-gzip package, now available on PyPI.

What if Starlette fixes a bug?

The problem with extracting code like this is that Starlette is a very effectively maintained package. What if they make improvements or fix bugs in the GZipMiddleware class? How can I make sure to apply those same fixes to my extracted copy?

As I thought about this challenge, I realized I had most of the solution already.

Git scraping is the name I’ve given to the trick of running a periodic scraper that writes to a git repository in order to track changes to data over time.

It may seem redundant to do this against a file that already lives in version control elsewhere—but in addition to tracking changes, Git scraping can offfer a cheap and easy way to add automation that triggers when a change is detected.

I need an actionable alert any time the Starlette code changes so I can review the change and apply a fix to my own library, if necessary.

Since I already run all of my projects out of GitHub issues, automatically opening an issue against the asgi-gzip repository would be ideal.

My track.yml workflow does exactly that: it implements the Git scraping pattern against the gzip.py module in Starlette, and files an issue any time it detects changes to that file.

Starlette haven’t made any changes to that file since I started tracking it, so I created a test repo to try this out.

Here’s one of the example issues. I decided to include the visual diff in the issue description and have a link to it from the underlying commit as well.

Screenshot of an open issue page. The issues is titled "gzip.py was updated" and contains a visual diff showing the change to a file. A commit that references the issue is listed too.

How it works

The implementation is contained entirely in this track.yml workflow. I designed this to be contained as a single file to make it easy to copy and paste it to adapt it for other projects.

It uses actions/github-script, which makes it easy to do things like file new issues using JavaScript.

Here’s a heavily annotated copy:

name: Track the Starlette version of this

# Run on repo pushes, and if a user clicks the "run this action" button,
# and on a schedule at 5:21am UTC every day
on:
  push:
  workflow_dispatch:
  schedule:
  - cron:  '21 5 * * *'

# Without this block I got this error when the action ran:
# HttpError: Resource not accessible by integration
permissions:
  # Allow the action to create issues
  issues: write
  # Allow the action to commit back to the repository
  contents: write

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - uses: actions/github-script@v6
      # Using env: here to demonstrate how an action like this can
      # be adjusted to take dynamic inputs
      env:
        URL: https://raw.githubusercontent.com/encode/starlette/master/starlette/middleware/gzip.py
        FILE_NAME: tracking/gzip.py
      with:
        script: |
          const { URL, FILE_NAME } = process.env;
          // promisify pattern for getting an await version of child_process.exec
          const util = require("util");
          // Used exec_ here because 'exec' variable name is already used:
          const exec_ = util.promisify(require("child_process").exec);
          // Use curl to download the file
          await exec_(`curl -o ${FILE_NAME} ${URL}`);
          // Use 'git diff' to detect if the file has changed since last time
          const { stdout } = await exec_(`git diff ${FILE_NAME}`);
          if (stdout) {
            // There was a diff to that file
            const title = `${FILE_NAME} was updated`;
            const body =
              `${URL} changed:` +
              "\n\n```diff\n" +
              stdout +
              "\n```\n\n" +
              "Close this issue once those changes have been integrated here";
            const issue = await github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: title,
              body: body,
            });
            const issueNumber = issue.data.number;
            // Now commit and reference that issue number, so the commit shows up
            // listed at the bottom of the issue page
            const commitMessage = `${FILE_NAME} updated, refs #${issueNumber}`;
            // https://til.simonwillison.net/github-actions/commit-if-file-changed
            await exec_(`git config user.name "Automated"`);
            await exec_(`git config user.email "actions@users.noreply.github.com"`);
            await exec_(`git add -A`);
            await exec_(`git commit -m "${commitMessage}" || exit 0`);
            await exec_(`git pull --rebase`);
            await exec_(`git push`);
          }

In the asgi-gzip repository I keep the fetched gzip.py file in a tracking/ directory. This directory isn’t included in the Python package that gets uploaded to PyPI—it’s there only so that my code can track changes to it over time.

More interesting applications

I built this to solve my "tell me when Starlette update their gzip.py file" problem, but clearly this pattern has much more interesting uses.

You could point this at any web page to get a new GitHub issue opened when that page content changes. Subscribe to notifications for that repository and you get a robust , shared mechanism for alerts—plus an issue system where you can post additional comments and close the issue once someone has reviewed the change.

There’s a lot of potential here for solving all kinds of interesting problems. And it doesn’t cost anything either: GitHub Actions (somehow) remains completely free for public repositories!

Weeknotes: Parallel queries for Datasette, plus some middleware tricks one day ago

A promising new performance optimization for Datasette, plus new datasette-gzip and datasette-total-page-time plugins.

Parallel SQL queries in Datasette

From the start of the project, Datasette has been built on top of Python’s asyncio capabilities—mainly to benefit things like streaming enormous CSV files.

This week I started experimenting with a new way to take advantage of them, by exploring the potential to run multiple SQL queries in parallel.

Consider this Datasette table page:

Screenshot of the commits table, showing a count and suggested facets and activated facets and some table data

That page has to execute quite a few SQL queries:

  • A select count(*) ... to populate the 3,283 rows heading at the top
  • Queries against each column to decide what the “suggested facets” should be (details here)
  • For each of the selected facets (in this case repos and committer) a select name, count(*) from ... group by name order by count(*) desc query
  • The actual select * from ... limit 101 query used to display the actual table

It ends up executing more than 30 queries! Which may seem like a lot, but Many Small Queries Are Efficient In SQLite.

One thing that’s interesting about the above list of queries though is that they don’t actually have any dependencies on each other. There’s no reason not to run all of them in parallel—later queries don’t depend on the results from earlier queries.

I’ve been exploring a fancy way of executing parallel code using pytest-style dependency injection in my asyncinject library. But I decided to do a quick prototype to see what this would look like using asyncio.gather().

It turns out that simpler approach worked surprisingly well!

You can follow my research in this issue, but the short version is that as-of a few days ago the Datasette main branch runs many of the above queries in parallel.

This trace (using the datasette-pretty-traces plugin) illustrates my initial results:

Screenshot of a trace - many SQL queries have overlapping lines

As you can see, the grey lines for many of those SQL queries are now overlapping.

You can add the undocumented ?_noparallel=1 query string parameter to disable parallel execution to compare the difference:

Same trace again, but this time each query ends before the next one begins

One thing that gives me pause: for this particular Datasette deployment (on the cheapest available Cloud Run instance) the overall performance difference between the two is very small.

I need to dig into this deeper: on my laptop I feel like I’m seeing slightly better results, but definitely not conclusively. It may be that multiple cores are not being used effectively here.

Datasette runs SQL queries in a pool of threads. You might expect Python’s infamous GIL (Global Interpreter Lock) to prevent these from executing across multiple cores—but I checked, and the GIL is released in Python’s C code the moment control transfers to SQLite. And since SQLite can happily run multiple threads, my hunch is that this means parallel queries should be able to take advantage of multiple cores. Theoretically at least!

I haven’t yet figured out how to prove this though, and I’m not currently convinced that parallel queries are providing any overall benefit at all. If you have any ideas I’d love to hear them—I have a research issue open, comments welcome!

datasette-gzip

I’ve been putting off investigating gzip support for Datasette for a long time, because it’s easy to add as a separate layer. If you run Datasette behind Cloudflare or an Apache or Nginx proxy configuring gzip can happen there, with very little effort and fantastic performance.

Then I noticed that my Global Power Plants demo returned an HTML table page that weighed in at 420KB... but gzipped was just 16.61KB. Turns out HTML tables have a ton of repeated markup and compress REALLY well!

More importantly: Google Cloud Run doesn’t gzip for you. So all of my Datasette instances that were running on Cloud Run without also using Cloudflare were really suffering.

So this morning I released datasette-gzip, a plugin that gzips content if the browser sends an Accept-Encoding: gzip header.

The plugin is an incredibly thin wrapper around the thorougly proven-in-production GZipMiddleware. So thin that this is the full implementation:

from datasette import hookimpl
from starlette.middleware.gzip import GZipMiddleware

@hookimpl(trylast=True)
def asgi_wrapper(datasette):
    return GZipMiddleware

This kind of thing is exactly why I ported Datasette to ASGI back in 2019—and why I continue to think that the burgeoning ASGI ecosystem is the most under-rated piece of today’s Python web development environment.

The plugin’s tests are a lot more interesting.

That @hookimpl(trylast=True) line is there to ensure that this plugin runs last, after ever other plugin has executed.

This is necessary because there are existing ASGI plugins for Datasette (such as the new datasette-total-page-time) which modify the generated request.

If the gzip plugin runs before they do, they’ll get back a blob of gzipped data rather than the HTML that they were expecting. This is likely to break them.

I wanted to prove to myself that trylast=True would prevent these errors—so I ended up writing a test that demonstrated that the plugin registered with trylast=True was compatible with a transforming content plugin (in the test it just converts everything to uppercase) whereas tryfirst=True would instead result in an error.

Thankfully I have an older TIL on Registering temporary pluggy plugins inside tests that I could lean on to help figure out how to do this.

The plugin is now running on my latest-with-plugins demo instance. Since that instance loads dozens of different plugins it ends up serving a bunch of extra JavaScript and CSS, all of which benefits from gzip:

Screenshot of the Firefox inspector pane showing 419KB of HTML reduced to 16.61KB and 922KB of JavaScript reduced to 187.47KB. Total of 3.08MB page weight but only 546KB transferred.

datasette-total-page-time

To help understand the performance improvements introduced by parallel SQL queries I decided I wanted the Datasette footer to be able to show how long it took for the entire page to load.

This is a tricky thing to do: how do you measure the total time for a page and then include it on that page if the page itself hasn’t finished loading when you render that template?

I came up with a pretty devious middleware trick to solve this, released as the datasette-total-page-time plugin.

The trick is to start a timer when the page load begins, and then end that timer at the very last possible moment as the page is being served back to the user.

Then, inject the following HTML directly after the closing </html> tag (which works fine, even though it’s technically invalid):

<script>
let footer = document.querySelector("footer");
if (footer) {
    let ms = 37.224;
    let s = ` &middot; Page took ${ms.toFixed(3)}ms`;
    footer.innerHTML += s;
}
</script>

This adds the timing information to the page’s <footer> element, if one exists.

You can see this running on this latest-with-plugins page.

Releases this week

TIL this week

Useful tricks with pip install URL and GitHub four days ago

The pip install command can accept a URL to a zip file or tarball. GitHub provides URLs that can create a zip file of any branch, tag or commit in any repository. Combining these is a really useful trick for maintaining Python packages.

pip install URL

The most common way of using pip is with package names from PyPi:

pip install datasette

But the pip install command has a bunch of other abilities—it can install files, pull from various version control systems and most importantly it can install packages from a URL.

I sometimes use this to distribute ad-hoc packages that I don’t want to upload to PyPI. Here’s a quick and simple Datasette plugin I built a while ago that I install using this option:

pip install 'https://static.simonwillison.net/static/2021/datasette_expose_some_environment_variables-0.1-py3-none-any.whl'

(Source code here)

You can also list URLs like this directly in your requirements.txt file, one per line.

datasette install

Datasette has a datasette install command which wraps pip install. It exists purely so that people can install Datasette plugins easily without first having to figure out the location of Datasette’s Python virtual environment.

This works with URLs too, so you can install that plugin like so:

datasette install https://static.simonwillison.net/static/2021/datasette_expose_some_environment_variables-0.1-py3-none-any.whl

The datasette publish commands have an --install option for installing plugin which works with URLs too:

datasette publish cloudrun mydatabase.db \
  --service=plugins-demo \
  --install datasette-vega \
  --install https://static.simonwillison.net/static/2021/datasette_expose_some_environment_variables-0.1-py3-none-any.whl \
  --install datasette-graphql

Installing branches, tags and commits

Any reference in a GitHub repository can be downloaded as a zip file or tarball—that means branches, tags and commits are all available.

If your repository contains a Python package with a setup.py file, those URLs will be compatible with pip install.

This means you can use URLs to install tags, branches and even exact commits!

Some examples:

  • pip install https://github.com/simonw/datasette/archive/refs/heads/main.zip installs the latest main branch from the simonw/datasette repository.
  • pip install https://github.com/simonw/datasette/archive/refs/tags/0.61.1.zip—installs version 0.61.1 of Datasette, via this tag.
  • pip install https://github.com/simonw/datasette/archive/refs/heads/0.60.x.zip—installs the latest head from my 0.60.x branch.
  • pip install https://github.com/simonw/datasette/archive/e64d14e4.zip—installs the package from the snapshot at commit e64d14e413a955a10df88e106a8b5f1572ec8613—note that you can use just the first few characters in the URL rather than the full commit hash.

That last option, installing for a specific commit hash, is particularly useful in requirements.txt files since unlike branches or tags you can be certain that the content will not change in the future.

As you can see, the URLs are all predictable—GitHub has really good URL design. But if you don’t want to remember or look them up you can instead find them using the Code -> Download ZIP menu item for any view onto the repository:

Screenshot of the GitHub web interface - click on the green Code button, then right click on Download ZIP and selecet Copy Link

Installing from a fork

I sometimes use this trick when I find a bug in an open source Python library and need to apply my fix before it has been accepted by upstream.

I create a fork on GitHub, apply my fix and send a pull request to the project.

Then in my requirements.txt file I drop in a URL to the fix in my own repository—with a comment reminding me to switch back to the official package as soon as they’ve applied the bug fix.

Installing pull requests

This is a new trick I discovered this morning: there’s a hard-to-find URL that lets you do the same thing for code in pull requests.

Consider PR #1717 against Datasette, by Tim Sherratt, adding a --timeout option the datasette publish cloudrun command.

I can install that in a fresh environment on my machine using:

pip install https://api.github.com/repos/simonw/datasette/zipball/pull/1717/head

This isn’t as useful as checking out the code directly, since it’s harder to review the code in a text editor—but it’s useful knowing it’s possible.

Installing gists

GitHub Gists also get URLs to zip files. This means it’s possible to create and host a full Python package just using a Gist, by packaging together a setup.py file and one or more Python modules.

Here’s an example Gist containing my datasette-expose-some-environment-variables plugin.

You can right click and copy link on the “Download ZIP” button to get this URL:

https://gist.github.com/simonw/b6dbb230d755c33490087581821d7082/archive/872818f6b928d9393737eee541c3c76d6aa4b1ba.zip

Then pass that to pip install or datasette install to install it.

That Gist has two files—a setup.py file containing the following:

from setuptools import setup

VERSION = "0.1"

setup(
    name="datasette-expose-some-environment-variables",
    description="Expose environment variables in Datasette at /-/env",
    author="Simon Willison",
    license="Apache License, Version 2.0",
    version=VERSION,
    py_modules=["datasette_expose_some_environment_variables"],
    entry_points={
        "datasette": [
            "expose_some_environment_variables = datasette_expose_some_environment_variables"
        ]
    },
    install_requires=["datasette"],
)

And a datasette_expose_some_environment_variables.py file containing the actual plugin:

from datasette import hookimpl
from datasette.utils.asgi import Response
import os

REDACT = {"GPG_KEY"}


async def env(request):
    output = []
    for key, value in os.environ.items():
        if key not in REDACT:
            output.append("{}={}".format(key, value))
    return Response.text("\n".join(output))


@hookimpl
def register_routes():
    return [
        (r"^/-/env$", env)
    ]

Building a Covid sewage Twitter bot (and other weeknotes) 10 days ago

I built a new Twitter bot today: @covidsewage. It tweets a daily screenshot of the latest Covid sewage monitoring data published by Santa Clara county.

I’m increasingly distrustful of Covid numbers as fewer people are tested in ways that feed into the official statistics. But the sewage numbers don’t lie! As the Santa Clara county page explains:

SARS-CoV-2 (the virus that causes COVID-19) is shed in feces by infected individuals and can be measured in wastewater. More cases of COVID-19 in the community are associated with increased levels of SARS-CoV-2 in wastewater, meaning that data from wastewater analysis can be used as an indicator of the level of transmission of COVID-19 in the community.

That page also embeds some beautiful charts of the latest numbers, powered by an embedded Observable notebook built by Zan Armstrong.

Once a day, my bot tweets a screenshot of those latest charts that looks like this:

Screenshot of a tweet that says "Latest Covid sewage charts for the SF Bay Area" with an attached screenshot of some charts. The numbers are trending up in an alarming direction.

How the bot works

The bot runs once a daily using this scheduled GitHub Actions workflow.

Here’s the bit of the workflow that generates the screenshot:

- name: Generate screenshot with shot-scraper
  run: |-
    shot-scraper https://covid19.sccgov.org/dashboard-wastewater \
      -s iframe --wait 3000 -b firefox --retina -o /tmp/covid.png

This uses my shot-scraper screenshot tool, described here previously. It takes a retina screenshot just of the embedded iframe, and uses Firefox because for some reason the default Chromium screenshot failed to load the embed.

This bit sends the tweet:

- name: Tweet the new image
  env:
    TWITTER_CONSUMER_KEY: ${{ secrets.TWITTER_CONSUMER_KEY }}
    TWITTER_CONSUMER_SECRET: ${{ secrets.TWITTER_CONSUMER_SECRET }}
    TWITTER_ACCESS_TOKEN_KEY: ${{ secrets.TWITTER_ACCESS_TOKEN_KEY }}
    TWITTER_ACCESS_TOKEN_SECRET: ${{ secrets.TWITTER_ACCESS_TOKEN_SECRET }}
  run: |-
    tweet-images "Latest Covid sewage charts for the SF Bay Area" \
      /tmp/covid.png --alt "Screenshot of the charts" > latest-tweet.md

tweet-images is a tiny new tool I built for this project. It uses the python-twitter library to send a tweet with one or more images attached to it.

The hardest part of the project was getting the credentials for sending tweets with the bot! I had to go through Twitter’s manual verification flow, presumably because I checked the “bot” option when I applied for the new developer account. I also had to figure out how to extract all four credentials (with write permissions) from the Twitter developer portal.

I wrote up full notes on this in a TIL: How to get credentials for a new Twitter bot.

Datasette for geospatial analysis

I stumbled across datanews/amtrak-geojson, a GitHub repository containing GeoJSON files (from 2015) showing all of the Amtrak stations and sections of track in the USA.

I decided to try exploring it using my geojson-to-sqlite tool, which revealed a bug triggered by records with a geometry but no properties. I fixed that in version 1.0.1, and later shipped version 1.1 with improvements by Chris Amico.

In exploring the Amtrak data I found myself needing to learn how to use the SpatiaLite GUnion function to aggregate multiple geometries together. This resulted in a detailed TIL on using GUnion to combine geometries in SpatiaLite, which further evolved as I used it as a chance to learn how to use Chris’s datasette-geojson-map and sqlite-colorbrewer plugins.

This was so much fun that I was inspired to add a new “uses” page to the official Datasette website: Datasette for geospatial analysis now gathers together links to plugins, tools and tutorials for handling geospatial data.

sqlite-utils 3.26

I’ll quote the release notes for sqlite-utils 3.26 in full:

shot-scraper 0.12

In addition to support for WebKit contributed by Ryan Murphy, shot-scraper 0.12 adds options for taking a screenshot that encompasses all of the elements on a page that match a CSS selector.

In also adds a new --js-selector option, suggested by Tony Hirst. This covers the case where you want to take a screenshot of an element on the page that cannot be easily specified using a CSS selector. For example, this expression takes a screenshot of the first paragraph on a page that includes the text “shot-scraper”:

shot-scraper https://simonwillison.net/2022/Apr/8/weeknotes/ \
  --js-selector 'el.tagName == "P" && el.innerText.includes("shot-scraper")' \
  --padding 15 --retina

And an airship museum!

I finally got to add another listing to my www.niche-museums.com website about small or niche museums I have visited.

The Moffett Field Historical Society museum in Mountain View is situated in the shadow of Hangar One, an airship hangar built in 1933 to house the mighty USS Macon.

It’s the absolute best kind of local history museum. Our docent was a retired pilot who had landed planes on aircraft carriers using the kind of equipment now on display in the museum. They had dioramas and models. They even had a model railway. It was superb.

Releases this week

TIL this week

Pillar Point Stewards, pypi-to-sqlite, improvements to shot-scraper and appreciating datasette-dashboards 19 days ago

This week I helped Natalie launch the Pillar Point Stewards website and built a new tool for loading PyPI package data into SQLite, in order to help promote the excellent datasette-dashboards plugin by Romain Clement.

Pillar Point Stewards

I’ve been helping my wife Natalie Downe build the website for the Pillar Point Stewards initative that she is organizing on behalf of the San Mateo MPA Collaborative and California Academy of Sciences.

We live in El Granada, CA—home to the Pillar Point reef.

The reef has always been mixed-use, with harvesting of sea life such as mussels and clams legal provided the harvesters have an inexpensive fishing license.

Unfortunately, during the pandemic the number of people harvesting the reef raised by an order of magnitude—up to over a thousand people in just a single weekend. This had a major impact on the biodiversity of the reef, as described in Packed at Pillar Point by Anne Marshall-Chalmers for Bay Nature.

Pillar Point Stewards is an initiative to recruit volunteer stewards to go out on the reef during low tides, talking to people and trying to inspire curiosity and discourage unsustainable harvesting practices.

A very small part of the project is the website to support it, which helps volunteers sign up for shifts at low tides.

We re-used some of the work we had previously done for Rocky Beaches, in particular the logic for working with tide times from NOAA to decide when the shifts should be.

Natalie designed the site and built the front-end. I implemented the Django backend and integrated with Auth0 in order to avoid running our own signup and registration flow. This was the inspiration for the datasette-auth0 plugin I released last week.

Signed in volunteers can select their shift times from a calendar:

The signed in homepage, showing a list of upcoming shifts and a calendar view.

We also included an SVG tide chart on each shift page using the tide data from NOAA, which looks like this:

The tide chart shows the tide level throughout the day, highlighting the low tide and showing which portion of the day is covered by the shift

We’ve been building the site in public. You can see how everything works in the natbat/pillarpointstewards GitHub repository, including how the site uses continuous deployment against Fly.

datasette-dashboards

This is not my project, but I’m writing about it here because I only just found out about it and it’s really cool.

Romain Clement built a plugin for Datasette called datasette-dashboards. It’s best explained by checking out his live demo, which looks like this:

A dashboard, showing Job offers statistics - with a line chart, a big number, a donut chart, a nested bar chart and a choropleth map. The elements are arranged in a visually pleasing grid, with the line chart taking up two columns while everything else takes up one.

There are a bunch of clever ideas in this plugin.

It uses YAML syntax to define the different dashboard panels, outsourcing the actual visualization elements to Vega. You can see the YAML for the demo here. Here’s an edited subset of the YAML illustrating some interesting points:

plugins:
  datasette-dashboards:
    job-offers-stats:
      title: Job offers statistics
      description: Gather metrics about job offers
      layout:
        - [analysis-note, offers-day, offers-day, offers-count]
        - [analysis-note, offers-source, offers-day-source, offers-region]
      filters:
        date_start:
          name: Date Start
          type: date
          default: '2021-01-01'
        date_end:
          name: Date End
          type: date
      charts:
        analysis-note:
          library: markdown
          display: |-
            # Analysis details
            ...
        offers-count:
          title: Total number of offers
          db: jobs
          query: SELECT count(*) as count FROM offers_view WHERE TRUE [[ AND date >= date(:date_start) ]] [[ AND date <= date(:date_end) ]];
          library: metric
          display:
            field: count
            prefix:
            suffix: " offers"
        offers-day:
          title: Number of offers by day
          db: jobs
          query: SELECT date(date) as day, count(*) as count FROM offers_view WHERE TRUE [[ AND date >= date(:date_start) ]] [[ AND date <= date(:date_end) ]] GROUP BY day ORDER BY day
          library: vega
          display:
            mark: { type: line, tooltip: true }
            encoding:
              x: { field: day, type: temporal }
              y: { field: count, type: quantitative }

The SQL query for each panel is defined as query:—and can take parameters such as :date_end which are defined by the filters: section. Note that here one of the filters has a type of date, which turns into a <input type="date"> in the filter interface.

For library: vega panels the display: key holds the raw Vega specification, so anything the Vega visualization library can do is available to the plugin.

I didn’t know Vega could render choropleth maps! That map there is defined by this YAML, which loads a GeoJSON file of the regions in France from the gregoiredavid/france-geojson GitHub repository.

display:
  mark: geoshape
  projection: { type: mercator }
  transform:
    - lookup: region
      from:
        data:
          url: https://raw.githubusercontent.com/gregoiredavid/france-geojson/master/regions.geojson
          format: { type: json, property: features }
        key: properties.nom
        fields: [type, geometry]

I think my favourite trick though is the way it handles layout. The layout for the demo is defined thus:

layout:
  - [analysis-note, offers-day, offers-day, offers-count]
  - [analysis-note, offers-source, offers-day-source, offers-region]

This is then implemented using CSS grids! Here’s the template fragment that does the work:

<style>
  @media (min-width: 800px) {
    .dashboard-grid {
      {% if dashboard.layout %}
      grid-template-areas: {% for row in dashboard.layout %}"{% for col in row %}{{ col }} {% endfor %}" {% endfor %};
      {% else %}
      grid-template-columns: repeat(2, 1fr);
      {% endif %}
    }

    {% if dashboard.layout %}
    {% for chart_slug, chart in dashboard.charts.items() %}
    #card-{{ chart_slug }} {
      grid-area: {{ chart_slug }};
    }
    {% endfor %}
    {% endif %}
  }
</style>

Such a clever and elegant trick.

pypi-to-sqlite

I wanted to add datasette-dashboards to the official Datasette plugins directory, but there was a catch: since most of the plugins listed there are written by me, the site has some baked in expectations: in particular, it expects that plugins will all be using the GitHub releases feature (for example) to announce their releases.

Romain’s plugin wasn’t using that feature, instead maintaining its own changelog file.

I’ve been meaning to make the plugin directory more forgiving for a while. I decided to switch from using GitHub releases as the definitive source of release information to using releases published to PyPI (the Python package index) instead.

PyPI offers a stable JSON API: https://pypi.org/pypi/datasette-dashboards/json—which includes information on the package and all of its releases.

To reliably pull that information into datasette.io I decided on a two-step process. First, I set up a Git scraper to archive the data that I cared about into a new repository called pypi-datasette-packages.

That repo stores the current PyPI JSON for every package listed on the Datasette website. This means I can see changes made to those files over time by browsing the commit history. It also means that if PyPI is unavailable I can still build and deploy the site.

Then I wrote a new tool called pypi-to-sqlite to load that data into SQLite database tables. You can try that out like so:

pip install pypi-to-sqlite
pypi-to-sqlite pypi.db datasette-dashboards pypi-to-sqlite --prefix pypi_

That --prefix option causes the tables to be created with the specified prefix in their names.

Here are the three tables generated by that command:

Using data from these tables I was able to rework the SQL view that powers the plugins and tools directories on the site, and now datasette-dashboards has its own page there.

shot-scraper 0.10 and 0.11

shot-scraper is my tool for taking automated screenshots of web pages, built on top of Playwright.

Ben Welsh has been a key early adopter of shot-scraper, using it to power his news-homepages project which takes screenshots of various news websites and then both tweets the results and uploads them to the News Homepages collection on the Internet Archive.

shot-scraper 0.10 is mostly Ben’s work: he contributed both a --timeout option and a --browser option to let you install and use browsers other than the Chromium default!

(Ben needed this because some news homepages were embedding videos in a format that wasn’t supported by Chromium but did work fine in regular Chrome.)

Ryan Cheley also contributed to 0.10—thanks to Ryan, the shot-scraper multi command now continues taking shots even if one of them fails, unless you pass the --fail-on-error flag.

In writing my weeknotes, I decided to use shot-scraper to take a screenshot of the signed in homepage of the www.pillarpointstewards.com site.

In doing so, I found out that Google SSO refuses to work with the default Playwright Chromium! But it does continue to work with Firefox, so I fixed the shot-scraper auth to support the --browser option.

I took the screenshot like this:

shot-scraper auth https://www.pillarpointstewards.com/ -b firefox auth.json
# Now manually sign in with Auth0 and Google
shot-scraper https://www.pillarpointstewards.com/ -b firefox -a auth.json \
  --javascript "
    Array.from(
      document.querySelectorAll('[href^=tel]')
    ).forEach(el => el.innerHTML = '(xxx) xxx-xxxx')"

That --javascript line there redacts the phone numbers that are displayed on the page to signed in volunteers.

I created the second screenshot of just the tide times chart using this:

shot-scraper https://www.pillarpointstewards.com/shifts/182/ \
  -b firefox -a auth.json \
  --selector '.primary h2:nth-child(8)' \
  --selector .day-alone --padding 15

shot-scraper 0.11, released a few minutes ago, contains the new auth --browser feature plus some additional contributions from Ben Welsh, Ryan Murphy and Ian Wootten:

  • New shot-scraper accessibility --timeout option, thanks Ben Welsh. #59
  • shot-scraper auth --browser option for authentication using a browser other than Chromium. #61
  • Using --quality now results in a JPEG file with the correct .jpg extension. Thanks, Ian Wootten. #58
  • New --reduced-motion flag for emulating the “prefers-reduced-motion” media feature. Thanks, Ryan Murphy. #49

Releases this week

TIL this week

Weeknotes: datasette-auth0 one month ago

Datasette 0.61, a Twitter Space and a new Datasette plugin for authenticating against Auth0.

datasette-auth0

I’ve been figuring out how best to integrate with Auth0 for account management and sign-in, for a project I’m working on with Natalie.

I used Auth0 for VIAL with Vaccinate CA last year, following the Auth0 Django tutorial using Python Social Auth.

This time I decided to try and do it from first principles rather than use a library, for reasons I discussed in this Twitter thread.

It turns out Auth0 is using regular OAuth 2, which can be implemented from scratch in just three steps:

  1. Generate a URL to a page on Auth0 that will display the login screen
  2. Implement a page that handles the redirect back from Auth0, which needs to exchange the code from the ?code= parameter for an access token by POSTing it to an authenticated API endpoint
  3. Use that access token to retrieve the authenticated user’s profile

I wrote up the steps in this TIL: Simplest possible OAuth authentication with Auth0

Since it turned out to be pretty straight-forward, I turned it into a new authentication plugin for Datasette: datasette-auth0.

You can try that out with the live demo at datasette-auth0-demo.datasette.io—click on the top right menu icon and select “Sign in with Auth0”.

Animated demo of the datasette-auth0 sign-in flow

The live demo is deployed automatically any time a push to the main branch passes its tests. I’ve implemented this pattern a few times now, so I wrote it up in another TIL: Deploying a live Datasette demo when the tests pass.

Datasette 0.61.0 (and 0.61.1)

I wrote about these in the annotated weeknotes. They were pretty significant releases, representing 86 commits since 0.60.2 released back in January.

New features as documentation

Some of my favourite feature requests for my projects are ones that can be solved by documentation. I had a great example of that this week.

In #420: Transform command with shared context Mehmet Sukan wanted a way to speed up the following operation using sqlite-utils insert --convert—described in What’s new in sqlite-utils 3.20 and 3.21:

cat items.json | jq '.data' | sqlite-utils insert listings.db listings - --convert '
d = enchant.Dict("en_US")
row["is_dictionary_word"] = d.check(row["name"])
' --import=enchant --ignore

The --convert option lets you specify Python code that will be executed against each inserted item. Mehmet’s problem was that the enchant.Dict("en_US") operation is quite expensive, and it was being run every time around the loop.

I started looking into ways to speed this up... and realized that there was already a mechanism for doing this, but it was undocumented and I hadn’t previously realized it was even possible!

The recipe that works looks like this:

echo '[
  {"name": "notaword"},
  {"name": "word"}
]' | python3 -m sqlite_utils insert listings.db listings - --convert '
import enchant
d = enchant.Dict("en_US")

def convert(row):
    global d
    row["is_dictionary_word"] = d.check(row["name"])
'

The result:

% sqlite-utils rows listings.db listings        
[{"name": "notaword", "is_dictionary_word": 0},
 {"name": "word", "is_dictionary_word": 1}]

This takes advantage of a feature of the sqlite-utils Python snippet mechanism, which is implemented here. It first attempts exec() against the provided code to see if it defined a convert(value) function—if that fails, it composes a function body (to cover simple expressions like row["ok"] = True).

So I got to close the issue by adding some documentation showing how to do this!

Another, smaller example this week: when I figured out Extracting web page content using Readability.js and shot-scraper I learned that Playwright can accept and execute async () => {...} functions, enabling this pattern:

shot-scraper javascript https://simonwillison.net/2022/Mar/24/datasette-061/ "
async () => {
  const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
  return (new readability.Readability(document)).parse();
}"

So I added that pattern to the shot-scraper documentation.

SQLite Happy Hour Twitter Space

I hosted my first Twitter Space. The recording and notes can be found in SQLite Happy Hour—a Twitter Spaces conversation about three interesting projects building on SQLite.

I also learned how to download the audio of a Twitter Spaces in two different ways, as documented in my TIL on Exporting and editing a Twitter Spaces recording.

Releases this week

TIL this Weeknotes

Elsewhere

26th April 2022

  • HTML event handler attributes: down the rabbit hole (via) onclick=“myfunction(event)” is an idiom for passing the click event to a function—but how does it work? It turns out the answer is buried deep in the HTML spec—the browser wraps that string of code in a function(event) { ... that string ... } function and makes the event available to its local scope that way. #26th April 2022, 8:35 pm
  • Mac OS 8 emulated in WebAssembly (via) Absolutely incredible project by Mihai Parparita. This is a full, working copy of Mac OS 8 (from 1997) running in your browser via WebAssembly—and it’s fully loaded with games and applications too. I played with Photoshop 3.0 and Civilization and there’s so much more on there to explore too—I finally get to try out HyperCard! #26th April 2022, 7:16 pm
  • Learn Go with tests. I really like this approach to learning a new language: start by learning to write tests (which gets you through hello world, environment setup and test running right from the beginning) and use them to explore the language. I also really like how modern Go development no longer depends on the GOPATH, which I always found really confusing. #26th April 2022, 7:12 pm
  • jq language description (via) I love jq but I’ve always found it difficult to remember how to use it, and the manual hasn’t helped me as much as I would hope. It turns out the jq wiki on GitHub offers an alternative, more detailed description of the language which fits the way my brain works a lot better. #26th April 2022, 7:04 pm
  • A tiny CI system (via) Christian Ştefănescu shares a recipe for building a tiny self-hosted CI system using Git and Redis. A post-receive hook runs when a commit is pushed to the repo and uses redis-cli to push jobs to a list. Then a separate bash script runs a loop with a blocking “redis-cli blpop jobs” operation which waits for new jobs and then executes the CI job as a shell script. #26th April 2022, 3:39 pm

24th April 2022

21st April 2022

  • Web Components as Progressive Enhancement (via) I think this is a key aspect of Web Components I had been missing: since they default to rendering their contents, you can use them as a wrapper around regular HTML elements that can then be progressively enhanced once the JavaScript has loaded. #21st April 2022, 9:33 pm

19th April 2022

  • Glue code to quickly copy data from one Postgres table to another (via) The Python script that Retool used to migrate 4TB of data between two PostgreSQL databases. I find the structure of this script really interesting—it uses Python to spin up a queue full of ID ranges to be transferred and then starts some threads, but then each thread shells out to a command that runs “psql COPY (SELECT ...) TO STDOUT” and pipes the result to “psql COPY xxx FROM STDIN”. Clearly this works really well (“saturate the database’s hardware capacity” according to a comment on HN), and neatly sidesteps any issues with Python’s GIL. #19th April 2022, 4:57 pm
  • Netlify Edge Functions: A new serverless runtime powered by Deno. You can now run Deno scripts directly in Netlify’s edge CDN—bundled as part of their default pricing plan. Interesting that they decided to host it on Deno’s Deno Deploy infrastructure. The hello world example is pleasingly succinct:

    export default () => new Response(“Hello world”) #19th April 2022, 4:46 pm

18th April 2022

13th April 2022

  • Litestream: Live Read Replication (via) The documentation for the read replication implemented in the latest Litestream beta (v0.4.0-beta.2). The design is really simple and clever: the primary runs a web server on a port, and replica instances can then be started with a configured URL pointing to the IP and port of the primary. That’s all it takes to have a SQLite database replicated to multiple hosts, each of which can then conduct read queries against their local copies. #13th April 2022, 2:04 am
  • Datasette for geospatial analysis (via) I added a new page to the Datasette website describing how Datasette can be used for geospatial analysis, pulling together several of the relevant plugins and tools from the Datasette ecosystem. #13th April 2022, 12:48 am

7th April 2022

  • datasette-dashboards (via) Romain Clement’s datasette-dashboards plugin lets you configure dashboards for Datasette using YAML, combining markdown blocks, Vega graphs and single number metrics using a layout powered by CSS grids. This is a beautiful piece of software design, with a very compelling live demo. #7th April 2022, 6:36 pm

29th March 2022

  • WebAssembly in my Browser Desktop Environment (via) Dustin Brett built the WebAssembly demo to end all WebAssembly demos: his daedalOS browser desktop environment simulates a Windows-style operating system, and bundles WebAssembly projects that include v86 for 486 emulation, js-dos for DOS emulation to run Doom, BoxedWine to run Wine applications like Notepad++, Ruffle to emulate Flash, ffmpeg.wasm to power audio and video conversion, WASM-ImageMagick for image conversion, Pyodide for a Python shell and more besides that! #29th March 2022, 1:26 am

24th March 2022

  • geoBoundaries. This looks useful: “The world’s largest open, free and research-ready database of political administrative boundaries.” Founded by the geoLab at William & Mary university, and released under a Creative Commons Attribution license that includes a requirement for a citation. File formats offered include shapefiles, GeoJSON and TopoJSON. #24th March 2022, 2:03 pm
  • DAOs are, I think, one of the best illustrations of the problem with a lot of these Web3 projects: They are trying to find technological solutions that will somehow codify very complex social structures. A lot of them also seem to operate under the assumption that everyone is acting in good faith, and that project members’ interests will generally align—a baffling assumption given the amount of bad actors in the crypto space.

    Molly White # 24th March 2022, 11:07 am

17th March 2022

  • Deno by example (via) Interesting approach to documentation: a big list of annotated examples illustrating the Deno way of solving a bunch of common problems. #17th March 2022, 1:02 am