Simon Willison’s Weblog

On github 89 pagni 3 django 523 fly 8 security 380 ...

 

Recent entries

Weeknotes number 100 12 hours ago

This entry marks my 100th weeknotes, which I’ve managed to post once a week (plus or minus a few days) consistently since 13th September 2019.

I started writing weeknotes to add some accountability to the work I was doing during my JSK fellowship year at Stanford. The fellowship ended over a year ago but I’ve stuck to the habit—I’ve been finding it really helpful as a structured approach to thinking about my work every week, and it occasionally helps motivate me to get things done enough that I have something I can write about!

Datasette Desktop 0.2.0

My big achievement this week was Datasette Desktop 0.2.0 (and the 0.2.1 patch release that followed). I published annotated release notes for that a few days ago. I’m really pleased with the release—I think Datasette as a desktop application is going to significantly increase the impact of the project.

I also sent out an issue of the Datasette Newsletter promoting the new desktop application.

Datasette Desktop for Windows

I did a quick research spike to investigate the feasibility of publishing a Windows version of Datasette Desktop. To my surprise, I managed to get a working prototype going with just half a small amount of work:

Electron claims to solve cross-platform development and it seems to uphold that claim pretty well!

I’m still quite a bit of work away from having a release: I need to learn how to build and sign Windows installers. But this is a very promising first step.

json-flatten

I’ve started thinking about how I can enable Datasette Desktop users to configure plugins without having to hand-edit plugin configuration JSON (the current mechanism).

This made me take another look at a small library I released a couple of years ago, json-flatten, which turns a nested JSON object into a set of flat key/value pairs suitable for editing using an HTML form and then unflattens that data later on.

>>> import json_flatten
>>> json_flatten.flatten({"foo": {"bar": [1, True, None]}})
{'foo.bar.[0]$int': '1', 'foo.bar.[1]$bool': 'True', 'foo.bar.[2]$none': 'None'}
>>> json_flatten.unflatten(_)
{'foo': {'bar': [1, True, None]}}

It turns out a few people have been using the library, and had filed issues—I released version 0.2 with a couple of fixes.

TIL this week

Releases this week

Datasette Desktop 0.2.0: The annotated release notes five days ago

Datasette Desktop is a new macOS desktop application version of Datasette, an “open source multi-tool for exploring and publishing data” built on top of SQLite. I released the first version last week—I’ve just released version 0.2.0 (and a 0.2.1 bug fix) with a whole bunch of critical improvements.

You can see the release notes for 0.2.0 here, but as I’ve done with Datasette in the past I’ve decided to present an annotated version of those release notes providing further background on each of the new features.

The plugin directory

A new plugin directory for installing new plugins and upgrading or uninstalling existing ones. Open it using the “Plugins -> Install and Manage Plugins...” menu item. #74

Demo showing installing and upgrading a plugin

This was the main focus for the release. Plugins are a key component of both Datasette and Datasette Desktop: my goal is for Datasette to provide a robust core for exploring databases, with a wide array of plugins that support any additional kind of visualization, exploration or data manipulation capability that a user might want.

Datasette Desktop goes as far as bundling an entire standalone Python installation just to ensure that plugins will work correctly, and invisibly sets up a dedicated Python virtual environment for plugins to install into when you first run the application.

The first version of the app allowed users to install plugins by pasting their name into a text input field. Version 0.2.0 is a whole lot more sophisticated: the single input field has been replaced by a full plugin directory interface that shows installed v.s. available plugins and provides “Install”, “Upgrade” and “Uninstall” buttons depending on the state of the plugin.

When I set out to build this I knew I wanted to hit this JSON API on datasette.io to fetch the list of plugins, and I knew I wanted a simple searchable index page. The I realized I also wanted faceted search, so I could filter for installed vs not-yet-installed plugins.

Datasette’s built-in table interface already implements faceted search! So I decided to use that, with some custom templates to add the install buttons and display the plugins in a more suitable format.

The first challenge was getting the latest list of plugins into my Datasette instance. I built this into the datasette-app-support plugin using the startup() plugin hook—every time the server starts up it hits that API and populates an in-memory table with the returned data.

The data from the API is then extended with four extra columns:

  • "installed" is set to “installed” or “not installed” depending on whether the plugin has already been installed by the user
  • "Installed_version" is the currently installed version of the plugin
  • "upgrade" is the string “upgrade available” or None—allowing the user to filter for just plugins that can be upgraded
  • "default" is set to 1 if the plugin is a default plugin that came with Datasette

The data needed to build the plugin table is gathered by these three lines of code:

plugins = httpx.get(
     "https://datasette.io/content/plugins.json?_shape=array"
).json()
# Annotate with list of installed plugins
installed_plugins = {
    plugin["name"]: plugin["version"]
    for plugin in (await datasette.client.get("/-/plugins.json")).json()
}
default_plugins = (os.environ.get("DATASETTE_DEFAULT_PLUGINS") or "").split()

The first line fetches the full list of known plugins from the Datasette plugin directory

The second makes an internal API call to the Datasette /-/plugins.json endpoint using the datasette.client mechanism to discover what plugins are currently installed and their versions.

The third line loads a space-separated list of default plugins from the DATASETTE_DEFAULT_PLUGINS environment variable.

That last one deserves further explanation. Datasette Desktop now ships with some default plugins, and the point of truth for what those are lives in the Electron app codebase—because that’s where the code responsible for installing them is.

Five plugins are now installed by default: datasette-vega, datasette-cluster-map, datasette-pretty-json, datasette-edit-schema and datasette-configure-fts. #81

The plugin directory needs to know what these defaults are so it can avoid showing the “uninstall” button for those plugins. Uninstalling them currently makes no sense because Datasette Desktop installs any missing dependencies when the app starts, which would instantly undo the user’s uninstall action decision.

An environment variable felt like the most straight-forward way to expose that list of default plugins to the underlying Datasette server!

I plan to make default plugins uninstallable in the future but doing so require a mechanism for persisting user preference state which I haven’t built yet (see issue #101).

A log on the loading screen

The application loading screen now shows a log of what is going on. #70

The first time you launch the Datasette Desktop application it creates a virtual environment and installs datasette, datasette-app-support and the five default plugins (plus their dependencies) into that environment.

Animated demo of the Datasette Desktop launch screen showing the log scrolling past

This can take quite a few seconds, during which the original app would show an indeterminate loading indicator.

Personally I hate loading indicators which don’t show the difference between something that’s working and something that’s eternally hung. Since I can’t estimate how long it will take, I decided to pipe the log of what the pip install command is doing to the loading screen itself.

For most users this will be meaningless, but hopefully will help communicate “I’m installing extra stuff that I need”. Advanced users may find this useful though, especially for bug reporting if something goes wrong.

Under the hood I implemented this using a Node.js EventEmitter. I use the same trick to forward server log output to the “Debug -> Show Sever Log” interface.

Example CSV files

The welcome screen now invites you to try out the application by opening interesting example CSV files, taking advantage of the new “File -> Open CSV from URL...” feature. #91

Previously Datasette Desktop wouldn’t do anything at all until you opened up a CSV or SQLite database, and I have a hunch that unlike me most people don’t have good examples of those to hand at all times!

The new welcome screen offers example CSV files that can be opened directly from the internet. I implemented this using a new API at datasette.io/content/example_csvs (add .json for the JSON version) which is loaded by code running on that welcome page.

I have two examples at the moment, for the Squirrel Census and the London Fire Brigade’s animal rescue data. I’ll be adding more in the future.

The API itself is a great example of the Baked Data architectural pattern in action: the data itself is stored in this hand-edited YAML file, which is compiled to SQLite every time the site is deployed.

To get this feature working I added a new “Open CSV from URL” capability to the app, which is also available in the File menu. Under the hood this works by passing the provided URL to the new /-/open-csv-from-url API endpoint. The implementation of this was surprisingly fiddly as I wanted to consume the CSV file using an asynchronous HTTP client—I ended up using an adaption of some example code from the aiofile README.

Recently opened files and “Open with Datasette”

Recently opened .db and .csv files can now be accessed from the new “File -> Open Recent” menu. Thanks, Kapilan M! #54

This was the project’s first external contribution! Kapilan M figured out a way to hook into the macOS “recent files” mechanism from Electron, and I expanded that to cover SQLite database in addition to CSV files.

When a recent file is selected, Electron fires the “open-file” event. This same event is fired when a file is opened using “Open With -> Datasette” or dragged onto the application’s dock.

This meant I needed to tell the difference between a CSV or a SQLite database file, which I do by checking if the first 16 bytes of the file match the SQLite header of SQLite format 3\0.

.db and .csv files can now be opened in Datasette starting from the Finder using “Right Click -> Open With -> Datasette”. #40

Registering Datasette as a file handler for .csv and .db was not at all obvious. It turned out to involve adding the following to the Electron app’s package.json file:

  "build": {
    "appId": "io.datasette.app",
    "mac": {
      "category": "public.app-category.developer-tools",
      "extendInfo": {
        "CFBundleDocumentTypes": [
          {
            "CFBundleTypeExtensions": [
              "csv",
              "tsv",
              "db"
            ],
            "LSHandlerRank": "Alternate"
          }
        ]
      }

The Debug Menu

A new Debug menu can be enabled using Datasette -> About Datasette -> Enable Debug Menu".

The debug menu existed previously in development mode, but with 0.2.0 I decided to expose it to end users. I didn’t want to show it to people who weren’t ready to see it, so you have to first enable it using a button on the about menu.

The most interesting option there is “Run Server Manually”.

Most of the time when you are using the app there’s a datasette Python server running under the hood, but it’s entirely managed by the Node.js child_process module.

When developing the application (or associated plugins) it can be useful to manually run that server rather than having it managed by the app, so you can see more detailed error messages or even add the --pdb option to drop into a debugger should something go wrong.

To run that server, you need the Electron app to kill its own version... and you then need to know things like what port it was running on and which environment variables it was using.

Here’s what you see when you click the “Run Server Manually” debug option:

Run server manually? Clicking OK will terminate the Datasette server used by this app. Copy this command to a terminal to manually run a replacement

Here’s that command in full:

DATASETTE_API_TOKEN="0ebb45444ba4cbcbacdbcbb989bb0cd3aa10773c0dfce73c0115868d0cee2afa" DATASETTE_SECRET="4a8ac89d0d269c31d99059933040b4511869c12dfa699a1429ea29ee3310a850" DATASETTE_DEFAULT_PLUGINS="datasette datasette-app-support datasette-vega datasette-cluster-map datasette-pretty-json datasette-edit-schema datasette-configure-fts datasette-leaflet" /Users/simon/.datasette-app/venv/bin/datasette --port 8002 --version-note xyz-for-datasette-app --setting sql_time_limit_ms 10000 --setting max_returned_rows 2000 --setting facet_time_limit_ms 3000 --setting max_csv_mb 0

This is a simulation of the command that the app itself used to launch the server. Pasting that into a terminal will produce an exact copy of the original process—and you can add --pdb or other options to further customize it.

Bonus: Restoring the in-memory database on restart

This didn’t make it into the formal release notes, but it’s a fun bug that I fixed in this release.

Datasette Desktop defaults to opening CSV files in an in-memory database. You can import them into an on-disk database too, but if you just want to start exploring CSV data in Datasette I decided an in-memory database would be a better starting point.

There’s one problem with this: installing a plugin requires a Datasette server restart, and restarting the server clears the content of that in-memory database, causing any tables created from imported CSVs to disappear. This is confusing!

You can follow my progress on this in issue #42: If you open a CSV and then install a plugin the CSV table vanishes. I ended up solving it by adding code that dumps the “temporary” in-memory database to a file on disk before a server restart, restarts the server, then copies that disk backup into memory again.

This works using two custom API endpoints added to the datasette-app-support plugin:

  • POST /-/dump-temporary-to-file with {"path": "/path/to/backup.db"} dumps the contents of that in-memory temporary database to the specified file.
  • POST /-/restore-temporary-from-file with {"path": "/path/to/backup.db"} restors the content back again.

These APIs are called from the startOrRestart() method any time the server restarts, using a file path generated by Electron using the following:

backupPath = path.join(
  app.getPath("temp"),
  `backup-${crypto.randomBytes(8).toString("hex")}.db`
);

The file is deleted once it has been restored.

After much experimentation, I ended up using the db.backup(other_connection) method that was added to Python’s sqlite3 module in Python 3.7. Since Datasette Desktop bundles its own copy of Python 3.9 I don’t have to worry about compatibility with older versions at all.

The rest is in the milestone

If you want even more detailed notes on what into the release, each new feature is included in the 0.2.0 milestone, accompanied by a detailed issue with screenshots (and even a few videos) plus links to the underlying commits.

Datasette Desktop—a macOS desktop application for Datasette 10 days ago

I just released version 0.1.0 of the new Datasette macOS desktop application, the first version that end-users can easily install. I would very much appreciate your help testing it out!

Datasette Desktop

Datasette Desktop screenshot

Datasette is “an open source multi-tool for exploring and publishing data”. It’s a Python web application that lets you explore data held in SQLite databases, plus a growing ecosystem of plugins for visualizing and manipulating those databases.

Datasette is aimed at data journalists, museum curators, archivists, local governments, scientists, researchers and anyone else who has data that they wish to explore and share with the world.

There’s just one big catch: since it’s a Python web application, those users have needed to figure out how to install and run Python software in order to use it. For people who don’t live and breath Python and the command-line this turns out to be a substantial barrier to entry!

Datasette Desktop is my latest attempt at addressing this problem. I’ve packaged up Datasette, SQLite and a full copy of Python such that users can download and uncompress a zip file, drag it into their /Applications folder and start using Datasette, without needing to know that there’s a Python web server running under the hood (or even understand what a Python web server is).

Please try it out, and send me feedback and suggestions on GitHub.

What the app does

This initial release has a small but useful set of features:

  • Open an existing SQLite database file and offer all of Datasette’s functionality, including the ability to explore tables and to execute arbitrary SQL queries.
  • Open a CSV file and offer the Datasette table interface (example here). By default this uses an in-memory database that gets cleared when the app shuts down, or you can...
  • Import CSV files into tables in on-disk SQLite databases (including creating a new blank database first).
  • By default the application runs a local web server which only accepts connections from your machine... but you can change that in the “File -> Access Control” menu to allow connections from anyone on your network. This includes Tailscale networks too, allowing you to run the application on your home computer and then access it securely from other devices such as your mobile phone anywhere in the world.
  • You can install plugins! This is the most exciting aspect of this initial release: it’s already in a state where users can customize it and developers can extend it, either with Datasette’s existing plugins (69 and counting) or by writing new ones.

How the app works

There are three components to the app:

  • A macOS wrapper application
  • Datasette itself
  • The datasette-app-support plugin

The first is the macOS application itself. This is currently written with Electron, and bundles a full copy of Python 3.9 (based on python-build-standalone by Gregory Szorc). Bundling Python is essential: the principal goal of the app is to allow people to use Datasette who aren’t ready to figure out how to install their own Python environment. Having an isolated and self-contained Python is also a great way of avoiding making XKCD 1987 even worse.

The macOS application doesn’t actually include Datasette itself. Instead, on first launch it creates a new Python virtual environment (currently in ~/.datasette-app/venv, feedback on that location welcome) and installs the other two components: Datasette and the datasette-app-support plugin.

Having a dedicated virtual environment is what enables the “Install Plugin” menu option. When a plugin is installed the macOS application runs pip install name-of-plugin and then restarts the Datasette server process, causing it to load that new plugin.

The datasette-app-support plugin is designed exclusively to work with this application. It adds API endpoints that the Electron shell can use to trigger specific actions, such as “import from this CSV file” or “attach this SQLite database”—these are generally triggered by macOS application menu items.

It also adds a custom authentication mechanism. The user of the app should have special permissions: only they should be able to import a CSV file from anywhere on their computer into Datasette. But for the “network share” feature I want other users to be able to access the web application.

An interesting consequence of installing Datasette on first-run rather than bundling it with the application is that the user will be able to upgrade to future Datasette releases without needing to re-install the application itself.

How I built it

I’ve been building this application completely in public over the past two weeks, writing up my notes and research in GitHub issues as I went (here’s the initial release milestone).

I had to figure out a lot of stuff!

First, Electron. Since almost all of the user-facing interface is provided by the existing Datasette web application, Electron was a natural fit: I needed help powering native menus and bundling everything up as an installable application, which Electron handles extremely well.

I also have ambitions to get a Windows version working in the future, which should share almost all of the same code.

Electron also has fantastic initial developer onboarding. I’d love to achieve a similar level of quality for Datasette some day.

The single biggest challenge was figuring out how to bundle a working copy of the Datasette Python application to run inside the Electron application.

My initial plan (touched on last week) was to compile Datasette and its dependencies into a single executable using PyInstaller or PyOxidizer or py2app.

These tools strip down a Python application to the minimal required set of dependencies and then use various tricks to compress that all into a single binary. They are really clever. For many projects I imagine this would be the right way to go.

I had one big problem though: I wanted to support plugin installation. Datasette plugins can have their own dependencies, and could potentially use any of the code from the Python standard library. This means that a stripped-down Python isn’t actually right for this project: I need a full installation, standard library and all.

Telling the user they had to install Python themselves was an absolute non-starter: the entire point of this project is to make Datasette available to users who are unwilling or unable to jump through those hoops.

Gregory Szorc built PyOxidizer, and as part of that he built python-build-standalone:

This project produces self-contained, highly-portable Python distributions. These Python distributions contain a fully-usable, full-featured Python installation as well as their build artifacts (object files, libraries, etc).

Sounds like exactly what I needed! I opened a research issue, built a proof-of-concept and decided to commit to that as the approach I was going to use. Here’s a TIL that describes how I’m doing this: Bundling Python inside an Electron app

(I find GitHub issue threads to be the ideal way of exploring these kinds of areas. Many of my repositories have a research label specifically to track them.)

The last key step was figuring out how to sign the application, so I could distribute it to other macOS users without them facing this dreaded dialog:

Datasette.app can't be opened because Apple cannot check it for malicious software

It turns out there are two steps to this these days: signing the code with a developer certificate, and then “notarizing” it, which involves uploading the bundle to Apple’s servers, having them scan it for malicious code and attaching the resulting approval to the bundle.

I was expecting figuring this out to be a nightmare. It ended up not too bad: I spent two days on it, but most of the work ended up being done by electron-builder—one of the biggest advantages of working within the Electron ecosystem is that a lot of people have put a lot of effort into these final steps.

I was adamant that my eventual signing and notarization solution should be automated using GitHub Actions: nothing defangs a frustrating build process more than good automation! This made things a bit harder because all of the tutorials and documentation assumed you were working with a GUI, but I got there in the end. I wrote this all up as a TIL: Signing and notarizing an Electron app for distribution using GitHub Actions (see also Attaching a generated file to a GitHub release using Actions).

What’s next

I announced the release last night on Twitter and I’ve already started getting feedback. This has resulted in a growing number of issues under the usability label.

My expectation is that most improvements made for the benefit of Datasette Desktop will benefit the regular Datasette web application too.

There’s also a strategic component to this. I’m investing a lot of development work in Datasette, and I want that work to have the biggest impact possible. Datasette Desktop is an important new distribution channel, which also means that any time I add a new feature to Datasette or build a new plugin the desktop application should see the same benefit as the hosted web application.

If I’m unlucky I’ll find this slows me down: every feature I build will need to include consideration as to how it affects the desktop application.

My intuition currently is that this trade-off will be worthwhile: I don’t think ensuring desktop compatibility will be a significant burden, and the added value from getting new features almost for free through a whole separate distribution channel should hopefully be huge!

TIL this week

Releases this week

Building a desktop application for Datasette (and weeknotes) 20 days ago

This week I started experimenting with a desktop application version of Datasette—with the goal of providing people who aren’t comfortable with the command-line the ability to get Datasette up and running on their own personal computers.

Update 8th September 2021: I made a bunch more progress over the week following this post, see Datasette Desktop—a macOS desktop application for Datasette for details or download the app to try it out.

Screenshot of the new Datasette desktop app prototype with several open windows

Why a desktop application?

On Monday I kicked off an enormous Twitter conversation when I posted:

I wonder how much of the popularity of R among some communities in comparison to Python comes down to the fact that with R you can install the RStudio desktop application and you’re ready to go

This ties into my single biggest complaint about Python: it’s just too hard for people to get started with. Setting up a Python development environment for the first time remains an enormous barrier to entry.

I later put this in stronger terms:

The more I think about this the more frustrated I get, thinking about the enormous amount of human potential that’s squandered because the barriers to getting started learning to program are so much higher than they need to be

Which made me think of glass houses. My own Datasette project has exactly the same problem: to run it locally you need to install Python and then install Datasette! Mac users can use Homebrew, but telling newcomers to install Homebrew first isn’t particularly welcoming either.

Ideally, I’d like people to be able to install a regular desktop application and start using Datasette that way, without even needing to know that it’s written in Python.

There’s been an open issue to get Datasette running as a standalone binary using PyInstaller since November 2017, with quite a bit of research.

But I want a UI as well: I don’t want to have to teach new users how to install and run a command-line application if I can avoid it.

So I decided to spend some time researching Electron to see how hard it would be to make a basic Datasette desktop application a reality.

Progress so far

The code I’ve written so far can be found in the simonw/datasette.app repository on GitHub. The app so far does the following:

  • Run a datasette server on localhost attached to an available port (found using portfinder) which terminates when the app quits.
  • Open a desktop window showing that Datasette instance once the server has started.
  • Allow additional windows onto the same instance to be opened using the “New Window” menu option or the Command+N keyboard shortcut.
  • Provides an “Open Database...” menu option (and Command+O shortcut) which brings up a file picker to allow the user to select a SQLite database file to open—once selected, this is attached to the Datasette instance and any windows showing the Datasette homepage are reloaded.

Here’s a video demo showing these features in action:

It’s very much an MVP, but I’m encouraged by the progress so far. I think this is enough of a proof of concept to be worth turning this into an actual usable product.

How this all works

There are two components to the application.

The first is a thin Electron shell, responsible for launching the Python server, managing windows and configuring the various desktop menu options used to configure it. The code for that lives in main.js.

The second is a custom Datasette plugin that adds extra functionality needed by the application. Currently this consists of a tiny bit of extra CSS to make the footer stick to the bottom of the window, and a custom API endpoint at /-/open-database-file which is called by the menu option for opening a new database.

Initial impressions of Electron

I know it’s cool to knock Electron, but in this case it feels like exactly the right tool for the job. Datasette is already a web application—what I need is a way to hide the configuration of that web application behind an icon, and re-present the interface in a way that feels more like a desktop application.

This is my first time building anything with Electron—here are some of my initial impressions.

  • The initial getting started workflow is really good. I started out with their Quick Start and was up and running with a barebones application that I could start making changes to in just a few minutes.
  • The documentation is pretty good, but it leans more towards being an API reference. I found myself googling for examples of different things I wanted to do pretty often.
  • The automated testing situation isn’t great. I’m using Spectron and Mocha for my initial (very thin) tests—I got them up and running in GitHub Actions, but I’ve already run into some limitations:
    • For some reason each time I run the tests an Electron window (and datasette Python process) is left running. I can’t figure out why this is.
    • There doesn’t appear to be a way for tests to trigger menu items, which is frustrating because most of the logic I’ve written so far deals with menu items! There is an open issue for this dating back to May 2016.
  • I haven’t yet managed to package my app. This is clearly going to be the biggest challenge.

Up next: packaging the app

I was hoping to get to this before writing up my progress in these weeknotes, but it looks like it’s going to be quite a challenge.

In order to produce an installable macOS app (I’ll dive into Windows later) I need to do the following:

  • Build a standalone Datasette executable, complete with the custom plugin, using PyInstaller
  • Sign that binary with an Apple developer certificate
  • Build an Electron application that bundles a copy of that datasette binary
  • Sign the resulting Electron application

I’m expecting figuring this out to be a long-winded and frustrating experience, which is more the fault of Apple than of Electron. I’m tracking my progress on this in issue #7.

Datasette 0.59a2

I pushed out a new alpha of Datasette earlier this week, partly driven by work I was doing on datasette.app.

The biggest new feature in this release is a new plugin hook: register_commands()—which lets plugins add additional commands to Datasette, e.g. datasette verify name-of-file.db.

I released a new plugin that exercises this hook called datasette-verify. Past experience has shown me that it’s crucial to ship an example plugin alongside a new hook, to help confirm that the hook design is fit for purpose.

It turns out I didn’t need this for datasette.app after all, but it’s still a great capability to have!

sqlite-utils 3.17

Quoting the release notes in full:

  • The sqlite-utils memory command has a new --analyze option, which runs the equivalent of the analyze-tables command directly against the in-memory database created from the incoming CSV or JSON data. (#320)
  • sqlite-utils insert-files now has the ability to insert file contents in to TEXT columns in addition to the default BLOB. Pass the --text option or use content_text as a column specifier. (#319)

evernote-to-sqlite 0.3.2

As a follow-up to last week’s work on my personal Dogsheep, I decided to re-import my Evernote notes... and found out that Evernote has changed their export mechanism in ways that broke my tool. Most concerningly their exported XML is even less well-formed than it used to be. This new release works around that.

TIL this week

Releases this week

Dynamic content for GitHub repository templates using cookiecutter and GitHub Actions 21 days ago

GitHub repository templates were introduced a couple of years ago to provide a mechanism for creating a brand new GitHub repository starting with an initial set of files.

They have one big limitation: the repositories that they create share the exact same contents as the template repository. They’re basically a replacement for duplicating an existing folder and using that as the starting point for a new project.

I’m a big fan of the Python cookiecutter tool, which provides a way to dynamically create new folder structures from user-provided variables using Jinja templates to generate content.

This morning, inspired by this repo by Bruno Rocha, I finally figured out a neat pattern for combining cookiecutter with repository templates to compensate for that missing dynamic content ability.

The result: datasette-plugin-template-repository for creating new Datasette plugins with a single click, python-lib-template-repository for creating new Python libraries and click-app-template-repository for creating Click CLI tools.

Cookiecutter

I maintain three cookiecutter templates at the moment:

Having installed cookiecutter (pip install cookiecutter) each of these can be used like so:

% cookiecutter gh:simonw/datasette-plugin
plugin_name []: visualize counties
description []: Datasette plugin for visualizing counties
hyphenated [visualize-counties]: 
underscored [visualize_counties]: 
github_username []: simonw
author_name []: Simon Willison
include_static_directory []: y
include_templates_directory []: 

Cookiecutter prompts for some variables defined in a cookiecutter.json file, then generates the project by evaluating the templates.

The challenge was: how can I run this automatically when a new repository is created from a GitHub repository template? And where can I get those variables from?

Bruno’s trick: a self-rewriting repository

Bruno has a brilliant trick for getting this to run, exhibited by this workflow YAML. His workflow starts like this:

name: Rename the project from template

on: [push]

jobs:
  rename-project:
    if: ${{ github.repository != 'rochacbruno/python-project-template' }}
    runs-on: ubuntu-latest
    steps:
       # ...

This means that his workflow only runs on copies of the original repository—the workflow is disabled in the template repository itself by that if: condition.

Then at the end of the workflow he does this:

      - uses: stefanzweifel/git-auto-commit-action@v4
        with:
          commit_message: "Ready to clone and code"
          push_options: --force

This does a force push to replace the contents of the repository with whatever was generated by the rest of the workflow script!

This trick was exactly what I needed to get cookiecutter to work with repository templates.

Gathering variables using the GitHub GraphQL API

All three of my existing cookiecutter templates require the following variables:

  • A name to use for the generated folder
  • A one-line description to use in the README and in setup.py
  • The GitHub username of the owner of the package
  • The display name of the owner

I need values for all of these before I can run cookiecutter.

It turns out they are all available from the GitHub GraphQL API, which can be called from the initial workflow copied from the repository template!

Here’s the GitHub Actions step that does that:

- uses: actions/github-script@v4
  id: fetch-repo-and-user-details
  with:
    script: |
      const query = `query($owner:String!, $name:String!) {
        repository(owner:$owner, name:$name) {
          name
          description
          owner {
            login
            ... on User {
              name
            }
            ... on Organization {
              name
            }
          }
        }
      }`;
      const variables = {
        owner: context.repo.owner,
        name: context.repo.repo
      }
      const result = await github.graphql(query, variables)
      console.log(result)
      return result

Here I’m using the actions/github-script action, which provides a pre-configured, authenticated instance of GitHub’s octokit/rest.js JavaScript library. You can then provide custom JavaScript that will be executed by the action.

await github.graphql(query, variables) can then execute a GitHub GraphQL query. The query I’m using here gives me back the currenty repository’s name and description and the login and display name of the owner of that repository.

GitHub repositories can be owned by either a user or an organization—the ... on User / ... on Organization syntax provides the same result here for both types of nested object.

The output of this GraphQL query looks something like this:

{
  "repository": {
    "name": "datasette-verify",
    "description": "Verify that files can be opened by Datasette",
    "owner": {
      "login": "simonw",
      "name": "Simon Willison"
    }
  }
}

I assigned an id of fetch-repo-and-user-details to that step of the workflow, so that the return value from the script could be accessed as JSON in the next step.

Passing those variables to cookiecutter

Cookiecutter defaults to asking for variables interactively, but it also supports passing in those variables as command-line parameters.

Here’s part of my next workflow steps that executes cookiecutter using the variables collected by the GraphQL query:

- name: Rebuild contents using cookiecutter
  env:
    INFO: ${{ steps.fetch-repo-and-user-details.outputs.result }}
  run: |
    export REPO_NAME=$(echo $INFO | jq -r '.repository.name')
    # Run cookiecutter
    cookiecutter gh:simonw/python-lib --no-input \
      lib_name=$REPO_NAME \
      description="$(echo $INFO | jq -r .repository.description)" \
      github_username="$(echo $INFO | jq -r .repository.owner.login)" \
      author_name="$(echo $INFO | jq -r .repository.owner.name)"

The env: INFO: block exposes an environment variable called INFO to the step, populated with the output of the previous fetch-repo-and-user-details step—a string of JSON.

Then within the body of the step I use jq to extract out the details that I need—first the repository name:

export REPO_NAME=$(echo $INFO | jq -r '.repository.name')

Then I pass the other details directly to cookiecutter as arguments:

cookiecutter gh:simonw/python-lib --no-input \
  lib_name=$REPO_NAME \
  description="$(echo $INFO | jq -r .repository.description)" \
  github_username="$(echo $INFO | jq -r .repository.owner.login)" \
  author_name="$(echo $INFO | jq -r .repository.owner.name)"

jq -r ensures that the raw text value is returned by jq, as opposed to the JSON string value which would be wrapped in double quotes.

Cleaning up at the end

Running cookiecutter in this way creates a folder within the root of the repository that duplicates the repository name, something like this:

datasette-verify/datasette-verify

I actually want the contents of that folder to live in the root, so the next step I run is:

mv $REPO_NAME/* .
mv $REPO_NAME/.gitignore .
mv $REPO_NAME/.github .

Here’s my completed workflow.

This almost worked—but when I tried to run it for the first time I got this error:

![remote rejected] (refusing to allow an integration to create or update .github/workflows/publish.yml)

It turns out the credentials provided to GitHub Actions are forbidden from making modifications to their own workflow files!

I can understand why that limitation is in place, but it’s frustrating here. For the moment, my workaround is to do this just before pushing the final content back to the repository:

mv .github/workflows .github/rename-this-to-workflows

I leave it up to the user to rename that folder back again when they want to enable the workflows that have been generated for them.

Give these a go

I’ve set up two templates using this pattern now:

Both of these work the same way: enter a repository name and description, click “Create repository from template” and watch as GitHub copies the new repository and then, a few seconds later, runs the workflow to execute the cookiecutter template to replace the contents with the final result.

You can see examples of repositories that I created using these templates here:

Weeknotes: Getting my personal Dogsheep up and running again 28 days ago

I gave a talk about Dogsheep at Noisebridge’s Five Minutes of Fame on Thursday. Just one problem: my regular Dogsheep demo was broken, so I ended up building it from scratch again. In doing so I fixed a few bugs in some Dogsheep tools.

Dogsheep on a Digital Ocean droplet

The latest iteration of my personal Dogsheep runs on a $20/month 4GB/2CPU Digital Ocean Droplet running Ubuntu 20.04 LTS.

It runs a private Datasette instance and a bunch of cron jobs to fetch data from Twitter, GitHub, Foursquare Swarm, Pocket and Hacker News.

It also has copies of my Apple Photos and Apple HealthKit data which I upload manually—plus a copy of my genome for good measure.

Some abbreviated notes on how I set it up, copied from a private GitHub Issues thread:

  1. Create a new Ubuntu droplet, and configure its IP address as the A record for dogsheep.simonwillison.net

  2. Install Python 3 and NGINX and SQLite: apt-get install python3 python3-venv nginx sqlite -y

  3. Use letsencrypt to get an HTTPS certificate for it: apt-get update and then apt install certbot python3-certbot-nginx -y, then certbot --nginx -d dogsheep.simonwillison.net

  4. I had to remove the ipv6only=on; bit from the NGINX configuration due to this bug

  5. Created a dogsheep user, useradd -s /bin/bash -d /home/dogsheep/ -m -G

  6. As that user, created a virtual environment: python3 -mvenv datasette-venv and then datasette-venv/bin/pip install wheel and datasette-venv/bin/pip install datasette datasette-auth-passwords

  7. Created a /etc/systemd/system/datasette.service file with this contents

  8. Created a set of blank SQLite database files in WAL mode in /home/dogsheep using the following:

    for f in beta.db twitter.db healthkit.db github.db \
      swarm.db photos.db genome.db simonwillisonblog.db \
      pocket.db hacker-news.db memories.db
    do
      sqlite3 $f vacuum
      # And enable WAL mode:
      sqlite3 $f 'PRAGMA journal_mode=WAL;'
    done
    
  9. Started the Datasette service: service datasette start

  10. Configured NGINX to proxy to localhost port 8001, using this configuration

It’s a few more steps than I’d like, but the end result was a password-protected Datasette instance running against a bunch of SQLite database files on my new server.

With Datasette up and running, the next step was to start loading in data.

Importing my tweets

I started with Twitter. I dropped my Twitter API access credentials into an auth.json file (as described here) and ran the following:

source /home/dogsheep/datasette-venv/bin/activate
pip install twitter-to-sqlite
twitter-to-sqlite user-timeline /home/dogsheep/twitter.db \
  -a /home/dogsheep/auth.json
@simonw [###############################-----] 26299/29684 00:02:06

That pulled in all 29,684 of my personal tweets.

(Actually, first it broke with an error, exposing a bug that had already been reported. I shipped a fix for that and tried again and it worked.)

Favourited tweets were a little harder—I have 39,904 favourited tweets, but the Twitter API only returns the most recent 3,200. I grabbed those more recent ones with:

twitter-to-sqlite favorites /home/dogsheep/twitter.db \
  -a /home/dogsheep/auth.json

Then I requested my Twitter archive, waited 24 hours and uploaded the resulting like.js file to the server, then ran:

twitter-to-sqlite import twitter.db /tmp/like.js

This gave me an archive_like table with the data from that file—but it wasn’t the full tweet representation, just the subset that Twitter expose in the archive export.

The README shows how to inflate those into full tweets:

twitter-to-sqlite statuses-lookup twitter.db \
  --sql='select tweetId from archive_like' \
  --skip-existing
Importing 33,382 tweets [------------------------------------] 0% 00:18:28

Once that was done I wrote additional records into the favorited_by table like so:

sqlite3 twitter.db '
  INSERT OR IGNORE INTO favorited_by (tweet, user)
  SELECT tweetId, 12497 FROM archive_like
'

(12497 is my Twitter user ID.)

I also came up with a SQL view that lets me see just media attached to tweets:

sqlite-utils create-view twitter.db media_details "
select
  json_object('img_src', media_url_https, 'width', 400) as img,
  tweets.full_text,
  tweets.created_at,
  tweets.id as tweet_id,
  users.screen_name,
  'https://twitter.com/' || users.screen_name || '/status/' || tweets.id as tweet_url
from
  media
  join media_tweets on media.id = media_tweets.media_id
  join tweets on media_tweets.tweets_id = tweets.id
  join users on tweets.user = users.id
order by
  tweets.id desc
"

Now I can visit /twitter/media_details?_where=tweet_id+in+(select+tweet+from+favorited_by+where+user+=+12497) to see the most recent media tweets that I’ve favourited!

media_details view showing Twitter media I have liked

Swarm checkins

Swarm checkins were a lot easier. I needed my previously-created Foursquare API token, and swarm-to-sqlite:

pip install swarm-to-sqlite
swarm-to-sqlite /home/dogsheep/swarm.db --token=...

This gave me a full table of my Swarm checkins, which I can visualize using datasette-cluster-map:

Map of my 2021 Swarm checkins

Apple HealthKit

I don’t yet have full automation for my Apple HealthKit data (collected by my Apple Watch) or my Apple Photos—both require me to run scripts on my laptop to create the SQLite database file and then copy the result to the server via scp.

healthkit-to-sqlite runs against the export.zip that is produced by the Apple Health app on the iPhone’s export data button—for me that was a 158MB zip file which I AirDropped to my laptop and converted (after fixing a new bug) like so:

healthkit-to-sqlite ~/Downloads/export.zip healthkit.db
Importing from HealthKit  [-----------------------------]    2%  00:02:25

I uploaded the resulting 1.5GB healthkit.db file and now I can do things like visualize my 2017 San Francisco Half Marathon run on a map:

A map of the half marathon I ran

Apple Photos

For my photos I use dogsheep-photos, which I described last year in Using SQL to find my best photo of a pelican according to Apple Photos. The short version: I run this script on my laptop:

# Upload original photos to my S3 bucket
dogsheep-photos upload photos.db \
  ~/Pictures/Photos\ Library.photoslibrary/originals
dogsheep-photos apple-photos photos.db \
  --image-url-prefix "https://photos.simonwillison.net/i/" \
  --image-url-suffix "?w=600"
scp photos.db dogsheep:/home/dogsheep/photos.db

photos.db is only 171MB—it contains the metadata, including the machine learning labels, but not the photos themselves.

And now I can run queries for things like photos of food I’ve taken in 2021:

Some photos of food

Automation via cron

I’m still working through the last step, which involves setting up cron tasks to refresh my data periodically from various sources. My crontab currently looks like this:

# Twitter
1,11,21,31,41,51 * * * * /home/dogsheep/datasette-venv/bin/twitter-to-sqlite user-timeline /home/dogsheep/twitter.db -a /home/dogsheep/auth.json --since
4,14,24,34,44,54 * * * * run-one /home/dogsheep/datasette-venv/bin/twitter-to-sqlite mentions-timeline /home/dogsheep/twitter.db -a /home/dogsheep/auth.json --since
11 * * * * run-one /home/dogsheep/datasette-venv/bin/twitter-to-sqlite user-timeline /home/dogsheep/twitter.db cleopaws -a /home/dogsheep/auth.json --since
6,16,26,36,46,56 * * * * run-one /home/dogsheep/datasette-venv/bin/twitter-to-sqlite favorites /home/dogsheep/twitter.db -a /home/dogsheep/auth.json --stop_after=50

# Swarm
25 */2 * * * /home/dogsheep/datasette-venv/bin/swarm-to-sqlite /home/dogsheep/swarm.db --token=... --since=2w

# Hacker News data every six hours
35 0,6,12,18 * * * /home/dogsheep/datasette-venv/bin/hacker-news-to-sqlite user /home/dogsheep/hacker-news.db simonw

# Re-build dogsheep-beta search index once an hour
32 * * * * /home/dogsheep/datasette-venv/bin/dogsheep-beta index /home/dogsheep/beta.db /home/dogsheep/dogsheep-beta.yml

I’ll be expanding this out as I configure more of the Dogsheep tools for my personal instance.

TIL this week

Releases this week

Elsewhere

10th September 2021

  • Imagine writing the investment memo for “20% of a picture of a dog” and being like “the most we should pay is probably about $2 million because the whole picture of the dog sold for $4 million three months ago and it can’t realistically have appreciated more than 150% since then; even if the whole picture of the dog is worth, aggressively, $10 million, this share would be worth $2 milllion.” What nonsense that is!

    Matt Levine # 10th September 2021, 7:27 am

8th September 2021

  • Datasette Desktop 0.1.0 (via) This is the first installable version of the new Datasette Desktop macOS application I’ve been building. Please try it out and leave feedback on Twitter or on the GitHub Discussions thread linked from the release notes. #8th September 2021, 5:14 am

7th September 2021

  • We never shipped a great commercial product. The reason for that is we didn’t focus. We tried to do a little bit of everything. It’s hard enough to maintain the growth of your developer community and build one great commercial product, let alone three or four, and it is impossible to do both, but that’s what we tried to do and we spent an enormous amount of money doing it.

    Solomon Hykes # 7th September 2021, 2:47 pm

6th September 2021

  • Making world-class docs takes effort (via) Curl maintainer Daniel Stenberg writes about his principles for good documentation. I agree with all of these: he emphasizes keeping docs in the repo, avoiding the temptation to exclusively generate them from code, featuring examples and ensuring every API you provide has documentation. Daniel describes an approach similar to the documentation unit tests I’ve been using for my own projects: he has scripts which scan the curl documentation to ensure not only that everything is documented but that each documentation area contains the same sections in the same order. #6th September 2021, 6:58 pm

3rd September 2021

  • Per-project PostgreSQL (via) Jamey Sharp describes an ingenious way of setting up PostgreSQL instances for each of your local development project, without depending on an always-running shared localhost database server. The trick is a shell script which creates a PGDATA folder in the current folder and then instantiates a PostgreSQL server in --single single user mode which listens on a Unix domain socket in that folder, instead of listening on the network. Jamey then uses direnv to automatically configure that PostgreSQL, initializing the DB if necessary, for each of his project folders. #3rd September 2021, 3:06 am

30th August 2021

25th August 2021

  • Allo shows the ultimate failure of Google’s Minimum Viable Product strategy. MVP works when you have almost no competition, or if you are taking a radically different approach to what’s on the market, but it completely falls on its face when you are just straight-up cloning an established competitor. There’s no reason to use a half-baked WhatsApp clone when regular WhatsApp exists.

    Ron Amadeo # 25th August 2021, 10:28 pm

  • API Tokens: A Tedious Survey. Thomas Ptacek reviews different approaches to implementing secure API tokens, from simple random strings stored in a database through various categories of signed token to exotic formats like Macaroons and Biscuits, both new to me.

    Macaroons carry a signed list of restrictions with them, but combine it with a mechanism where a client can add their own additional restrictions, sign the combination and pass the token on to someone else.

    Biscuits are similar, but “embed Datalog programs to evaluate whether a token allows an operation”. #25th August 2021, 12:12 am

24th August 2021

  • SQLModel. A new project by FastAPI creator Sebastián Ramírez: SQLModel builds on top of both SQLAlchemy and Sebastián’s Pydantic validation library to provide a new ORM that’s designed around Python 3’s optional typing. The real brilliance here is that a SQLModel subclass is simultaneously a valid SQLAlchemy ORM model AND a valid Pydantic validation model, saving on duplicate code by allowing the same class to be used both for form/API validation and for interacting with the database. #24th August 2021, 11:16 pm
  • How Discord Stores Billions of Messages (via) Fascinating article from 2017 describing how Discord migrated their primary message store to Cassandra (from MongoDB, but I could easily see them making the same decision if they had started with PostgreSQL or MySQL). The trick with scalable NoSQL databases like Cassandra is that you need to have a very deep understanding of the kinds of queries you will need to answer—and Discord had exactly that. In the article they talk about their desire to eventually migrate to Scylla (a compatible Cassandra alternative written in C++)—in the Hacker News comments they confirm that in 2021 they are using Scylla for a few things but they still have their core messages in Cassandra. #24th August 2021, 9:31 pm

23rd August 2021

  • It’s perhaps a very English thing to find it hard to accept kind words about oneself. If anyone praised me in my early days as a comedy performer I would say, “Oh, nonsense. Shut up. No really, I was dreadful.” I remember going through this red-faced shuffle in the presence of the mighty John Cleese who upbraided me the moment we were alone. ‘You genuinely think you’re being polite and modest, don’t you?’ ‘Well, you know …’ ‘Don’t you see that when someone hears their compliments contradicted they naturally assume that you must think them a fool? [..] ‘It’s so simple. You just say thank you. You just thank them. How hard is that?’

    Stephen Fry # 23rd August 2021, 10:33 pm

  • At critical moments in time, you can raise the aspirations of other people significantly, especially when they are relatively young, simply by suggesting they do something better or more ambitious than what they might have in mind.  It costs you relatively little to do this, but the benefit to them, and to the broader world, may be enormous.

    Tyler Cowen # 23rd August 2021, 8:02 pm

  • The rapid increase of COVID-19 cases among kids has shattered last year’s oft-repeated falsehood that kids don’t get COVID-19, and if they do, it’s not that bad. It was a convenient lie that was easy to believe in part because we kept most of our kids home. With remote learning not an option now, this year we’ll find out how dangerous this virus is for children in the worst way possible.

    Dan Sinker # 23rd August 2021, 1:59 am

22nd August 2021

  • MDN: Subdomain takeovers (via) MDN have a page about subdomain takeover attacks that focuses more on CNAME records: if you have a CNAME pointing to a common delegated hosting provider but haven’t yet provisioned your virtual host there, someone else might beat you to it and use it for an XSS attack.

    “Preventing subdomain takeovers is a matter of order of operations in lifecycle management for virtual hosts and DNS.”

    I now understand why Google Cloud make your “prove” your ownership of a domain before they’ll let you configure it to host e.g. a Cloud Run instance. #22nd August 2021, 5:31 am
  • I stumbled across a nasty XSS hole involving DNS A records. Found out today that an old subdomain that I had assigned an IP address to via a DNS A record was serving unexpected content—turned out I’d shut down the associated VPS and the IP had been recycled to someone else, so their content was now appearing under my domain. It strikes me that if you got really unlucky this could turn into an XSS hole—and that new server could even use Let’s Encrypt to obtain an HTTPS certificate for your subdomain.

    I’ve added “audit your A records” to my personal security checklist. #22nd August 2021, 5:27 am