Simon Willison’s Weblog

On datasettedesktop 5 management 37 sqlite 140 javascript 584 fly 9 ...

 

Recent entries

Finding and reporting an asyncio bug in Python 3.10 seven days ago

I found a bug in Python 3.10 today! Some notes on how I found it and my process for handling it once I figured out what was going on.

Testing Datasette against Python 3.10

I finally got around to attempting to upgrade Datasette to the just-released Python 3.10 this morning. I started by adding "3.10" to the matrix of Python versions that Datasette is tested against by GitHub Actions:

strategy:
  matrix:
    python-version: ["3.6", "3.7", "3.8", "3.9", "3.10"]

That line previously looked like this:

python-version: [3.6, 3.7, 3.8, 3.9]

I switched to using quoted strings here based on a tip from Jeff Triplett, who pointed out that 3.10 is actually a floating point number in YAML syntax which will be treated as referring to Python 3.1 instead!

This one-line change in a pull request was all it took to run the tests against the new version of Python... and they failed, spectacularly.

Annoyingly the root error wasn’t displayed directly in the tests, because it was triggered and then caught by Datasette’s own error handling and turned into a 500 HTTP error. So, I needed to run Python 3.10 locally and it the debugger.

Running Python 3.10 locally

I occasionally use pyenv to manage multiple Python versions on my machine. I have this installed through Homebrew.

I had to run brew upgrade pyenv first, because my installed version didn’t know about Python 3.10. Then I ran this:

pyenv install 3.10.0

Which gave me a Python 3.10 installation on my Mac in ~/.pyenv/versions/3.10.0/bin/python.

Finally, I created a new virtual environment using this version of Python.

~/.pyenv/versions/3.10.0/bin/python \
  -m venv /tmp/py310
# Now activate that virtual environment:
source /tmp/py310/bin/activate

I put this in my /tmp directory so I don’t have to remember to clean it up later.

Having done this, I installed my Datasette test dependencies into the new environment and used pytest to run my tests:

% cd ~/Dropbox/Development/datasette
% pip install -e '.[test]'

Running pytest -x --pdb stopped at the first failing test and dropped me into a debugger where I could finally access the full traceback, which looked something like this:

  File "datasette/datasette/views/base.py", line 122, in dispatch_request
    await self.ds.refresh_schemas()
  File "datasette/datasette/app.py", line 344, in refresh_schemas
    await self._refresh_schemas()
  File "datasette/datasette/app.py", line 349, in _refresh_schemas
    await init_internal_db(internal_db)
  File "datasette/datasette/utils/internal_db.py", line 5, in init_internal_db
    await db.execute_write(
  File "datasette/datasette/database.py", line 102, in execute_write
    return await self.execute_write_fn(_inner, block=block)
  File "datasette/datasette/database.py", line 113, in execute_write_fn
    reply_queue = janus.Queue()
  File "janus/__init__.py", line 39, in __init__
    self._async_not_empty = asyncio.Condition(self._async_mutex)
  File "lib/python3.10/asyncio/locks.py", line 234, in __init__
    raise ValueError("loop argument must agree with lock")
ValueError: loop argument must agree with lock

So, something is going long with asyncio.Condition(), which is being called by janus.Queue().

Some background on Janus

The Janus library referenced here describes itself as a “Thread-safe asyncio-aware queue for Python”.

I use it in Datasette for the write queue—a mechanism where writes can be safely made to a SQLite database by first queueing them up in memory.

Janus provides a mechanism that lets asyncio event tasks send and receive messages to Python threads, and vice-versa, via a queue class that is modelled on Python’s own standard library queues. It’s really neat.

Tracking the investigation in issue comments

Any time I’m investigating a bug I make sure that there’s an associated GitHub issue or pull request, to give me somewhere to build up detailed notes of my investigation.

I used my PR #1481 for this. Now that I had an error message—“loop argument must agree with lock”—I did some initial Google searches and found to my surprise that it had very few existing mentions—my first clue that this might be related to the new Python 3.10 release itself.

I tracked down the relevant source code in both Python and the Janus library based on the traceback and linked to them from the issue comments. I pasted in copies of the relevant code too, since GitHub only magically embeds code from the same repository, not code from other repositories. You can see those comments here.

I always like to include copies of the code most likely to be the cause of a bug in an issue comment, to save myself from having to dig around for it again later on and to act as a historical record once the bug has been fixed.

I also started an issue in the Janus repo, with the title Error with Python 3.10: “ValueError: loop argument must agree with lock”. I linked this to my own issue and posted relevant research in both places.

The two lines of code in Janus that caused the bug where these ones (note my added comment):

self._async_mutex = asyncio.Lock()
# "loop argument must agree with lock" exception is raised here:
self._async_not_empty = asyncio.Condition(self._async_mutex)

This is where I began to suspect a Python bug. The above code can be simplified to this:

asyncio.Condition(asyncio.Lock())

This is a documented way of using conditions in Python: you can instantiate conditions with an optional lock, which allows that lock to be shared by more than one condition.

So I tried that in the Python 3.10 interpreter... and it didn’t throw the error:

% ~/.pyenv/versions/3.10.0/bin/python
Python 3.10.0 (default, Oct  7 2021, 13:45:58) [Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import asyncio
>>> print(asyncio.Condition(asyncio.Lock()))
<asyncio.locks.Condition object at 0x1062521d0 [unlocked]>

From browsing the source code I could see that there was something going on here involving the event loop—so on a hunch I tried running that code in an event loop instead, like so:

>>> async def example():
...     print(asyncio.Condition(asyncio.Lock()))
... 
>>> asyncio.run(example())
Traceback (most recent call last):
  ...
  File "<stdin>", line 2, in example
  File "/Users/simon/.pyenv/versions/3.10.0/lib/python3.10/asyncio/locks.py", line 234, in __init__
    raise ValueError("loop argument must agree with lock")
ValueError: loop argument must agree with lock

There’s the exception!

So it looked like this might be a genuine Python 3.10 bug. After all, I knew that this code worked in prior versions of Python (since Janus and Datasette worked fine there). In fact I could confirm it like so:

~ % python3.9
Python 3.9.7 (default, Sep  3 2021, 12:45:31) 
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import asyncio
>>> async def example():
...     print(asyncio.Condition(asyncio.Lock()))
... 
>>> asyncio.run(example())
<asyncio.locks.Condition object at 0x1049aab50 [unlocked]>

Reporting the bug to the Python issue tracker

Python bugs are tracked on bugs.python.org—thankfully they support sign in with GitHub (and Google and even OpenID) so I signed in with my GitHub account to report the bug.

I filed Issue 45416 entitled:

“loop argument must agree with lock” instantiating asyncio.Condition

And included a brief description with my minimal Python script that demonstrated the bug.

Then I added a comment pointing back to one of the related GitHub issues where I had been figuring this out, and another comment that linked to the most relevant test in Python’s own test suite, which I found using GitHub search for the term Condition.

Any time I file an issue against a project—my own or someone else’s—I always like to do a bit of extra research and link to both the likely code at fault and the likely test.

As a maintainer, this saves me a bunch of time—if the issue links straight to the code and tests I don’t have to spend any time digging around looking for them.

My hope is that this saves time for the people I am reporting the bug to as well, which increases the chance that they’ll dig into it and help me find a fix!

Then I tweeted about it. I have a bunch of followers with relevant experience, so I thought it was time to get a few more eyes on the problem.

I included a screenshot of my steps-to-reproduce script to save anyone who might be casually interested from having to click through to the issue itself.

Łukasz Langa to the rescue

Łukasz is the “CPython Developer in Residence”, which means he is paid by the Python Software Foundation to work on Python full-time.

Less than an hour after I filed the issue he posted a comment that both confirmed the bug and provided two short-term workarounds for it!

I really liked his first workaround, which looks like this:

>>> l = asyncio.Lock()
>>> getattr(l, '_get_loop', lambda: None)()
<_UnixSelectorEventLoop running=True closed=False debug=False>

It works by calling the lock._get_loop() method, but only if it exists—so even though that’s an undocumented internal method this should be safe even when run against future Python versions that might remove the method.

You can see that method here—it safely populates the self._loop property, which helps work around the bug.

Submitting a PR to Janus

The best place to apply this workaround is to Janus—so I submitted a PR there which adds the workaround and updates their CI configuration to test against Python 3.10: Janus PR #359. The GitHub Actions tests for that PR are now passing against Python 3.10.

Their README links to a gitter chat so I dropped a link to my PR in there too. Hopefully that can be merged soon!

What I should have done

I’m feeling pretty good about this situation—the bug is set to be fixed, we have a great workaround thanks to Łukasz and I’m optimistic that the upstream library from Datasette will land the fix soon.

If only I’d done this a few weeks earlier!

The Python 3.10 release schedule shipped the first alpha a year ago, in October 2020. It totalled 7 alphas, 4 betas and 2 release candidates before the final 3.10 release on Monday 4th October 2021.

If I’d taken the time to get Datasette’s test suite running against Python 3.10 for any of those pre-releases, I could have helped spot this bug before it landed in the final release!

So lesson learned: Python has alphas and betas and RCs for a reason. For future Pythons I’m going to pay attention to testing against these for my projects before the final release lands.

Weeknotes: CDC vaccination history fixes, developing in GitHub Codespaces 18 days ago

I spent the last week mostly surrounded by boxes: we’re completing our move to the new place and life is mostly unpacking now. I did find some time to fix some issues with my CDC vaccination history Datasette instance though.

Fixing my CDC vaccination history site

I started tracking changes made to the CDC’s COVID Data Tracker website back in Feburary. I created a git scraper repository for it as part of my five minute lightning talk on git scraping (notes and video) at this year’s NICAR data journalism conference.

Since then it’s been quietly ticking along, recording the latest data in a git repository that now has 335 commits.

In March I added a script to build the collected historic data into a SQLite database and publish it to Vercel using GitHub. That started breaking a few weeks ago, and it turnoud out that was because the database file had grown in size to the point where it was too large to deploy to Vercel (~100MB).

I got a bug report about this, so I took some time to move the deployment over to Google Cloud Run which doesn’t have a documented size limit (though in my experience starts to creak once you go above about 2GB.)

I also started publishing the raw collected data directly as a CSV file, partly as an excuse to learn how to publish to Google Cloud Storage.

datasette-template-request

I released an extremely simple plugin this week called datasette-template-request—all it does is expose Datasette’s request object in the context passed to custom templates, for people who want to update their custom page based on incoming request parameters.

More notable is how I built the plugin: this is the first plugin I’ve developed, tested and released entirely in my browser using the new GitHub Codespaces online development environment.

I created the new repo using my Datasette plugin template repository, opened it up in Codespaces, implemented the plugin and tests, tried it out using the port forwarding feature and then published it to PyPI using the publish.yml workflow.

Not having to even open a text editor on my laptop (let alone get a new Python development environment up and running) felt really good. I should turn this into a tutorial.

Releases this week

TIL this week

Weeknotes number 100 27 days ago

This entry marks my 100th weeknotes, which I’ve managed to post once a week (plus or minus a few days) consistently since 13th September 2019.

I started writing weeknotes to add some accountability to the work I was doing during my JSK fellowship year at Stanford. The fellowship ended over a year ago but I’ve stuck to the habit—I’ve been finding it really helpful as a structured approach to thinking about my work every week, and it occasionally helps motivate me to get things done enough that I have something I can write about!

Datasette Desktop 0.2.0

My big achievement this week was Datasette Desktop 0.2.0 (and the 0.2.1 patch release that followed). I published annotated release notes for that a few days ago. I’m really pleased with the release—I think Datasette as a desktop application is going to significantly increase the impact of the project.

I also sent out an issue of the Datasette Newsletter promoting the new desktop application.

Datasette Desktop for Windows

I did a quick research spike to investigate the feasibility of publishing a Windows version of Datasette Desktop. To my surprise, I managed to get a working prototype going with just half a small amount of work:

Electron claims to solve cross-platform development and it seems to uphold that claim pretty well!

I’m still quite a bit of work away from having a release: I need to learn how to build and sign Windows installers. But this is a very promising first step.

json-flatten

I’ve started thinking about how I can enable Datasette Desktop users to configure plugins without having to hand-edit plugin configuration JSON (the current mechanism).

This made me take another look at a small library I released a couple of years ago, json-flatten, which turns a nested JSON object into a set of flat key/value pairs suitable for editing using an HTML form and then unflattens that data later on.

>>> import json_flatten
>>> json_flatten.flatten({"foo": {"bar": [1, True, None]}})
{'foo.bar.[0]$int': '1', 'foo.bar.[1]$bool': 'True', 'foo.bar.[2]$none': 'None'}
>>> json_flatten.unflatten(_)
{'foo': {'bar': [1, True, None]}}

It turns out a few people have been using the library, and had filed issues—I released version 0.2 with a couple of fixes.

TIL this week

Releases this week

Datasette Desktop 0.2.0: The annotated release notes one month ago

Datasette Desktop is a new macOS desktop application version of Datasette, an “open source multi-tool for exploring and publishing data” built on top of SQLite. I released the first version last week—I’ve just released version 0.2.0 (and a 0.2.1 bug fix) with a whole bunch of critical improvements.

You can see the release notes for 0.2.0 here, but as I’ve done with Datasette in the past I’ve decided to present an annotated version of those release notes providing further background on each of the new features.

The plugin directory

A new plugin directory for installing new plugins and upgrading or uninstalling existing ones. Open it using the “Plugins -> Install and Manage Plugins...” menu item. #74

Demo showing installing and upgrading a plugin

This was the main focus for the release. Plugins are a key component of both Datasette and Datasette Desktop: my goal is for Datasette to provide a robust core for exploring databases, with a wide array of plugins that support any additional kind of visualization, exploration or data manipulation capability that a user might want.

Datasette Desktop goes as far as bundling an entire standalone Python installation just to ensure that plugins will work correctly, and invisibly sets up a dedicated Python virtual environment for plugins to install into when you first run the application.

The first version of the app allowed users to install plugins by pasting their name into a text input field. Version 0.2.0 is a whole lot more sophisticated: the single input field has been replaced by a full plugin directory interface that shows installed v.s. available plugins and provides “Install”, “Upgrade” and “Uninstall” buttons depending on the state of the plugin.

When I set out to build this I knew I wanted to hit this JSON API on datasette.io to fetch the list of plugins, and I knew I wanted a simple searchable index page. The I realized I also wanted faceted search, so I could filter for installed vs not-yet-installed plugins.

Datasette’s built-in table interface already implements faceted search! So I decided to use that, with some custom templates to add the install buttons and display the plugins in a more suitable format.

The first challenge was getting the latest list of plugins into my Datasette instance. I built this into the datasette-app-support plugin using the startup() plugin hook—every time the server starts up it hits that API and populates an in-memory table with the returned data.

The data from the API is then extended with four extra columns:

  • "installed" is set to “installed” or “not installed” depending on whether the plugin has already been installed by the user
  • "Installed_version" is the currently installed version of the plugin
  • "upgrade" is the string “upgrade available” or None—allowing the user to filter for just plugins that can be upgraded
  • "default" is set to 1 if the plugin is a default plugin that came with Datasette

The data needed to build the plugin table is gathered by these three lines of code:

plugins = httpx.get(
     "https://datasette.io/content/plugins.json?_shape=array"
).json()
# Annotate with list of installed plugins
installed_plugins = {
    plugin["name"]: plugin["version"]
    for plugin in (await datasette.client.get("/-/plugins.json")).json()
}
default_plugins = (os.environ.get("DATASETTE_DEFAULT_PLUGINS") or "").split()

The first line fetches the full list of known plugins from the Datasette plugin directory

The second makes an internal API call to the Datasette /-/plugins.json endpoint using the datasette.client mechanism to discover what plugins are currently installed and their versions.

The third line loads a space-separated list of default plugins from the DATASETTE_DEFAULT_PLUGINS environment variable.

That last one deserves further explanation. Datasette Desktop now ships with some default plugins, and the point of truth for what those are lives in the Electron app codebase—because that’s where the code responsible for installing them is.

Five plugins are now installed by default: datasette-vega, datasette-cluster-map, datasette-pretty-json, datasette-edit-schema and datasette-configure-fts. #81

The plugin directory needs to know what these defaults are so it can avoid showing the “uninstall” button for those plugins. Uninstalling them currently makes no sense because Datasette Desktop installs any missing dependencies when the app starts, which would instantly undo the user’s uninstall action decision.

An environment variable felt like the most straight-forward way to expose that list of default plugins to the underlying Datasette server!

I plan to make default plugins uninstallable in the future but doing so require a mechanism for persisting user preference state which I haven’t built yet (see issue #101).

A log on the loading screen

The application loading screen now shows a log of what is going on. #70

The first time you launch the Datasette Desktop application it creates a virtual environment and installs datasette, datasette-app-support and the five default plugins (plus their dependencies) into that environment.

Animated demo of the Datasette Desktop launch screen showing the log scrolling past

This can take quite a few seconds, during which the original app would show an indeterminate loading indicator.

Personally I hate loading indicators which don’t show the difference between something that’s working and something that’s eternally hung. Since I can’t estimate how long it will take, I decided to pipe the log of what the pip install command is doing to the loading screen itself.

For most users this will be meaningless, but hopefully will help communicate “I’m installing extra stuff that I need”. Advanced users may find this useful though, especially for bug reporting if something goes wrong.

Under the hood I implemented this using a Node.js EventEmitter. I use the same trick to forward server log output to the “Debug -> Show Sever Log” interface.

Example CSV files

The welcome screen now invites you to try out the application by opening interesting example CSV files, taking advantage of the new “File -> Open CSV from URL...” feature. #91

Previously Datasette Desktop wouldn’t do anything at all until you opened up a CSV or SQLite database, and I have a hunch that unlike me most people don’t have good examples of those to hand at all times!

The new welcome screen offers example CSV files that can be opened directly from the internet. I implemented this using a new API at datasette.io/content/example_csvs (add .json for the JSON version) which is loaded by code running on that welcome page.

I have two examples at the moment, for the Squirrel Census and the London Fire Brigade’s animal rescue data. I’ll be adding more in the future.

The API itself is a great example of the Baked Data architectural pattern in action: the data itself is stored in this hand-edited YAML file, which is compiled to SQLite every time the site is deployed.

To get this feature working I added a new “Open CSV from URL” capability to the app, which is also available in the File menu. Under the hood this works by passing the provided URL to the new /-/open-csv-from-url API endpoint. The implementation of this was surprisingly fiddly as I wanted to consume the CSV file using an asynchronous HTTP client—I ended up using an adaption of some example code from the aiofile README.

Recently opened files and “Open with Datasette”

Recently opened .db and .csv files can now be accessed from the new “File -> Open Recent” menu. Thanks, Kapilan M! #54

This was the project’s first external contribution! Kapilan M figured out a way to hook into the macOS “recent files” mechanism from Electron, and I expanded that to cover SQLite database in addition to CSV files.

When a recent file is selected, Electron fires the “open-file” event. This same event is fired when a file is opened using “Open With -> Datasette” or dragged onto the application’s dock.

This meant I needed to tell the difference between a CSV or a SQLite database file, which I do by checking if the first 16 bytes of the file match the SQLite header of SQLite format 3\0.

.db and .csv files can now be opened in Datasette starting from the Finder using “Right Click -> Open With -> Datasette”. #40

Registering Datasette as a file handler for .csv and .db was not at all obvious. It turned out to involve adding the following to the Electron app’s package.json file:

  "build": {
    "appId": "io.datasette.app",
    "mac": {
      "category": "public.app-category.developer-tools",
      "extendInfo": {
        "CFBundleDocumentTypes": [
          {
            "CFBundleTypeExtensions": [
              "csv",
              "tsv",
              "db"
            ],
            "LSHandlerRank": "Alternate"
          }
        ]
      }

The Debug Menu

A new Debug menu can be enabled using Datasette -> About Datasette -> Enable Debug Menu".

The debug menu existed previously in development mode, but with 0.2.0 I decided to expose it to end users. I didn’t want to show it to people who weren’t ready to see it, so you have to first enable it using a button on the about menu.

The most interesting option there is “Run Server Manually”.

Most of the time when you are using the app there’s a datasette Python server running under the hood, but it’s entirely managed by the Node.js child_process module.

When developing the application (or associated plugins) it can be useful to manually run that server rather than having it managed by the app, so you can see more detailed error messages or even add the --pdb option to drop into a debugger should something go wrong.

To run that server, you need the Electron app to kill its own version... and you then need to know things like what port it was running on and which environment variables it was using.

Here’s what you see when you click the “Run Server Manually” debug option:

Run server manually? Clicking OK will terminate the Datasette server used by this app. Copy this command to a terminal to manually run a replacement

Here’s that command in full:

DATASETTE_API_TOKEN="0ebb45444ba4cbcbacdbcbb989bb0cd3aa10773c0dfce73c0115868d0cee2afa" DATASETTE_SECRET="4a8ac89d0d269c31d99059933040b4511869c12dfa699a1429ea29ee3310a850" DATASETTE_DEFAULT_PLUGINS="datasette datasette-app-support datasette-vega datasette-cluster-map datasette-pretty-json datasette-edit-schema datasette-configure-fts datasette-leaflet" /Users/simon/.datasette-app/venv/bin/datasette --port 8002 --version-note xyz-for-datasette-app --setting sql_time_limit_ms 10000 --setting max_returned_rows 2000 --setting facet_time_limit_ms 3000 --setting max_csv_mb 0

This is a simulation of the command that the app itself used to launch the server. Pasting that into a terminal will produce an exact copy of the original process—and you can add --pdb or other options to further customize it.

Bonus: Restoring the in-memory database on restart

This didn’t make it into the formal release notes, but it’s a fun bug that I fixed in this release.

Datasette Desktop defaults to opening CSV files in an in-memory database. You can import them into an on-disk database too, but if you just want to start exploring CSV data in Datasette I decided an in-memory database would be a better starting point.

There’s one problem with this: installing a plugin requires a Datasette server restart, and restarting the server clears the content of that in-memory database, causing any tables created from imported CSVs to disappear. This is confusing!

You can follow my progress on this in issue #42: If you open a CSV and then install a plugin the CSV table vanishes. I ended up solving it by adding code that dumps the “temporary” in-memory database to a file on disk before a server restart, restarts the server, then copies that disk backup into memory again.

This works using two custom API endpoints added to the datasette-app-support plugin:

  • POST /-/dump-temporary-to-file with {"path": "/path/to/backup.db"} dumps the contents of that in-memory temporary database to the specified file.
  • POST /-/restore-temporary-from-file with {"path": "/path/to/backup.db"} restors the content back again.

These APIs are called from the startOrRestart() method any time the server restarts, using a file path generated by Electron using the following:

backupPath = path.join(
  app.getPath("temp"),
  `backup-${crypto.randomBytes(8).toString("hex")}.db`
);

The file is deleted once it has been restored.

After much experimentation, I ended up using the db.backup(other_connection) method that was added to Python’s sqlite3 module in Python 3.7. Since Datasette Desktop bundles its own copy of Python 3.9 I don’t have to worry about compatibility with older versions at all.

The rest is in the milestone

If you want even more detailed notes on what into the release, each new feature is included in the 0.2.0 milestone, accompanied by a detailed issue with screenshots (and even a few videos) plus links to the underlying commits.

Datasette Desktop—a macOS desktop application for Datasette one month ago

I just released version 0.1.0 of the new Datasette macOS desktop application, the first version that end-users can easily install. I would very much appreciate your help testing it out!

Datasette Desktop

Datasette Desktop screenshot

Datasette is “an open source multi-tool for exploring and publishing data”. It’s a Python web application that lets you explore data held in SQLite databases, plus a growing ecosystem of plugins for visualizing and manipulating those databases.

Datasette is aimed at data journalists, museum curators, archivists, local governments, scientists, researchers and anyone else who has data that they wish to explore and share with the world.

There’s just one big catch: since it’s a Python web application, those users have needed to figure out how to install and run Python software in order to use it. For people who don’t live and breath Python and the command-line this turns out to be a substantial barrier to entry!

Datasette Desktop is my latest attempt at addressing this problem. I’ve packaged up Datasette, SQLite and a full copy of Python such that users can download and uncompress a zip file, drag it into their /Applications folder and start using Datasette, without needing to know that there’s a Python web server running under the hood (or even understand what a Python web server is).

Please try it out, and send me feedback and suggestions on GitHub.

What the app does

This initial release has a small but useful set of features:

  • Open an existing SQLite database file and offer all of Datasette’s functionality, including the ability to explore tables and to execute arbitrary SQL queries.
  • Open a CSV file and offer the Datasette table interface (example here). By default this uses an in-memory database that gets cleared when the app shuts down, or you can...
  • Import CSV files into tables in on-disk SQLite databases (including creating a new blank database first).
  • By default the application runs a local web server which only accepts connections from your machine... but you can change that in the “File -> Access Control” menu to allow connections from anyone on your network. This includes Tailscale networks too, allowing you to run the application on your home computer and then access it securely from other devices such as your mobile phone anywhere in the world.
  • You can install plugins! This is the most exciting aspect of this initial release: it’s already in a state where users can customize it and developers can extend it, either with Datasette’s existing plugins (69 and counting) or by writing new ones.

How the app works

There are three components to the app:

  • A macOS wrapper application
  • Datasette itself
  • The datasette-app-support plugin

The first is the macOS application itself. This is currently written with Electron, and bundles a full copy of Python 3.9 (based on python-build-standalone by Gregory Szorc). Bundling Python is essential: the principal goal of the app is to allow people to use Datasette who aren’t ready to figure out how to install their own Python environment. Having an isolated and self-contained Python is also a great way of avoiding making XKCD 1987 even worse.

The macOS application doesn’t actually include Datasette itself. Instead, on first launch it creates a new Python virtual environment (currently in ~/.datasette-app/venv, feedback on that location welcome) and installs the other two components: Datasette and the datasette-app-support plugin.

Having a dedicated virtual environment is what enables the “Install Plugin” menu option. When a plugin is installed the macOS application runs pip install name-of-plugin and then restarts the Datasette server process, causing it to load that new plugin.

The datasette-app-support plugin is designed exclusively to work with this application. It adds API endpoints that the Electron shell can use to trigger specific actions, such as “import from this CSV file” or “attach this SQLite database”—these are generally triggered by macOS application menu items.

It also adds a custom authentication mechanism. The user of the app should have special permissions: only they should be able to import a CSV file from anywhere on their computer into Datasette. But for the “network share” feature I want other users to be able to access the web application.

An interesting consequence of installing Datasette on first-run rather than bundling it with the application is that the user will be able to upgrade to future Datasette releases without needing to re-install the application itself.

How I built it

I’ve been building this application completely in public over the past two weeks, writing up my notes and research in GitHub issues as I went (here’s the initial release milestone).

I had to figure out a lot of stuff!

First, Electron. Since almost all of the user-facing interface is provided by the existing Datasette web application, Electron was a natural fit: I needed help powering native menus and bundling everything up as an installable application, which Electron handles extremely well.

I also have ambitions to get a Windows version working in the future, which should share almost all of the same code.

Electron also has fantastic initial developer onboarding. I’d love to achieve a similar level of quality for Datasette some day.

The single biggest challenge was figuring out how to bundle a working copy of the Datasette Python application to run inside the Electron application.

My initial plan (touched on last week) was to compile Datasette and its dependencies into a single executable using PyInstaller or PyOxidizer or py2app.

These tools strip down a Python application to the minimal required set of dependencies and then use various tricks to compress that all into a single binary. They are really clever. For many projects I imagine this would be the right way to go.

I had one big problem though: I wanted to support plugin installation. Datasette plugins can have their own dependencies, and could potentially use any of the code from the Python standard library. This means that a stripped-down Python isn’t actually right for this project: I need a full installation, standard library and all.

Telling the user they had to install Python themselves was an absolute non-starter: the entire point of this project is to make Datasette available to users who are unwilling or unable to jump through those hoops.

Gregory Szorc built PyOxidizer, and as part of that he built python-build-standalone:

This project produces self-contained, highly-portable Python distributions. These Python distributions contain a fully-usable, full-featured Python installation as well as their build artifacts (object files, libraries, etc).

Sounds like exactly what I needed! I opened a research issue, built a proof-of-concept and decided to commit to that as the approach I was going to use. Here’s a TIL that describes how I’m doing this: Bundling Python inside an Electron app

(I find GitHub issue threads to be the ideal way of exploring these kinds of areas. Many of my repositories have a research label specifically to track them.)

The last key step was figuring out how to sign the application, so I could distribute it to other macOS users without them facing this dreaded dialog:

Datasette.app can't be opened because Apple cannot check it for malicious software

It turns out there are two steps to this these days: signing the code with a developer certificate, and then “notarizing” it, which involves uploading the bundle to Apple’s servers, having them scan it for malicious code and attaching the resulting approval to the bundle.

I was expecting figuring this out to be a nightmare. It ended up not too bad: I spent two days on it, but most of the work ended up being done by electron-builder—one of the biggest advantages of working within the Electron ecosystem is that a lot of people have put a lot of effort into these final steps.

I was adamant that my eventual signing and notarization solution should be automated using GitHub Actions: nothing defangs a frustrating build process more than good automation! This made things a bit harder because all of the tutorials and documentation assumed you were working with a GUI, but I got there in the end. I wrote this all up as a TIL: Signing and notarizing an Electron app for distribution using GitHub Actions (see also Attaching a generated file to a GitHub release using Actions).

What’s next

I announced the release last night on Twitter and I’ve already started getting feedback. This has resulted in a growing number of issues under the usability label.

My expectation is that most improvements made for the benefit of Datasette Desktop will benefit the regular Datasette web application too.

There’s also a strategic component to this. I’m investing a lot of development work in Datasette, and I want that work to have the biggest impact possible. Datasette Desktop is an important new distribution channel, which also means that any time I add a new feature to Datasette or build a new plugin the desktop application should see the same benefit as the hosted web application.

If I’m unlucky I’ll find this slows me down: every feature I build will need to include consideration as to how it affects the desktop application.

My intuition currently is that this trade-off will be worthwhile: I don’t think ensuring desktop compatibility will be a significant burden, and the added value from getting new features almost for free through a whole separate distribution channel should hopefully be huge!

TIL this week

Releases this week

Building a desktop application for Datasette (and weeknotes) one month ago

This week I started experimenting with a desktop application version of Datasette—with the goal of providing people who aren’t comfortable with the command-line the ability to get Datasette up and running on their own personal computers.

Update 8th September 2021: I made a bunch more progress over the week following this post, see Datasette Desktop—a macOS desktop application for Datasette for details or download the app to try it out.

Screenshot of the new Datasette desktop app prototype with several open windows

Why a desktop application?

On Monday I kicked off an enormous Twitter conversation when I posted:

I wonder how much of the popularity of R among some communities in comparison to Python comes down to the fact that with R you can install the RStudio desktop application and you’re ready to go

This ties into my single biggest complaint about Python: it’s just too hard for people to get started with. Setting up a Python development environment for the first time remains an enormous barrier to entry.

I later put this in stronger terms:

The more I think about this the more frustrated I get, thinking about the enormous amount of human potential that’s squandered because the barriers to getting started learning to program are so much higher than they need to be

Which made me think of glass houses. My own Datasette project has exactly the same problem: to run it locally you need to install Python and then install Datasette! Mac users can use Homebrew, but telling newcomers to install Homebrew first isn’t particularly welcoming either.

Ideally, I’d like people to be able to install a regular desktop application and start using Datasette that way, without even needing to know that it’s written in Python.

There’s been an open issue to get Datasette running as a standalone binary using PyInstaller since November 2017, with quite a bit of research.

But I want a UI as well: I don’t want to have to teach new users how to install and run a command-line application if I can avoid it.

So I decided to spend some time researching Electron to see how hard it would be to make a basic Datasette desktop application a reality.

Progress so far

The code I’ve written so far can be found in the simonw/datasette.app repository on GitHub. The app so far does the following:

  • Run a datasette server on localhost attached to an available port (found using portfinder) which terminates when the app quits.
  • Open a desktop window showing that Datasette instance once the server has started.
  • Allow additional windows onto the same instance to be opened using the “New Window” menu option or the Command+N keyboard shortcut.
  • Provides an “Open Database...” menu option (and Command+O shortcut) which brings up a file picker to allow the user to select a SQLite database file to open—once selected, this is attached to the Datasette instance and any windows showing the Datasette homepage are reloaded.

Here’s a video demo showing these features in action:

It’s very much an MVP, but I’m encouraged by the progress so far. I think this is enough of a proof of concept to be worth turning this into an actual usable product.

How this all works

There are two components to the application.

The first is a thin Electron shell, responsible for launching the Python server, managing windows and configuring the various desktop menu options used to configure it. The code for that lives in main.js.

The second is a custom Datasette plugin that adds extra functionality needed by the application. Currently this consists of a tiny bit of extra CSS to make the footer stick to the bottom of the window, and a custom API endpoint at /-/open-database-file which is called by the menu option for opening a new database.

Initial impressions of Electron

I know it’s cool to knock Electron, but in this case it feels like exactly the right tool for the job. Datasette is already a web application—what I need is a way to hide the configuration of that web application behind an icon, and re-present the interface in a way that feels more like a desktop application.

This is my first time building anything with Electron—here are some of my initial impressions.

  • The initial getting started workflow is really good. I started out with their Quick Start and was up and running with a barebones application that I could start making changes to in just a few minutes.
  • The documentation is pretty good, but it leans more towards being an API reference. I found myself googling for examples of different things I wanted to do pretty often.
  • The automated testing situation isn’t great. I’m using Spectron and Mocha for my initial (very thin) tests—I got them up and running in GitHub Actions, but I’ve already run into some limitations:
    • For some reason each time I run the tests an Electron window (and datasette Python process) is left running. I can’t figure out why this is.
    • There doesn’t appear to be a way for tests to trigger menu items, which is frustrating because most of the logic I’ve written so far deals with menu items! There is an open issue for this dating back to May 2016.
  • I haven’t yet managed to package my app. This is clearly going to be the biggest challenge.

Up next: packaging the app

I was hoping to get to this before writing up my progress in these weeknotes, but it looks like it’s going to be quite a challenge.

In order to produce an installable macOS app (I’ll dive into Windows later) I need to do the following:

  • Build a standalone Datasette executable, complete with the custom plugin, using PyInstaller
  • Sign that binary with an Apple developer certificate
  • Build an Electron application that bundles a copy of that datasette binary
  • Sign the resulting Electron application

I’m expecting figuring this out to be a long-winded and frustrating experience, which is more the fault of Apple than of Electron. I’m tracking my progress on this in issue #7.

Datasette 0.59a2

I pushed out a new alpha of Datasette earlier this week, partly driven by work I was doing on datasette.app.

The biggest new feature in this release is a new plugin hook: register_commands()—which lets plugins add additional commands to Datasette, e.g. datasette verify name-of-file.db.

I released a new plugin that exercises this hook called datasette-verify. Past experience has shown me that it’s crucial to ship an example plugin alongside a new hook, to help confirm that the hook design is fit for purpose.

It turns out I didn’t need this for datasette.app after all, but it’s still a great capability to have!

sqlite-utils 3.17

Quoting the release notes in full:

  • The sqlite-utils memory command has a new --analyze option, which runs the equivalent of the analyze-tables command directly against the in-memory database created from the incoming CSV or JSON data. (#320)
  • sqlite-utils insert-files now has the ability to insert file contents in to TEXT columns in addition to the default BLOB. Pass the --text option or use content_text as a column specifier. (#319)

evernote-to-sqlite 0.3.2

As a follow-up to last week’s work on my personal Dogsheep, I decided to re-import my Evernote notes... and found out that Evernote has changed their export mechanism in ways that broke my tool. Most concerningly their exported XML is even less well-formed than it used to be. This new release works around that.

TIL this week

Releases this week

Elsewhere

12th October 2021

  • How to win at CORS (via) Jake Archibald’s definitive guide to CORS, including a handy CORS playground interactive tool. Also includes a useful history explaining why we need CORS in the first place. #12th October 2021, 2:07 pm

6th October 2021

  • Abusing Terraform to Upload Static Websites to S3 (via) I found this really interesting. Terraform is infrastructure as code software which mostly handles creating and updating infrastructure resources, so it’s a poor fit for uploading files to S3 and setting the correct Content-Type headers for them. But... in figuring out how to do that, this article taught me a ton about how Terraform works. I wonder if that’s a useful general pattern? Get a tool to do something that it’s poorly designed to handle and see how much you learn about that tool along the way. #6th October 2021, 6:51 pm
  • Writing for distributed teams (via) Vicki Boykis describes how she only sent 11 emails during her first 12 months working at Automattic, because the company culture there revolves around asynchronous communication through durable writing using the P2 custom WordPress theme. “This is a completely different paradigm than I’ve ever worked in, which has been a world usually riddled with information lost to Slack, Confluence, and dozens of email re:re:res.” #6th October 2021, 6:29 pm

29th September 2021

26th September 2021

  • django-upgrade (via) Adam Johnson’s new CLI tool for upgrading Django projects by automatically applying changes to counter deprecations made in different versions of the framework. Uses the Python standard library tokenize module which gives it really quick performance in parsing and rewriting Python code. Exciting to see this kind of codemod approach becoming more common in Python world—JavaScript developers use this kind of thing a lot. #26th September 2021, 5:42 am

24th September 2021

  • New tool: an nginx playground. Julia Evans built a sandbox tool for interactively trying out an nginx configuration and executing test requests through it. I love this kind of tool, and Julia’s explanation of how they built it using a tiny fly.io instance and a network namespace to reduce the amount of damage any malicious usage could cause is really interesting. #24th September 2021, 6:44 pm

23rd September 2021

  • File not found: A generation that grew up with Google is forcing professors to rethink their lesson plans (via) This is fascinating: as-of 2017 university instructors have been increasingly encountering students who have absolutely no idea how files and folders on a computer work. The new generation has a completely different mental model of how applications work, where everything is found using search and data mostly lives inside the application that you use to manipulate it. #23rd September 2021, 10:49 pm
  • Introducing Partytown 🎉: Run Third-Party Scripts From a Web Worker (via) This is just spectacularly clever. Partytown is a 6KB JavaScript library that helps you move gnarly poorly performing third-party scripts out of your main page and into a web worker, so they won’t destroy your page performance. The really clever bit is in how it provides sandboxed access to the page DOM: it uses a devious trick where a proxy object provides getters and setters which then make blocking API calls to a separate service worker, using the mostly-forgotten xhr.open(..., false) parameter that turns off the async default for an XMLHttpRequest call. #23rd September 2021, 6:29 pm

20th September 2021

10th September 2021

  • Imagine writing the investment memo for “20% of a picture of a dog” and being like “the most we should pay is probably about $2 million because the whole picture of the dog sold for $4 million three months ago and it can’t realistically have appreciated more than 150% since then; even if the whole picture of the dog is worth, aggressively, $10 million, this share would be worth $2 milllion.” What nonsense that is!

    Matt Levine # 10th September 2021, 7:27 am

8th September 2021

  • Datasette Desktop 0.1.0 (via) This is the first installable version of the new Datasette Desktop macOS application I’ve been building. Please try it out and leave feedback on Twitter or on the GitHub Discussions thread linked from the release notes. #8th September 2021, 5:14 am

7th September 2021

  • We never shipped a great commercial product. The reason for that is we didn’t focus. We tried to do a little bit of everything. It’s hard enough to maintain the growth of your developer community and build one great commercial product, let alone three or four, and it is impossible to do both, but that’s what we tried to do and we spent an enormous amount of money doing it.

    Solomon Hykes # 7th September 2021, 2:47 pm