Simon Willison’s Weblog

Subscribe

Weeknotes: A flurry of not-quite-finished features

3rd January 2021

My Christmas present to myself this year was to allow myself to spend a week working on stuff I found interesting, rather than sticking to the most important things. This may have been a mistake: it’s left me with a flurry of interesting but not-quite-finished features.

Prettier for Datasette

A couple of years ago I decided to adopt Black, an opinionated code formatter, for all of my Python projects.

This proved to be a much, much bigger productivity boost than I had expected.

It turns out I had been spending a non-trivial portion of my coding brain cicles thinking about code formatting. I was really stressing the details over how data structures should be indented, or how best to attempt to hit a 80 character line length limit.

Letting go and outsourcing all of that nit-picking to Black meant I didn’t have to spend any time thinking about that stuff, at all. It as so liberating! It even changed the way I write unit tests, as I described in How to cheat at unit tests with pytest and Black.

I’ve been hoping to adopt something similar for JavaScript for ages. The leading contender in that space is Prettier, but I’ve held back because I didn’t like the aesthetics of its handling of chained method calls.

That’s a pretty dumb reason to lose out on such a big productivity boost, so I’ve decided to forget about that issue entirely.

Datasette doesn’t have much JavaScript at the moment, but that’s likely to change in the future. So I’ve applied Prettier to the table.js file and configured a GitHub action that will fail if Prettier hasn’t been applied—I wrote that up as a TIL.

JavaScript plugins for Datasette

This is why I’ve been thinking about JavaScript: I’ve been pushing forward on the open issue to provide a JavaScript plugin mechanism that Datasette plugins can hook into.

Datasette’s Python plugins use pluggy, and I absolutely love it—it’s a powerful, simple and elegant way to define plugins and have them cooperate with each other.

Pluggy works by defining “hooks” and allowing plugins to register to respond to those hooks. When a hook is called by Datasette’s core code each plugin gets to return a value, and the collected values from all of them are made available as a Python list.

I want the same thing for JavaScript. As an example, consider the “column actions” menu activated by the cog icon on each column on the table page.

I’d like plugins written in JavaScript to be able to return additional menu items to be shown in that menu—similar to how Python plugins can extend the existing table actions menu using the menu_actions hook.

I’ve been fleshing out ideas for this in issue 693, and I’m getting close to something that I’m happy with.

A big concern is loading times. I’d like to keep Datasette pages loading as quickly as possible, so I don’t want to add a large chunk of JavaScript to each page to support plugins—and potentially have that block the page load until the script has been fetched and parsed.

But... if a JavaScript plugin is going to register itself with the plugin system, it needs to be able to call an API function to register itself! So at least some plugin JavaScript has to be made available as soon as possible in the lifecycle of the page.

I’ve always admired how Google Analytics deals with this problem. They define a tiny ga() function to start collecting analytics options, then execute those as soon as the larger script has been asynchronously loaded:

(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-XXXXX-Y', 'auto');
ga('send', 'pageview');

Could I build a plugin system that’s so small I can inline it into the <head> of every page without worrying about the extra page bulk?

I think I can. Here’s the smallest version I’ve reached so far:

var datasette = datasette || {};
datasette.plugins = (() => {
  var registry = {};
  return {
    register: (hook, fn, parameters) => {
      if (!registry[hook]) {
        registry[hook] = [];
      }
      registry[hook].push([fn, parameters]);
    },
    call: (hook, args) => {
      args = args || {};
      var results = [];
      (registry[hook] || []).forEach(([fn, parameters]) => {
        /* Call with the correct arguments */
        var result = fn.apply(
          fn,
          parameters.map((parameter) => args[parameter])
        );
        if (result !== undefined) {
          results.push(result);
        }
      });
      return results;
    },
  };
})();

Which minifies (with uglify-es) to just 250 characters—it fits in a tweet!

var datasette=datasette||{};datasette.plugins=(()=>{var a={};return{register:(t,r,e)=>{a[t]||(a[t]=[]),a[t].push([r,e])},call:(t,r)=>{r=r||{};var e=[];return(a[t]||[]).forEach(([a,t])=>{var s=a.apply(a,t.map(a=>r[a]));void 0!==s&&e.push(s)}),e}}})();

Matthew Somerville suggested an even shorter version using object destructuring to get it down to 200 bytes. I’ll probably adopt that idea in my next iteration.

As part of this work I’ve also started thinking about how I can introduce minification into the Datasette development process, and how I can start writing unit tests for the JavaScript. I tried using Jest at first but now I’m leaning towards Cypress, since that will help support writing browser automation tests against some of the more advanced functionality as well.

datasette insert

Issue 1160 tracks my work on a big new idea for Datasette: datasette insert.

Datasette assumes you already have data in a SQLite database. Getting that data into your database is an exercise left for the user.

I’ve been building a plethora of additional tools to help with that challenge—things like sqlite-utils and csvs-to-sqlite and yaml-to-sqlite and so-on.

I want Datasette itself to include this functionality, starting with a new datasette insert sub-command.

So why bring this functionality into Datasette itself? A few reasons:

  • Helping people install Python software is hard enough already, without them needing to install multiple tools. If I can tell someone to brew install datasette and then datasette insert data.db file.csv they’ll have a much easier time than if they have to install and learn additional tools as well.
  • This is a fantastic opportunity for plugin hooks. I want a plugin hook that lets people build additional import formats for Datasette—by default it will support CSV, TSV and JSON but I want to be able to add GeoJSON and Shapefiles and DBF files and Excel and all sorts of other things.
  • Crucially, I plan to build both a command-line interface and a web interface within Datasette that lets people upload files—or copy and paste in their content, or specify a URL to import them from. This means any new format plugins will enable not just a command-line interface but a web interface for importing that data as well.
  • It’s still possible that Datasette will evolve beyond just SQLite in the future. I don’t want to have to build yaml-to-postgresql and so on—this new Datasette plugin hook could support importing formats into other databases too in the future.

I made some initial progress on this but it’s going to be a while before I can land anything to main.

Storing metadata in _metadata tables

Last week I wrote about the new _internal in-memory database in Datasette, which collects details of every attached database, table and column in order to power things like a forthcoming searchable, paginated homepage.

This made me think: what else could I address with in-memory SQLite databases?

I realized that this could overlap with a truly ancient Datasette feature idea: the abiilty to bundle metadata and templates within the SQLite database files that Datasette exposes. Bundling metadata is particularly interesting as it would allow that metadata to “travel” with the data, contained in the same binary SQLite file.

So I’ve been doing a whole lot of thinking about how this might work—including how the existing metadata.json format passed to Datasette when it starts up could be read into an in-memory database and then further combined with any metadata baked into tables in the various attached databases.

The work is entirely an issue thread at the moment, but I’m excited to spend more time on it in the future. I want to implement column-level metadata as part of this, so you can finally attach information to columns to help describe what they actually represent.

sqlite-utils cached table counts

select count(*) from table can be a slow query against large database tables, in both SQLite and PostgreSQL. This often surprises people—surely it should be trivial for a relational database to cache the total number of rows in a table? It turns out that’s not actually possible, due to MVCC—a relational database needs to be able to correctly answer that query from the point of view of the current transaction, even while other transactions are inserting or deleting rows.

Datasette tries to show the count of rows in a table in various places. This has lead to a number of performance issues when running against large tables—Datasette has a number of existing tricks for handling that, including timing out counts after 50ms and caching the counts for databases opened in “immutable” mode, but it’s still a source of problems.

For a while I’ve been contemplating working around this with triggers. SQLite has a robust triggers system—how about writing triggers that increment or decrement a counter any time a row is inserted or deleted from a table?

My initial experiments indicate that this can indeed work really well!

I’ve been implementing this in sqlite-utils. Like most features of that package it comes in two flavours: a Python API you can call from your own code, and a command-line tool you can use to directly manipulate a database.

Enabling cached counts for a table in the Python API documentation describes how this works. You can call db.enable_counts() on a database to install the triggers, after which a _counts table will be automatically updated with a count record for each table.

The sqlite-utils enable-counts command can be used to configure this from the command-line.

The triggers it adds for a table look like this:

CREATE TRIGGER [plants_counts_insert] AFTER INSERT ON [plants]
BEGIN
  INSERT OR REPLACE INTO [_counts]
  VALUES (
    'plants',
    COALESCE(
      (SELECT count FROM [_counts] WHERE [table] = 'plants'),
    0
    ) + 1
  );
END
CREATE TRIGGER [plants_counts_delete] AFTER DELETE ON [plants]
BEGIN
  INSERT OR REPLACE INTO [_counts]
  VALUES (
    'plants',
    COALESCE(
      (SELECT count FROM [_counts] WHERE [table] = 'plants'),
    0
    ) - 1
  );
END

I haven’t benchmarked it yet, but my expectation is that this will add a minimal overhead to INSERT and DELETE operations against that table. If your table isn’t handling hundreds of writes a second I expect the overhead to be inconsequential.

This work is mostly done, and should be out in a sqlite-utils release within a few days.

Update 3rd January 2021: this is all now released in sqlite-utils 3.2.

Releases this week

TIL this week