Simon Willison’s Weblog

On c, bakeddata, github, facebook, dogsheep, ...

 

Recent entries

Weeknotes: datasette-export-notebook, PyInstaller packaged Datasette, CBSAs six days ago

What a terrible week. I’ve found it hard to concentrate on anything substantial. In a mostly futile attempt to distract myself from doomscrolling I’ve mainly been building some experimental output plugins, fiddling with PyInstaller and messing around with shapefiles.

Packaged Datasette with PyInstaller

A long running goal for Datasette has been to make it as easy to install as possible—something that’s not particularly straight-forward for applications written in Python, at least in comparison to toolchains like Rust, Go or Deno.

Back in November 2017 Raffaele Messuti suggested using PyInstaller for this. I revisited that issue while looking through open Datasette issues ordered by least recently updated and decided to try it out again—and it worked! Here’s the resulting TIL, and I’ve attached a bundled datasette macOS binary file to the 0.53 release on GitHub.

There’s one catch: the binary isn’t signed, which means it shows security warnings that have to be worked around if you try to run it on macOS. I’ve started looking into the signing process—I’m going to need an Apple Developer account and to jump through a bunch of different hoops, but it looks like I should be able to get that working. Here’s my issue for that: GitHub Actions workflow to build and sign macOS binary executables—it looks like gon is the missing automation piece I need.

One thing that really impressed me about PyInstaller is the size of the resulting file. On both macOS and Linux it was able to create a roughly 8MB file containing Datasette, all of its dependencies AND a working Python environment. It’s pretty magic!

datasette-css-properties

I wrote this up in detail a couple of days ago: datasette-css-properties is an amusingly weird output plugin that turns the results of a SQL query into CSS custom property definitions which can then be used to style or insert content into the current page.

sqlite-utils 3.2

The big new feature in this release is cached table counts using triggers, which I described last week. Full release notes here.

I’ve opened an issue to take advantage of this optimization in Datasette itself.

datasette-export-notebook

This is an idea I’ve been bouncing around for a while, and during a bout of attempted-coup-induced insomnia I decided to sketch out an initial version.

datasette-export-notebook is a plugin that adds an export-to-notebook option to any table or query.

This provides a page of documentation with copy-and-paste examples for loading data from that table or query into a Jupyter or Observable notebook.

Here’s a live demo. The interface currently looks like this:

Screenshot of the interface, showing copy-and-paste examples for Jupyter and Observable imports

As often happens when building even simple plugins like this I identified some small improvements I can make to Datasette.

cbsa-datasette

Core-based statistical areas are a US government concept used for various statistical purposes. They are essentially metropolitan areas, based on central cities and the commuting area that they sit inside.

I built cbsa.datasettes.com this week to provide an API for looking up a CBSA based on a latitude and longitude point. Here’s a location within San Francisco for example:

https://cbsa.datasettes.com/core/by_lat_lon?longitude=-122.51&latitude=37.78

This returns San Francisco-Oakland-Berkeley, CA. Add .json and &_shape=array to the above URL to get a JSON API version.

The data comes from a shapefile published by the Bureau of Transportation Stastics. I’m using shapefile-to-sqlite to import it into a SpatiaLite database, then publishing it to Cloud Run using this GitHub Actions workfow. Full details in the README.

I built this mainly to act as a simple updated example of how to use Datasette and SpatiaLite to provide an API against data from a shapefile. I published a tutorial about doing this for timezones three years ago, but shapefile-to-sqlite makes it much easier.

Releases this week

TIL this week

APIs from CSS without JavaScript: the datasette-css-properties plugin eight days ago

I built a new Datasette plugin called datasette-css-properties. It’s very, very weird—it adds a .css output extension to Datasette which outputs the result of a SQL query using CSS custom property format. This means you can display the results of database queries using pure CSS and HTML, no JavaScript required!

I was inspired by Custom Properties as State, published by by Chris Coyier earlier this week. Chris points out that since CSS custom properties can be defined by an external stylesheet, a crafty API could generate a stylesheet with dynamic properties that could then be displayed on an otherwise static page.

This is a weird idea. Datasette’s plugins system is pretty much designed for weird ideas—my favourite thing about having plugins is that I can try out things like this without any risk of damaging the integrity of the core project.

So I built it! Here are some examples:

roadside_attractions is a table that ships as part of Datasette’s “fixtures” test database, which I write unit tests against and use for quick demos.

The URL of that table within Datasette is /fixtures/roadside_attractions. To get the first row in the table back as CSS properties, simply add a .css extension:

/fixtures/roadside_attractions.css returns this:

:root {
  --pk: '1';
  --name: 'The Mystery Spot';
  --address: '465 Mystery Spot Road, Santa Cruz, CA 95065';
  --latitude: '37.0167';
  --longitude: '-122.0024';
}

You can make use of these properties in an HTML document like so:

<link rel="stylesheet" href="https://latest-with-plugins.datasette.io/fixtures/roadside_attractions.css">
<style>
.attraction-name:after { content: var(--name); }
.attraction-address:after { content: var(--address); }
</style>
<p class="attraction-name">Attraction name: </p>
<p class="attraction-address">Address: </p>

Here that is on CodePen. It outputs this:

Attraction name: The Mystery Spot

Address: 465 Mystery Spot Road, Santa Cruz, CA 95065

Apparently modern screen readers will read these values, so they’re at least somewhat accessible. Sadly users won’t be able to copy and paste their values.

Let’s try something more fun: a stylesheet that changes colour based on the time of the day.

I’m in San Francisco, which is currentl 8 hours off UTC. So this SQL query gives me the current hour of the day in my timezone:

SELECT strftime('%H', 'now') - 8

I’m going to define the following sequence of colours:

  • Midnight to 4am: black
  • 4am to 8am: grey
  • 8am to 4pm: yellow
  • 4pm to 6pm: orange
  • 6pm to midnight: black again

Here’s a SQL query for that, using the CASE expression:

SELECT
  CASE
    WHEN strftime('%H', 'now') - 8 BETWEEN 4
    AND 7 THEN 'grey'
    WHEN strftime('%H', 'now') - 8 BETWEEN 8
    AND 15 THEN 'yellow'
    WHEN strftime('%H', 'now') - 8 BETWEEN 16
    AND 18 THEN 'orange'
    ELSE 'black'
  END as [time-of-day-color]

Execute that here, then add the .css extension and you get this:

:root {
  --time-of-day-color: 'yellow';
}

This isn’t quite right. The yellow value is wrapped in single quotes—but that means it won’t work as a colour if used like this:

<style>
nav {
  background-color: var(--time-of-day-color);
}
</style>
<nav>This is the navigation</nav>

To fix this, datasette-css-properties supports a ?_raw= querystring argument for specifying that a specific named column should not be quoted, but should be returned as the exact value that came out of the database.

So we add ?_raw=time-of-day-color to the URL to get this:

:root {
  --time-of-day-color: yellow;
}

(I’m a little nervous about the _raw= feature. It feels like it could be a security hole, potentially as an XSS vector. I have an open issue about that and I’d love to get some feedback—I’m serving the page with the X-Content-Type-Options: nosniff HTTP header which I think should keep things secure but I’m worried there may be attack patterns that I don’t know about.)

Let’s take a moment to admire the full HTML document for this demo:

<link rel="stylesheet" href="https://latest-with-plugins.datasette.io/fixtures.css?sql=SELECT%0D%0A++CASE%0D%0A++++WHEN+strftime(%27%25H%27,+%27now%27)+-+8+BETWEEN+4%0D%0A++++AND+7+THEN+%27grey%27%0D%0A++++WHEN+strftime(%27%25H%27,+%27now%27)+-+8+BETWEEN+8%0D%0A++++AND+15+THEN+%27yellow%27%0D%0A++++WHEN+strftime(%27%25H%27,+%27now%27)+-+8+BETWEEN+16%0D%0A++++AND+18+THEN+%27orange%27%0D%0A++++ELSE+%27black%27%0D%0A++END+as+%5Btime-of-day-color%5D&_raw=time-of-day-color">
<style>
nav {
  background-color: var(--time-of-day-color);
}
</style>
<nav>This is the navigation</nav>

That’s a SQL query URL-encoded into the querystring for a stylesheet, loaded in a <link> element and used to style an element on a page. It’s calling and reacting to an API with not a line of JavaScript required!

Is this plugin useful for anyone? Probably not, but it’s a really fun idea, and it’s a great illustration of how having plugins dramatically reduces the friction against trying things like this out.

Weeknotes: A flurry of not-quite-finished features 13 days ago

My Christmas present to myself this year was to allow myself to spend a week working on stuff I found interesting, rather than sticking to the most important things. This may have been a mistake: it’s left me with a flurry of interesting but not-quite-finished features.

Prettier for Datasette

A couple of years ago I decided to adopt Black, an opinionated code formatter, for all of my Python projects.

This proved to be a much, much bigger productivity boost than I had expected.

It turns out I had been spending a non-trivial portion of my coding brain cicles thinking about code formatting. I was really stressing the details over how data structures should be indented, or how best to attempt to hit a 80 character line length limit.

Letting go and outsourcing all of that nit-picking to Black meant I didn’t have to spend any time thinking about that stuff, at all. It as so liberating! It even changed the way I write unit tests, as I described in How to cheat at unit tests with pytest and Black.

I’ve been hoping to adopt something similar for JavaScript for ages. The leading contender in that space is Prettier, but I’ve held back because I didn’t like the aesthetics of its handling of chained method calls.

That’s a pretty dumb reason to lose out on such a big productivity boost, so I’ve decided to forget about that issue entirely.

Datasette doesn’t have much JavaScript at the moment, but that’s likely to change in the future. So I’ve applied Prettier to the table.js file and configured a GitHub action that will fail if Prettier hasn’t been applied—I wrote that up as a TIL.

JavaScript plugins for Datasette

This is why I’ve been thinking about JavaScript: I’ve been pushing forward on the open issue to provide a JavaScript plugin mechanism that Datasette plugins can hook into.

Datasette’s Python plugins use pluggy, and I absolutely love it—it’s a powerful, simple and elegant way to define plugins and have them cooperate with each other.

Pluggy works by defining “hooks” and allowing plugins to register to respond to those hooks. When a hook is called by Datasette’s core code each plugin gets to return a value, and the collected values from all of them are made available as a Python list.

I want the same thing for JavaScript. As an example, consider the “column actions” menu activated by the cog icon on each column on the table page.

I’d like plugins written in JavaScript to be able to return additional menu items to be shown in that menu—similar to how Python plugins can extend the existing table actions menu using the menu_actions hook.

I’ve been fleshing out ideas for this in issue 693, and I’m getting close to something that I’m happy with.

A big concern is loading times. I’d like to keep Datasette pages loading as quickly as possible, so I don’t want to add a large chunk of JavaScript to each page to support plugins—and potentially have that block the page load until the script has been fetched and parsed.

But... if a JavaScript plugin is going to register itself with the plugin system, it needs to be able to call an API function to register itself! So at least some plugin JavaScript has to be made available as soon as possible in the lifecycle of the page.

I’ve always admired how Google Analytics deals with this problem. They define a tiny ga() function to start collecting analytics options, then execute those as soon as the larger script has been asynchronously loaded:

(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-XXXXX-Y', 'auto');
ga('send', 'pageview');

Could I build a plugin system that’s so small I can inline it into the <head> of every page without worrying about the extra page bulk?

I think I can. Here’s the smallest version I’ve reached so far:

var datasette = datasette || {};
datasette.plugins = (() => {
  var registry = {};
  return {
    register: (hook, fn, parameters) => {
      if (!registry[hook]) {
        registry[hook] = [];
      }
      registry[hook].push([fn, parameters]);
    },
    call: (hook, args) => {
      args = args || {};
      var results = [];
      (registry[hook] || []).forEach(([fn, parameters]) => {
        /* Call with the correct arguments */
        var result = fn.apply(
          fn,
          parameters.map((parameter) => args[parameter])
        );
        if (result !== undefined) {
          results.push(result);
        }
      });
      return results;
    },
  };
})();

Which minifies (with uglify-es) to just 250 characters—it fits in a tweet!

var datasette=datasette||{};datasette.plugins=(()=>{var a={};return{register:(t,r,e)=>{a[t]||(a[t]=[]),a[t].push([r,e])},call:(t,r)=>{r=r||{};var e=[];return(a[t]||[]).forEach(([a,t])=>{var s=a.apply(a,t.map(a=>r[a]));void 0!==s&&e.push(s)}),e}}})();

Matthew Somerville suggested an even shorter version using object destructuring to get it down to 200 bytes. I’ll probably adopt that idea in my next iteration.

As part of this work I’ve also started thinking about how I can introduce minification into the Datasette development process, and how I can start writing unit tests for the JavaScript. I tried using Jest at first but now I’m leaning towards Cypress, since that will help support writing browser automation tests against some of the more advanced functionality as well.

datasette insert

Issue 1160 tracks my work on a big new idea for Datasette: datasette insert.

Datasette assumes you already have data in a SQLite database. Getting that data into your database is an exercise left for the user.

I’ve been building a plethora of additional tools to help with that challenge—things like sqlite-utils and csvs-to-sqlite and yaml-to-sqlite and so-on.

I want Datasette itself to include this functionality, starting with a new datasette insert sub-command.

So why bring this functionality into Datasette itself? A few reasons:

  • Helping people install Python software is hard enough already, without them needing to install multiple tools. If I can tell someone to brew install datasette and then datasette insert data.db file.csv they’ll have a much easier time than if they have to install and learn additional tools as well.
  • This is a fantastic opportunity for plugin hooks. I want a plugin hook that lets people build additional import formats for Datasette—by default it will support CSV, TSV and JSON but I want to be able to add GeoJSON and Shapefiles and DBF files and Excel and all sorts of other things.
  • Crucially, I plan to build both a command-line interface and a web interface within Datasette that lets people upload files—or copy and paste in their content, or specify a URL to import them from. This means any new format plugins will enable not just a command-line interface but a web interface for importing that data as well.
  • It’s still possible that Datasette will evolve beyond just SQLite in the future. I don’t want to have to build yaml-to-postgresql and so on—this new Datasette plugin hook could support importing formats into other databases too in the future.

I made some initial progress on this but it’s going to be a while before I can land anything to main.

Storing metadata in _metadata tables

Last week I wrote about the new _internal in-memory database in Datasette, which collects details of every attached database, table and column in order to power things like a forthcoming searchable, paginated homepage.

This made me think: what else could I address with in-memory SQLite databases?

I realized that this could overlap with a truly ancient Datasette feature idea: the abiilty to bundle metadata and templates within the SQLite database files that Datasette exposes. Bundling metadata is particularly interesting as it would allow that metadata to “travel” with the data, contained in the same binary SQLite file.

So I’ve been doing a whole lot of thinking about how this might work—including how the existing metadata.json format passed to Datasette when it starts up could be read into an in-memory database and then further combined with any metadata baked into tables in the various attached databases.

The work is entirely an issue thread at the moment, but I’m excited to spend more time on it in the future. I want to implement column-level metadata as part of this, so you can finally attach information to columns to help describe what they actually represent.

sqlite-utils cached table counts

select count(*) from table can be a slow query against large database tables, in both SQLite and PostgreSQL. This often surprises people—surely it should be trivial for a relational database to cache the total number of rows in a table? It turns out that’s not actually possible, due to MVCC—a relational database needs to be able to correctly answer that query from the point of view of the current transaction, even while other transactions are inserting or deleting rows.

Datasette tries to show the count of rows in a table in various places. This has lead to a number of performance issues when running against large tables—Datasette has a number of existing tricks for handling that, including timing out counts after 50ms and caching the counts for databases opened in “immutable” mode, but it’s still a source of problems.

For a while I’ve been contemplating working around this with triggers. SQLite has a robust triggers system—how about writing triggers that increment or decrement a counter any time a row is inserted or deleted from a table?

My initial experiments indicate that this can indeed work really well!

I’ve been implementing this in sqlite-utils. Like most features of that package it comes in two flavours: a Python API you can call from your own code, and a command-line tool you can use to directly manipulate a database.

Enabling cached counts for a table in the Python API documentation describes how this works. You can call db.enable_counts() on a database to install the triggers, after which a _counts table will be automatically updated with a count record for each table.

The sqlite-utils enable-counts command can be used to configure this from the command-line.

The triggers it adds for a table look like this:

CREATE TRIGGER [plants_counts_insert] AFTER INSERT ON [plants]
BEGIN
  INSERT OR REPLACE INTO [_counts]
  VALUES (
    'plants',
    COALESCE(
      (SELECT count FROM [_counts] WHERE [table] = 'plants'),
    0
    ) + 1
  );
END
CREATE TRIGGER [plants_counts_delete] AFTER DELETE ON [plants]
BEGIN
  INSERT OR REPLACE INTO [_counts]
  VALUES (
    'plants',
    COALESCE(
      (SELECT count FROM [_counts] WHERE [table] = 'plants'),
    0
    ) - 1
  );
END

I haven’t benchmarked it yet, but my expectation is that this will add a minimal overhead to INSERT and DELETE operations against that table. If your table isn’t handling hundreds of writes a second I expect the overhead to be inconsequential.

This work is mostly done, and should be out in a sqlite-utils release within a few days.

Update 3rd January 2021: this is all now released in sqlite-utils 3.2.

Releases this week

TIL this week

Weeknotes: Datasette internals 20 days ago

I’ve been working on some fundamental changes to Datasette’s internal workings—they’re not quite ready for a release yet, but they’re shaping up in an interesting direction.

One of my goals for Datasette is to be able to handle a truly enormous variety of data in one place. The Datasette Library ticket tracks this effort—I’d like a newsroom (or any other information-based organization) to be able to keep hundreds of databases with potentially thousands of tables all in a single place.

SQLite databases are just files on disk, so if you have a TB of assorted databases of all shapes and sizes I’d like to be able to present them in a single Datasette instance.

If you have a hundred database files each with a hundred tables, that’s 10,000 tables total. This implies a need for pagination of the homepage, plus the ability to search and filter within the tables that are available to the Datasette instance.

Sounds like the kind of problem I’d normally solve with Datasette!

So in issue #1150 I’ve implemented the first part of a solution. On startup, Datasette now creates an in-memory SQLite database representing all of the connected databases and tables, plus those tables’ columns, indexes and foreign keys.

For a demo, first sign in as root to the latest.datasette.io demo instance and then visit the private _internal database.

Screenshot of the columns table, which has columns describing the columns that make up all of the tables.

This new internal database is currently private because I don’t want to expose any metadata about tables that may themselves be covered by Datasette’s permissions mechanism.

The new in-memory database represents the schemas of the underlying database files—but what if those change, as new tables are added or modified?

SQLite has a neat trick to help with this: PRAGMA schema_version returns an integer representing the current version of the schema, which changes any time a table is created or modified.

This means I can cache the table schemas and only recalculate them if something has changed. Running PRAGMA schema_version against a connection to a database is an extremely fast query.

I first used this trick in datasette-graphql to cache the results of GraphQL schema introspection, so I’m confident it will work here too.

The problem with permissions

There’s one really gnarly challenge I still need to solve here: permissions.

Datasette’s permissions system, added in Datasette 0.44 back in June, works against plugin hooks. Permissions plugins can answer questions along the lines of "is the current authenticated actor allowed to perform the action view-table against table X?". Every time that question is asked, the plugins are queried via a plugin hook.

When I designed the permissions system I forgot a crucial lesson I’ve learned about permissions systems before: at some point, you’re going to need to answer the question “show me the list of all Xs that this actor has permission to act on”.

That time has now come. If I’m going to render a paginated homepage for Datasette listing 10,000+ tables, I need an efficient way to calculate the subset of those 10,000 tables that the current user is allowed to see.

Looping through and calling that plugin hook 10,000 times isn’t going to cut it.

So I’m starting to rethink permissions a bit—I’m glad I’ve not hit Datasette 1.0 yet!

In my favour is SQLite. Efficiently answering permissions questions in bulk, in a generic, customizable way, is a really hard problem. It’s made a lot easier by the presence of a relational database.

Issue #1152 tracks some of my thinking on this. I have a hunch that the solution is going to involve a new plugin hook, potentially along the lines of “return a fragment of SQL that identifies databases or tables that this user can access”. I can then run that SQL against the new _internal database (potentially combined with SQL returned by other plugins answering the same hook, using an AND or a UNION) to construct a set of tables that the user can access.

If I can do that, and then run the query against an in-memory SQLite database, I should be able to provide a paginated, filtered interface to 10,000+ tables on the homepage that easily fulfills my performance goals.

I’d like to ship the new _internal database in a release quite soon, so I may end up implementing a slower version of this first. It’s definitely an interesting problem.

Building a search engine for datasette.io 27 days ago

This week I added a search engine to datasette.io, using the search indexing tool I’ve been building for Dogsheep.

A screenshot of dogsheep.io search results for ripgrep

Project search for Datasette

The Datasette project has a lot of constituent parts. There’s the project itself and its documentation—171 pages when exported to PDF and counting. Then there are the 48 plugins, sqlite-utils and 21 more tools for creating SQLite databases, the Dogsheep collection and over three years of content I’ve written about the project on my blog.

The new datasette.io search engine provides a faceted search interface to all of this material in one place. It currently searches across:

  • Every section of the latest documentation (415 total)
  • 48 plugin READMEs
  • 22 tool READMEs
  • 63 news items posted on the Datasette website
  • 212 items from my blog
  • Release notes from 557 package releases

I plan to extend it with more data sources in the future.

How it works: Dogsheep Beta

I’m reusing the search engine I originally built for my Dogsheep personal analytics project (see Personal Data Warehouses: Reclaiming Your Data). I call that search engine Dogsheep Beta. The name is a pun.

SQLite has great full-text search built in, and I make extensive use of that in Datasette projects already. But out of the box it’s not quite right for this kind of search engine that spans multiple different content types.

The problem is relevance calculation. I wrote about this in Exploring search relevance algorithms with SQLite—short version: query relevance is calculated using statistics against the whole corpus, so search terms that occur rarely in the overall corpus contribute a higher score than more common terms.

This means that calculated full-text ranking scores calculated against one table of data cannot be meaningfully compared to scores calculated independently against a separate table, as the corpus statistics used to calculate the rank will differ.

To get usable scores, you need everything in a single table. That’s what Dogsheep Beta does: it creates a new table, called search_index, and copies searchable content from the other tables into that new table.

This is analagous to how an external search index like Elasticsearch works: you store your data in the main database, then periodically update an index in Elasticsearch. It’s the denormalized query engine design pattern in action.

Configuring Dogsheep Beta

There are two components to Dogsheep Beta: a command-line tool for building a search index, and a Datasette plugin for providing an interface for running searches.

Both of these run off a YAML configuration file, which defines the tables that should be indexed and also defines how those search results should be displayed.

(Having one configuration file handle both indexing and display feels a little inelegant, but it’s extremely productive for iterating on so I’m letting that slide.)

Here’s the full Dogsheep configuration for datasette.io. An annotated extract:

# Index material in the content.db SQLite file
content.db:
  # Define a search type called 'releases'
  releases:
    # Populate that search type by executing this SQL
    sql: |-
      select
        releases.id as key,
        repos.name || ' ' || releases.tag_name as title,
        releases.published_at as timestamp,
        releases.body as search_1,
        1 as is_public
      from
        releases
        join repos on releases.repo = repos.id
    # When displaying a search result, use this SQL to
    # return extra details about the item
    display_sql: |-
      select
        -- highlight() is a custom SQL function
        highlight(render_markdown(releases.body), :q) as snippet,
        html_url
      from releases where id = :key
    # Jinja template fragment to display the result
    display: |-
      <h3>Release: <a href="{{ display.html_url }}">{{ title }}</a></h3>
      <p>{{ display.snippet|safe }}</p>
      <p><small>Released {{ timestamp }}</small></p>

The core pattern here is the sql: key, which defines a SQL query that must return the following columns:

  • key—a unique identifier for this search item
  • title—a title for this indexed document
  • timestamp—a timestamp for when it was created. May be null.
  • search_1—text to be searched. I may add support for search_2 and search_3 later on to store text that will be treated with a lower relevance score.
  • is_public—should this be considered “public” data. This is a holdover from Dogsheep Beta’s application for personal analytics, I don’t actually need it for datasette.io.

To create an index, run the following:

dogsheep-beta index dogsheep-index.db dogsheep-config.yml

The index command will loop through every configured search type in the YAML file, execute the SQL query and use it to populate a search_index table in the dogsheep-index.db SQLite database file.

Here’s the search_index table for datasette.io.

When you run a search, the plugin queries that table and gets back results sorted by relevance (or other sort criteria, if specified).

To display the results, it loops through each one and uses the Jinja template fragment from the configuration file to turn it into HTML.

If a display_sql: query is defined, that query will be executed for each result to populate the {{ display }} object made available to the template. Many Small Queries Are Efficient In SQLite.

Search term highlighting

I spent a bit of time thinking about search highlighting. SQLite has an implementation of highlighting built in—the snippet() function—but it’s not designed to be HTML-aware so there’s a risk it might mangle HTML by adding highlighting marks in the middle of a tag or attribute.

I ended up rolling borrowing a BSD licensed highlighting class from the django-haystack project. It deals with HTML by stripping tags, which seems to be more-or-less what Google do for their own search results so I figured that’s good enough for me.

I used this one-off site plugin to wrap the highlighting code in a custom SQLite function. This meant I could call it from the display_sql: query in the Dogsheep Beta YAML configuration.

A custom template tag would be more elegant, but I don’t yet have a mechanism to expose custom template tags in the Dogsheep Beta rendering mechanism.

Build, index, deploy

The Datasette website implements the Baked Data pattern, where the content is compiled into SQLite database files and bundled with the application code itself as part of the deploy.

Building the index is just another step of that process.

Here’s the deploy.yml GitHub workflow used by the site. It roughly does the following:

  • Download the current version of the content.db database file. This is so it doesn’t have to re-fetch release and README content that was previously stored there.
  • Download the current version of blog.db, with entries from my blog. This means I don’t have to fetch all entries, just the new ones.
  • Run build_directory.py, the script which fetches data for the plugins and tools pages.
    • This hits the GitHub GraphQL API to find new repositories tagged datasette-io and datasette-plugin and datasette-tool.
    • That GraphQL query also returns the most recent release. The script then checks to see if those releases have previously been fetched and, if not, uses github-to-sqlite to fetch them.
  • Imports the data from news.yaml into a news table using yaml-to-sqlite
  • Imports the latest PyPI download statistics for my packages from my simonw/package-stats repository, which implements git scraping against the most excellent pypistats.org.
  • Runs the dogsheep-beta index command to build a dogsheep-index.db search index.
  • Runs some soundness checks, e.g. datasette . --get "/plugins", to verify that Datasette is likely to at least return 200 results for some critical pages once published.
  • Uses datasette publish cloudrun to deploy the results to Google Cloud Run, which hosts the website.

I love building websites this way. You can have as much complexity as you like in the build script (my TIL website build script generates screenshots using Puppeteer) but the end result is some simple database files running on inexpensive, immutable, scalable hosting.

Build v.s. buy: how billing models affect your internal culture one month ago

Something to pay attention to when making a build v.s. buy decision is the impact that billing models will have on your usage of a tool.

Take logging for example. If you buy a log aggregation platform like Splunk Cloud or Loggly the pricing is likely based on the quantity of data you ingest per day.

This can set up a weird incentive. If you are already close to the limit of your plan, you’ll find that engineers are discouraged from logging new things.

This can have a subtle effect on your culture. Engineers who don’t want to get into a budgeting conversation will end up avoiding using key tools, and this can cost you a lot of money in terms of invisible lost productivity.

Tools that charge per-head have a similar problem: if your analytics tool charges per head, your junior engineers won’t have access to it. This means you won’t build a culture where engineers use analytics to help make decisions.

This is a very tricky dynamic. On the one hand it’s clearly completely crazy to invest in building your own logging or analytics solutions—you should be spending engineering effort solving the problems that are unique to your company!

But on the other hand, there are significant, hard-to-measure hidden costs of vendors with billing mechanisms that affect your culture in negative ways.

I don’t have a solution to this. It’s just something I’ve encountered that makes the “build v.s. buy” decision a lot more subtle than it can first appear.

It’s also worth noting that this is only a problem in larger engineering organizations. In a small startup the decision chain to “spend more money on logging” is a 30 second conversation with the founder with the credit card—even faster if you’re the founder yourself!

Update: a process solution?

Thinking about this more, I realize that this isn’t a technology problem: it’s a process and culture problem. So there should be a process and cultural solution.

One thing that might work would be to explicitly consider this issue in the vendor selection conversations, then document it once the new tool has been implemented.

A company-wide document listing these tools, with clear guidance as to when it’s appropriate to increase capacity/spend and a documented owner (individual or team) plus contact details could really help overcome the standard engineer’s resistance to having conversations about budget.

This post started out as a comment on Hacker News.

Elsewhere

12th January 2021

7th January 2021

  • datasette-css-properties (via) My new Datasette plugin defines a “.css” output format which returns the data from the query as a valid CSS stylesheet defining custom properties for each returned column. This means you can build a page using just HTML and CSS that consumes API data from Datasette, no JavaScript required! Whether this is a good idea or not is left as an exercise for the reader. #7th January 2021, 7:42 pm
  • Custom Properties as State. Fascinating thought experiment by Chris Coyier: since CSS custom properties can be defined in an external stylesheet, we can APIs that return stylesheets defining dynamically server-side generated CSS values for things like time-of-day colour schemes or even strings that can be inserted using ::after { content: var(--my-property).

    This gave me a very eccentric idea for a Datasette plugin... #7th January 2021, 7:39 pm

6th January 2021

  • When you know something it is almost impossible to imagine what it is like not to know that thing. This is the curse of knowledge, and it is the root of countless misunderstandings and inefficiencies. Smart people who are comfortable with complexity can be especially prone to it!

    If you don’t guard against the curse of knowledge it has the potential to obfuscate all forms of communication, including code. The more specialized your work, the greater the risk that you will communicate in ways that are incomprehensible to the uninitiated.

    Joel Goldberg # 6th January 2021, 7:43 pm

  • brumm.af/shadows (via) I did not know this trick: by defining multiple box-shadow values as a comma separated list you can create much more finely tuned shadow effects. This tool by Philipp Brumm provides a very smart UI for designing shadows. #6th January 2021, 4:12 pm

5th January 2021

  • DALL·E: Creating Images from Text (via) “DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text–image pairs.”. The examples in this paper are astonishing—“an illustration of a baby daikon radish in a tutu walking a dog” generates exactly that. #5th January 2021, 8:31 pm
  • Generally, product-aligned teams deliver better products more rapidly. Again, Conway’s Law is inescapable; if delivering a new feature requires several teams to coordinate, you’ll struggle compared to an org where a single team can execute on a new feature.

    Jacob Kaplan-Moss # 5th January 2021, 4:33 pm

  • Everything You Always Wanted To Know About GitHub (But Were Afraid To Ask) (via) ClickHouse by Yandex is an open source column-oriented data warehouse, designed to run analytical queries against TBs of data. They’ve loaded the full GitHub Archive of events since 2011 into a public instance, which is a great way of both exploring GitHub activity and trying out ClickHouse. Here’s a query I just ran that shows number of watch events per year, for example:

    SELECT toYear(created_at) as yyyy, count() FROM github_events WHERE event_type = ’WatchEvent’ group by yyyy #5th January 2021, 1:02 am

4th January 2021

3rd January 2021

  • sqlite-utils 3.2 (via) As discussed in my weeknotes yesterday, this is the release of sqlite-utils that adds the new “cached table counts via triggers” mechanism. #3rd January 2021, 9:25 pm

29th December 2020

  • You know Google Maps? What I do is, like, build little pieces of Google Maps over and over for people who need them but can’t just use Google Maps because they’re not allowed to for some reason, or another.

    Joe Morrison # 29th December 2020, 8:32 pm

28th December 2020

  • Replicating SQLite with rqlite (via) I’ve been trying out rqlite, a “lightweight, distributed relational database, which uses SQLite as its storage engine”. It’s written in Go and uses the Raft consensus algorithm to allow a cluster of nodes to elect a leader and replicate SQLite statements between them. By default it uses in-memory SQLite databases with an on-disk Raft replication log—here are my notes on running it in “on disk” mode as a way to run multiple Datasette processes against replicated SQLite database files. #28th December 2020, 7:51 pm
  • While copywriting is used to persuade a user to take a certain action, technical writing exists to support the user and remove barriers to getting something done. Good technical writing is hard because writers must get straight to the point without losing or confusing readers.

    Stephanie Morillo # 28th December 2020, 3:58 pm

19th December 2020

  • How Shopify Uses WebAssembly Outside of the Browser (via) I’m fascinated by applications of WebAssembly outside the browser. As a Python programmer I’m excited to see native code libraries herring compiled to WASM in a way that lets me call them from Python code via a bridge, but the other interesting application is executing untrusted code in a sandbox.

    Shopify are doing exactly that—they are building a kind-of plugin mechanism where partner code compiled to WASM runs inside their architecture using Fastly’s Lucet. The performance numbers are in the same ballpark as native code.

    Also interesting: they’re recommending AssemblyScript, a TypeScript-style language designed to compile directly to WASM without needing any additional interpreter support, as required by dynamic languages such as JavaScript, Python or Ruby. #19th December 2020, 4:46 pm