Simon Willison’s Weblog

On management, webperformance, sql, asgi, pypi, ...

 

Recent entries

Weeknotes: Improv at Stanford, planning Datasette Cloud six days ago

Last week was the first week of the quarter at Stanford—which is called “shopping week” here because students are expected to try different classes to see which ones they are going to stick with.

I’ve settled on three classes this quarter: Beginning Improvising, Designing Machine Learning and Entrepreneurship: Formation of New Ventures.

Beginning Improvising is the Stanford improv theater course. It’s a big time commitment: three two-hours sessions a week for ten weeks is nearly 60 hours of improv!

It’s already proving to be really interesting though: it turns out the course is a thinly disguised applied psychology course.

Improv is about creating a creative space for other people to shine. The applications to professional teamwork are obvious and fascinating to me. I’ll probably write more about this as the course continues.

Designing Machine Learning is a class at the Stanford d.School taught by Michelle Carney and Emily Callaghan. It focuses on multidisciplinary applications of machine learning, mixing together students from many different disciplines around Stanford.

I took a fast.ai deep learning course last year which gave me a basic understanding of the code size of neural networks, but I’m much more interestind in figuring out applications so this seems like a much more interesting option than a more code-focused course.

The class started out building some initial models using Google’s Teachable Machine tool, which is fascinating. It lets you train transfer learning models for image, audio and posture recognition entirely in your browser—no data is transferred to Google’s servers at all. You can then export those models and use them with a variety of different libraries—I’ve got them to work with both JavaScript and Python already.

I’m taking Entrepreneurship: Formation of New Ventures because of the rave reviews I heard from other JSK fellows who took it last quarter. It’s a classic case-study business school class: each session features a guest speaker who is a successful entrepreneur, and the class discusses their case for the first two thirds of the section while they listen in—then finds out how well the discussion matched to what actually happened.

Planning Datasette Cloud

Shopping week kept me pretty busy so I’ve not done much actual development over the past week, but I have started planning out and researching my next major project, which I’m currently calling Datasette Cloud.

Datasette Cloud will be an invite-only hosted SaaS version of Datasette. It’s designed to help get news organizations on board with the software without having to talk them through figuring out their own hosting, so I can help them solve real problems and learn more about how the ecosystem should evolve to support them.

I’d love to be able to run this on serverless hosting platforms like Google Cloud Run or Heroku, but sadly those tools aren’t an option to me due to a key problem: I’m trying to build a stateful service (SQLite databases need to live on a local disk) in 2020.

I posed this challenge on Twitter back in October:

I’ve been exploring my options since then, and I think I’ve settled on a decidedly 2010-era way of doing this: I’m going to run my own instances! So I’ve been exploring hosting Datasette on both AWS Lightsail and Digital Ocean Droplets over the past few months.

My current plan is to have each Datasette Cloud account run as a Datasette instance in its own Docker container, primarily to ensure filesystem isolation: different accounts must not be able to see each other’s database files.

I started another discussion about this on Twitter and had several recommendations for Traefik as a load balancer for assigning hostnames to different Docker containers, which is exactly what I need to do.

So this afternoon I made my way through Digital Ocean’s outstanding tutorial How To Use Traefik as a Reverse Proxy for Docker Containers on Ubuntu 18.04 and I think I’ve convinced myself that this is a smart way forward.

So, mostly a research week but I’ve got a solid plan for my next steps.

This week’s Niche Museums

I also finally got around to implementing a map.

Building a sitemap.xml with a one-off Datasette plugin 13 days ago

One of the fun things about launching a new website is re-learning what it takes to promote a website from scratch on the modern web. I’ve been thoroughly enjoying using Niche Museums as an excuse to explore 2020-era SEO.

I used to use Google Webmaster Tools for this, but apparently that got rebranded as Google Search Console back in May 2015. It’s really useful. It shows which search terms got impressions, which ones got clicks and lets you review which of your pages are indexed and which have returned errors.

Niche Museums has been live since October 24th, but it was a SPA for the first month. I switched it to server-side rendering (with separate pages for each museum) on November 25th. The Google Search Console shows it first appeared in search results on 2nd December.

So far, I’ve had 35 clicks! Not exactly earth-shattering, but every site has to start somewhere.

Screenshot of the Google Search Console.

In a bid to increase the number of indexed pages, I decided to build a sitemap.xml. This probably isn’t necessary—Google advise that you might not need one if your site is “small”, defined as 500 pages or less (Niche Museums lists 88 museums, though it’s still increasing by one every day). It’s nice to be able to view that sitemap and confirm that those pages have all been indexed inside the Search Console though.

Since Niche Museums is entirely powered by a customized Datasette instance, I needed to figure out how best to build that sitemap.

One-off plugins

Datasette’s most powerful customization options are provided by the plugins mechanism. Back in June I ported Datasette to ASGI, and the subsequent release of Datasette 0.29 introduced a new asgi_wrapper plugin hook. This hook makes it possible to intercept requests and implement an entirely custom response—ideal for serving up a /sitemap.xml page.

I considered building and releasing a generic datasette-sitemap plugin that could be used anywhere, but that felt like over-kill for this particular problem. Instead, I decided to take advantage of the --plugins-dir= Datasette option to build a one-off custom plugin for the site.

The Datasette instance that runs Niche Museums starts up like this:

$ datasette browse.db about.db \
    --template-dir=templates/ \
    --plugins-dir=plugins/ \
    --static css:static/ \
    -m metadata.json

This serves the two SQLite database files, loads custom templatse from the templates/ directory, sets up www.niche-museums.com/css/museums.css to serve data from the static/ directory and loads metadata settings from metadata.json. All of these files are on GitHub.

It also tells Datasette to look for any Python files in the plugins/ directory and load those up as plugins.

I currently have four Python files in that directory—you can see them here. The sitemap.xml is implemented using the new sitemap.py plugin file.

Here’s the first part of that file, which wraps the Datasette ASGI app with middleware that checks for the URL /robots.txt or /sitemap.xml and returns custom content for either of them:

from datasette import hookimpl
from datasette.utils.asgi import asgi_send


@hookimpl
def asgi_wrapper(datasette):
    def wrap_with_robots_and_sitemap(app):
        async def robots_and_sitemap(scope, recieve, send):
            if scope["path"] == "/robots.txt":
                await asgi_send(
                    send, "Sitemap: https://www.niche-museums.com/sitemap.xml", 200
                )
            elif scope["path"] == "/sitemap.xml":
                await send_sitemap(send, datasette)
            else:
                await app(scope, recieve, send)

        return robots_and_sitemap

    return wrap_with_robots_and_sitemap

The boilerplate here is a little convoluted, but this does the job. I’m considering adding alternative plugin hooks for custom pages that could simplify this in the future.

The asgi_wrapper(datasette) plugin function is expected to return a function which will be used to wrap the Datasette ASGI application. In this case that wrapper function is called wrap_with_robots_and_sitemap(app). Here’s the Datasette core code that builds the ASGI app and applies the wrappers:

asgi = AsgiLifespan(
    AsgiTracer(DatasetteRouter(self, routes)), on_startup=setup_db
)
for wrapper in pm.hook.asgi_wrapper(datasette=self):
    asgi = wrapper(asgi)

So this plugin will be executed as:

asgi = wrap_with_robots_and_sitemap(asgi)

The wrap_with_robots_and_sitemap(app) function then returns another, asynchronous function. This function follows the ASGI protocol specification, and has the following signature and body:

async def robots_and_sitemap(scope, recieve, send):
    if scope["path"] == "/robots.txt":
        await asgi_send(
            send, "Sitemap: https://www.niche-museums.com/sitemap.xml", 200
        )
    elif scope["path"] == "/sitemap.xml":
        await send_sitemap(send, datasette)
    else:
        await app(scope, recieve, send)

If the incoming URL path is /robots.txt, the function directly returns a reference to the sitemap, as seen at www.niche-museums.com/robots.txt.

If the path is /sitemap.xml, it calls the send_sitemap(...) function.

For any other path, it proxies the call to the original ASGI app function that was passed to the wrapper function: await app(scope, recieve, send).

The most interesting part of the implementation is that send_sitemap() function. This is the function which constructs the sitemap.xml returned by www.niche-museums.com/sitemap.xml.

Here’s what that function looks like:

async def send_sitemap(send, datasette):
    content = [
        '<?xml version="1.0" encoding="UTF-8"?>',
        '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">',
    ]
    for db in datasette.databases.values():
        hidden = await db.hidden_table_names()
        tables = await db.table_names()
        for table in tables:
            if table in hidden:
                continue
            for row in await db.execute("select id from [{}]".format(table)):
                content.append(
                    "<url><loc>https://www.niche-museums.com/browse/{}/{}</loc></url>".format(
                        table, row["id"]
                    )
                )
    content.append("</urlset>")
    await asgi_send(send, "\n".join(content), 200, content_type="application/xml")

The key trick here is to use the datasette instance object which was passed to the asgi_wrapper() plugin hook.

The code uses that instance to introspect the attached SQLite databases. It loops through them listing all of their tables, and filtering out any hidden tables (which in this case are tables used by the SQLite FTS indexing mechanism). Then for each of those tables it runs select id from [tablename] and uses the results to build the URLs that are listed in the sitemap.

Finally, the resulting XML is concatenated together and sent back to the client with an application/xml content type.

For the moment, Niche Museums only has one table that needs including in the sitemap—the museums table.

I have a longer-term goal to provide detailed documentation for the datasette object here: since it’s exposed to plugins it’s become part of the API interface for Datasette itself. I want to stabilize this before I release Datasette 1.0.

This week’s new museums

I had a lot of fun writing up the Griffith Observatory: it turns out founding donor Griffith J. Griffith was a truly terrible individual.

sqlite-utils 2.0: real upserts 21 days ago

I just released version 2.0 of my sqlite-utils library/CLI tool to PyPI.

It’s 2.0 because this is my first backwards compatibility breaking version of the library. I was hoping to avoid this for as long as possible: given semantic versioning, my opinion is that the highest quality libraries are the 1.x versions with the highest sub-version. Version 1.253 means a library managed to ship 253 new features without once breaking compatibility!

For sqlite-utils I needed the clean break, because I made a big mistake in my definition of “upsert”.

An upsert is a database operation which says “Update this record with these new values. If the record does not exist, create it.”

This is actually a documented feature of SQLite version 3.24.0 and higher. I’ve seen the term used for other databases such as PostgreSQL as well.

The problem is: when I first implemented the .upsert() method for sqlite-utils I incorrectly assumed that upsert was an alias for INSERT OR REPLACE.

So in sqlite-utils 1.x, the following code:

db["dogs"].upsert({
    "id": 1,
    "name": "Cleo",
    "age": 4
}, pk="id")

Executes then following SQL:

INSERT OR REPLACE INTO dogs
    ("id", "name", "age")
VALUES
    (1, "Cleo", 4);

It turns out this isn’t how upsert should be expected to work. This query will entirely replace any existing record. But... an upsert should update the record in place. Crucially, it should leave any columns not referenced in the upsert alone. My implementation over-wrote the entire row, deleting any missing values.

So if I run this Python code:

db["dogs"].upsert({
    "id": 1,
    "age": 5
}, pk="id")

The existing record for id=1 should have its age updated, but the name should stay the same.

I thought hard about whether I could keep the existing, incorrect .upsert() behavior and introduce a different method name or argument that did things right—maintaining backwards compatibility at the price of a less intuitive API—but this felt like too big a discrepancy. And this is what major version number changes are for! You can follow my deliberations about this in this issue.

While this is technically a breaking change, when I look at the code I have already written that uses sqlite-utils it turns out all of my uses of .upsert() are compatible with the new implementation anyway—they all send the full record object, so they will over-ride all of the fields no matter if they are running against 1.x or 2.x.

If I’m lucky, other users of sqlite-utils will be unaffected by this behavior change either. So hopefully it will be a painless upgrade.

If you do need to keep the old behavior, I’ve made it available as a replace=True argument to .insert() and .insert_all(), like so:

db["dogs"].insert({
    "id": 1,
    "name": "Cleo",
    "age": 4
}, pk="id", replace=True)

I did think about introducing a new piece of terminology for this—.inplace(), for insert-or-replace—but after getting in trouble redefining upsert by mistake I decided inventing new language wasn’t worth the risk.

Using this for Niche Museums

The project that made me realize I needed this upsert change was Niche Museums.

Niche Museums works by building a SQLite database file from a museums.yaml file. The YAML file contains the data for all 81 museums on the site. To add or edit a listing, I edit that file—sometimes in Visual Code on my Mac, sometimes via the GitHub web interface and often by using the Working Copy git client on my iPhone.

Creating the database is done using my yaml-to-sqlite tool:

$ yaml-to-sqlite browse.db museums museums.yaml --pk=id

The build process then runs a few extra scripts to annotate the records with extra data.

annotate_timestamps.py iterates through the full git history and uses it to figure out when each record was first created or last updated.

annotate_nominatim.py uses the OpenStreetMap Nominatim API to reverse geocode the latitude and longitude of each museum, adding extra columns for the city, state, country and all kinds of other interesting geographical details.

I’m not doing much with this OSM data yet—mainly using it to include the OSM country and city name in the full-text search index—but I hope to use it for faceting and other navigation improvements in the future (and maybe location-based breadcrumbs).

Here’s the problem though: every time I deploy the site, I run those scripts. Nominatim asks you not to hit their API more than once a second, so I have a sleep(1) in the code. This means the more museums I add, the longer that build step takes to reverse geocode them all.

It would be great to avoid geocoding museums that have already been geocoded. Properly implemented upsert let’s me do exactly that.

I haven’t made these changes yet, but my plan is to update the build script so that instead of starting from a blank database each time, it downloads the previously built version of the database and runs an upsert against it with the data from the YAML. This will overwrite all of the data that I might have updated in that file, but leave the osm_* columns that were created by the annotate_nominatim.py script in place.

That script can then just run against records for which osm_country is null. As such, it should only geocode newly created records.

Other improvements this week

I added a few new features to Niche Museums this week.

First, I added simple photo galleries to the listings pages, using CSS columns. This means I can add more than one photo to a museum! I’ve done this to a few now, including going back to the listing for the Ilfracombe Museum to add a whole bunch of extra photos.

I added photos and press links to the Atom feed for the site. Since the Atom feed is defined using a custom SQL query and datasette-atom, achieving this involved building a truly diabolical SQL query (making extensive use of the SQLite json1 extension).

I fixed an issue with the search engine where unexpected punctuation could cause an FTS syntax error in SQLite. I’ve extracted that into a fix for a Datasette issue as well.

The site search engine also now attempts to geocode the user’s query and offers a link to the corresponding lat/lon radius search for that point. Try a search for new orleans to see what that looks like—the implementation (entirely client-side, calling the Nominatim API) is here.

Here are the museums I added this week:

Weeknotes: Datasette 0.33 27 days ago

I released Datasette 0.33 yesterday. The release represents an accumulation of small changes and features since Datasette 0.32 back in November. Duplicating the release notes:

  • rowid is now included in dropdown menus for filtering tables (#636)
  • Columns are now only suggested for faceting if they have at least one value with more than one record (#638)
  • Queries with no results now display “0 results” (#637)
  • Improved documentation for the --static option (#641)
  • asyncio task information is now included on the /-/threads debug page
  • Bumped Uvicorn dependency 0.11
  • You can now use --port 0 to listen on an available port
  • New template_debug setting for debugging templates, e.g. https://latest.datasette.io/fixtures/roadsideattractions?context=1 (#654)

The last three items deserve extra explanation.

Port 0

I run a lot of Datasette instances on my laptop—I often have 5 or 6 running representing different projects or experiments.

Eventually this means I start to hit port conflicts. I’ll run datasette mydb.db -p 8005 and get an error because port 8005 is already in use by something else.

I asked on Twitter if there were any downsides to having Datasette automatically pick an unused port... and @davids and @dominicrodger pointed out that port 0 is a Unix convention for “pick a port for me”.

It turned out Datasette already supported this! If you ran datasette --port=0 the underlying Uvicorn server would indeed pick an unused port and start serving on it.

There was just one catch: the URL output to the console would show port 0, so it wasn’t clear which port to actually visit. This was a bug in Uvicorn.

Open source to the rescue... I filed an issue with Uvicorn and then constructed a pull request. Tom Christie merged it five hours later and shipped a release. Datasette 0.33 depends on that updated version of Uvicorn, so --port=0 now works as expected!

template_debug

I’m always looking for new ways to add debugging and development tools to Datasette. I added support for ?_trace=1 to see executed SQL queries back in May for example.

While thinking through a refactoring of Datasette’s template rendering to better support plugins, I realized there was an opportunity for better debugging around templates.

Datasette supports custom templates, which means template authors need a clear way to see what variables are available to their template logic.

I tried adding a ?_context=1 parameter to dump out the template context, and it instantly helped me better understand what was going on.

There was just one problem: what about secret information? By default Datasette doesn’t include any secrets (signing keys, API tokens etc) in its template context—but Datasette plugins are allowed to add things to the context, and I can’t guarantee that a future plugin won’t add something that shouldn’t be exposed by the ?_context=1 debugging parameter.

The solution was to have context debugging only available if the new template_debug configuration setting is turned on:

$ datasette --memory --config template_debug:1

I’ve set up the latest.datasette.io demo to run with this setting, so you can now view the context for any page on that site, like so:

Niche Museums

Natalie and I are in the UK this week. We spent a few days in London and managed to get in a few niche museums, including the first four on this week’s new listings on Niche Museums:

I also made a small fix to the site’s Atom feed: it now respects newlines in the entries, by rendering the description using Markdown.

Here’s the implementation. Since the Atom feed is defined by a SQL query (using datasette-atom) I took advantage of Datasette’s ability to load custom plugins from a per-site plugins/ directory and added a custom SQLite SQL function for rendering markdown, then added that function call to the query that defines the feed.

Logging to SQLite using ASGI middleware one month ago

I had some fun playing around with ASGI middleware and logging during our flight back to England for the holidays.

asgi-log-to-sqlite

I decided to experiment with SQLite as a logging mechanism. I wouldn’t use this on a high traffic site, but most of my Datasette related projects are small enough that logging HTTP traffic directly to a SQLite database feels like it should work reasonable well.

Once your logs are in a SQLite database, you can use Datasette to analyze them. I think this could be a lot of fun.

asgi-log-to-sqlite is my first exploration of this idea. It’s a piece of ASGI middleware which wraps an ASGI application and then logs relevant information from the request and response to an attached SQLite database.

You use it like this:

from asgi_log_to_sqlite import AsgiLogToSqlite
from my_asgi_app import app

app = AsgiLogToSqlite(app, "/tmp/log.db")

Here’s a demo Datasette instance showing logs from my testing: asgi-log-demo-j7hipcg4aq-uc.a.run.app

As always with Datasette, the data is at its most interesting once you apply some facets.

Intercepting requests to and from the wrapped ASGI app

There are a couple of interesting parts of the implementation. The first is how the information is gathered from the request and response.

This is a classic pattern for ASGI middleware. The ASGI protocol has three key components; a scope dictionary describing the incoming request, and two async functions called receive and send which are used to retrieve and send data to the connected client (usually a browser).

Most middleware works by wrapping those functions with custom replacements. That’s what I’m doing here:

class AsgiLogToSqlite:
    def __init__(self, app, file):
        self.app = app
        self.db = sqlite_utils.Database(file)
    # ...
    async def __call__(self, scope, receive, send):
        response_headers = []
        body_size = 0
        http_status = None

        async def wrapped_send(message):
            nonlocal body_size, response_headers, http_status
            if message["type"] == "http.response.start":
                response_headers = message["headers"]
                http_status = message["status"]

            if message["type"] == "http.response.body":
                body_size += len(message["body"])

            await send(message)

        start = time.time()
        await self.app(scope, receive, wrapped_send)
        end = time.time()

My wrapped_send() function replaces the original send() function with one that pulls out some of the data I want to log from the messages that are being sent to the client.

I record a start time, then await the original ASGI application, then record an end time when it finishes.

Logging to SQLite using sqlite-utils

I’m using my sqlite-utils library to implement the logging. My first version looked like this:

db["requests"].insert({
    "path": scope.get("path"),
    "response_headers": str(response_headers),
    "body_size": body_size,
    "http_status": http_status,
    "scope": str(scope),
}, alter=True)

sqlite-utils automatically creates a table with the correct schema the first time you try to insert a record into it. This makes it ideal for rapid prototyping. In this case I captured stringified versions of various data structures so I could look at them in my browser with Datasette.

The alter=True argument here means that if I attempt to insert a new shape of record into an existing tables any missing columns will be added automatically as well. Again, handy for prototyping.

Based on the above, I evolved the code into recording the values I wanted to see in my logs—the full URL path, the User-Agent, the HTTP referrer, the IP and so on.

This resulted in a LOT of duplicative data. Values like the path, user-agent and HTTP referrer are the same across many different requests.

Regular plain text logs can solve this with gzip compression, but you can’t gzip a SQLite database and still expect it to work.

Since we are logging to a relational database, we can solve for duplicate values using normalization. We can extract out those lengthy strings into separate lookup tables—that way we can store mostly integer foreign key references in the requests table itself.

After a few iterations, my database code ended up looking like this:

with db.conn:  # Use a transaction
    db["requests"].insert(
        {
            "start": start,
            "method": scope["method"],
            "path": lookup(db, "paths", path),
            "query_string": lookup(db, "query_strings", query_string),
            "user_agent": lookup(db, "user_agents", user_agent),
            "referer": lookup(db, "referers", referer),
            "accept_language": lookup(db, "accept_languages", accept_language),
            "http_status": http_status,
            "content_type": lookup(db, "content_types", content_type),
            "client_ip": scope.get("client", (None, None))[0],
            "duration": end - start,
            "body_size": body_size,
        },
        alter=True,
        foreign_keys=self.lookup_columns,
    )


def lookup(db, table, value):
    return db[table].lookup({
        "name": value
    }) if value else None

The table.lookup() method in sqlite-utils is designed for exactly this use-case. If you pass it a value (or multiple values) it will ensure the underlying table has those columns with a unique index on them, then get-or-insert your data and return you the primary key.

Automatically creating tables is fine for an initial prototype, but it starts getting a little messy once you have foreign keys relationships that you need to be able to rely on. I moved to explicit table creation in an ensure_tables() method that’s called once when the middleware class is used to wrap the underlying ASGI app:

    lookup_columns = (
        "path",
        "user_agent",
        "referer",
        "accept_language",
        "content_type",
        "query_string",
    )

    def ensure_tables(self):
        for column in self.lookup_columns:
            table = "{}s".format(column)
            if not self.db[table].exists:
                self.db[table].create({
                    "id": int,
                    "name": str
                }, pk="id")
        if not self.db["requests"].exists:
            self.db["requests"].create({
                "start": float,
                "method": str,
                "path": int,
                "query_string": int,
                "user_agent": int,
                "referer": int,
                "accept_language": int,
                "http_status": int,
                "content_type": int,
                "client_ip": str,
                "duration": float,
                "body_size": int,
            }, foreign_keys=self.lookup_columns)

I’m increasingly using this pattern in my sqlite-utils projects. It’s not a full-grown migrations system but it’s a pretty low-effort way of creating tables correctly provided they don’t yet exist.

Here’s the full implementation of the middleware.

Configuring the middleware for use with Datasette

Publishing standalone ASGI middleware for this kind of thing is neat because it can be used with any ASGI application, not just with Datasette.

To make it as usable as possible with Datasette I want it made available as a plugin.

I’ve tried two different patterns for this in the past.

My first ASGI middleware was asgi-cors. I published that as two separate packages to PyPI: asgi-cors is the middleware itself, and datasette-cors is a very thin plugin wrapper around it that hooks into Datasette’s plugin configuration mechanism.

For datasette-auth-github I decided not to publish two packages. Instead I published a single plugin package and then described how to use it as standalone ASGI middleware in its documentation.

This lazier approach is confusing: it’s not at all clear that a package called datasette-auth-github can be used independently of Datasette. But I did get to avoid having to publish two packages.

datasette-configure-asgi

Since I want to do a lot more experiments with ASGI plugins in the future, I decided to try solving the ASGI configuration issue once and for all. I built a new experimental plugin, datasette-configure-asgi which can be used to configure ANY ASGI middleware that conforms to an expected protocol.

Here’s what that looks like at the configuration level, using a metadata.json settings file (which I should really rename since it’s more about configuration than metadata these days):

{
  "plugins": {
    "datasette-configure-asgi": [
      {
        "class": "asgi_log_to_sqlite.AsgiLogToSqlite",
        "args": {
          "file": "/tmp/log.db"
        }
      }
    ]
  }
}

The implementation of this plugin is very simple: here’s the entire thing:

from datasette import hookimpl
import importlib


@hookimpl
def asgi_wrapper(datasette):
    def wrap_with_classes(app):
        configs = datasette.plugin_config("datasette-configure-asgi") or []
        for config in configs:
            module_path, class_name = config["class"].rsplit(".", 1)
            mod = importlib.import_module(module_path)
            klass = getattr(mod, class_name)
            args = config.get("args") or {}
            app = klass(app, **args)
        return app

    return wrap_with_classes

It hooks into the asgi_wrapper plugin hook, reads its configuration from the datasette object (using plugin_config()), then loops through the list of configured plugins and dynamically loads each implementation using importlib. Then it wraps the ASGI app with each of them in turn.

Open questions

This is where I’ve got to with my experiments so far. Should you use this stuff in production? Almost certainly not! I wrote it on a plane just now. It definitely needs a bit more thought.

A couple of obvious open questions:

Python async functions shouldn’t make blocking calls, since doing so will block the entire event loop for everyone else.

Interacting with SQLite is a blocking call. Datasette works around this by running SQL queries in a thread pool; my logging plugin doesn’t bother with that.

Maybe it should? My hunch is that inserting into SQLite in this way is so fast it won’t actually cause any noticeable overhead. It would be nice to test that assumption thoroughly though.

Log rotation. This is an important detail for any well designed logging system, and I’ve punted on it entirely. Figuring out an elegant way to handle this with underlying SQLite databases files would be an interesting design challenge—relevant issue.

Would my SQLite logging middleware work with Django 3.0? I don’t see why not—the documentation covers how to wrap entire Django applications with ASGI middleware. I should try that out!

This week’s Niche Museums

These are technically my weeknotes, but logging experiments aside it’s been a quiet week for me.

I finally added paragraph breaks to Niche Museums (using datasette-render-markdown, implementation here) As a result my descriptions have been getting a whole lot longer. Added this week:

Better presentations through storytelling and STAR moments one month ago

Last week I completed GSBGEN 315: Strategic Communication at the Stanford Graduate School of Business.

The course has a stellar, well deserved reputation. It’s principally about public speaking, and I gained a huge amount from it despite having over fifteen years of experience speaking at conferences.

Some of the things that really stood out for me (partially in the form of catchy acronyms):

  • Every talk should start with an AIM: Audience, Intent, Message. Who are the audience for the talk? What do you intend to achieve by giving the presentation? With those two things in mind, you can construct the message—the actual content of the talk.
  • Try to include at least one STAR moment—Something They’ll Always Remember. This can be a gimmick, a repeated theme, a well-selected video or audio clip. Something to help the talk stand out.
  • The human brain is incredibly attuned to stories. If you can find an excuse to tell a story, no matter how thin that excuse is, take it.
  • Presentations are most interesting if they are structured with contrasts. These can be emotional high and low points, or content that illustrates what is compared to what could be. Sparklines are a tool that can be used to think about this structure.

That last point about stories is where things get really interesting. We reviewed the classic hero’s journey story structure... but with a twist.

When giving a talk, position your audience as the hero. They start in position of comfort and safety. Your job is to call them to adventure—guide them towards a dangerous and unknown realm, encourage them to take on new challenges, learn new things and finish the adventure in a new, advanced state of mind.

You’re not the hero—you’re more the mentor who they meet along the way.

One of the course texts was Nancy Duarte’s Resonate, which explains this model of presenting in great detail. It’s a really clever and surprising way of thinking about a presentation.

My JSK backstory

The backstory is a core tradition of the JSK fellowship I’m participating in this year at Stanford. Each week, one of the 19 fellows tells the story of their career and how they came to journalism.

Last Wednesday was my turn. The timing couldn’t have been more fortunate, as I got to apply the lessons I’d learned from Strategic Communications in putting together my presentation.

I think it was one of the best pieces of public speaking I’d ever done. Backstories include details that aren’t necessarily intended for a public audience so I won’t be sharing much of it here, but mindfully constructing an emotional sparkline and seeking out STAR moments worked out really well for me.

Since GSBGEN 315 is only available to Stanford GSB students, I’ll throw in a strong recommendation for reading Resonate as an alternative if this has sparked your interest.

Also this week

Preparing my backstory took up much of my time this week. I ended up losing my streaks against both email checking and Datasette contributions, but I’m hoping to pick those back up again now that the presentation is out of the way.

I posted the following museums to Niche Museums—one of which, the Centennial Light, we got to see on Saturday:

I’m getting concerned about how many not-quite-finished Datasette features I have outstanding now (I started exploring another one just the other day). I’m going to try to resist the temptation to pick up any more until I’ve shipped at least some of the 47 currently open feature tickets.

Elsewhere

18th January 2020

  • Ray Bandar's Bone Palace (via) The hundredth museum I’ve posted to Niche Museums, and this one is my absolute favourite. Ray Bandar spent sixty years collecting over 7,000 animal skulls and putting them on display in his house and basement. Natalie managed to score us a tour a few weeks before the entire collection was donated to the California Academy of Sciences. It was the most amazing room I’ve ever been in. #
  • Code is made of pain, lies, and bad ideas, all so we can pretend that electrified sand will do what we tell it to

    Yoz Grahame #

17th January 2020

  • Your own hosted blog, the easy, free, open way (even if you're not a computer expert) (via) Jeremy Howard and the fast.ai team have released fast_template—a GitHub repository designed to be used as a template to create new repositories with a complete Jekyll blog configured for use with GitHub pages. GitHub’s official document recommends you install Ruby on your machine to do this, but Jeremy points out that with the right repository setup you can run a blog entirely by editing files through the GitHub web interface. #

16th January 2020

  • How to do Zero Downtime Deployments of Docker Containers. I’m determined to get reliable zero-downtime deploys working for a new project, because I know from experience that even a few seconds of downtime during a deploy changes the project mentality from “deploy any time you want” to “don’t deploy too often”. I’m using Docker containers behind Traefik, which means new containers should have traffic automatically balanced to them by Traefik based on their labels. After much fiddling around the pattern described by this article worked best for me: it lets me start a new container, then stop the old one and have Traefik’s “retry” mechanism send any requests to the stopped container over to the new one instead. #
  • Demos, Prototypes, and MVPs (via) I really like how Jacob describes the difference between a demo and a prototype: a demo is externally facing and helps explain a concept to a customer; a prototype is internally facing and helps prove that something can be built. #

15th January 2020

  • There is enough wood; Green estimates that it takes about 13 minutes for 20 North American forests to collectively grow enough wood for a 20-story building

    David Roberts #

10th January 2020

  • Serving 100µs reads with 100% availability (via) Fascinating use-case for SQLite from Segment: they needed a massively replicated configuration database across all of their instances that process streaming data. They chose to make the configuration available as a ~50GB SQLite database file mirrored to every instance, meaning lookups against that data could complete in microseconds. Changes to the central MySQL configuration store are pulled every 2-3 seconds, resulting in a trade-off of consistency for availability which fits their use-case just fine. #
  • A visual introduction to machine learning. Beautiful interactive essay explaining how a decision tree machine learning module is constructed, and using that to illustrate the concept of overfitting. This is one of the best explanations of machine learning fundamentals I’ve seen anywhere. #
  • Snowpack (via) Really interesting new twist on build systems for JavaScript. Modern browsers (everything since IE11) support JavaScript modules, but actually working with them is tricky since so much of the JavaScript ecosystem expects you to be using a bundler like Webpack. Snowpack is a tool for converting npm dependencies into JavaScript modules which can then be loaded directly by the browser, taking advantage of HTTP/2 to efficiently load the resulting larger number of files. #
  • Portable Cloud Functions with the Python Functions Framework (via) The new functions-framework library on PyPI lets you run Google Cloud Functions written in Python in other environments—on your local developer machine or bundled in a Docker container for example. I have real trouble trusting serverless platforms that lock you into a single provider (AWS Lambda makes me very uncomfortable) so this is a breath of fresh air. #
  • Async Support - HTTPX (via) HTTPX is the new async-friendly HTTP library for Python spearheaded by Tom Christie. It works in both async and non-async mode with an API very similar to requests. The async support is particularly interesting—it’s a really clean API, and now that Jupyter supports top-level await you can run ’(await httpx.AsyncClient().get(url)).text’ directly in a cell and get back the response. Most excitingly the library lets you pass an ASGI app directly to the client and then perform requests against it—ideal for unit tests. #

8th January 2020

  • I’ve found, in my 20 years of running the site, that whenever you ban an ironic Nazi, suddenly they become actual Nazis

    Richard “Lowtax” Kyanka #

  • A Compiler Writing Journey (via) Warren Toomey has been writing a self-compiling compiler for a subset of C, and extensively documenting every step of the journey here on GitHub. The result is an extremely high quality free textbook on compiler construction. #

7th January 2020

  • Better Python Object Serialization. TIL about functions.singledispatch, a decorator which makes it easy to create Python functions with implementations that vary based on the type of their arguments and which can have additional implementations registered after the fact—great for things like custom JSON serialization. #

3rd January 2020

  • Come version 80, any cookie without a SameSite attribute will be treated as “Lax” by Chrome. This is really important to understand because put simply, it’ll very likely break a bunch of stuff. [...] The fix is easy, all it needs is for everyone responsible for maintaining any system that uses cookies that might be passed from an external origin to understand what’s going on. Can’t be that hard, right? Hello? Oh...

    Troy Hunt #

  • How the Digg team was acquihired. (via) Useful insight into how a talent acquisition can play out from Will Larson, who was an engineering leader at Digg when they negotiated their acquihire exit. #
  • What do you call the parts of a story? Or: why can’t journalists spell “lead”? (via) Carl M. Johnson’s analysis of what journalists call different elements of a story, useful for data modeling a CMS for a news organization. #

2nd January 2020

  • How we use "ship small" to rapidly build new features at GitHub (via) Useful insight into how GitHub develop new features. They make aggressive use of feature flags, shipping a rough skeleton of a new feature to production as early as possible and actively soliciting feedback from other employees as they iterate on the feature. They static JSON mocks of APIs to unblock their frontend engineers and iterate on the necessary data structures while the real backend is bring implemented. #

30th December 2019

  • Scaling React Server-Side Rendering (via) Outstanding, detailed essay from 2017 on challenges and solutions for scaling React server-side rendering at Kijiji, Canada’s largest classified site (owned by eBay). There’s a lot of great stuff in here, including a detailed discussion of different approaches to load balancing, load shedding, component caching, client-side rendering fallbacks and more. #
  • Guide To Using Reverse Image Search For Investigations (via) Detailed guide from Bellingcat’s Aric Toler on using reverse image search for investigative reporting. Surprisingly Google Image Search isn’t the state of the art: Russian search engine Yandex offers a much more powerful solution, mainly because it’s the largest public-facing image search engine to integrate scary levels of face recognition. #
  • Machine Learning on Mobile and at the Edge: 2019 industry year-in-review (via) This is a fantastic detailed overview of advances made in the field of machine learning on the edge (primarily on mobile devices) over 2019. I’m really excited about this trend: I love the improved privacy implications of running models on my phone without uploading data to a server, and it’s great to see techniques like Federated Learning (from Google Labs) which enable devices to privately train models in a distributed way without having to upload their training data. #