Logging to SQLite using ASGI middleware

16th December 2019

I had some fun playing around with ASGI middleware and logging during our flight back to England for the holidays.

asgi-log-to-sqlite

I decided to experiment with SQLite as a logging mechanism. I wouldn’t use this on a high traffic site, but most of my Datasette related projects are small enough that logging HTTP traffic directly to a SQLite database feels like it should work reasonable well.

Once your logs are in a SQLite database, you can use Datasette to analyze them. I think this could be a lot of fun.

asgi-log-to-sqlite is my first exploration of this idea. It’s a piece of ASGI middleware which wraps an ASGI application and then logs relevant information from the request and response to an attached SQLite database.

You use it like this:

from asgi_log_to_sqlite import AsgiLogToSqlite
from my_asgi_app import app

app = AsgiLogToSqlite(app, "/tmp/log.db")

Here’s a demo Datasette instance showing logs from my testing: asgi-log-demo-j7hipcg4aq-uc.a.run.app

As always with Datasette, the data is at its most interesting once you apply some facets.

Intercepting requests to and from the wrapped ASGI app

There are a couple of interesting parts of the implementation. The first is how the information is gathered from the request and response.

This is a classic pattern for ASGI middleware. The ASGI protocol has three key components; a scope dictionary describing the incoming request, and two async functions called receive and send which are used to retrieve and send data to the connected client (usually a browser).

Most middleware works by wrapping those functions with custom replacements. That’s what I’m doing here:

class AsgiLogToSqlite:
    def __init__(self, app, file):
        self.app = app
        self.db = sqlite_utils.Database(file)
    # ...
    async def __call__(self, scope, receive, send):
        response_headers = []
        body_size = 0
        http_status = None

        async def wrapped_send(message):
            nonlocal body_size, response_headers, http_status
            if message["type"] == "http.response.start":
                response_headers = message["headers"]
                http_status = message["status"]

            if message["type"] == "http.response.body":
                body_size += len(message["body"])

            await send(message)

        start = time.time()
        await self.app(scope, receive, wrapped_send)
        end = time.time()

My wrapped_send() function replaces the original send() function with one that pulls out some of the data I want to log from the messages that are being sent to the client.

I record a start time, then await the original ASGI application, then record an end time when it finishes.

Logging to SQLite using sqlite-utils

I’m using my sqlite-utils library to implement the logging. My first version looked like this:

db["requests"].insert({
    "path": scope.get("path"),
    "response_headers": str(response_headers),
    "body_size": body_size,
    "http_status": http_status,
    "scope": str(scope),
}, alter=True)

sqlite-utils automatically creates a table with the correct schema the first time you try to insert a record into it. This makes it ideal for rapid prototyping. In this case I captured stringified versions of various data structures so I could look at them in my browser with Datasette.

The alter=True argument here means that if I attempt to insert a new shape of record into an existing tables any missing columns will be added automatically as well. Again, handy for prototyping.

Based on the above, I evolved the code into recording the values I wanted to see in my logs—the full URL path, the User-Agent, the HTTP referrer, the IP and so on.

This resulted in a LOT of duplicative data. Values like the path, user-agent and HTTP referrer are the same across many different requests.

Regular plain text logs can solve this with gzip compression, but you can’t gzip a SQLite database and still expect it to work.

Since we are logging to a relational database, we can solve for duplicate values using normalization. We can extract out those lengthy strings into separate lookup tables—that way we can store mostly integer foreign key references in the requests table itself.

After a few iterations, my database code ended up looking like this:

with db.conn:  # Use a transaction
    db["requests"].insert(
        {
            "start": start,
            "method": scope["method"],
            "path": lookup(db, "paths", path),
            "query_string": lookup(db, "query_strings", query_string),
            "user_agent": lookup(db, "user_agents", user_agent),
            "referer": lookup(db, "referers", referer),
            "accept_language": lookup(db, "accept_languages", accept_language),
            "http_status": http_status,
            "content_type": lookup(db, "content_types", content_type),
            "client_ip": scope.get("client", (None, None))[0],
            "duration": end - start,
            "body_size": body_size,
        },
        alter=True,
        foreign_keys=self.lookup_columns,
    )


def lookup(db, table, value):
    return db[table].lookup({
        "name": value
    }) if value else None

The table.lookup() method in sqlite-utils is designed for exactly this use-case. If you pass it a value (or multiple values) it will ensure the underlying table has those columns with a unique index on them, then get-or-insert your data and return you the primary key.

Automatically creating tables is fine for an initial prototype, but it starts getting a little messy once you have foreign keys relationships that you need to be able to rely on. I moved to explicit table creation in an ensure_tables() method that’s called once when the middleware class is used to wrap the underlying ASGI app:

    lookup_columns = (
        "path",
        "user_agent",
        "referer",
        "accept_language",
        "content_type",
        "query_string",
    )

    def ensure_tables(self):
        for column in self.lookup_columns:
            table = "{}s".format(column)
            if not self.db[table].exists:
                self.db[table].create({
                    "id": int,
                    "name": str
                }, pk="id")
        if not self.db["requests"].exists:
            self.db["requests"].create({
                "start": float,
                "method": str,
                "path": int,
                "query_string": int,
                "user_agent": int,
                "referer": int,
                "accept_language": int,
                "http_status": int,
                "content_type": int,
                "client_ip": str,
                "duration": float,
                "body_size": int,
            }, foreign_keys=self.lookup_columns)

I’m increasingly using this pattern in my sqlite-utils projects. It’s not a full-grown migrations system but it’s a pretty low-effort way of creating tables correctly provided they don’t yet exist.

Here’s the full implementation of the middleware.

Configuring the middleware for use with Datasette

Publishing standalone ASGI middleware for this kind of thing is neat because it can be used with any ASGI application, not just with Datasette.

To make it as usable as possible with Datasette I want it made available as a plugin.

I’ve tried two different patterns for this in the past.

My first ASGI middleware was asgi-cors. I published that as two separate packages to PyPI: asgi-cors is the middleware itself, and datasette-cors is a very thin plugin wrapper around it that hooks into Datasette’s plugin configuration mechanism.

For datasette-auth-github I decided not to publish two packages. Instead I published a single plugin package and then described how to use it as standalone ASGI middleware in its documentation.

This lazier approach is confusing: it’s not at all clear that a package called datasette-auth-github can be used independently of Datasette. But I did get to avoid having to publish two packages.

datasette-configure-asgi

Since I want to do a lot more experiments with ASGI plugins in the future, I decided to try solving the ASGI configuration issue once and for all. I built a new experimental plugin, datasette-configure-asgi which can be used to configure ANY ASGI middleware that conforms to an expected protocol.

Here’s what that looks like at the configuration level, using a metadata.json settings file (which I should really rename since it’s more about configuration than metadata these days):

{
  "plugins": {
    "datasette-configure-asgi": [
      {
        "class": "asgi_log_to_sqlite.AsgiLogToSqlite",
        "args": {
          "file": "/tmp/log.db"
        }
      }
    ]
  }
}

The implementation of this plugin is very simple: here’s the entire thing:

from datasette import hookimpl
import importlib


@hookimpl
def asgi_wrapper(datasette):
    def wrap_with_classes(app):
        configs = datasette.plugin_config("datasette-configure-asgi") or []
        for config in configs:
            module_path, class_name = config["class"].rsplit(".", 1)
            mod = importlib.import_module(module_path)
            klass = getattr(mod, class_name)
            args = config.get("args") or {}
            app = klass(app, **args)
        return app

    return wrap_with_classes

It hooks into the asgi_wrapper plugin hook, reads its configuration from the datasette object (using plugin_config()), then loops through the list of configured plugins and dynamically loads each implementation using importlib. Then it wraps the ASGI app with each of them in turn.

Open questions

This is where I’ve got to with my experiments so far. Should you use this stuff in production? Almost certainly not! I wrote it on a plane just now. It definitely needs a bit more thought.

A couple of obvious open questions:

Python async functions shouldn’t make blocking calls, since doing so will block the entire event loop for everyone else.

Interacting with SQLite is a blocking call. Datasette works around this by running SQL queries in a thread pool; my logging plugin doesn’t bother with that.

Maybe it should? My hunch is that inserting into SQLite in this way is so fast it won’t actually cause any noticeable overhead. It would be nice to test that assumption thoroughly though.

Log rotation. This is an important detail for any well designed logging system, and I’ve punted on it entirely. Figuring out an elegant way to handle this with underlying SQLite databases files would be an interesting design challenge—relevant issue.

Would my SQLite logging middleware work with Django 3.0? I don’t see why not—the documentation covers how to wrap entire Django applications with ASGI middleware. I should try that out!

This week’s Niche Museums

These are technically my weeknotes, but logging experiments aside it’s been a quiet week for me.

I finally added paragraph breaks to Niche Museums (using datasette-render-markdown, implementation here) As a result my descriptions have been getting a whole lot longer. Added this week:

The Tonga Room in San Francisco
London Silver Vaults in London
Rosie the Riveter National Historical Park in Richmond, CA
LA Bureau of Street Lighting Museum in Los Angeles
Aye-Aye Island in Madagascar
Monarch Bear Grove in San Francisco
Alverstone Mead Red Squirrel Hide on the Isle of Wight

Posted 16th December 2019 at 10:30 pm · Follow me on Mastodon or Twitter or subscribe to my newsletter

Simon Willison’s Weblog