sqlite-history: tracking changes to SQLite tables using triggers (also weeknotes)

15th April 2023

In between blogging about ChatGPT rhetoric, micro-benchmarking with ChatGPT Code Interpreter and Why prompt injection is an even bigger problem now I managed to ship the beginnings of a new project: sqlite-history.

sqlite-history

Recording changes made to a database table is a problem that has popped up consistently throughout my entire career. I’ve managed to mostly avoid it in Datasette so far because it mainly dealt with read-only data, but with the new JSON write API has made me reconsider: if people are going to build mutable databases on top of Datasette, having a way to track those changes becomes a whole lot more desirable.

I’ve written before about how working with ChatGPT makes me more ambitious. A few weeks ago I started a random brainstorming session with GPT-4 around this topic, mainly to learn more about how SQLite triggers could be used to address this sort of problem.

Here’s the resulting transcript. It turns out ChatGPT makes for a really useful brainstorming partner.

Initially I had thought that I wanted a “snapshot” system, where a user could click a button to grab a snapshot of the current state of the table, and then restore it again later if they needed to.

I quickly realized that a system for full change tracking would be easier to build, and provide more value to users.

sqlite-history 0.1 is the first usable version of this system. It’s still very early and should be treated as unstable software, but initial testing results have been very positive so far.

The key idea is that for each table that is tracked, a separate _tablename_history table is created. For example:

CREATE TABLE people (
    id INTEGER PRIMARY KEY,
    name TEXT,
    age INTEGER,
    weight REAL
);

Gets a history table like this:

CREATE TABLE _people_history (
    _rowid INTEGER,
    id INTEGER,
    name TEXT,
    age INTEGER,
    weight REAL,
    _version INTEGER,
    _updated INTEGER,
    _mask INTEGER
);
CREATE INDEX idx_people_history_rowid ON _people_history (_rowid);

AS you can see, the history table includes the columns from the original table, plus four extra integer columns for tracking different things:

_rowid corresponds to the SQLite rowid of the parent table—which is automatically and invisibly created for all SQLite tables. This is how history records map back to their corresponding row.
_version is an incrementing version number for each individal tracked row
_updated records a millisecond-precision timestamp for when the row was updated—see this TIL.
_mask is an integer bitmap recording which columns in the row were updated in a specific change.

The _mask column is particularly important to this design.

The simplest way to implement history is to create a full copy of the previous state of a row every time it is updated.

This has a major downside: if the rows include large amounts of content—a content_html column on a blog for example—you end up storing a full copy of that data every time you make an edit, even if it was just a tweak to a headline.

I didn’t want to duplicate that much data.

An alternative approach is to store null for any column that didn’t change since the previous version. This saves on space, but introduces a new challenge: what if the user updated a column and set the new value to null? That change would be indistinguishable from no change at all.

My solution then is to use this _mask column. Every column in the table gets a power-of-two number—1, 2, 4, 8 for id, name, age and weight respectively.

The _mask then records the sum of those numbers as a bitmask. In this way, the _history row need only store information for columns that have changed, with an overhead of just four extra integer columns to record the metadata about that change.

Populating this history table can now be handled entirely using SQLite triggers. Here they are:

CREATE TRIGGER people_insert_history
AFTER INSERT ON people
BEGIN
    INSERT INTO _people_history (_rowid, id, name, age, weight, _version, _updated, _mask)
    VALUES (new.rowid, new.id, new.name, new.age, new.weight, 1, cast((julianday('now') - 2440587.5) * 86400 * 1000 as integer), 15);
END;

CREATE TRIGGER people_update_history
AFTER UPDATE ON people
FOR EACH ROW
BEGIN
    INSERT INTO _people_history (_rowid, id, name, age, weight, _version, _updated, _mask)
    SELECT old.rowid, 
        CASE WHEN old.id != new.id then new.id else null end, 
        CASE WHEN old.name != new.name then new.name else null end, 
        CASE WHEN old.age != new.age then new.age else null end, 
        CASE WHEN old.weight != new.weight then new.weight else null end,
        (SELECT MAX(_version) FROM _people_history WHERE _rowid = old.rowid) + 1,
        cast((julianday('now') - 2440587.5) * 86400 * 1000 as integer),
        (CASE WHEN old.id != new.id then 1 else 0 end) + (CASE WHEN old.name != new.name then 2 else 0 end) + (CASE WHEN old.age != new.age then 4 else 0 end) + (CASE WHEN old.weight != new.weight then 8 else 0 end)
    WHERE old.id != new.id or old.name != new.name or old.age != new.age or old.weight != new.weight;
END;

CREATE TRIGGER people_delete_history
AFTER DELETE ON people
BEGIN
    INSERT INTO _people_history (_rowid, id, name, age, weight, _version, _updated, _mask)
    VALUES (
        old.rowid,
        old.id, old.name, old.age, old.weight,
        (SELECT COALESCE(MAX(_version), 0) from _people_history WHERE _rowid = old.rowid) + 1,
        cast((julianday('now') - 2440587.5) * 86400 * 1000 as integer),
        -1
    );
END;

There are a couple of extra details here. The insert trigger records a full copy of the row when it is first inserted, with a version number of 1.

The update trigger is the most complicated. It includes some case statements to populate the correct columns, and then a big case statement at the end to add together the integers for that _mask bitmask column.

The delete trigger records the record that has just been deleted and sets the _mask column to -1 as a way of marking it as a deletion. That idea was suggested by GPT-4!

Writing these triggers out by hands would be pretty arduous... so the sqlite-history repository contains a Python library and CLI tool that can create those triggers automatically, either for specific tables:

python -m sqlite_history data.db table1 table2 table3

Or for all tables at once (excluding things like FTS tables):

python -m sqlite_history data.db --all

There are still a bunch of problems I want to solve. Open issues right now are:

Functions for restoring tables or individual rows enhancement—recording history is a lot more interesting if you can easily restore from it! GPT-4 wrote a recursive CTE for this but I haven’t fully verified that it does the right thing yet.
Try saving space by not creating full duplicate history row until first edit—currently the insert trigger instantly creates a duplicate of the full row, doubling the amount of storage space needed. I’m contemplating a change where that first record would contain just null values, and then the first time a row was updated a record would be created containing the full original copy.
Document how to handle alter table. Originally I had thought that altering a table would by necessity invalidate the history recorded so far, but I’ve realized that the _mask mechanism might actually be compatible with a subset of alterations—anything that adds a new column to the end of an existing table could work OK, since that column would get a new, incrementally larger mask value without disrupting previous records.

I’m also thinking about building a Datasette plugin on top of this library, to make it really easy to start tracking history of tables in an existing Datasette application.

Entries this week

Museums this week

Mattie Leeds Sculpture Garden

Releases this week

asyncinject 0.6—2023-04-14
Run async workflows using pytest-fixtures-style dependency injection
swarm-to-sqlite 0.3.4—2023-04-11
Create a SQLite database containing your checkin history from Foursquare Swarm
sqlite-history 0.1—2023-04-09
Track changes to SQLite tables using triggers

TIL this week

Running Dolly 2.0 on Paperspace—2023-04-12
Creating desktop backgrounds using Midjourney—2023-04-10
Unix timestamp in milliseconds in SQLite—2023-04-09
Saving an in-memory SQLite database to a file in Python—2023-04-09
GPT-4 for API design research—2023-04-06

Posted 15th April 2023 at 5:33 pm · Follow me on Mastodon or Twitter or subscribe to my newsletter

Simon Willison’s Weblog