sqlite-history: tracking changes to SQLite tables using triggers (also weeknotes)
15th April 2023
In between blogging about ChatGPT rhetoric, micro-benchmarking with ChatGPT Code Interpreter and Why prompt injection is an even bigger problem now I managed to ship the beginnings of a new project: sqlite-history.
Recording changes made to a database table is a problem that has popped up consistently throughout my entire career. I’ve managed to mostly avoid it in Datasette so far because it mainly dealt with read-only data, but with the new JSON write API has made me reconsider: if people are going to build mutable databases on top of Datasette, having a way to track those changes becomes a whole lot more desirable.
I’ve written before about how working with ChatGPT makes me more ambitious. A few weeks ago I started a random brainstorming session with GPT-4 around this topic, mainly to learn more about how SQLite triggers could be used to address this sort of problem.
Here’s the resulting transcript. It turns out ChatGPT makes for a really useful brainstorming partner.
Initially I had thought that I wanted a “snapshot” system, where a user could click a button to grab a snapshot of the current state of the table, and then restore it again later if they needed to.
I quickly realized that a system for full change tracking would be easier to build, and provide more value to users.
sqlite-history 0.1 is the first usable version of this system. It’s still very early and should be treated as unstable software, but initial testing results have been very positive so far.
The key idea is that for each table that is tracked, a separate
_tablename_history table is created. For example:
CREATE TABLE people ( id INTEGER PRIMARY KEY, name TEXT, age INTEGER, weight REAL );
Gets a history table like this:
CREATE TABLE _people_history ( _rowid INTEGER, id INTEGER, name TEXT, age INTEGER, weight REAL, _version INTEGER, _updated INTEGER, _mask INTEGER ); CREATE INDEX idx_people_history_rowid ON _people_history (_rowid);
AS you can see, the history table includes the columns from the original tabel, plus four extra integer columns for tracking different things:
_rowidcorresponds to the SQLite
rowidof the parent table—which is automatically and invisibly created for all SQLite tables. This is how history records map back to their corresponding row.
_versionis an incrementing version number for each individal tracked row
_updatedrecords a millisecond-precision timestamp for when the row was updated—see this TIL.
_maskis an integer bitmap recording which columns in the row were updated in a specific change.
_mask column is particularly important to this design.
The simplest way to implement history is to create a full copy of the previous state of a row every time it is updated.
This has a major downside: if the rows include large amounts of content—a
content_html column on a blog for example—you end up storing a full copy of that data every time you make an edit, even if it was just a tweak to a headline.
I didn’t want to duplicate that much data.
An alternative approach is to store
null for any column that didn’t change since the previous version. This saves on space, but introduces a new challenge: what if the user updated a column and set the new value to
null? That change would be indistinguishable from no change at all.
My solution then is to use this
_mask column. Every column in the table gets a power-of-two number—1, 2, 4, 8 for
_mask then records the sum of those numbers as a bitmask. In this way, the
_history row need only store information for columns that have changed, with an overhead of just four extra integer columns to record the metadata about that change.
Populating this history table can now be handled entirely using SQLite triggers. Here they are:
CREATE TRIGGER people_insert_history AFTER INSERT ON people BEGIN INSERT INTO _people_history (_rowid, id, name, age, weight, _version, _updated, _mask) VALUES (new.rowid, new.id, new.name, new.age, new.weight, 1, cast((julianday('now') - 2440587.5) * 86400 * 1000 as integer), 15); END; CREATE TRIGGER people_update_history AFTER UPDATE ON people FOR EACH ROW BEGIN INSERT INTO _people_history (_rowid, id, name, age, weight, _version, _updated, _mask) SELECT old.rowid, CASE WHEN old.id != new.id then new.id else null end, CASE WHEN old.name != new.name then new.name else null end, CASE WHEN old.age != new.age then new.age else null end, CASE WHEN old.weight != new.weight then new.weight else null end, (SELECT MAX(_version) FROM _people_history WHERE _rowid = old.rowid) + 1, cast((julianday('now') - 2440587.5) * 86400 * 1000 as integer), (CASE WHEN old.id != new.id then 1 else 0 end) + (CASE WHEN old.name != new.name then 2 else 0 end) + (CASE WHEN old.age != new.age then 4 else 0 end) + (CASE WHEN old.weight != new.weight then 8 else 0 end) WHERE old.id != new.id or old.name != new.name or old.age != new.age or old.weight != new.weight; END; CREATE TRIGGER people_delete_history AFTER DELETE ON people BEGIN INSERT INTO _people_history (_rowid, id, name, age, weight, _version, _updated, _mask) VALUES ( old.rowid, old.id, old.name, old.age, old.weight, (SELECT COALESCE(MAX(_version), 0) from _people_history WHERE _rowid = old.rowid) + 1, cast((julianday('now') - 2440587.5) * 86400 * 1000 as integer), -1 ); END;
There are a couple of extra details here. The
insert trigger records a full copy of the row when it is first inserted, with a version number of 1.
update trigger is the most complicated. It includes some
case statements to populate the correct columns, and then a big
case statement at the end to add together the integers for that
_mask bitmask column.
delete trigger records the record that has just been deleted and sets the
_mask column to
-1 as a way of marking it as a deletion. That idea was suggested by GPT-4!
Writing these triggers out by hand would be pretty arduous... so the
sqlite-utils repository contains a Python library and CLI tool that can create those triggers automatically, either for specific tables:
python -m sqlite_history data.db table1 table2 table3
Or for all tables at once (excluding things like FTS tables):
python -m sqlite_history data.db --all
There are still a bunch of problems I want to solve. Open issues right now are:
- Functions for restoring tables or individual rows enhancement—recording history is a lot more interesting if you can easily restore from it! GPT-4 wrote a recursive CTE for this but I haven’t fully verified that it does the right thing yet.
Try saving space by not creating full duplicate history row until first edit—currently the insert trigger instantly creates a duplicate of the full row, doubling the amount of storage space needed. I’m contemplating a change where that first record would contain just
nullvalues, and then the first time a row was updated a record would be created containing the full original copy.
Document how to handle alter table. Originally I had thought that altering a table would by necessity invalidate the history recorded so far, but I’ve realized that the
_maskmechanism might actually be compatible with a subset of alterations—anything that adds a new column to the end of an existing table could work OK, since that column would get a new, incrementally larger mask value without disrupting previous records.
I’m also thinking about building a Datasette plugin on top of this library, to make it really easy to start tracking history of tables in an existing Datasette application.
Entries this week
- Prompt injection: what’s the worst that can happen?
- Running Python micro-benchmarks using the ChatGPT Code Interpreter alpha
- Thoughts on AI safety in this era of increasingly powerful open source LLMs
- The Changelog podcast: LLMs break the internet
- Path to Citus Con: Working in public
- We need to tell people ChatGPT will lie to them, not debate linguistics
Museums this week
Releases this week
Run async workflows using pytest-fixtures-style dependency injection
Create a SQLite database containing your checkin history from Foursquare Swarm
Track changes to SQLite tables using triggers
TIL this week
- Running Dolly 2.0 on Paperspace—2023-04-12
- Creating desktop backgrounds using Midjourney—2023-04-10
- Unix timestamp in milliseconds in SQLite—2023-04-09
- Saving an in-memory SQLite database to a file in Python—2023-04-09
- GPT-4 for API design research—2023-04-06
More recent articles
- Datasette Enrichments: a new plugin framework for augmenting your data - 1st December 2023
- llamafile is the new best way to run a LLM on your own computer - 29th November 2023
- Prompt injection explained, November 2023 edition - 27th November 2023
- I'm on the Newsroom Robots podcast, with thoughts on the OpenAI board - 25th November 2023
- Weeknotes: DevDay, GitHub Universe, OpenAI chaos - 22nd November 2023
- Deciphering clues in a news article to understand how it was reported - 22nd November 2023
- Exploring GPTs: ChatGPT in a trench coat? - 15th November 2023
- Financial sustainability for open source projects at GitHub Universe - 10th November 2023
- ospeak: a CLI tool for speaking text in the terminal via OpenAI - 7th November 2023
- DALL-E 3, GPT4All, PMTiles, sqlite-migrate, datasette-edit-schema - 30th October 2023