sqlite-comprehend: run AWS entity extraction against content in a SQLite database

11th July 2022

I built a new tool this week: sqlite-comprehend, which passes text from a SQLite database through the AWS Comprehend entity extraction service and stores the returned entities.

I created this as a complement to my s3-ocr tool, which uses AWS Textract service to perform OCR against every PDF file in an S3 bucket.

Short version: given a database table full of text, run the following:

% pip install sqlite-comprehend
% sqlite-comprehend entities myblog.db blog_entry body --strip-tags
  [###---------------------------------]    9%  00:01:02

This will churn through every piece of text in the body column of the blog_entry table in the myblog.db SQLite database, strip any HTML tags (the --strip-tags option), submit it to AWS Comprehend, and store the extracted entities in the following tables:

comprehend_entities—the extracted entities, classified by type
blog_entry_comprehend_entities—a table relating entities to the entries that they appear in
comprehend_entity_types—a small lookup table of entity types

The above table names link to a live demo produced by running the tool against all of the content in my blog.

Here are 225 mentions that Comprehend classified as the organization called “Mozilla”.

The tool tracks which rows have been processed already (in the blog_entry_comprehend_entities_done table), so you can run it multiple times and it will only process newly added rows.

AWS Comprehend pricing starts at $0.0001 per hundred characters. sqlite-comprehend only submits the first 5,000 characters of each row.

How the demo works

My live demo for this tool uses a new Datasette instance at datasette.simonwillison.net. It hosts a complete copy of the data from my blog—data that lives in a Django/PostgreSQL database on Heroku, but is now mirrored to a SQLite database hosted by Datasette.

The demo runs out of my simonwillisonblog-backup GitHub repository.

A couple of years ago I realized that I’m no longer happy having any content I care about not stored in a Git repository. I want to track my changes! I also want really robust backups: GitHub mirror their repos to three different regions around the world, and having data in a Git repository makes mirroring it somewhere else as easy as running a git pull.

So I created simonwillisonblog-backup using a couple of my other tools: db-to-sqlite, which converts a PostgreSQL database to a SQLite database, and sqlite-diffable, which dumps out a SQLite database as a “diffable” directory of newline-delimited JSON files.

Here’s the simplest version of that pattern:

db-to-sqlite \
    'postgresql+psycopg2://user:pass@hostname:5432/dbname' \
    simonwillisonblog.db --all

This connects to PostgreSQL, loops through all of the database tables and converts them all to SQLite tables stored in simonwillisonblog.db.

sqlite-diffable dump simonwillisonblog.db simonwillisonblog --all

This converts that SQLite database into a directory of JSON files. Each table gets two files: table.metadata.json containing the table’s name, columns and schema and table.ndjson containing a newline-separated list of JSON arrays representing every row in that table.

You can see these files for my blog’s database in the simonwillisonblog folder.

(My actual script is a little more complex, because I backup only selected tables and then run extra code to redact some of the fields.)

Since I set this up it’s captured over 600 changes I’ve applied to my blog’s database, all made the regular Django admin interface.

This morning I extended the script to run sqlite-comprehend against my blog entries and deploy the resulting data using Datasette.

The concise version of the new script looks like this:

wget -q https://datasette.simonwillison.net/simonwillisonblog.db

This retrieves the previous version of the database. I do this to avoid being charged by AWS Comprehend for running entity extraction against rows I have already processed.

sqlite-diffable load simonwillisonblog.db simonwillisonblog --replace

This creates the simonwillisonblog.db database by loading in the JSON from the simonwillisonblog/ folder. I do it this way mainly to exercise the new sqlite-diffable load command I just added to that tool.

The --replace option ensures that any tables that already exist are replaced by a fresh copy (while leaving my existing comprehend entity extraction data intact).

sqlite-comprehend entities simonwillisonblog.db blog_entry title body --strip-tags

This runs sqlite-comprehend against the blog entries that have not yet been processed.

set +e
sqlite-utils enable-fts simonwillisonblog.db blog_series title summary --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_tag tag --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_quotation quotation source --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_entry title body --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_blogmark link_title via_title commentary --create-triggers --tokenize porter 2>/dev/null
set -e

This configures SQLite full-text search against each of those tables, using this pattern to supress any errors that occur if the FTS tables already exist.

Setting up FTS in this way means I can use the datasette-search-all plugin to run searches like this one for aws across all of those tables at once.

datasette publish cloudrun simonwillisonblog.db \
-m metadata.yml \
--service simonwillisonblog \
--install datasette-block-robots \
--install datasette-graphql \
--install datasette-search-all

This uses the using datasette publish command to deploy the datasette.simonwillison.net site to Google Cloud Run.

I’m adding two more plugins here: datasette-block-robots to avoid search engine crawlers indexing a duplicate copy of my blog’s content, and datasette-graphql to enable GraphQL queries against my data.

Here’s an example GraphQL query that returns my most recent blog entries that are tagged with datasette.

Releases this week

sqlite-comprehend: 0.2.1—(4 releases total)—2022-07-11
Tools for running data in a SQLite database through AWS Comprehend
sqlite-diffable: 0.4—(5 releases total)—2022-07-11
Tools for dumping/loading a SQLite database to diffable directory structure
datasette-redirect-to-https: 0.2—(2 releases total)—2022-07-04
Datasette plugin that redirects all non-https requests to https
datasette-unsafe-expose-env: 0.1.1—(2 releases total)—2022-07-03
Datasette plugin to expose some environment variables at /-/env for debugging
datasette-expose-env: 0.1—2022-07-03
Datasette plugin to expose selected environment variables at /-/env for debugging
datasette-upload-csvs: 0.7.2—(10 releases total)—2022-07-03
Datasette plugin for uploading CSV files and converting them to database tables
datasette-packages: 0.2—(3 releases total)—2022-07-03
Show a list of currently installed Python packages
datasette-graphql: 2.1—(35 releases total)—2022-07-03
Datasette plugin providing an automatic GraphQL API for your SQLite databases
datasette-edit-schema: 0.5—(9 releases total)—2022-07-01
Datasette plugin for modifying table schemas

TIL this week

Posted 11th July 2022 at 9:31 pm · Follow me on Mastodon or Twitter or subscribe to my newsletter

Simon Willison’s Weblog