sqlite-comprehend: run AWS entity extraction against content in a SQLite database
I built a new tool this week: sqlite-comprehend, which passes text from a SQLite database through the AWS Comprehend entity extraction service and stores the returned entities.
I created this as a complement to my s3-ocr tool, which uses AWS Textract service to perform OCR against every PDF file in an S3 bucket.
Short version: given a database table full of text, run the following:
% pip install sqlite-comprehend % sqlite-comprehend entities myblog.db blog_entry body --strip-tags [###---------------------------------] 9% 00:01:02
This will churn through every piece of text in the
body column of the
blog_entry table in the
myblog.db SQLite database, strip any HTML tags (the
--strip-tags option), submit it to AWS Comprehend, and store the extracted entities in the following tables:
- comprehend_entities—the extracted entities, classified by type
- blog_entry_comprehend_entities—a table relating entities to the entries that they appear in
- comprehend_entity_types—a small lookup table of entity types
The above table names link to a live demo produced by running the tool against all of the content in my blog.
Here are 225 mentions that Comprehend classified as the organization called “Mozilla”.
The tool tracks which rows have been processed already (in the blog_entry_comprehend_entities_done table), so you can run it multiple times and it will only process newly added rows.
AWS Comprehend pricing starts at $0.0001 per hundred characters.
sqlite-comprehend only submits the first 5,000 characters of each row.
How the demo works
My live demo for this tool uses a new Datasette instance at datasette.simonwillison.net. It hosts a complete copy of the data from my blog—data that lives in a Django/PostgreSQL database on Heroku, but is now mirrored to a SQLite database hosted by Datasette.
The demo runs out of my simonwillisonblog-backup GitHub repository.
A couple of years ago I realized that I’m no longer happy having any content I care about not stored in a Git repository. I want to track my changes! I also want really robust backups: GitHub mirror their repos to three different regions around the world, and having data in a Git repository makes mirroring it somewhere else as easy as running a
So I created
simonwillisonblog-backup using a couple of my other tools: db-to-sqlite, which converts a PostgreSQL database to a SQLite database, and sqlite-diffable, which dumps out a SQLite database as a “diffable” directory of newline-delimited JSON files.
Here’s the simplest version of that pattern:
db-to-sqlite \ 'postgresql+psycopg2://user:pass@hostname:5432/dbname' \ simonwillisonblog.db --all
This connects to PostgreSQL, loops through all of the database tables and converts them all to SQLite tables stored in
sqlite-diffable dump simonwillisonblog.db simonwillisonblog --all
This converts that SQLite database into a directory of JSON files. Each table gets two files:
table.metadata.json containing the table’s name, columns and schema and
table.ndjson containing a newline-separated list of JSON arrays representing every row in that table.
You can see these files for my blog’s database in the simonwillisonblog folder.
(My actual script is a little more complex, because I backup only selected tables and then run extra code to redact some of the fields.)
Since I set this up it’s captured over 600 changes I’ve applied to my blog’s database, all made the regular Django admin interface.
This morning I extended the script to run
sqlite-comprehend against my blog entries and deploy the resulting data using Datasette.
The concise version of the new script looks like this:
wget -q https://datasette.simonwillison.net/simonwillisonblog.db
This retrieves the previous version of the database. I do this to avoid being charged by AWS Comprehend for running entity extraction against rows I have already processed.
sqlite-diffable load simonwillisonblog.db simonwillisonblog --replace
This creates the
simonwillisonblog.db database by loading in the JSON from the
simonwillisonblog/ folder. I do it this way mainly to exercise the new sqlite-diffable load command I just added to that tool.
--replace option ensures that any tables that already exist are replaced by a fresh copy (while leaving my existing comprehend entity extraction data intact).
sqlite-comprehend entities simonwillisonblog.db blog_entry title body --strip-tags
sqlite-comprehend against the blog entries that have not yet been processed.
set +e sqlite-utils enable-fts simonwillisonblog.db blog_series title summary --create-triggers --tokenize porter 2>/dev/null sqlite-utils enable-fts simonwillisonblog.db blog_tag tag --create-triggers --tokenize porter 2>/dev/null sqlite-utils enable-fts simonwillisonblog.db blog_quotation quotation source --create-triggers --tokenize porter 2>/dev/null sqlite-utils enable-fts simonwillisonblog.db blog_entry title body --create-triggers --tokenize porter 2>/dev/null sqlite-utils enable-fts simonwillisonblog.db blog_blogmark link_title via_title commentary --create-triggers --tokenize porter 2>/dev/null set -e
This configures SQLite full-text search against each of those tables, using this pattern to supress any errors that occur if the FTS tables already exist.
Setting up FTS in this way means I can use the datasette-search-all plugin to run searches like this one for aws across all of those tables at once.
datasette publish cloudrun simonwillisonblog.db \ -m metadata.yml \ --service simonwillisonblog \ --install datasette-block-robots \ --install datasette-graphql \ --install datasette-search-all
This uses the using datasette publish command to deploy the datasette.simonwillison.net site to Google Cloud Run.
I’m adding two more plugins here: datasette-block-robots to avoid search engine crawlers indexing a duplicate copy of my blog’s content, and datasette-graphql to enable GraphQL queries against my data.
Here’s an example GraphQL query that returns my most recent blog entries that are tagged with
Releases this week
sqlite-comprehend: 0.2.1—(4 releases total)—2022-07-11
Tools for running data in a SQLite database through AWS Comprehend
sqlite-diffable: 0.4—(5 releases total)—2022-07-11
Tools for dumping/loading a SQLite database to diffable directory structure
datasette-redirect-to-https: 0.2—(2 releases total)—2022-07-04
Datasette plugin that redirects all non-https requests to https
datasette-unsafe-expose-env: 0.1.1—(2 releases total)—2022-07-03
Datasette plugin to expose some environment variables at /-/env for debugging
Datasette plugin to expose selected environment variables at /-/env for debugging
datasette-upload-csvs: 0.7.2—(10 releases total)—2022-07-03
Datasette plugin for uploading CSV files and converting them to database tables
datasette-packages: 0.2—(3 releases total)—2022-07-03
Show a list of currently installed Python packages
datasette-graphql: 2.1—(35 releases total)—2022-07-03
Datasette plugin providing an automatic GraphQL API for your SQLite databases
datasette-edit-schema: 0.5—(9 releases total)—2022-07-01
Datasette plugin for modifying table schemas
TIL this week
More recent articles
- AI-enhanced development makes me more ambitious with my projects - 27th March 2023
- I built a ChatGPT plugin to answer questions about data hosted in Datasette - 24th March 2023
- Weeknotes: AI won't slow down, a new newsletter and a huge Datasette refactor - 22nd March 2023
- Don't trust AI to talk accurately about itself: Bard wasn't trained on Gmail - 22nd March 2023
- A conversation about prompt engineering with CBC Day 6 - 18th March 2023
- Could you train a ChatGPT-beating model for $85,000 and run it in a browser? - 17th March 2023