Recent entries
Litestream backups for Datasette Cloud (and weeknotes) 25 minutes ago
My main focus this week has been adding robust backups to the forthcoming Datasette Cloud.
Datasette Cloud is a SaaS service for Datasette. It allows people to create a private Datasette instance where they can upload data, visualize and transform it and share it with other members of their team. You can join the waiting list to try it out using this form.
I’m building Datastte Cloud on Fly, specifically on Fly Machines.
Security is a big concern for Datasette Cloud. Teams should only be able to access their own data—bugs where users accidentally (or maliciously) access data for another team should be protected against as much as possible.
To help guarantee that, I’ve designed Datasette Cloud so that each team gets their own, dedicated instance, running in a Firecracker VM managed by Fly. Their data lives in a dedicated volume.
Fly volumes already implement snapshot backups, but I’m interested in defence in depth. This is where Litestream comes in (coincidentally now part of Fly, although it wasn’t when I first selected it as my backup strategy).
I’m using Litestream to constantly backup the data for each Datasette Cloud team to an S3 bucket. In the case of a complete failure of a volume, I can restore data from a backup that should be at most a few seconds out of date. Litestream also gives me point-in-time backups, such that I can recover a previous version of the data within a configurable retention window.
Keeping backups isolated
Litestream works by writing a constant stream of pages from SQLite’s WAL (Write-Ahead Log) up to an S3 bucket. It needs the ability to both read and write from S3.
This requires making S3 credentials available within the containers that run Datasette and Litestream for each team account.
Credentials in those containers are not visible to the users of the software, but I still wanted to be confident that if the credentials leaked in some way the isolation between teams would be maintained.
Initially I thought about having a separate S3 bucket for each team, but it turns out AWS has a default limit of 100 buckets per account, and a hard limit of 1,000. I aspire to have more than 1,000 customers, so this limit makes a bucket-per-team seem like the wrong solution.
I’ve learned an absolute ton about S3 and AWS permissions building my s3-credentials tool for creating credentials for accessing S3.
One of the tricks I’ve learned is that it’s possible to create temporary, time-limited credentials that only work for a prefix (effectively a folder) within an S3 bucket.
This means I can run Litestream with credentials that are specific to the team—that can read and write only from the team-ID/
prefix in the S3 bucket I am using to store the backups.
Obtaining temporary credentials
My s3-credentials
tool can create credentials for a prefix within an S3 bucket like this:
s3-credentials create my-bucket-for-backus \
--duration 12h \
--prefix team-56/
This command uses the sts.assume_role() AWS method to create credentials that allow access to that bucket, attaching this generated JSON policy to it in order to restrict access to the provided prefix.
I extracted the relevant Python code from s3-credentials
and used it to create a private API endpoint in my Datasette Cloud management server which could return the temporary credentials needed by the team container.
With the endpoint in place, my code for launching a team container can do this:
- Create the volume and machine for that team (if they do not yet exist)
- Generate a signed secret token that the machine container can exchange for its S3 credentials
- Launch the machine container, passing it the secret token
- On launch, the container runs a script which exchanges that secret token for its 12 hour S3 credentials, using the private API endpoint I created
- Those credentials are used to populate the
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
andAWS_SESSION_TOKEN
environment variables used by Litestream - Start Litestream, which then starts Datasette
Restarting every 12 hours
You may be wondering why I bothered with that initial secret token—why not just pass the temporary AWS credentials to the container when I launch it?
The reason for this is that I need to be able to obtain fresh credentials every 12 hours.
A really neat feature of Fly Machines is that they support scale-to-zero. You can stop them, and Fly will automatically restart them the next time they recieve traffic.
All you need to do is call sys.exit(0)
in your Python code (or the equivalent in any other language) and Fly will stop your container... and then restart it again with a couple of seconds of cold start time the next time an HTTP request for your container hits the Fly router.
So far I’m mainly using this to avoid the cost of running containers when they aren’t actually in- use. But there’s a neat benefit when it comes to Litestream too.
I’m using S3 credentials which expire after 12 hours. This means I need to periodically refresh the credentials and restart Litestream or it will stop being able to write to the S3 bucket.
After considering a few ways of doing this, I selected the simplest to implement: have Datasette call sys.exit(0)
after ten hours, and let Fly restart the container causing my startup script to fetch freshly generated 12 hour credentials and pass them to Litestream.
I implemented this by adding it as a new setting to my existing datasette-scale-to-zero plugin. You can now configure that with "max-age": "10h"
and it will shut down Datasette once the server has been running for that long.
Why does this require my own secret token system? Because when the container is restarted, it needs to make an authenticated call to my endpoint to retrieve those fresh S3 credentials. Fly persists environment variable secrets between restarts to the container, so that secret can be long-lived even while it is exchanged for short-term S3 credentials.
I only just put the new backup system in place, so I’m exercising it a bit before I open things up to trial users—but so far it’s looking like a very robust solution to the problem.
s3-ocr improvements
I released a few new versions of s3-ocr this week, as part of my ongoing project working with the San Francisco Microscopical Society team to release a searchable version of their scanned document archives.
The two main improvements are:
- A new
--dry-run
option tos3-ocr start
which shows you what the tool will do without making any changes to your S3 bucket, or triggering any OCR jobs. #22 -
s3-ocr start
used to fail with an error if running it would create more than 100 (or 600 depending on your region) concurrent OCR jobs. The tool now knows how to identify that error and pause and retry starting the jobs instead. #21
The fix that took the most time is this: installations of the tool no longer arbitrarily fail to work depending on the environment you install them into!
Solving this took me the best part of a day. The short version is this: Click 8.1.0 introduced a new feature that lets you use @cli.command
as a decorator instead of @cli.command()
. This meant that installing s3-ocr
in an environment that already had a previous version of Click would result in silent errors.
The solution is simple: pin to click>=8.1.0
in the project dependencies if you plan to use this new syntax.
If I’d read the Click changelog more closely I would have saved myself a whole lot of time.
Issues #25 and #26 detail the many false turns I took trying to figure this out.
More fun with GPT-3 and DALL-E
This tweet scored over a million impressions on Twitter:
New hobby: prototyping video games in 60 seconds using a combination of GPT-3 and DALL-E
- Simon Willison (@simonw) August 5, 2022
Here’s “Raccoon Heist” pic.twitter.com/xQ3Vm8p2XW
As this got retweeted outside of my usual circles it started confusing people who thought the “prototype” was a working game, as opposed to a fake screenshot and a paragraph of descriptive text! I wasn’t kidding when I said I spent 60 seconds on this.
I also figured out how to use GPT-3 to write jq
one-liners. I love jq
but I have to look up how to use it every time, so having GPT-3 do the work for me is a pretty neat time saver. More on that in this TIL: Using GPT-3 to figure out jq recipes
Releases this week
-
s3-ocr: 0.6.3—(9 releases total)—2022-08-10
Tools for running OCR against files stored in S3 -
datasette-scale-to-zero: 0.2—(4 releases total)—2022-08-05
Quit Datasette if it has not received traffic for a specified time period -
shot-scraper: 0.14.3—(18 releases total)—2022-08-02
A command-line utility for taking automated screenshots of websites -
s3-credentials: 0.12.1—(13 releases total)—2022-08-01
A tool for creating credentials for accessing S3 buckets - datasette-sqlite-fts4: 0.3.2—(2 releases total)—2022-07-31
TIL this week
Weeknotes: Joining the board of the Python Software Foundation 11 days ago
A few weeks ago I was elected to the board of directors for the Python Software Foundation.
I put myself up for election partly because I’ve found myself saying “I wish the PSF would help with ...” a few times over the years, and realized that joining the board could be a more useful way to actively participate, rather than shouting from the sidelines.
I was quite surprised to win. I wrote up a short manifesto—you can see that here—but the voting system lets voters select as many candidates as they like, so it’s possible I got in more on broad name recognition among the voters than based on what I wrote. I don’t think there’s a way to tell one way or the other.
I had my first board meeting on Wednesday, where I formally joined the board and got to vote on my first resolutions. This is my first time as a board member for a non-profit and I have learned a bunch already, with a lot more to go!
Board terms last three years. I expect it will take me at least a few months to get fully up to speed on how everything works.
As a board member, my primary responsibilities are to show up to the meetings, vote on resolutions, act as an ambassador for the PSF to the Python community and beyond and help both set the direction for the PSF and ensure that the PSF meets its goals and holds true to its values.
I’m embarassed to admit that I wrote my election manifesto without a deep understanding of how the PSF operates and how much is possible for it to get done. Here’s the section I wrote about my goals should I be elected:
I believe there are problems facing the Python community that require dedicated resources beyond volunteer labour. I’d like the PSF to invest funding in the following areas in particular:
- Improve Python onboarding. In coaching new developers I’ve found that the initial steps to getting started with a Python development environment can be a difficult hurdle to cross. I’d like to help direct PSF resources to tackling this problem, with a goal of making the experience of starting to learn Python as smooth as possible, no matter what platform the learner is using.
- Make Python a great platform for distributing software. In building my own application, Datasette, in Python I’ve seen how difficult it can be to package up a Python application so that it can be installed by end-users, who aren’t ready to install Python and learn pip in order to try out a new piece of software. I’ve researched solutions for this for my own software using Homebrew, Docker, an Electron app and WASM/Pyodide. I’d like the PSF to invest in initiatives and documentation to make this as easy as possible, so that one of the reasons to build with Python is that distributing an application to end-users is already a solved problem.
I still think these are good ideas, and I hope to make progress on them during my term as a director—but I’m not going to start arguing for new initiatives until I’ve learned the ropes and fully understood the PSF’s abilities, current challenges and existing goals.
In figuring out how the board works, one of the most useful pages I stumbled across was this list of resolutions voted on by the board, dating back to 2001. There are over 1,600 of them! Browsing through them gave me a much better idea of the kind of things the board has the authority to do.
Scraping data into Datasette Lite
Because everything looks like a nail when you have a good hammer, I explored the board resolutions by loading them into Datasette. I tried a new trick this time: I scraped data from that page into a CSV file, then loaded up that CSV file in Datasette Lite via a GitHub Gist.
My scraper isn’t perfect—it misses about 150 resolutions because they don’t exactly fit the format it expects, but it was good enough for a proof of concept. I wrote that in a Jupyter Notebook which you can see here.
Here’s the CSV in a Gist. The great thing about Gists is that GitHub serve those files with the access-control-allow-origin: *
HTTP header, which means you can load them cross-domain.
Here’s what you get if you paste the URL to that CSV into Datasette Lite (using this new feature I added last month):
And here’s a SQL query that shows the sum total dollar amount from every resolution that mentions “Nigeria”:
with filtered as (
select * from
[psf-resolutions]
where
"dollars" is not null
and "text" like '%' || :search || '%'
)
select
'Total: $' || printf('%,d', sum(dollars)) as text,
null as date
from filtered
union all
select
text, date
from filtered;
I’m using a new-to-me trick here: I use a CTE to filter down to just the rows I am interested in, then I create a first row that sums the dollar amounts as the text
column and leaves the date
column null
, then unions that against the rows from the query.
Important note: These numbers aren’t actually particularly useful. Just because the PSF board voted on a resolution does not mean that the money made it to the grantee—there are apparently situations where the approved grant may not be properly claimed and transferred. Also, my scraper logic isn’t perfect. Plus the PSF spends a whole lot of money in ways that don’t show up in these resolutions.
So this is a fun hack, and a neat way to start getting a general idea of how the PSF works, but any numbers it produces should not be taken as the absolute truth.
As a general pattern though, I really like this workflow of generating CSV files, saving them to a Gist and then opening them directly in Datasette Lite. It provides a way to use Datasette to share and explore data without needing to involve any server-side systems (other than GitHub Gists) at all!
Big-endian bugs in sqlite-fts4
sqlite-fts4 is a small Python library I wrote that adds SQLite functions for calculating relevance scoring for full-text search using the FTS4 module that comes bundled with SQLite. I described that project in detail in Exploring search relevance algorithms with SQLite.
It’s a dependency of sqlite-utils so it has a pretty big install base, despite being relatively obscure.
This week I had a fascinating bug report from Sarah Julia Kriesch: Test test_underlying_decode_matchinfo fails on PPC64 and s390x on openSUSE.
The s390x
is an IBM mainframe architecture and it uses a big-endian byte order, unlike all of the machines I use which are little-endian.
This is the first time I’ve encountered a big-endian v.s. little-endian bug in my entire career! I was excited to dig in.
Here’s the relevant code:
def decode_matchinfo(buf): # buf is a bytestring of unsigned integers, each 4 bytes long return struct.unpack("I" * (len(buf) // 4), buf)
SQLite FTS4 provides a matchinfo
binary string which you need to decode in order to calculate the relevance score. This code uses the struct standard library module to unpack that binary string into a list of integers.
My initial attempt at fixing this turned out to be entirely incorrect.
I didn’t have a big-endian machine available for testing, and I assumed that the problem was caused by Python interpreting the bytes as the current architecture’s byte order. So I applied this fix:
return struct.unpack(">" + ("I" * (len(buf) // 4)), buf)
The >
prefix there ensures that struct
will always interpret the bytes as little-endian. I wrote up a TIL and shipped 1.0.2 with the fix.
Sarah promptly got back to me and reported some new failing tests.
It turns out my fix was entirely incorrect—in fact, I’d broken something that previously was working just fine.
The clue is in the SQLite documentation for matchinfo (which I really should have checked):
The
matchinfo
function returns a blob value. If it is used within a query that does not use the full-text index (a “query by rowid” or “linear scan”), then the blob is zero bytes in size. Otherwise, the blob consists of zero or more 32-bit unsigned integers in machine byte-order (emphasis mine).
Looking more closely at the original bug report, the test that failed was this one:
@pytest.mark.parametrize( "buf,expected", [ ( b"\x01\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00", (1, 2, 2, 2), ) ], ) def test_underlying_decode_matchinfo(buf, expected): assert expected == decode_matchinfo(buf)
That test hard-codes a little-endian binary string and checks the output of my decode_matchinfo
function. This is obviously going to fail on a big-endian system.
So my original behaviour was actually correct: I was parsing the string using the byte order of the architecture, and SQLite was providing the string in the byte order of the architecture. The only bug was in my test.
I reverted my previous fix and fixed the test instead:
@pytest.mark.parametrize( "buf,expected", [ ( b"\x01\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00" if sys.byteorder == "little" else b"\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x02", (1, 2, 2, 2), ) ], ) def test_underlying_decode_matchinfo(buf, expected): assert expected == decode_matchinfo(buf)
sys.byteorder
reports the byte order of the host system, so this test now passes on both little-endian and big-endian systems.
There was one remaining challenge: how to test this? I wasn’t going to make the same mistake of shipping a fix that hadn’t actually been exercised on the target architecture a second time.
After quite a bit of research (mainly throwing the terms docker
and s390x
into the GitHub code search engine and seeing what I could find) I figured out a fix. It turns out you can use Docker and QEMU to run an emulated s390x
system—both on a Mac loptop and in GitHub Actions.
Short version:
docker run --rm --privileged multiarch/qemu-user-static:register --reset
docker run -it multiarch/ubuntu-core:s390x-focal /bin/bash
For the longer version, check my TIL: Emulating a big-endian s390x with QEMU.
Releases this week
-
sqlite-fts4: 1.0.3—(5 releases total)—2022-07-30
Custom Python functions for working with SQLite FTS4 -
shot-scraper: 0.14.2—(17 releases total)—2022-07-28
A comand-line utility for taking automated screenshots of websites -
datasette-sqlite-fts4: 0.3.1—2022-07-28
Datasette plugin that adds custom SQL functions for working with SQLite FTS4 -
datasette-publish-vercel: 0.14.1—(22 releases total)—2022-07-23
Datasette plugin for publishing data using Vercel - datasette-insert: 0.8—(8 releases total)—2022-07-22
TIL this week
Weeknotes: Datasette, sqlite-utils, Datasette Desktop 21 days ago
A flurry of releases this week, including a new Datasette alpha and a fixed Datasette Desktop.
datasette 0.62a1
Work on Datasette Cloud continues—the last 10% of the work needed for the beta launch is trending towards taking 90% of the time. It’s been driving all sorts of fixes to the wider Datasette ecosystem, which is nice.
I ran into a bug which would have been easier to investigate using Sentry. The datasette-sentry plugin wasn’t working right, and it turned out I needed a new handle_exception() plugin hook to fix it. This was the impetus I needed to push out a new Datasette alpha—I like to get new hooks into an alpha as quickly as possible so I can depend on that Datasette version from alpha releases of plugins.
Here are some other highlights from the alpha’s release notes:
- The render_cell() plugin hook is now also passed a
row
argument, representing thesqlite3.Row
object that is being rendered. (#1300)
A neat thing about Pluggy is that you can add new arguments to existing plugin hooks without breaking plugins that already use them.
- New
--nolock
option for ignoring file locks when opening read-only databases. (#1744)
Since the very start of the project Datasette has suggested trying the following command to start exploring your Google Chrome history, which is stored using SQLite:
datasette ~/Library/Application\ Support/Google/Chrome/Default/History
I’m not sure when this changed, but I tried running the command recently and got the following error:
sqlite3.OperationalError: database is locked
Since Datasette opens databases in read-only mode I didn’t see why a lock like this should be respected. It turns out SQLite can be told to ignore locks like so:
sqlite3.connect( "file:places.sqlite?mode=ro&nolock=1", uri=True )
So I added a --nolock
option to Datasette which does exactly that:
datasette ~/Library/Application\ Support/Google/Chrome/Default/History --nolock
- Datasette now has a Discord community.
Inspired by 6 More Things I Learned Building Snowpack to 20,000 Stars (Part 2) by Fred K. Schott I finally setup a chat community for Datasette, using Discord.
It’s attracted 88 members already! You can join it here. I wrote detailed notes on how I configured it in this issue.
- Database file downloads now implement conditional GET using ETags. (#1739)
This is a change I made to support Datasette Lite—I noticed that the WASM version of Datasette was downloading a fresh database every time, so I added ETag support to encourage browsers to avoid a duplicate download and use a cached copy of the database file instead, provided it hasn’t changed.
Datasette Desktop
Datasette Desktop was hanging on launch. Paul Everitt figured out a fix, which it took me way too long to get around to applying.
I finally shipped that in Datasette Desktop 0.2.2, but I wanted to reduce the chances of this happening again as much as possible. Datasette Desktop’s Elecron tests used the spectron test harness, but that’s marked as deprecated.
I’m a big fan of Playwright and I was optimistic to see that it has support for testing Electron apps. I figured out how to use that with Datasette Desktop and run the tests in GitHub Actions: I wrote up what I learned in a TIL, Testing Electron apps with Playwright and GitHub Actions.
sqlite-utils 3.28
Annotated release notes:
- New table.duplicate(new_name) method for creating a copy of a table with a matching schema and row contents. Thanks, David. (#449)
- New
sqlite-utils duplicate data.db table_name new_name
CLI command for Duplicating tables. (#454)
davidleejy suggested the table.duplicate()
method and contributed an implementation. This was the impetus for pushing out a fresh release.
I added the CLI equivalent, sqlite-utils duplicate
.
sqlite_utils.utils.rows_from_file()
is now a documented API. It can be used to read a sequence of dictionaries from a file-like object containing CSV, TSV, JSON or newline-delimited JSON. It can be passed an explicit format or can attempt to detect the format automatically. (#443)sqlite_utils.utils.TypeTracker
is now a documented API for detecting the likely column types for a sequence of string rows, see Detecting column types using TypeTracker. (#445)sqlite_utils.utils.chunks()
is now a documented API for splitting an iterator into chunks. (#451)
I have a policy that any time I need to use an undocumented method from sqlite-utils
in some other project I file an issue to add that to the documented API surface.
I had used rows_from_file()
and TypeTracker
in datasette-socrata.
This was inspired by my TIL Ignoring errors in a section of a Bash script—a trick I had to figure out because one of my scripts needed to add columns and enable FTS but only if those changes had not been previously applied.
In looking into that I spotted inconsistencies in the design of the sqlite-utils
commands, so I fixed those as much as I could while still maintaining backwards compatibility with the 3.x releases.
Releases this week
-
s3-ocr: 0.5—(5 releases total)—2022-07-19
Tools for running OCR against files stored in S3 -
datasette-graphql: 2.1.1—(36 releases total)—2022-07-18
Datasette plugin providing an automatic GraphQL API for your SQLite databases -
datasette-sentry: 0.2a0—(3 releases total)—2022-07-18
Datasette plugin for configuring Sentry -
datasette: 0.62a1—(112 releases total)—2022-07-18
An open source multi-tool for exploring and publishing data -
sqlite-utils: 3.28—(102 releases total)—2022-07-15
Python CLI utility and library for manipulating SQLite databases -
datasette-publish-vercel: 0.14—(21 releases total)—2022-07-13
Datasette plugin for publishing data using Vercel -
datasette-app: 0.2.2—(4 releases total)—2022-07-13
The Datasette macOS application -
datasette-app-support: 0.11.6—(19 releases total)—2022-07-12
Part of https://github.com/simonw/datasette-app
TIL this week
sqlite-comprehend: run AWS entity extraction against content in a SQLite database one month ago
I built a new tool this week: sqlite-comprehend, which passes text from a SQLite database through the AWS Comprehend entity extraction service and stores the returned entities.
I created this as a complement to my s3-ocr tool, which uses AWS Textract service to perform OCR against every PDF file in an S3 bucket.
Short version: given a database table full of text, run the following:
% pip install sqlite-comprehend
% sqlite-comprehend entities myblog.db blog_entry body --strip-tags
[###---------------------------------] 9% 00:01:02
This will churn through every piece of text in the body
column of the blog_entry
table in the myblog.db
SQLite database, strip any HTML tags (the --strip-tags
option), submit it to AWS Comprehend, and store the extracted entities in the following tables:
- comprehend_entities—the extracted entities, classified by type
- blog_entry_comprehend_entities—a table relating entities to the entries that they appear in
- comprehend_entity_types—a small lookup table of entity types
The above table names link to a live demo produced by running the tool against all of the content in my blog.
Here are 225 mentions that Comprehend classified as the organization called “Mozilla”.
The tool tracks which rows have been processed already (in the blog_entry_comprehend_entities_done table), so you can run it multiple times and it will only process newly added rows.
AWS Comprehend pricing starts at $0.0001 per hundred characters. sqlite-comprehend
only submits the first 5,000 characters of each row.
How the demo works
My live demo for this tool uses a new Datasette instance at datasette.simonwillison.net. It hosts a complete copy of the data from my blog—data that lives in a Django/PostgreSQL database on Heroku, but is now mirrored to a SQLite database hosted by Datasette.
The demo runs out of my simonwillisonblog-backup GitHub repository.
A couple of years ago I realized that I’m no longer happy having any content I care about not stored in a Git repository. I want to track my changes! I also want really robust backups: GitHub mirror their repos to three different regions around the world, and having data in a Git repository makes mirroring it somewhere else as easy as running a git pull
.
So I created simonwillisonblog-backup
using a couple of my other tools: db-to-sqlite, which converts a PostgreSQL database to a SQLite database, and sqlite-diffable, which dumps out a SQLite database as a “diffable” directory of newline-delimited JSON files.
Here’s the simplest version of that pattern:
db-to-sqlite \
'postgresql+psycopg2://user:pass@hostname:5432/dbname' \
simonwillisonblog.db --all
This connects to PostgreSQL, loops through all of the database tables and converts them all to SQLite tables stored in simonwillisonblog.db
.
sqlite-diffable dump simonwillisonblog.db simonwillisonblog --all
This converts that SQLite database into a directory of JSON files. Each table gets two files: table.metadata.json
containing the table’s name, columns and schema and table.ndjson
containing a newline-separated list of JSON arrays representing every row in that table.
You can see these files for my blog’s database in the simonwillisonblog folder.
(My actual script is a little more complex, because I backup only selected tables and then run extra code to redact some of the fields.)
Since I set this up it’s captured over 600 changes I’ve applied to my blog’s database, all made the regular Django admin interface.
This morning I extended the script to run sqlite-comprehend
against my blog entries and deploy the resulting data using Datasette.
The concise version of the new script looks like this:
wget -q https://datasette.simonwillison.net/simonwillisonblog.db
This retrieves the previous version of the database. I do this to avoid being charged by AWS Comprehend for running entity extraction against rows I have already processed.
sqlite-diffable load simonwillisonblog.db simonwillisonblog --replace
This creates the simonwillisonblog.db
database by loading in the JSON from the simonwillisonblog/
folder. I do it this way mainly to exercise the new sqlite-diffable load command I just added to that tool.
The --replace
option ensures that any tables that already exist are replaced by a fresh copy (while leaving my existing comprehend entity extraction data intact).
sqlite-comprehend entities simonwillisonblog.db blog_entry title body --strip-tags
This runs sqlite-comprehend
against the blog entries that have not yet been processed.
set +e
sqlite-utils enable-fts simonwillisonblog.db blog_series title summary --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_tag tag --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_quotation quotation source --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_entry title body --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_blogmark link_title via_title commentary --create-triggers --tokenize porter 2>/dev/null
set -e
This configures SQLite full-text search against each of those tables, using this pattern to supress any errors that occur if the FTS tables already exist.
Setting up FTS in this way means I can use the datasette-search-all plugin to run searches like this one for aws across all of those tables at once.
datasette publish cloudrun simonwillisonblog.db \
-m metadata.yml \
--service simonwillisonblog \
--install datasette-block-robots \
--install datasette-graphql \
--install datasette-search-all
This uses the using datasette publish command to deploy the datasette.simonwillison.net site to Google Cloud Run.
I’m adding two more plugins here: datasette-block-robots to avoid search engine crawlers indexing a duplicate copy of my blog’s content, and datasette-graphql to enable GraphQL queries against my data.
Here’s an example GraphQL query that returns my most recent blog entries that are tagged with datasette
.
Releases this week
-
sqlite-comprehend: 0.2.1—(4 releases total)—2022-07-11
Tools for running data in a SQLite database through AWS Comprehend -
sqlite-diffable: 0.4—(5 releases total)—2022-07-11
Tools for dumping/loading a SQLite database to diffable directory structure -
datasette-redirect-to-https: 0.2—(2 releases total)—2022-07-04
Datasette plugin that redirects all non-https requests to https -
datasette-unsafe-expose-env: 0.1.1—(2 releases total)—2022-07-03
Datasette plugin to expose some environment variables at /-/env for debugging -
datasette-expose-env: 0.1—2022-07-03
Datasette plugin to expose selected environment variables at /-/env for debugging -
datasette-upload-csvs: 0.7.2—(10 releases total)—2022-07-03
Datasette plugin for uploading CSV files and converting them to database tables -
datasette-packages: 0.2—(3 releases total)—2022-07-03
Show a list of currently installed Python packages -
datasette-graphql: 2.1—(35 releases total)—2022-07-03
Datasette plugin providing an automatic GraphQL API for your SQLite databases -
datasette-edit-schema: 0.5—(9 releases total)—2022-07-01
Datasette plugin for modifying table schemas
TIL this week
Using GPT-3 to explain how code works one month ago
One of my favourite uses for the GPT-3 AI language model is generating explanations of how code works. It’s shockingly effective at this: its training set clearly include a vast amount of source code.
(I initially thought this was related to GitHub Copilot being built on GPT-3, but actually that’s built on a GPT-3 descendent called OpenAI Codex.)
Here are a few recent examples.
Explaining some Python
Ned Batchelder shared this brilliant Python code snippet on Twitter this morning:
import re TOKEN_RX = r"""(?xm) (?P<string> ".*?"|'.*?' )| (?P<float> \d*(\d\.|\.\d)\d* )| (?P<int> \d+ )| (?P<id> [_a-zA-Z][_a-zA-Z0-9]* )| (?P<punct> [(){}:\[\]=.,+*/-] )| ( \#.*$ )| ( \s+ ) """ def tokens(text): for match in re.finditer(TOKEN_RX, text): if match.lastgroup: yield (match.lastgroup, match[0]) TEXT = """ x = 123 + "hello #99" # ignore me! print(hello.bye[0] + 3.14, 'single') """ for kind, text in tokens(TEXT): print(f"{kind:7}: {text=}")
I pasted it into GPT-3 (here’s how to use the playground interface) and started asking questions.
This uses the feature of GPT-3 where you can get a response, then add more text and run it again to get further details.
Anything in bold is text that GPT-3 responded with. The rest are my prompts.
Explain this code:
import re TOKEN_RX = r"""(?xm) (?P<string> ".*?"|'.*?' )| (?P<float> \d*(\d\.|\.\d)\d* )| (?P<int> \d+ )| (?P<id> [_a-zA-Z][_a-zA-Z0-9]* )| (?P<punct> [(){}:\[\]=.,+*/-] )| ( \#.*$ )| ( \s+ ) """ def tokens(text): for match in re.finditer(TOKEN_RX, text): if match.lastgroup: yield (match.lastgroup, match[0]) TEXT = """ x = 123 + "hello #99" # ignore me! print(hello.bye[0] + 3.14, 'single') """ for kind, text in tokens(TEXT): print(f"{kind:7}: {text=}")
The code is defining a regular expression that will match different types of tokens in a string of text. The different groups defined in the regular expression will match different types of tokens, such as strings, floats, ints, ids, punctuation, and comments. The code will then iterate over the matches and yield the type of token and the text that was matched.
Explain the regular expression:
The regular expression is made up of a number of different groups, each of which match a different type of token. The first group, “string”, matches any text enclosed in double or single quotes. The second group, “float”, matches any text that looks like a floating point number. The third group, “int”, matches any text that looks like an integer. The fourth group, “id”, matches any text that looks like a variable name. The fifth group, “punct”, matches any text that looks like punctuation. The sixth group, “#.*$”, matches any text that looks like a comment. The seventh group, “\s+”, matches any text that looks like whitespace.
What does (?xm) do?
The (?xm) notation is a modifier that allows the regular expression to span multiple lines and to use comments.
Why does it use match.lastgroup?
The match.lastgroup variable contains the name of the last group that was matched by the regular expression. This is used to determine the type of token that was matched.
This is a really good, really clear explanation of what’s going on.
I already knew how that code works though. Let’s try something that I genuinely didn’t understand.
Explaining some JavaScript
Chris Haynes tweeted an ingenious demo showing how the new Web Component Declarative Shadow DOM <slot>
mechanism (described here) can be used to create a page that streams content from a server, filling in different regions of the page as the content arrives—without using any JavaScript!
His demo is at enamel.pages.dev but only works for me in Chrome. For other browsers, he provides enamel.pages.dev/?js which loads a tiny polyfill.
The polyfill code lives in enamel-lite.js and I could not figure out how it works. So I asked GPT-3.
Explain this code:
class Enamel extends HTMLElement { attemptPolyfillDSD() { const dsd = this.querySelector('template[shadowroot]'); if (dsd?.content) { const mode = dsd.getAttribute('shadowroot'); this.attachShadow({ mode }); this.shadowRoot.appendChild(dsd.content); dsd.remove(); return true; } return false; } connectedCallback() { if ( !HTMLTemplateElement.prototype.hasOwnProperty('shadowRoot') && !this.attemptPolyfillDSD() ) { const _observer = new MutationObserver(() => { if (this.attemptPolyfillDSD()) { _observer.disconnect(); } }); _observer.observe(this, { childList: true, }); } } } export default Enamel;
This code polyfills the Shadow DOM API for older browsers that do not support it.
OK, that sounds convincing, but it’s not enough information for me. So I added another prompt:
How does it do that?
The code uses a MutationObserver to watch for changes to the element’s children. When a new child is added, the code checks to see if that child is a
<template>
element with a ’shadowroot’ attribute. If so, the code creates a Shadow Root for the element and appends the template’s content to it. Finally, the code removes the template element from the DOM.
That is a really good explanation. After reading that, I went back to look at the code again and I now completely understand how it works.
Building some SQL
I wanted to build a SQL query that would return all of the rows in one table that were not represented in another table... but where the rows used a compound primary key covering two columns. I didn’t know the right syntax for this off the top of my head.
First, I needed to create a table with a compound primary key. Rather than look that up I asked GPT-3 directly:
SQLite create a pages table that has a compound primary key of type, id and a content column.
CREATE TABLE pages ( type TEXT, id TEXT, content TEXT, PRIMARY KEY (type, id) );
It seems to know about SQLite SQL syntax, which is handy. Turns out you create a compound primary key using PRIMARY KEY (col1, col2)
at the end of the table definition.
I added a second line with my own CREATE TABLE, then asked it to build me a query:
CREATE TABLE pages_done ( type TEXT, id TEXT, PRIMARY KEY (type, id) );
SQL query returning rows in the pages table that do not have corresponding records in the pages_done table:
SELECT type, id, content FROM pages WHERE (type, id) NOT IN (SELECT type, id FROM pages_done)
This is useful! I thought that a WHERE ... NOT IN (SELECT
pattern would be good here, but I wasn’t sure of the syntax for looking up two columns at once.
I’m pretty sure this could instead be solved with a join... so I asked that too:
That query but using a join:
SELECT p.type, p.id, p.content FROM pages p LEFT JOIN pages_done pd ON p.type = pd.type AND p.id = pd.id WHERE pd.type IS NULL;
But which is better? I posed two more questions:
Which of those two queries is more efficient and why?
The first query is more efficient because it uses a WHERE clause with a subquery. The second query is less efficient because it uses a LEFT JOIN.
Why is a subquery more efficient?
A subquery is more efficient because it only returns the rows that meet the conditions specified in the WHERE clause. A LEFT JOIN returns all rows from the left table, even if there is no match in the right table.
As with everything GPT-3, this answer sounds very convincing! But is this actually true? I would expect the SQL query optimizer to be smart enough to optimize the LEFT JOIN
to the point where it would run efficiently. I think GPT-3’s answer to that question is actually very misleading.
And that’s not surprising, because GPT-3 doesn’t actually know anything about anything at all. It’s a huge pattern generator. You can’t trust anything it says, because all it does is group words together into convincing looking shapes based on text that it’s seen before.
Once again, I’m reminded that tools like GPT-3 should be classified in the “bicycles for the mind” category. You still have to know how to pedal!
They’re fantastic tools for thinking, but to actually use their output effectively requires VERY deep knowledge—both of the subject matter in question, and of the way that the AI tools themselves work.
Leading questions
Got this great tip from Riley Goodside:
When prompting like this, it’s important (and difficult) to avoid leading questions. If you ask which of two queries are more efficient and why, it’s extremely unlikely to to say “It doesn’t matter,” no matter how true that is.
I hadn’t considered this at all. I’ll be keeping that in mind in my future experiments.
Bonus: how about mathematics?
I decided to try one more thing: could it explain mathematical equations?
GitHub recently added LaTeX rendering to their Markdown implementation. I grabbed a formula at random from this page of documentation:
Explain this:
$$\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)$$
This is the Cauchy-Schwarz inequality.
OK, so you know what it’s called. But how does it work?
I added “Explain the formula syntax step by step, using $$ wrapped around the formula steps:” (the $$
bit is explained in a moment):
$$\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)$$ This is the Cauchy-Schwarz inequality. $$\left( \sum_{k=1}^n a_k b_k \right)^2$$ This is the sum of all products of corresponding elements in the vectors a and b, squared. $$\leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)$$ This is saying that this sum is less than or equal to the product of the sums of the squares of each vector.
Here’s why I added that bit about $$
to the prompt: doing so caused it to produce its answer in valid GitHub Markdown syntax. I pasted that into a Gist and rendered it, and it came out like this:
Explaining configuration files
Here’s another prompt pattern I stumbled across that seems to work really well: “Explain every line of this config file:”. I tried it against the Lit example from the Vite 3.0 starter templates:
Explain every line of this config file:
import { defineConfig } from 'vite' // https://vitejs.dev/config/ export default defineConfig({ build: { lib: { entry: 'src/my-element.js', formats: ['es'] }, rollupOptions: { external: /^lit/ } } })
This config file is written in JavaScript, using the ViteJS library. The first line imports the “defineConfig” function from the ViteJS library. This function allows the user to define a configuration object for the build process.
The next line exports the “defineConfig” function as the default export. This means that when this file is imported, the “defineConfig” function will be the only thing that is imported.
The “defineConfig” function takes one argument, which is an object. This object contains the configuration for the build process. The “build” property contains the configuration for the build tool. The “lib” property contains the configuration for the library that will be built. The “entry” property contains the path to the entry file for the library. The “formats” property contains the formats that the library will be built in. In this case, the library will only be built in the “es” format.
The “rollupOptions” property contains the configuration for the Rollup tool. The “external” property contains a regular expression that matches the names of the dependencies that should be excluded from the build. In this case, all dependencies whose names start with “lit” will be excluded from the build.
s3-ocr: Extract text from PDF files stored in an S3 bucket one month ago
I’ve released s3-ocr, a new tool that runs Amazon’s Textract OCR text extraction against PDF files in an S3 bucket, then writes the resulting text out to a SQLite database with full-text search configured so you can run searches against the extracted data.
You can search through a demo of 697 pages of OCRd text at s3-ocr-demo.datasette.io/pages/pages.
Textract works extremely well: it handles dodgy scanned PDFs full of typewritten code and reads handwritten text better than I can! It charges $1.50 per thousand pages processed.
Why I built this
My initial need for this is a collaboration I have running with the San Francisco Microscopy Society. They’ve been digitizing their archives—which stretch back to 1870!—and were looking for help turning the digital scans into something more useful.
The archives are full of hand-written and type-written notes, scanned and stored as PDFs.
I decided to wrap my work up as a tool because I’m sure there are a LOT of organizations out there with a giant bucket of PDF files that would benefit from being able to easily run OCR and turn the results into a searchable database.
Running Textract directly against large numbers of files is somewhat inconvenient (here’s my earlier TIL about it). s3-ocr
is my attempt to make it easier.
Tutorial: How I built that demo
The demo instance uses three PDFs from the Library of Congress Harry Houdini Collection on the Internet Archive:
- The unmasking of Robert-Houdin from 1908
- The practical magician and ventriloquist’s guide: a practical manual of fireside magic and conjuring illusions: containing also complete instructions for acquiring & practising the art of ventriloquism from 1876
- Latest magic, being original conjuring tricks from 1918
I started by downloading PDFs of those three files.
Then I installed the two tools I needed:
pip install s3-ocr s3-credentials
I used my s3-credentials tool to create a new S3 bucket and credentials with the ability to write files to it, with the new --statement option (which I released today) to add textract
permissions to the generated credentials:
s3-credentials create s3-ocr-demo --statement '{
"Effect": "Allow",
"Action": "textract:*",
"Resource": "*"
}' --create-bucket > ocr.json
(Note that you don’t need to use s3-credentials
at all if you have AWS credentials configured on your machine with root access to your account—just leave off the -a ocr.json
options in the following examples.)
s3-ocr-demo
is now a bucket I can use for the demo. ocr.json
contains JSON with an access key and secret key for an IAM user account that can interact with the that bucket, and also has permission to access the AWS Textract APIs.
I uploaded my three PDFs to the bucket:
s3-credentials put-object s3-ocr-demo latestmagicbeing00hoff.pdf latestmagicbeing00hoff.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo practicalmagicia00harr.pdf practicalmagicia00harr.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo unmaskingrobert00houdgoog.pdf unmaskingrobert00houdgoog.pdf -a ocr.json
(I often use Transmit as a GUI for this kind of operation.)
Then I kicked off OCR jobs against every PDF file in the bucket:
% s3-ocr start s3-ocr-demo --all -a ocr.json
Found 0 files with .s3-ocr.json out of 3 PDFs
Starting OCR for latestmagicbeing00hoff.pdf, Job ID: f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
Starting OCR for practicalmagicia00harr.pdf, Job ID: ef085728135d524a39bc037ad6f7253284b1fdbeb728dddcfbb260778d902b55
Starting OCR for unmaskingrobert00houdgoog.pdf, Job ID: 93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9
The --all
option scans for any file with a .pdf
extension. You can pass explicit file names instead if you just want to process one or two files at a time.
This returns straight away, but the OCR process itself can take several minutes depending on the size of the files.
The job IDs can be used to inspect the progress of each task like so:
% s3-ocr inspect-job f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
{
"DocumentMetadata": {
"Pages": 244
},
"JobStatus": "SUCCEEDED",
"DetectDocumentTextModelVersion": "1.0"
}
Once the job completed, I could preview the text extracted from the PDF like so:
% s3-ocr text s3-ocr-demo latestmagicbeing00hoff.pdf
111
.
116
LATEST MAGIC
BEING
ORIGINAL CONJURING TRICKS
INVENTED AND ARRANGED
BY
PROFESSOR HOFFMANN
(ANGELO LEWIS, M.A.)
Author of "Modern Magic," etc.
WITH NUMEROUS ILLUSTRATIONS
FIRST EDITION
NEW YORK
SPON & CHAMBERLAIN, 120 LIBERTY ST.
...
To create a SQLite database with a table containing rows for every page of scanned text, I ran this command:
% s3-ocr index s3-ocr-demo pages.db -a ocr.json
Fetching job details [####################################] 100%
Populating pages table [####--------------------------------] 13% 00:00:34
I then published the resulting pages.db
SQLite database using Datasette—you can explore it here.
How s3-ocr works
s3-ocr
works by calling Amazon’s S3 and Textract APIs.
Textract only works against PDF files in asynchronous mode: you call an API endpoint to tell it “start running OCR against this PDF file in this S3 bucket”, then wait for it to finish—which can take several minutes.
It defaults to storing the OCR results in its own storage, expiring after seven days. You can instead tell it to store them in your own S3 bucket—I use that option in s3-ocr
.
A design challenge I faced was that I wanted to make the command restartable and resumable: if the user cancelled the task, I wanted to be able to pick up from where it had got to. I also want to be able to run it again after adding more PDFs to the bucket without repeating work for the previously processed files.
I also needed to persist those job IDs: Textract writes the OCR results to keys in the bucket called textract-output/JOB_ID/1-?
—but there’s no indication as to which PDF file the results correspond to.
My solution is to write tiny extra JSON files to the bucket when the OCR job is first started.
If you have a file called latestmagicbeing00hoff.pdf
the start
command will create a new file called latestmagicbeing00hoff.pdf.s3-ocr.json
with the following content:
{
"job_id": "f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402",
"etag": "\"d79af487579dcbbef26c9b3be763eb5e-2\""
}
This associates the job ID with the PDF file. It also records the original ETag of the PDF file—this is so in the future I can implement a system that can re-run OCR if the PDF has been updated.
The existence of these files lets me do two things:
- If you run
s3-ocr start s3-ocr-demo --all
it can avoid re-submitting PDF files that have already been sent for OCR, by checking for the existence of the.s3-ocr.json
file. - When you later ask for the results of the OCR it can use these files to associate the PDF with the results.
Scatting .s3-ocr.json
files all over the place feels a little messy, so I have an open issue considering moving them all to a s3-ocr/
prefix in the bucket instead.
Try it and let me know what you think
This is a brand new project, but I think it’s ready for other people to start trying it out.
I ran it against around 7,000 pages from 531 PDF files in the San Francisco Microscopy Society archive and it seemed to work well!
If you try this out and it works (or it doesn’t work) please let me know via Twitter or GitHub.
A challenging example page
Here’s one of the more challenging pages I processed using Textract:
Here’s the result:
In. In J a ... the Joe 14 162 Volxv Lalpa spinosa, Eggt bud development. of 146 Farcomas spindle. cells in nested gowers 271 Fayigaga tridactylites, leaf glaur of ruce 33 staining & mounting Stiles 133 tilica films, a structure of Diatoins morehouse 38 thile new microscopic Broeck 22 / Smith reproduction in the huntroom tribe 6 Trakes, develop mouht succession of the porsion tango/229 Soirce President of the Roy: truc: Soo 285 forby, Presidents address 105 pongida, difficulties of classification 238 tage, american adjustable concentric 150 ttlese staining & mountring wood sections 133 Stodder, Frustulia Iasconica, havicula chomboides, & havi cula crassinervis 265 Vol XVI falicylic acid u movorcopy 160 falpar enctry ology of Brooke 9.97 Sanderson micros: characters If inflammation 43 tap, circulation of the 42 Jars, structure of the genus Brisinga 44 latter throvite connective substances 191- 241 Jehorey Cessification in birds, formation of ed blood corpuseles during the ossification process by
Releases this week
-
s3-ocr: 0.4—(4 releases total)—2022-06-30
Tools for running OCR against files stored in S3 -
s3-credentials: 0.12—(12 releases total)—2022-06-30
A tool for creating credentials for accessing S3 buckets -
datasette-scale-to-zero: 0.1.2—(3 releases total)—2022-06-23
Quit Datasette if it has not received traffic for a specified time period