Simon Willison’s Weblog

Subscribe

What’s new in sqlite-utils 3.20 and 3.21: --lines, --text, --convert

11th January 2022

sqlite-utils is my combined CLI tool and Python library for manipulating SQLite databases. Consider this the annotated release notes for sqlite-utils 3.20 and 3.21, both released in the past week.

sqlite-utils insert --convert with --lines and --text

The sqlite-utils insert command inserts rows into a SQLite database from a JSON, CSV or TSV file, creating a table with the necessary columns if one does not exist already.

It gained three new options in v3.20:

  • sqlite-utils insert ... --lines to insert the lines from a file into a table with a single line column, see Inserting unstructured data with --lines and --text.
  • sqlite-utils insert ... --text to insert the contents of the file into a table with a single text column and a single row.
  • sqlite-utils insert ... --convert allows a Python function to be provided that will be used to convert each row that is being inserted into the database. See Applying conversions while inserting data, including details on special behavior when combined with --lines and --text. (#356)

These features all evolved from an idea I had while re-reading my blog entry from last year, Apply conversion functions to data in SQLite columns with the sqlite-utils CLI tool. That blog entry introduced the sqlite-utils convert comand, which can run a custom Python function against a column in a table to convert that data in some way.

Given a log file log.txt that looks something like this:

2021-08-05T17:58:28.880469+00:00 app[web.1]: measure#nginx.service=4.212 request="GET /search/?type=blogmark&page=2&tag=highavailability HTTP/1.1" status_code=404 request_id=25eb296e-e970-4072-b75a-606e11e1db5b remote_addr="10.1.92.174" forwarded_for="114.119.136.88, 172.70.142.28" forwarded_proto="http" via="1.1 vegur" body_bytes_sent=179 referer="-" user_agent="Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)" request_time="4.212" upstream_response_time="4.212" upstream_connect_time="0.000" upstream_header_time="4.212";

I provided this example code to insert lines from a log file into a table with a single line column:

cat log.txt | \
    jq --raw-input '{line: .}' --compact-output | \
    sqlite-utils insert logs.db log - --nl

Since sqlite-utils insert requires JSON, this example first used jq to convert the lines into {"line": "..."} JSON objects.

My first idea was to improve this with the new --lines option, which lets you replace the above with this:

sqlite-utils insert logs.db log log.txt --lines

Using --lines will create a table with a single lines column and import every line from the file as a row in that table.

In the article, I then demonstrated how --convert could be used to convert those imported lines into structured rows using a regular expression:

sqlite-utils convert logs.db log line --import re --multi "$(cat <<EOD
    r = re.compile(r'([^\s=]+)=(?:"(.*?)"|(\S+))')
    pairs = {}
    for key, value1, value2 in r.findall(value):
        pairs[key] = value1 or value2
    return pairs
EOD
)"

The new --convert option to sqlite-utils means you can now achieve the same thing using:

sqlite-utils insert logs.db log log.txt --lines \
  --import re --convert "$(cat <<EOD
    r = re.compile(r'([^\s=]+)=(?:"(.*?)"|(\S+))')
    for key, value1, value2 in r.findall(line):
        pairs[key] = value1 or value2
    return pairs
EOD
)"

Since the --lines option allows you to consume mostly unstructured files split by newlines, I decided to also add an option to consume an entire unstructured file as a single record. I originally called that --all but found the code got messy because it conflicted with Python’s all() built-in, so I renamed it to --text.

Used on its own, --text creates a table with a single column called text:

% sqlite-utils insert logs.db fulllog log.txt --text
% sqlite-utils schema logs.db
CREATE TABLE [fulllog] (
   [text] TEXT
);

But with --convert you can pass a snippet of Python code which can take that text value and convert it into a list of dictionaries, which will then be used to populate the table.

Here’s a fun example. The following one-liner uses the classic feedparser library to parse the Atom feed for my blog and load it into a database table:

curl 'https://simonwillison.net/atom/everything/' | \
  sqlite-utils insert feed.db entries --text --convert '
    feed = feedparser.parse(text)
    return feed.entries' - --import feedparser

The resulting database looks like this:

% sqlite-utils tables feed.db --counts -t
table      count
-------  -------
feed          30

% sqlite-utils schema feed.db
CREATE TABLE [feed] (
   [title] TEXT,
   [title_detail] TEXT,
   [links] TEXT,
   [link] TEXT,
   [published] TEXT,
   [published_parsed] TEXT,
   [updated] TEXT,
   [updated_parsed] TEXT,
   [id] TEXT,
   [guidislink] INTEGER,
   [summary] TEXT,
   [summary_detail] TEXT,
   [tags] TEXT
);

Not bad for a one-liner!

This example uses the --import option to import that feedparser library. This means you’ll need to have that library installed in the same virtual environment as sqlite-utils.

If you run into problems here (maybe due to having installed sqlite-utils via Homebrew) one way to do this is to use the following:

python3 -m pip install feedparser sqlite-utils

Then use python3 -m sqlite_utils in place of sqlite-utils—this will ensure you are running the command from the same virtual environment where you installed the library.

Update 13th December 2022: sqlite-utils 3.30 introduced a new sqlite-utils install command for installing PyPI packages directly into the same virtual environment as sqlite-utils itself.

--convert for regular rows

The above examples combine --convert with the --lines and --text options to parse unstructured text into database tables.

But --convert works with the existing sqlite-utils insert options as well.

To review, those are the following:

  • sqlite-utils insert by default expects a JSON file that’s a list of objects, [{"id": 1, "text": "Like"}, {"id": 2, "text": "This"}].
  • sqlite-utils insert --nl accepts newline-delimited JSON, {"id": 1, "text": "Like"}\n{"id": 2, "text": "This"}.
  • sqlite-utils insert --csv and --tsv accepts CSV/TSV—with --delimiter and --encoding and --quotechar and --no-headers options for customizing that import, and a --sniff option for automatically detecting those settings.

You can now use --convert to define a Python function that accepts a row dictionary representing each row from the import and modifies that dictionary or returns a fresh one with changes.

Here’s a simple example that produces just the capitalized name, the latitude and the longitude from the WRI’s global power plants CSV file:

curl https://raw.githubusercontent.com/wri/global-power-plant-database/master/output_database/global_power_plant_database.csv | \
  sqlite-utils insert plants.db plants - --csv --convert '
  return {
      "name": row["name"].upper(),
      "latitude": float(row["latitude"]),
      "longitude": float(row["longitude"]),
  }'

The resulting database looks like this:

% sqlite-utils schema plants.db
CREATE TABLE [plants] (
   [name] TEXT,
   [latitude] FLOAT,
   [longitude] FLOAT
);

~ % sqlite-utils rows plants.db plants | head -n 3
[{"name": "KAJAKI HYDROELECTRIC POWER PLANT AFGHANISTAN", "latitude": 32.322, "longitude": 65.119},
 {"name": "KANDAHAR DOG", "latitude": 31.67, "longitude": 65.795},
 {"name": "KANDAHAR JOL", "latitude": 31.623, "longitude": 65.792},

sqlite-utils bulk

  • New sqlite-utils bulk command which can import records in the same way as sqlite-utils insert (from JSON, CSV or TSV) and use them to bulk execute a parametrized SQL query. (#375)

With the addition of --lines, --text, --convert and --import the sqlite-utils insert command is now a powerful tool for turning anything into a list of Python dictionaries, which can then in turn be inserted into a SQLite database table.

Which gave me an idea... what if you could use the same mechanisms to execute SQL statements in bulk instead?

Python’s SQLite library supports named parameters in SQL queries, which look like this:

insert into plants (id, name) values (:id, :name)

Those :id and :name parameters can be populated from a Python dictionary. And the .executemany() method can efficiently apply the same SQL query to a big list (or iterator or generator) of dictionaries in one go:

cursor = db.cursor()
cursor.executemany(
    "insert into plants (id, name) values (:id, :name)",
    [{"id": 1, "name": "One"}, {"id": 2, "name": "Two"}]
)

So I implemented the sqlite-utils bulk command, which takes the same import options as sqlite-utils but instead of creating and populating the specified table requires a SQL argument with a query that will be executed using the imported rows as arguments.

% sqlite-utils bulk demo.db \
  'insert into plants (id, name) values (:id, :name)' \
  plants.csv --csv

This feels like a powerful new feature, which was very simple to implement because the hard work of importing the data had already been done by the insert command.

Running ANALYZE

This idea came from Forest Gregg, who initially suggested running ANALYZE automatically as part of the sqlite-utils create-index command.

I have to confess: in all of my years of using SQLite, I’d never actually explored the ANALYZE command.

When run, it builds a new table called sqlite_stats1 containing statistics about each of the indexes on the table—indicating how “selective” each index is—effectively how many rows on average you are likely to filter down to if you use the index.

The SQLite query planner can then use this to decide which index to consult. For example, given the following query:

select * from ny_times_us_counties
where state = 'Missouri' and county = 'Greene'

(Try that here.)

If there are indexes on both columns, should the query planner use the state column or the county column?

In this case the state column will filter down to 75,209 rows, while the county column filters to 9,186—so county is clearly the better query plan.

Impressively, SQLite seems to make this kind of decision perfectly well without the sqlite_stat1 table being populated: explain query plan select * from ny_times_us_counties where “county” = ’Greene’ and “state” = ’Missouri’ returns the following:

SEARCH TABLE ny_times_us_counties USING INDEX idx_ny_times_us_counties_county (county=?)

I’ve not actually found a good example of a query where the sqlite_stat1 table makes a difference yet, but I’m confident such queries exist!

Using SQL, you can run ANALYZE against an entire database by executing ANALYZE;, or against all of the indexes for a specific table with ANALYZE tablename;, or against a specific index by name using ANALYZE indexname;.

There’s one catch with ANALYZE: since running it populates a static sqlite_stat1 table, the data in that table can get out of date. If you insert another million rows into a table for example your analyzye statistics might no longer reflect ground truth to the point that the query planner starts to make bad decisions.

For sqlite-utils I decided to make ANALYZE an explicit operation. In the Python library you can now run the following:

db.analyze() # Analyze every index in the database
db.analyze("indexname") # Analyze a specific index
db.analyze("tablename") # Analyze every index for that table
# Or the same thing using a table object:
db["tablename"].analyze()

I also added an optional analyze=True parameter to several methods, which you can use to trigger an ANALZYE once that operation completes:

db["tablename"].create_index(["column"], analyze=True)
db["tablename"].insert_rows(rows, analyze=True)
db["tablename"].delete_where(analyze=True)

The sqlite-utils CLI command has equivalent functionality:

# Analyze every index in a database:
% sqlite-utils analyze database.db
# Analyze a specific index:
% sqlite-utils analyze database.db indexname
# Analyze all indexes for a table:
% sqlite-utils analyze database.db tablename

And an --analyze option for various commands:

% sqlite-utils create-index ... --analyze
% sqlite-utils insert ... --analyze
% sqlite-utils upsert ... --analyze

Other smaller changes

  • New sqlite-utils create-database command for creating new empty database files. (#348)

Most sqlite-utils commands such as insert or create-table create the database file for you if it doesn’t already exist, but I decided it would be neat to have an explicit create-database command for deliberately creating an empty database.

Update 13th January 2022: I wrote a detailed description of my process building this command in How I build a feature.

  • The CLI tool can now also be run using python -m sqlite_utils. (#368)

I initially added this to help write a unit test that exercised the tool through a subprocess (see TIL Testing a Click app with streaming input) but it’s a neat pattern in general. datasette gained this through a contribution from Abdussamet Koçak a few years ago.

  • Using --fmt now implies --table, so you don’t need to pass both options. (#374)

A nice tiny usability enhancement. You can now run sqlite-utils rows my.db mytable --fmt rst to get back a reStructuredText table—previously you also needed to add --table.

I sometimes re-read the documentation for older features to remind me what they do, and occasionally an idea for a feature jumps out from that. Implementing these was a very small change.

  • The --nl import option now ignores blank lines in the input. (#376)
  • Fixed bug where streaming input to the insert command with --batch-size 1 would appear to only commit after several rows had been ingested, due to unnecessary input buffering. (#364)

That --nl improvement came from tinkering around trying to fix the bug.

The bug itself was interesting: I initially thought that my entire mechanism for comitting on every --batch-size chunk was broken, but it turned out I was unnecessarily buffering data from standard input in order to support the --sniff option for detecting the shape of incoming CSV data.

  • db.supports_strict property showing if the database connection supports SQLite strict tables.
  • table.strict property (see .strict) indicating if the table uses strict mode. (#344)

See previous weeknotes: this is the first part of my ongoing support for the new STRICT tables in SQLite.

I’m currently blocked on implementing more due to the need to get a robust mechanism up and running for executing sqlite-utils tests in CI against specific SQLite versions, see issue #346.

Releases this week

TILs this week

This is What’s new in sqlite-utils 3.20 and 3.21: --lines, --text, --convert by Simon Willison, posted on 11th January 2022.

Part of series New features in sqlite-utils

  1. Refactoring databases with sqlite-utils extract - Sept. 23, 2020, 4:02 p.m.
  2. Joining CSV and JSON data with an in-memory SQLite database - June 19, 2021, 10:55 p.m.
  3. Apply conversion functions to data in SQLite columns with the sqlite-utils CLI tool - Aug. 6, 2021, 6:05 a.m.
  4. What's new in sqlite-utils 3.20 and 3.21: --lines, --text, --convert - Jan. 11, 2022, 6:19 p.m.
  5. sqlite-utils now supports plugins - July 24, 2023, 5:06 p.m.

Next: How I build a feature

Previous: Weeknotes: Taking a break in Moss Landing