What’s new in sqlite-utils 3.20 and 3.21: --lines, --text, --convert
11th January 2022
sqlite-utils is my combined CLI tool and Python library for manipulating SQLite databases. Consider this the annotated release notes for sqlite-utils 3.20 and 3.21, both released in the past week.
sqlite-utils insert --convert with --lines and --text
The sqlite-utils insert
command inserts rows into a SQLite database from a JSON, CSV or TSV file, creating a table with the necessary columns if one does not exist already.
It gained three new options in v3.20:
sqlite-utils insert ... --lines
to insert the lines from a file into a table with a singleline
column, see Inserting unstructured data with --lines and --text.sqlite-utils insert ... --text
to insert the contents of the file into a table with a singletext
column and a single row.sqlite-utils insert ... --convert
allows a Python function to be provided that will be used to convert each row that is being inserted into the database. See Applying conversions while inserting data, including details on special behavior when combined with--lines
and--text
. (#356)
These features all evolved from an idea I had while re-reading my blog entry from last year, Apply conversion functions to data in SQLite columns with the sqlite-utils CLI tool. That blog entry introduced the sqlite-utils convert
comand, which can run a custom Python function against a column in a table to convert that data in some way.
Given a log file log.txt
that looks something like this:
2021-08-05T17:58:28.880469+00:00 app[web.1]: measure#nginx.service=4.212 request="GET /search/?type=blogmark&page=2&tag=highavailability HTTP/1.1" status_code=404 request_id=25eb296e-e970-4072-b75a-606e11e1db5b remote_addr="10.1.92.174" forwarded_for="114.119.136.88, 172.70.142.28" forwarded_proto="http" via="1.1 vegur" body_bytes_sent=179 referer="-" user_agent="Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)" request_time="4.212" upstream_response_time="4.212" upstream_connect_time="0.000" upstream_header_time="4.212";
I provided this example code to insert lines from a log file into a table with a single line
column:
cat log.txt | \
jq --raw-input '{line: .}' --compact-output | \
sqlite-utils insert logs.db log - --nl
Since sqlite-utils insert
requires JSON, this example first used jq
to convert the lines into {"line": "..."}
JSON objects.
My first idea was to improve this with the new --lines
option, which lets you replace the above with this:
sqlite-utils insert logs.db log log.txt --lines
Using --lines
will create a table with a single lines
column and import every line from the file as a row in that table.
In the article, I then demonstrated how --convert
could be used to convert those imported lines into structured rows using a regular expression:
sqlite-utils convert logs.db log line --import re --multi "$(cat <<EOD
r = re.compile(r'([^\s=]+)=(?:"(.*?)"|(\S+))')
pairs = {}
for key, value1, value2 in r.findall(value):
pairs[key] = value1 or value2
return pairs
EOD
)"
The new --convert
option to sqlite-utils
means you can now achieve the same thing using:
sqlite-utils insert logs.db log log.txt --lines \
--import re --convert "$(cat <<EOD
r = re.compile(r'([^\s=]+)=(?:"(.*?)"|(\S+))')
for key, value1, value2 in r.findall(line):
pairs[key] = value1 or value2
return pairs
EOD
)"
Since the --lines
option allows you to consume mostly unstructured files split by newlines, I decided to also add an option to consume an entire unstructured file as a single record. I originally called that --all
but found the code got messy because it conflicted with Python’s all()
built-in, so I renamed it to --text
.
Used on its own, --text
creates a table with a single column called text
:
% sqlite-utils insert logs.db fulllog log.txt --text
% sqlite-utils schema logs.db
CREATE TABLE [fulllog] (
[text] TEXT
);
But with --convert
you can pass a snippet of Python code which can take that text
value and convert it into a list of dictionaries, which will then be used to populate the table.
Here’s a fun example. The following one-liner uses the classic feedparser library to parse the Atom feed for my blog and load it into a database table:
curl 'https://simonwillison.net/atom/everything/' | \
sqlite-utils insert feed.db entries --text --convert '
feed = feedparser.parse(text)
return feed.entries' - --import feedparser
The resulting database looks like this:
% sqlite-utils tables feed.db --counts -t
table count
------- -------
feed 30
% sqlite-utils schema feed.db
CREATE TABLE [feed] (
[title] TEXT,
[title_detail] TEXT,
[links] TEXT,
[link] TEXT,
[published] TEXT,
[published_parsed] TEXT,
[updated] TEXT,
[updated_parsed] TEXT,
[id] TEXT,
[guidislink] INTEGER,
[summary] TEXT,
[summary_detail] TEXT,
[tags] TEXT
);
Not bad for a one-liner!
This example uses the --import
option to import that feedparser
library. This means you’ll need to have that library installed in the same virtual environment as sqlite-utils
.
If you run into problems here (maybe due to having installed sqlite-utils
via Homebrew) one way to do this is to use the following:
python3 -m pip install feedparser sqlite-utils
Then use python3 -m sqlite_utils
in place of sqlite-utils
—this will ensure you are running the command from the same virtual environment where you installed the library.
Update 13th December 2022: sqlite-utils 3.30 introduced a new sqlite-utils install command for installing PyPI packages directly into the same virtual environment as sqlite-utils
itself.
--convert for regular rows
The above examples combine --convert
with the --lines
and --text
options to parse unstructured text into database tables.
But --convert
works with the existing sqlite-utils insert
options as well.
To review, those are the following:
-
sqlite-utils insert
by default expects a JSON file that’s a list of objects,[{"id": 1, "text": "Like"}, {"id": 2, "text": "This"}]
. -
sqlite-utils insert --nl
accepts newline-delimited JSON,{"id": 1, "text": "Like"}\n{"id": 2, "text": "This"}
. -
sqlite-utils insert --csv
and--tsv
accepts CSV/TSV—with--delimiter
and--encoding
and--quotechar
and--no-headers
options for customizing that import, and a--sniff
option for automatically detecting those settings.
You can now use --convert
to define a Python function that accepts a row
dictionary representing each row from the import and modifies that dictionary or returns a fresh one with changes.
Here’s a simple example that produces just the capitalized name, the latitude and the longitude from the WRI’s global power plants CSV file:
curl https://raw.githubusercontent.com/wri/global-power-plant-database/master/output_database/global_power_plant_database.csv | \
sqlite-utils insert plants.db plants - --csv --convert '
return {
"name": row["name"].upper(),
"latitude": float(row["latitude"]),
"longitude": float(row["longitude"]),
}'
The resulting database looks like this:
% sqlite-utils schema plants.db
CREATE TABLE [plants] (
[name] TEXT,
[latitude] FLOAT,
[longitude] FLOAT
);
~ % sqlite-utils rows plants.db plants | head -n 3
[{"name": "KAJAKI HYDROELECTRIC POWER PLANT AFGHANISTAN", "latitude": 32.322, "longitude": 65.119},
{"name": "KANDAHAR DOG", "latitude": 31.67, "longitude": 65.795},
{"name": "KANDAHAR JOL", "latitude": 31.623, "longitude": 65.792},
sqlite-utils bulk
- New sqlite-utils bulk command which can import records in the same way as
sqlite-utils insert
(from JSON, CSV or TSV) and use them to bulk execute a parametrized SQL query. (#375)
With the addition of --lines
, --text
, --convert
and --import
the sqlite-utils insert
command is now a powerful tool for turning anything into a list of Python dictionaries, which can then in turn be inserted into a SQLite database table.
Which gave me an idea... what if you could use the same mechanisms to execute SQL statements in bulk instead?
Python’s SQLite library supports named parameters in SQL queries, which look like this:
insert into plants (id, name) values (:id, :name)
Those :id
and :name
parameters can be populated from a Python dictionary. And the .executemany()
method can efficiently apply the same SQL query to a big list (or iterator or generator) of dictionaries in one go:
cursor = db.cursor() cursor.executemany( "insert into plants (id, name) values (:id, :name)", [{"id": 1, "name": "One"}, {"id": 2, "name": "Two"}] )
So I implemented the sqlite-utils bulk
command, which takes the same import options as sqlite-utils
but instead of creating and populating the specified table requires a SQL
argument with a query that will be executed using the imported rows as arguments.
% sqlite-utils bulk demo.db \
'insert into plants (id, name) values (:id, :name)' \
plants.csv --csv
This feels like a powerful new feature, which was very simple to implement because the hard work of importing the data had already been done by the insert
command.
Running ANALYZE
- New Python methods for running
ANALYZE
against a database, table or index:db.analyze()
andtable.analyze()
, see Optimizing index usage with ANALYZE. (#366)- New sqlite-utils analyze command for running
ANALYZE
using the CLI. (#379)- The
create-index
,insert
andupsert
commands now have a new--analyze
option for runningANALYZE
after the command has completed. (#379)
This idea came from Forest Gregg, who initially suggested running ANALYZE
automatically as part of the sqlite-utils create-index
command.
I have to confess: in all of my years of using SQLite, I’d never actually explored the ANALYZE command.
When run, it builds a new table called sqlite_stats1
containing statistics about each of the indexes on the table—indicating how “selective” each index is—effectively how many rows on average you are likely to filter down to if you use the index.
The SQLite query planner can then use this to decide which index to consult. For example, given the following query:
select * from ny_times_us_counties
where state = 'Missouri' and county = 'Greene'
If there are indexes on both columns, should the query planner use the state
column or the county
column?
In this case the state
column will filter down to 75,209 rows, while the county
column filters to 9,186—so county
is clearly the better query plan.
Impressively, SQLite seems to make this kind of decision perfectly well without the sqlite_stat1
table being populated: explain query plan select * from ny_times_us_counties where “county” = ’Greene’ and “state” = ’Missouri’ returns the following:
SEARCH TABLE ny_times_us_counties USING INDEX idx_ny_times_us_counties_county (county=?)
I’ve not actually found a good example of a query where the sqlite_stat1
table makes a difference yet, but I’m confident such queries exist!
Using SQL, you can run ANALYZE
against an entire database by executing ANALYZE;
, or against all of the indexes for a specific table with ANALYZE tablename;
, or against a specific index by name using ANALYZE indexname;
.
There’s one catch with ANALYZE
: since running it populates a static sqlite_stat1
table, the data in that table can get out of date. If you insert another million rows into a table for example your analyzye statistics might no longer reflect ground truth to the point that the query planner starts to make bad decisions.
For sqlite-utils
I decided to make ANALYZE
an explicit operation. In the Python library you can now run the following:
db.analyze() # Analyze every index in the database db.analyze("indexname") # Analyze a specific index db.analyze("tablename") # Analyze every index for that table # Or the same thing using a table object: db["tablename"].analyze()
I also added an optional analyze=True
parameter to several methods, which you can use to trigger an ANALZYE
once that operation completes:
db["tablename"].create_index(["column"], analyze=True) db["tablename"].insert_rows(rows, analyze=True) db["tablename"].delete_where(analyze=True)
The sqlite-utils
CLI command has equivalent functionality:
# Analyze every index in a database:
% sqlite-utils analyze database.db
# Analyze a specific index:
% sqlite-utils analyze database.db indexname
# Analyze all indexes for a table:
% sqlite-utils analyze database.db tablename
And an --analyze
option for various commands:
% sqlite-utils create-index ... --analyze
% sqlite-utils insert ... --analyze
% sqlite-utils upsert ... --analyze
Other smaller changes
- New
sqlite-utils create-database
command for creating new empty database files. (#348)
Most sqlite-utils
commands such as insert
or create-table
create the database file for you if it doesn’t already exist, but I decided it would be neat to have an explicit create-database
command for deliberately creating an empty database.
Update 13th January 2022: I wrote a detailed description of my process building this command in How I build a feature.
- The CLI tool can now also be run using
python -m sqlite_utils
. (#368)
I initially added this to help write a unit test that exercised the tool through a subprocess (see TIL Testing a Click app with streaming input) but it’s a neat pattern in general. datasette
gained this through a contribution from Abdussamet Koçak a few years ago.
- Using
--fmt
now implies--table
, so you don’t need to pass both options. (#374)
A nice tiny usability enhancement. You can now run sqlite-utils rows my.db mytable --fmt rst
to get back a reStructuredText table—previously you also needed to add --table
.
- The insert-files command supports two new columns:
stem
andsuffix
. (#372)
I sometimes re-read the documentation for older features to remind me what they do, and occasionally an idea for a feature jumps out from that. Implementing these was a very small change.
That --nl
improvement came from tinkering around trying to fix the bug.
The bug itself was interesting: I initially thought that my entire mechanism for comitting on every --batch-size
chunk was broken, but it turned out I was unnecessarily buffering data from standard input in order to support the --sniff
option for detecting the shape of incoming CSV data.
db.supports_strict
property showing if the database connection supports SQLite strict tables.table.strict
property (see .strict) indicating if the table uses strict mode. (#344)
See previous weeknotes: this is the first part of my ongoing support for the new STRICT tables in SQLite.
I’m currently blocked on implementing more due to the need to get a robust mechanism up and running for executing sqlite-utils
tests in CI against specific SQLite versions, see issue #346.
Releases this week
-
sqlite-utils: 3.21—(92 releases total)—2022-01-11
Python CLI utility and library for manipulating SQLite databases -
sqlite-utils: 3.20—2022-01-05
-
stream-delay: 0.1—2022-01-08
Stream a file or stdin one line at a time with a delay
TILs this week
More recent articles
- Notes from Bing Chat—Our First Encounter With Manipulative AI - 19th November 2024
- Project: Civic Band - scraping and searching PDF meeting minutes from hundreds of municipalities - 16th November 2024
- Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac - 12th November 2024