Simon Willison's Weblog: New features in sqlite-utilshttp://simonwillison.net/2023-07-24T17:06:23+00:00Simon Willisonsqlite-utils now supports plugins2023-07-24T17:06:23+00:002023-07-24T17:06:23+00:00https://simonwillison.net/2023/Jul/24/sqlite-utils-plugins/#atom-series
<p><a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-34">sqlite-utils 3.34</a> is out with a major new feature: support for <a href="https://sqlite-utils.datasette.io/en/stable/plugins.html">plugins</a>.</p>
<p><code>sqlite-utils</code> is my combination Python library and command-line tool for manipulating SQLite databases. It recently celebrated its fifth birthday, and has had over 100 releases since it first launched back in 2018.</p>
<p>The new plugin system is inspired by similar mechanisms <a href="https://docs.datasette.io/en/stable/plugins.html">in Datasette</a> and <a href="https://llm.datasette.io/en/stable/plugins/index.html">LLM</a>. It lets developers add new features to <code>sqlite-utils</code> without needing to get their changes accepted by the core project.</p>
<p>I love plugin systems. As an open source maintainer they are by far the best way to encourage people to contribute to my projects - I can genuinely wake up in the morning and my software has new features, and I didn't even need to review a pull request.</p>
<p>Plugins also offer a fantastic medium for exploration and experimentation. I can try out new ideas without committing to supporting them in core, and without needing to tie improvements to them to the core release cycle.</p>
<p>Version 3.34 adds <a href="https://sqlite-utils.datasette.io/en/stable/plugins.html#plugin-hooks">two initial plugin hooks</a>: <code>register_commands()</code> and <code>prepare_connection()</code>. These are both based on the equivalent hooks in Datasette.</p>
<p>I planned to just ship <code>register_commands()</code>, but Alex Garcia spotted my activity on the repo and submitted <a href="https://github.com/simonw/sqlite-utils/pull/573">a PR</a> adding <code>prepare_connection()</code> literally minutes before I had intended to ship the release!</p>
<h4>register_commands()</h4>
<p>The <code>register_commands()</code> hook lets you add new commands to the <code>sqlite-utils</code> command-line tool - so users can run <code>sqlite-utils your-new-command</code> to access your feature.</p>
<p>I've learned from past experience that you should never ship a plugin hook without also releasing at least one plugin that uses it. I've built two so far for <code>register_commands()</code>:</p>
<ul>
<li>
<a href="https://github.com/simonw/sqlite-utils-shell">sqlite-utils-shell</a> adds a simply interactive shell, accessed using <code>sqlite-utils shell</code> for an in-memory database or <code>sqlite-utils shell data.db</code> to run it against a specific database file.</li>
<li>
<a href="https://github.com/simonw/sqlite-migrate">sqlite-migrate</a> is my first draft of a database migrations system for SQLite, loosely inspired by Django migrations and previewed by the migration mechanism I <a href="https://github.com/simonw/llm/blob/0.6.1/llm/migrations.py">added to LLM</a>.</li>
</ul>
<p>Try out the shell plugin like this:</p>
<div class="highlight highlight-source-shell"><pre>sqlite-utils install sqlite-utils-shell
sqlite-utils shell</pre></div>
<p>The interface looks like this:</p>
<pre><code>In-memory database, content will be lost on exit
Type 'exit' to exit.
sqlite-utils> select 3 + 5;
3 + 5
-------
8
sqlite-utils>
</code></pre>
<h4>prepare_connection()</h4>
<p>This hook, contributed by Alex, lets you modify the connection object before it is used to execute any SQL. Most importantly, this lets you register custom SQLite functions.</p>
<p>I expect this to be the most common category of plugin. I've built one so far: <a href="https://github.com/simonw/sqlite-utils-dateutil">sqlite-utils-dateutil</a>, which adds functions for parsing dates and times using the <a href="https://dateutil.readthedocs.io/">dateutil</a> library.</p>
<p>It lets you do things like this:</p>
<div class="highlight highlight-source-shell"><pre>sqlite-utils install sqlite-utils-dateutil
sqlite-utils memory <span class="pl-s"><span class="pl-pds">"</span>select dateutil_parse('3rd october')<span class="pl-pds">"</span></span> -t</pre></div>
<p>Output:</p>
<pre><code>dateutil_parse('3rd october')
-------------------------------
2023-10-03T00:00:00
</code></pre>
<p>This works inside <code>sqlite-shell</code> too.</p>
<p>Plugins that you install also become available in the Python API interface to <code>sqlite-utils</code>:</p>
<div class="highlight highlight-text-python-console"><pre>>>> <span class="pl-k">import</span> sqlite_utils
>>> db <span class="pl-k">=</span> sqlite_utils.Database(<span class="pl-v">memory</span><span class="pl-k">=</span><span class="pl-c1">True</span>)
>>> <span class="pl-c1">list</span>(db.query(<span class="pl-s"><span class="pl-pds">"</span>select dateutil_parse('3rd october')<span class="pl-pds">"</span></span>))
[{"dateutil_parse('3rd october')": '2023-10-03T00:00:00'}]</pre></div>
<p>You can opt out of executing installed plugins by passing <code>execute_plugins=False</code> to the <code>Database()</code> constructor:</p>
<div class="highlight highlight-text-python-console"><pre>>>> db <span class="pl-k">=</span> sqlite_utils.Database(<span class="pl-v">memory</span><span class="pl-k">=</span><span class="pl-c1">True</span>, <span class="pl-v">execute_plugins</span><span class="pl-k">=</span><span class="pl-c1">False</span>)
>>> <span class="pl-c1">list</span>(db.query(<span class="pl-s"><span class="pl-pds">"</span>select dateutil_parse('3rd october')<span class="pl-pds">"</span></span>))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../site-packages/sqlite_utils/db.py", line 494, in query
cursor = self.execute(sql, params or tuple())
File ".../site-packages/sqlite_utils/db.py", line 512, in execute
return self.conn.execute(sql, parameters)
sqlite3.OperationalError: no such function: dateutil_parse</pre></div>
<h4>sqlite-ml by Romain Clement</h4>
<p>I quietly released <code>sqlite-utils 3.34</code> on Saturday. The community has already released several plugins for it!</p>
<p>Romain Clement built <a href="https://github.com/rclement/sqlite-utils-ml">sqlite-utils-ml</a>, a plugin wrapper for his <a href="https://github.com/rclement/sqlite-ml">sqlite-ml</a> project.</p>
<p>This adds custom SQL functions for training machine learning models and running predictions, entirely within SQLite, using algorithms from <a href="https://scikit-learn.org">scikit-learn</a>.</p>
<p>Here's what that looks like running inside <code>sqlite-utils shell</code>:</p>
<div class="highlight highlight-source-shell"><pre>sqlite-utils install sqlite-utils-shell sqlite-utils-ml
sqlite-utils shell ml.db</pre></div>
<pre><code>Attached to ml.db
Type 'exit' to exit.
sqlite-utils> select sqml_load_dataset('iris') as dataset;
dataset
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
{"table": "dataset_iris", "feature_names": ["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"], "target_names": ["setosa", "versicolor", "virginica"], "size": 150}
sqlite-utils> select sqml_train(
...> 'Iris prediction',
...> 'classification',
...> 'logistic_regression',
...> 'dataset_iris',
...> 'target'
...> ) as training;
training
--------------------------------------------------------------------------------------------------------------------------------------------------------------
{"experiment_name": "Iris prediction", "prediction_type": "classification", "algorithm": "logistic_regression", "deployed": true, "score": 0.9736842105263158}
sqlite-utils> select
...> dataset_iris.*,
...> sqml_predict(
...> 'Iris prediction',
...> json_object(
...> 'sepal length (cm)', [sepal length (cm)],
...> 'sepal width (cm)', [sepal width (cm)],
...> 'petal length (cm)', [petal length (cm)],
...> 'petal width (cm)', [petal width (cm)]
...> )
...> ) as prediction
...> from dataset_iris
...> limit 1;
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target prediction
------------------- ------------------ ------------------- ------------------ -------- ------------
5.1 3.5 1.4 0.2 0 0
</code></pre>
<h4>SQLite extensions by Alex Garcia</h4>
<p>Alex Garcia has <a href="https://github.com/asg017/sqlite-ecosystem">a growing collection</a> of SQLite extensions, many of which are written in Rust but are packaged as wheels for ease of installation using Python.</p>
<p>Alex released five plugins for SQLite corresponding to five of his existing extensions:</p>
<ul>
<li><code>sqlite-utils-sqlite-regex</code></li>
<li><code>sqlite-utils-sqlite-path</code></li>
<li><code>sqlite-utils-sqlite-url</code></li>
<li><code>sqlite-utils-sqlite-ulid</code></li>
<li><code>sqlite-utils-sqlite-lines</code></li>
</ul>
<p>Here's an example of <code>sqlite-utils-sqlite-ulid</code> in action:</p>
<div class="highlight highlight-source-shell"><pre>sqlite-utils install sqlite-utils-sqlite-ulid
sqlite-utils memory <span class="pl-s"><span class="pl-pds">'</span>select ulid() u1, ulid() u2, ulid() u3<span class="pl-pds">'</span></span> <span class="pl-k">|</span> jq</pre></div>
<p>Output:</p>
<div class="highlight highlight-source-json"><pre>[
{
<span class="pl-ent">"u1"</span>: <span class="pl-s"><span class="pl-pds">"</span>01h64d1ysg1rx63z1gwy7nah4n<span class="pl-pds">"</span></span>,
<span class="pl-ent">"u2"</span>: <span class="pl-s"><span class="pl-pds">"</span>01h64d1ysgd7vx04sc9pncqh10<span class="pl-pds">"</span></span>,
<span class="pl-ent">"u3"</span>: <span class="pl-s"><span class="pl-pds">"</span>01h64d1ysgz1sy7njkqt86dkq9<span class="pl-pds">"</span></span>
}
]</pre></div>
<p>I've started a <a href="https://github.com/simonw/sqlite-utils-plugins">sqlite-utils plugin directory</a> with a list of all of the plugins so far.</p>
<h4>Building your own plugin</h4>
<p>If you want to try building your own plugin, the documentation includes a <a href="https://sqlite-utils.datasette.io/en/stable/plugins.html#building-a-plugin">simple step-by-step guide</a>. A plugin can be built with as little as two files: a Python module implementing the hooks, and a <code>pyproject.toml</code> module with metadata about how it should be installed.</p>
<p>I've also released a new <a href="https://pypi.org/project/cookiecutter/">cookiecutter</a> template: <a href="https://github.com/simonw/sqlite-utils-plugin">simonw/sqlite-utils-plugin</a>. Here's how to use that to get started building a plugin:</p>
<div class="highlight highlight-source-shell"><pre>cookiecutter gh:simonw/sqlite-utils-plugin</pre></div>
<p>Answer the form fields like this:</p>
<pre><code>plugin_name []: rot13
description []: select rot13('text') as a sqlite-utils plugin
hyphenated [rot13]:
underscored [rot13]:
github_username []: your-username
author_name []: your-name
</code></pre>
<p>Change directory into the new folder and use <code>sqlite-utils install -e</code> to install an editable version of your plugin, so changes you make will be reflected when you run the tool:</p>
<div class="highlight highlight-source-shell"><pre><span class="pl-c1">cd</span> sqlite-utils-rot13
sqlite-utils install -e <span class="pl-c1">.</span></pre></div>
<p>Run this command to confirm the plugin has been installed:</p>
<div class="highlight highlight-source-shell"><pre>sqlite-utils plugins</pre></div>
<p>You should see this:</p>
<div class="highlight highlight-source-json"><pre>[
{
<span class="pl-ent">"name"</span>: <span class="pl-s"><span class="pl-pds">"</span>sqlite-utils-rot13<span class="pl-pds">"</span></span>,
<span class="pl-ent">"hooks"</span>: [
<span class="pl-s"><span class="pl-pds">"</span>prepare_connection<span class="pl-pds">"</span></span>
],
<span class="pl-ent">"version"</span>: <span class="pl-s"><span class="pl-pds">"</span>0.1<span class="pl-pds">"</span></span>
}
]</pre></div>
<p>Now drop this code into the <code>sqlite_utils_rot13.py</code> file:</p>
<pre><span class="pl-k">import</span> <span class="pl-s1">sqlite_utils</span>
<span class="pl-k">def</span> <span class="pl-en">rot13</span>(<span class="pl-s1">s</span>):
<span class="pl-s1">chars</span> <span class="pl-c1">=</span> []
<span class="pl-k">for</span> <span class="pl-s1">v</span> <span class="pl-c1">in</span> <span class="pl-s1">s</span>:
<span class="pl-s1">c</span> <span class="pl-c1">=</span> <span class="pl-en">ord</span>(<span class="pl-s1">v</span>)
<span class="pl-k">if</span> <span class="pl-s1">c</span> <span class="pl-c1">>=</span> <span class="pl-en">ord</span>(<span class="pl-s">"a"</span>) <span class="pl-c1">and</span> <span class="pl-s1">c</span> <span class="pl-c1"><=</span> <span class="pl-en">ord</span>(<span class="pl-s">"z"</span>):
<span class="pl-k">if</span> <span class="pl-s1">c</span> <span class="pl-c1">></span> <span class="pl-en">ord</span>(<span class="pl-s">"m"</span>):
<span class="pl-s1">c</span> <span class="pl-c1">-=</span> <span class="pl-c1">13</span>
<span class="pl-k">else</span>:
<span class="pl-s1">c</span> <span class="pl-c1">+=</span> <span class="pl-c1">13</span>
<span class="pl-k">elif</span> <span class="pl-s1">c</span> <span class="pl-c1">>=</span> <span class="pl-en">ord</span>(<span class="pl-s">"A"</span>) <span class="pl-c1">and</span> <span class="pl-s1">c</span> <span class="pl-c1"><=</span> <span class="pl-en">ord</span>(<span class="pl-s">"Z"</span>):
<span class="pl-k">if</span> <span class="pl-s1">c</span> <span class="pl-c1">></span> <span class="pl-en">ord</span>(<span class="pl-s">"M"</span>):
<span class="pl-s1">c</span> <span class="pl-c1">-=</span> <span class="pl-c1">13</span>
<span class="pl-k">else</span>:
<span class="pl-s1">c</span> <span class="pl-c1">+=</span> <span class="pl-c1">13</span>
<span class="pl-s1">chars</span>.<span class="pl-en">append</span>(<span class="pl-en">chr</span>(<span class="pl-s1">c</span>))
<span class="pl-k">return</span> <span class="pl-s">""</span>.<span class="pl-en">join</span>(<span class="pl-s1">chars</span>)
<span class="pl-en">@<span class="pl-s1">sqlite_utils</span>.<span class="pl-s1">hookimpl</span></span>
<span class="pl-k">def</span> <span class="pl-en">prepare_connection</span>(<span class="pl-s1">conn</span>):
<span class="pl-s1">conn</span>.<span class="pl-en">create_function</span>(<span class="pl-s">"rot13"</span>, <span class="pl-c1">1</span>, <span class="pl-s1">rot13</span>)</pre>
<p>And try it out like this:</p>
<div class="highlight highlight-source-shell"><pre>sqlite-utils memory <span class="pl-s"><span class="pl-pds">"</span>select rot13('hello world')<span class="pl-pds">"</span></span></pre></div>
<p>Output:</p>
<div class="highlight highlight-source-json"><pre>[{<span class="pl-ent">"rot13('hello world')"</span>: <span class="pl-s"><span class="pl-pds">"</span>uryyb jbeyq<span class="pl-pds">"</span></span>}]</pre></div>
<p>And to reverse that:</p>
<pre><code>sqlite-utils memory "select rot13('uryyb jbeyq')"
</code></pre>
<p>Output:</p>
<div class="highlight highlight-source-json"><pre>[{<span class="pl-ent">"rot13('uryyb jbeyq')"</span>: <span class="pl-s"><span class="pl-pds">"</span>hello world<span class="pl-pds">"</span></span>}]</pre></div>
<p>As you can see, building plugins can be done with very little code. I'm excited to see what else people build with this new capability!</p>
What's new in sqlite-utils 3.20 and 3.21: --lines, --text, --convert2022-01-11T18:19:17+00:002022-01-11T18:19:17+00:00https://simonwillison.net/2022/Jan/11/sqlite-utils/#atom-series
<p><a href="https://sqlite-utils.datasette.io/">sqlite-utils</a> is my combined CLI tool and Python library for manipulating SQLite databases. Consider this the <a href="https://simonwillison.net/tags/annotatedreleasenotes/">annotated release notes</a> for sqlite-utils <a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-20">3.20</a> and <a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-21">3.21</a>, both released in the past week.</p>
<h4>sqlite-utils insert --convert with --lines and --text</h4>
<p>The <code>sqlite-utils insert</code> command inserts rows into a SQLite database from a JSON, CSV or TSV file, creating a table with the necessary columns if one does not exist already.</p>
<p>It gained three new options in v3.20:</p>
<blockquote>
<ul>
<li>
<code>sqlite-utils insert ... --lines</code> to insert the lines from a file into a table with a single <code>line</code> column, see <a href="https://sqlite-utils.datasette.io/en/stable/cli.html#cli-insert-unstructured">Inserting unstructured data with --lines and --text</a>.</li>
<li>
<code>sqlite-utils insert ... --text</code> to insert the contents of the file into a table with a single <code>text</code> column and a single row.</li>
<li>
<code>sqlite-utils insert ... --convert</code> allows a Python function to be provided that will be used to convert each row that is being inserted into the database. See <a href="https://sqlite-utils.datasette.io/en/stable/cli.html#cli-insert-convert">Applying conversions while inserting data</a>, including details on special behavior when combined with <code>--lines</code> and <code>--text</code>. (<a href="https://github.com/simonw/sqlite-utils/issues/356">#356</a>)</li>
</ul>
</blockquote>
<p>These features all evolved from an idea I had while re-reading my blog entry from last year, <a href="https://simonwillison.net/2021/Aug/6/sqlite-utils-convert/">Apply conversion functions to data in SQLite columns with the sqlite-utils CLI tool</a>. That blog entry introduced the <code>sqlite-utils convert</code> comand, which can run a custom Python function against a column in a table to convert that data in some way.</p>
<p>Given a log file <code>log.txt</code> that looks something like this:</p>
<p><code>2021-08-05T17:58:28.880469+00:00 app[web.1]: measure#nginx.service=4.212 request="GET /search/?type=blogmark&page=2&tag=highavailability HTTP/1.1" status_code=404 request_id=25eb296e-e970-4072-b75a-606e11e1db5b remote_addr="10.1.92.174" forwarded_for="114.119.136.88, 172.70.142.28" forwarded_proto="http" via="1.1 vegur" body_bytes_sent=179 referer="-" user_agent="Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)" request_time="4.212" upstream_response_time="4.212" upstream_connect_time="0.000" upstream_header_time="4.212";
</code></p>
<p>I provided this example code to insert lines from a log file into a table with a single <code>line</code> column:</p>
<pre><code>cat log.txt | \
jq --raw-input '{line: .}' --compact-output | \
sqlite-utils insert logs.db log - --nl
</code></pre>
<p>Since <code>sqlite-utils insert</code> requires JSON, this example first used <code>jq</code> to convert the lines into <code>{"line": "..."}</code> JSON objects.</p>
<p>My first idea was to improve this with the new <code>--lines</code> option, which lets you replace the above with this:</p>
<pre><code>sqlite-utils insert logs.db log log.txt --lines
</code></pre>
<p>Using <code>--lines</code> will create a table with a single <code>lines</code> column and import every line from the file as a row in that table.</p>
<p>In the article, I then demonstrated how <code>--convert</code> could be used to convert those imported lines into structured rows using a regular expression:</p>
<pre><code>sqlite-utils convert logs.db log line --import re --multi "$(cat <<EOD
r = re.compile(r'([^\s=]+)=(?:"(.*?)"|(\S+))')
pairs = {}
for key, value1, value2 in r.findall(value):
pairs[key] = value1 or value2
return pairs
EOD
)"
</code></pre>
<p>The new <code>--convert</code> option to <code>sqlite-utils</code> means you can now achieve the same thing using:</p>
<pre><code>sqlite-utils insert logs.db log log.txt --lines \
--import re --convert "$(cat <<EOD
r = re.compile(r'([^\s=]+)=(?:"(.*?)"|(\S+))')
for key, value1, value2 in r.findall(line):
pairs[key] = value1 or value2
return pairs
EOD
)"
</code></pre>
<p>Since the <code>--lines</code> option allows you to consume mostly unstructured files split by newlines, I decided to also add an option to consume an entire unstructured file as a single record. I originally called that <code>--all</code> but found the code got messy because it conflicted with Python's <code>all()</code> built-in, so I renamed it to <code>--text</code>.</p>
<p>Used on its own, <code>--text</code> creates a table with a single column called <code>text</code>:</p>
<pre><code>% sqlite-utils insert logs.db fulllog log.txt --text
% sqlite-utils schema logs.db
CREATE TABLE [fulllog] (
[text] TEXT
);
</code></pre>
<p>But with <code>--convert</code> you can pass a snippet of Python code which can take that <code>text</code> value and convert it into a list of dictionaries, which will then be used to populate the table.</p>
<p>Here's a fun example. The following one-liner uses the classic <a href="https://feedparser.readthedocs.io/">feedparser</a> library to parse the Atom feed for my blog and load it into a database table:</p>
<pre><code>curl 'https://simonwillison.net/atom/everything/' | \
sqlite-utils insert feed.db entries --text --convert '
feed = feedparser.parse(text)
return feed.entries' - --import feedparser
</code></pre>
<p>The resulting database looks like this:</p>
<pre><code>% sqlite-utils tables feed.db --counts -t
table count
------- -------
feed 30
% sqlite-utils schema feed.db
CREATE TABLE [feed] (
[title] TEXT,
[title_detail] TEXT,
[links] TEXT,
[link] TEXT,
[published] TEXT,
[published_parsed] TEXT,
[updated] TEXT,
[updated_parsed] TEXT,
[id] TEXT,
[guidislink] INTEGER,
[summary] TEXT,
[summary_detail] TEXT,
[tags] TEXT
);
</code></pre>
<p>Not bad for a one-liner!</p>
<p>This example uses the <code>--import</code> option to import that <code>feedparser</code> library. This means you'll need to have that library installed in the same virtual environment as <code>sqlite-utils</code>.</p>
<p>If you run into problems here (maybe due to having installed <code>sqlite-utils</code> via Homebrew) one way to do this is to use the following:</p>
<pre><code>python3 -m pip install feedparser sqlite-utils
</code></pre>
<p>Then use <code>python3 -m sqlite_utils</code> in place of <code>sqlite-utils</code> - this will ensure you are running the command from the same virtual environment where you installed the library.</p>
<p><strong>Update 13th December 2022</strong>: <a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-30">sqlite-utils 3.30</a> introduced a new <a href="https://sqlite-utils.datasette.io/en/stable/cli.html#cli-install">sqlite-utils install</a> command for installing PyPI packages directly into the same virtual environment as <code>sqlite-utils</code> itself.</p>
<h4>--convert for regular rows</h4>
<p>The above examples combine <code>--convert</code> with the <code>--lines</code> and <code>--text</code> options to parse unstructured text into database tables.</p>
<p>But <code>--convert</code> works with the existing <code>sqlite-utils insert</code> options as well.</p>
<p>To review, those are the following:</p>
<ul>
<li>
<code>sqlite-utils insert</code> by default expects a JSON file that's a list of objects, <code>[{"id": 1, "text": "Like"}, {"id": 2, "text": "This"}]</code>.</li>
<li>
<code>sqlite-utils insert --nl</code> accepts newline-delimited JSON, <code>{"id": 1, "text": "Like"}\n{"id": 2, "text": "This"}</code>.</li>
<li>
<code>sqlite-utils insert --csv</code> and <code>--tsv</code> accepts CSV/TSV - with <code>--delimiter</code> and <code>--encoding</code> and <code>--quotechar</code> and <code>--no-headers</code> options for customizing that import, and a <code>--sniff</code> option for automatically detecting those settings.</li>
</ul>
<p>You can now use <code>--convert</code> to define a Python function that accepts a <code>row</code> dictionary representing each row from the import and modifies that dictionary or returns a fresh one with changes.</p>
<p>Here's a simple example that produces just the capitalized name, the latitude and the longitude from the WRI's <a href="https://github.com/wri/global-power-plant-database/blob/master/output_database/global_power_plant_database.csv">global power plants</a> CSV file:</p>
<pre><code>curl https://raw.githubusercontent.com/wri/global-power-plant-database/master/output_database/global_power_plant_database.csv | \
sqlite-utils insert plants.db plants - --csv --convert '
return {
"name": row["name"].upper(),
"latitude": float(row["latitude"]),
"longitude": float(row["longitude"]),
}'
</code></pre>
<p>The resulting database looks like this:</p>
<pre><code>% sqlite-utils schema plants.db
CREATE TABLE [plants] (
[name] TEXT,
[latitude] FLOAT,
[longitude] FLOAT
);
~ % sqlite-utils rows plants.db plants | head -n 3
[{"name": "KAJAKI HYDROELECTRIC POWER PLANT AFGHANISTAN", "latitude": 32.322, "longitude": 65.119},
{"name": "KANDAHAR DOG", "latitude": 31.67, "longitude": 65.795},
{"name": "KANDAHAR JOL", "latitude": 31.623, "longitude": 65.792},
</code></pre>
<h4>sqlite-utils bulk</h4>
<blockquote>
<ul>
<li>New <a href="https://sqlite-utils.datasette.io/en/stable/cli.html#cli-bulk">sqlite-utils bulk command</a> which can import records in the same way as <code>sqlite-utils insert</code> (from JSON, CSV or TSV) and use them to bulk execute a parametrized SQL query. (<a href="https://github.com/simonw/sqlite-utils/issues/375">#375</a>)</li>
</ul>
</blockquote>
<p>With the addition of <code>--lines</code>, <code>--text</code>, <code>--convert</code> and <code>--import</code> the <code>sqlite-utils insert</code> command is now a powerful tool for turning anything into a list of Python dictionaries, which can then in turn be inserted into a SQLite database table.</p>
<p>Which gave me an idea... what if you could use the same mechanisms to execute SQL statements in bulk instead?</p>
<p>Python's SQLite library supports named parameters in SQL queries, which look like this:</p>
<div class="highlight highlight-source-sql"><pre><span class="pl-k">insert into</span> plants (id, name) <span class="pl-k">values</span> (:id, :name)</pre></div>
<p>Those <code>:id</code> and <code>:name</code> parameters can be populated from a Python dictionary. And the <code>.executemany()</code> method can efficiently apply the same SQL query to a big list (or iterator or generator) of dictionaries in one go:</p>
<pre><span class="pl-s1">cursor</span> <span class="pl-c1">=</span> <span class="pl-s1">db</span>.<span class="pl-en">cursor</span>()
<span class="pl-s1">cursor</span>.<span class="pl-en">executemany</span>(
<span class="pl-s">"insert into plants (id, name) values (:id, :name)"</span>,
[{<span class="pl-s">"id"</span>: <span class="pl-c1">1</span>, <span class="pl-s">"name"</span>: <span class="pl-s">"One"</span>}, {<span class="pl-s">"id"</span>: <span class="pl-c1">2</span>, <span class="pl-s">"name"</span>: <span class="pl-s">"Two"</span>}]
)</pre>
<p>So I implemented the <code>sqlite-utils bulk</code> command, which takes the same import options as <code>sqlite-utils</code> but instead of creating and populating the specified table requires a <code>SQL</code> argument with a query that will be executed using the imported rows as arguments.</p>
<pre><code>% sqlite-utils bulk demo.db \
'insert into plants (id, name) values (:id, :name)' \
plants.csv --csv</code></pre>
<p>This feels like a powerful new feature, which was very simple to implement because the hard work of importing the data had already been done by the <code>insert</code> command.</p>
<h4 id="running-analyze">Running ANALYZE</h4>
<blockquote>
<ul>
<li>New Python methods for running <code>ANALYZE</code> against a database, table or index: <code>db.analyze()</code> and <code>table.analyze()</code>, see <a href="https://sqlite-utils.datasette.io/en/stable/python-api.html#python-api-analyze">Optimizing index usage with ANALYZE</a>. (<a href="https://github.com/simonw/sqlite-utils/issues/366">#366</a>)</li>
<li>New <a href="https://sqlite-utils.datasette.io/en/stable/cli.html#cli-analyze">sqlite-utils analyze command</a> for running <code>ANALYZE</code> using the CLI. (<a href="https://github.com/simonw/sqlite-utils/issues/379">#379</a>)</li>
<li>The <code>create-index</code>, <code>insert</code> and <code>upsert</code> commands now have a new <code>--analyze</code> option for running <code>ANALYZE</code> after the command has completed. (<a href="https://github.com/simonw/sqlite-utils/issues/379">#379</a>)</li>
</ul>
</blockquote>
<p>This idea came from Forest Gregg, who <a href="https://github.com/simonw/sqlite-utils/issues/365">initially suggested</a> running <code>ANALYZE</code> automatically as part of the <code>sqlite-utils create-index</code> command.</p>
<p>I have to confess: in all of my years of using SQLite, I'd never actually explored <a href="https://www.sqlite.org/lang_analyze.html">the ANALYZE command</a>.</p>
<p>When run, it builds a new table called <code>sqlite_stats1</code> containing statistics about each of the indexes on the table - indicating how "selective" each index is - effectively how many rows on average you are likely to filter down to if you use the index.</p>
<p>The SQLite query planner can then use this to decide which index to consult. For example, given the following query:</p>
<div class="highlight highlight-source-sql"><pre><span class="pl-k">select</span> <span class="pl-k">*</span> <span class="pl-k">from</span> ny_times_us_counties
<span class="pl-k">where</span> state <span class="pl-k">=</span> <span class="pl-s"><span class="pl-pds">'</span>Missouri<span class="pl-pds">'</span></span> <span class="pl-k">and</span> county <span class="pl-k">=</span> <span class="pl-s"><span class="pl-pds">'</span>Greene<span class="pl-pds">'</span></span></pre></div>
<p>(<a href="https://covid-19.datasettes.com/covid?sql=select+*+from+ny_times_us_counties%0D%0Awhere+state+%3D+%27Missouri%27+and+county+%3D+%27Greene%27&p0=Greene&p1=Missouri">Try that here</a>.)</p>
<p>If there are indexes on both columns, should the query planner use the <code>state</code> column or the <code>county</code> column?</p>
<p>In this case the <code>state</code> column will filter down to 75,209 rows, while the <code>county</code> column filters to 9,186 - so <code>county</code> is clearly the better query plan.</p>
<p>Impressively, SQLite seems to make this kind of decision perfectly well without the <code>sqlite_stat1</code> table being populated: <a href="https://covid-19.datasettes.com/covid?sql=explain+query+plan+select+*+from+ny_times_us_counties+where+%22county%22+%3D+%3Ap0+and+%22state%22+%3D+%3Ap1&p0=Greene&p1=Missouri">explain query plan select * from ny_times_us_counties where "county" = 'Greene' and "state" = 'Missouri'</a> returns the following:</p>
<p><code>SEARCH TABLE ny_times_us_counties USING INDEX idx_ny_times_us_counties_county (county=?)</code></p>
<p>I've not actually found a good example of a query where the <code>sqlite_stat1</code> table makes a difference yet, but I'm confident such queries exist!</p>
<p>Using SQL, you can run <code>ANALYZE</code> against an entire database by executing <code>ANALYZE;</code>, or against all of the indexes for a specific table with <code>ANALYZE tablename;</code>, or against a specific index by name using <code>ANALYZE indexname;</code>.</p>
<p>There's one catch with <code>ANALYZE</code>: since running it populates a static <code>sqlite_stat1</code> table, the data in that table can get out of date. If you insert another million rows into a table for example your analyzye statistics might no longer reflect ground truth to the point that the query planner starts to make bad decisions.</p>
<p>For <code>sqlite-utils</code> I decided to make <code>ANALYZE</code> an explicit operation. In the Python library you can now run the following:</p>
<pre><span class="pl-s1">db</span>.<span class="pl-en">analyze</span>() <span class="pl-c"># Analyze every index in the database</span>
<span class="pl-s1">db</span>.<span class="pl-en">analyze</span>(<span class="pl-s">"indexname"</span>) <span class="pl-c"># Analyze a specific index</span>
<span class="pl-s1">db</span>.<span class="pl-en">analyze</span>(<span class="pl-s">"tablename"</span>) <span class="pl-c"># Analyze every index for that table</span>
<span class="pl-c"># Or the same thing using a table object:</span>
<span class="pl-s1">db</span>[<span class="pl-s">"tablename"</span>].<span class="pl-en">analyze</span>()</pre>
<p>I also added an optional <code>analyze=True</code> parameter to several methods, which you can use to trigger an <code>ANALZYE</code> once that operation completes:</p>
<pre><span class="pl-s1">db</span>[<span class="pl-s">"tablename"</span>].<span class="pl-en">create_index</span>([<span class="pl-s">"column"</span>], <span class="pl-s1">analyze</span><span class="pl-c1">=</span><span class="pl-c1">True</span>)
<span class="pl-s1">db</span>[<span class="pl-s">"tablename"</span>].<span class="pl-en">insert_rows</span>(<span class="pl-s1">rows</span>, <span class="pl-s1">analyze</span><span class="pl-c1">=</span><span class="pl-c1">True</span>)
<span class="pl-s1">db</span>[<span class="pl-s">"tablename"</span>].<span class="pl-en">delete_where</span>(<span class="pl-s1">analyze</span><span class="pl-c1">=</span><span class="pl-c1">True</span>)</pre>
<p>The <code>sqlite-utils</code> CLI command has equivalent functionality:</p>
<pre><code># Analyze every index in a database:
% sqlite-utils analyze database.db
# Analyze a specific index:
% sqlite-utils analyze database.db indexname
# Analyze all indexes for a table:
% sqlite-utils analyze database.db tablename
</code></pre>
<p>And an <code>--analyze</code> option for various commands:</p>
<pre><code>% sqlite-utils create-index ... --analyze
% sqlite-utils insert ... --analyze
% sqlite-utils upsert ... --analyze
</code></pre>
<h4>Other smaller changes</h4>
<blockquote>
<ul>
<li>New <code>sqlite-utils create-database</code> command for creating new empty database files. (<a href="https://github.com/simonw/sqlite-utils/issues/348">#348</a>)</li>
</ul>
</blockquote>
<p>Most <code>sqlite-utils</code> commands such as <code>insert</code> or <code>create-table</code> create the database file for you if it doesn't already exist, but I decided it would be neat to have an explicit <code>create-database</code> command for deliberately creating an empty database.</p>
<p><strong>Update 13th January 2022</strong>: I wrote a detailed description of my process building this command in <a href="https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/">How I build a feature</a>.</p>
<blockquote>
<ul>
<li>The CLI tool can now also be run using <code>python -m sqlite_utils</code>. (<a href="https://github.com/simonw/sqlite-utils/issues/368">#368</a>)</li>
</ul>
</blockquote>
<p>I initially added this to help write a unit test that exercised the tool through a subprocess (see TIL <a href="https://til.simonwillison.net/pytest/test-click-app-with-streaming-input">Testing a Click app with streaming input</a>) but it's a neat pattern in general. <code>datasette</code> gained this through <a href="https://github.com/simonw/datasette/pull/556">a contribution</a> from Abdussamet Koçak a few years ago.</p>
<blockquote>
<ul>
<li>Using <code>--fmt</code> now implies <code>--table</code>, so you don't need to pass both options. (<a href="https://github.com/simonw/sqlite-utils/issues/374">#374</a>)</li>
</ul>
</blockquote>
<p>A nice tiny usability enhancement. You can now run <code>sqlite-utils rows my.db mytable --fmt rst</code> to get back a reStructuredText table - previously you also needed to add <code>--table</code>.</p>
<blockquote>
<ul>
<li>The <a href="https://sqlite-utils.datasette.io/en/stable/cli.html#cli-insert-files">insert-files command</a> supports two new columns: <code>stem</code> and <code>suffix</code>. (<a href="https://github.com/simonw/sqlite-utils/issues/372">#372</a>)</li>
</ul>
</blockquote>
<p>I sometimes re-read the documentation for older features to remind me what they do, and occasionally an idea for a feature jumps out from that. Implementing these was <a href="https://github.com/simonw/sqlite-utils/commit/c9ecd0d6a32d4518c9b92bcc08183a10268d52d7#diff-76294b3d4afeb27e74e738daa01c26dd4dc9ccb6f4477451483a2ece1095902e">a very small change</a>.</p>
<blockquote>
<ul>
<li>The <code>--nl</code> import option now ignores blank lines in the input. (<a href="https://github.com/simonw/sqlite-utils/issues/376">#376</a>)</li>
<li>Fixed bug where streaming input to the <code>insert</code> command with <code>--batch-size 1</code> would appear to only commit after several rows had been ingested, due to unnecessary input buffering. (<a href="https://github.com/simonw/sqlite-utils/issues/364">#364</a>)</li>
</ul>
</blockquote>
<p>That <code>--nl</code> improvement came from tinkering around trying to fix the bug.</p>
<p>The bug itself was interesting: I initially thought that my entire mechanism for comitting on every <code>--batch-size</code> chunk was broken, but it turned out I was unnecessarily buffering data from standard input in order to support the <code>--sniff</code> option for detecting the shape of incoming CSV data.</p>
<blockquote>
<ul>
<li>
<code>db.supports_strict</code> property showing if the database connection supports <a href="https://www.sqlite.org/stricttables.html">SQLite strict tables</a>.</li>
<li>
<code>table.strict</code> property (see <a href="https://sqlite-utils.datasette.io/en/stable/python-api.html#python-api-introspection-strict">.strict</a>) indicating if the table uses strict mode. (<a href="https://github.com/simonw/sqlite-utils/issues/344">#344</a>)</li>
</ul>
</blockquote>
<p>See <a href="https://simonwillison.net/2021/Dec/1/beautiful-yaks/">previous weeknotes</a>: this is the first part of my ongoing support for the new STRICT tables in SQLite.</p>
<p>I'm currently blocked on implementing more due to the need to get a robust mechanism up and running for executing <code>sqlite-utils</code> tests in CI against specific SQLite versions, see <a href="https://github.com/simonw/sqlite-utils/issues/346">issue #346</a>.</p>
<h4>Releases this week</h4>
<ul>
<li>
<p><strong><a href="https://github.com/simonw/sqlite-utils">sqlite-utils</a></strong>: <a href="https://github.com/simonw/sqlite-utils/releases/tag/3.21">3.21</a> - (<a href="https://github.com/simonw/sqlite-utils/releases">92 releases total</a>) - 2022-01-11
<br />Python CLI utility and library for manipulating SQLite databases</p>
</li>
<li>
<p><strong><a href="https://github.com/simonw/sqlite-utils">sqlite-utils</a></strong>: <a href="https://github.com/simonw/sqlite-utils/releases/tag/3.20">3.20</a> - 2022-01-05</p>
</li>
<li>
<p><strong><a href="https://github.com/simonw/stream-delay">stream-delay</a></strong>: <a href="https://github.com/simonw/stream-delay/releases/tag/0.1">0.1</a> - 2022-01-08
<br />Stream a file or stdin one line at a time with a delay</p>
</li>
</ul>
<h4>TILs this week</h4>
<ul>
<li><a href="https://til.simonwillison.net/pytest/pytest-argparse">Writing pytest tests against tools written with argparse</a></li>
<li><a href="https://til.simonwillison.net/pytest/test-click-app-with-streaming-input">Testing a Click app with streaming input</a></li>
</ul>
Apply conversion functions to data in SQLite columns with the sqlite-utils CLI tool2021-08-06T06:05:15+00:002021-08-06T06:05:15+00:00https://simonwillison.net/2021/Aug/6/sqlite-utils-convert/#atom-series
<p>Earlier this week I released <a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-14">sqlite-utils 3.14</a> with a powerful new command-line tool: <code>sqlite-utils convert</code>, which applies a conversion function to data stored in a SQLite column.</p>
<p>Anyone who works with data will tell you that 90% of the work is cleaning it up. Running command-line conversions against data in a SQLite file turns out to be a really productive way to do that.</p>
<h4>Transforming a column</h4>
<p>Here's a simple example. Say someone gave you data with numbers that are formatted with commas - like <code>3,044,502</code> - in a <code>count</code> column in a <code>states</code> table.</p>
<p>You can strip those commas out like so:</p>
<pre><code>sqlite-utils convert states.db states count \
'value.replace(",", "")'
</code></pre>
<p>The <code>convert</code> command takes four arguments: the database file, the name of the table, the name of the column and a string containing a fragment of Python code that defines the conversion to be applied.</p>
<p><img alt="Animated demo using sqlite-utils convert to strip out commas" src="https://static.simonwillison.net/static/2021/sqlite-convert-demo.gif" style="max-width:100%;" /></p>
<p>The conversion function can be anything you can express with Python. If you want to import extra modules you can do so using <code>--import module</code> - here's an example that wraps text using the <a href="">textwrap</a> module from the Python standard library:</p>
<pre><code>sqlite-utils convert content.db articles content \
'"\n".join(textwrap.wrap(value, 100))' \
--import=textwrap
</code></pre>
<p>You can consider this analogous to using <code>Array.map()</code> in JavaScript, or running a transformation using a list comprehension in Python.</p>
<h4>Custom functions in SQLite</h4>
<p>Under the hood, the tool takes advantage of a powerful SQLite feature: the ability to <a href="https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.create_function">register custom functions</a> written in Python (or other languages) and call them from SQL.</p>
<p>The text wrapping example above works by executing the following SQL:</p>
<div class="highlight highlight-source-sql"><pre><span class="pl-k">update</span> articles <span class="pl-k">set</span> content <span class="pl-k">=</span> convert_value(content)</pre></div>
<p><code>convert_value(value)</code> is a custom SQL function, compiled as Python code and then made available to the database connection.</p>
<p>The equivalent code using just the Python standard library would look like this:</p>
<pre><span class="pl-k">import</span> <span class="pl-s1">sqlite3</span>
<span class="pl-k">import</span> <span class="pl-s1">textwrap</span>
<span class="pl-k">def</span> <span class="pl-en">convert_value</span>(<span class="pl-s1">value</span>):
<span class="pl-k">return</span> <span class="pl-s">"<span class="pl-cce">\n</span>"</span>.<span class="pl-en">join</span>(<span class="pl-s1">textwrap</span>.<span class="pl-en">wrap</span>(<span class="pl-s1">value</span>, <span class="pl-c1">100</span>))
<span class="pl-s1">conn</span> <span class="pl-c1">=</span> <span class="pl-s1">sqlite3</span>.<span class="pl-en">connect</span>(<span class="pl-s">"content.db"</span>)
<span class="pl-s1">conn</span>.<span class="pl-en">create_function</span>(<span class="pl-s">"convert_value"</span>, <span class="pl-c1">1</span>, <span class="pl-s1">convert_value</span>)
<span class="pl-s1">conn</span>.<span class="pl-en">execute</span>(<span class="pl-s">"update articles set content = convert_value(content)"</span>)</pre>
<p><code>sqlite-utils convert</code> works by <a href="https://github.com/simonw/sqlite-utils/blob/cc90745f4e8bb1ac57d8ee973863cfe00c2e4fe5/sqlite_utils/cli.py#L2019-L2028">compiling the code argument</a> to a Python function, registering it with the connection and executing the above SQL query.</p>
<h4>Splitting columns into multiple other columns</h4>
<p>Sometimes when I'm working with a table I find myself wanting to split a column into multiple other columns.</p>
<p>A classic example is locations - if a <code>location</code> column contains <code>latitude,longitude</code> values I'll often want to split that into separate <code>latitude</code> and <code>longitude</code> columns, so I can visualize the data with <a href="https://datasette.io/plugins/datasette-cluster-map">datasette-cluster-map</a>.</p>
<p>The <code>--multi</code> option lets you do that using <code>sqlite-utils convert</code>:</p>
<pre><code>sqlite-utils convert data.db places location '
latitude, longitude = value.split(",")
return {
"latitude": float(latitude),
"longitude": float(longitude),
}' --multi
</code></pre>
<p><code>--multi</code> tells the command to expect the Python code to return dictionaries. It will then create new columns in the database corresponding to the keys in those dictionaries and populate them using the results of the transformation.</p>
<p>If the <code>places</code> table started with just a <code>location</code> column, after running the above command the new table schema will look like this:</p>
<div class="highlight highlight-source-sql"><pre>CREATE TABLE [places] (
[location] <span class="pl-k">TEXT</span>,
[latitude] FLOAT,
[longitude] FLOAT
);</pre></div>
<h4>Common recipes</h4>
<p>This new feature in <code>sqlite-utils</code> actually started life as a separate tool entirely, called <a href="https://github.com/simonw/sqlite-transform">sqlite-transform</a>.</p>
<p>Part of the rationale for adding it to <code>sqlite-utils</code> was to avoid confusion between what that tool did and the <a href="https://simonwillison.net/2020/Sep/23/sqlite-advanced-alter-table/">sqlite-utils transform</a> tool, which does something completely different (applies table transformations that aren't possible using SQLite's default <code>ALTER TABLE</code> statement). Somewhere along the line I messed up with the naming of the two tools!</p>
<p><code>sqlite-transform</code> bundles a number of useful <a href="https://github.com/simonw/sqlite-transform/blob/main/README.md#parsedate-and-parsedatetime">default transformation recipes</a>, in addition to allowing arbitrary Python code. I ended up making these available in <code>sqlite-utils convert</code> by exposing them as functions that can be called from the command-line code argument like so:</p>
<pre><code>sqlite-utils convert my.db articles created_at \
'r.parsedate(value)'
</code></pre>
<p>Implementing them as Python functions in this way meant I didn't need to invent a new command-line mechanism for passing in additional options to the individual recipes - instead, parameters are passed like this:</p>
<pre><code>sqlite-utils convert my.db articles created_at \
'r.parsedate(value, dayfirst=True)'
</code></pre>
<h4>Also available in the sqlite_utils Python library</h4>
<p>Almost every feature that is exposed by the <a href="https://sqlite-utils.datasette.io/en/stable/cli.html">sqlite-utils command-line tool</a> has a matching API in the <a href="https://sqlite-utils.datasette.io/en/stable/python-api.html">sqlite_utils Python library</a>. <code>convert</code> is no exception.</p>
<p>The Python API lets you perform operations like the following:</p>
<pre><span class="pl-s1">db</span> <span class="pl-c1">=</span> <span class="pl-s1">sqlite_utils</span>.<span class="pl-v">Database</span>(<span class="pl-s">"dogs.db"</span>)
<span class="pl-s1">db</span>[<span class="pl-s">"dogs"</span>].<span class="pl-en">convert</span>(<span class="pl-s">"name"</span>, <span class="pl-k">lambda</span> <span class="pl-s1">value</span>: <span class="pl-s1">value</span>.<span class="pl-en">upper</span>())</pre>
<p>Any Python callable can be passed to <code>convert</code>, and it will be applied to every value in the specified column - again, like using <code>map()</code> to apply a transformation to every item in an array.</p>
<p>You can also use the Python API to perform more complex operations like the following two examples:</p>
<pre><span class="pl-c"># Convert title to upper case only for rows with id > 20</span>
<span class="pl-s1">table</span>.<span class="pl-en">convert</span>(
<span class="pl-s">"title"</span>,
<span class="pl-k">lambda</span> <span class="pl-s1">v</span>: <span class="pl-s1">v</span>.<span class="pl-en">upper</span>(),
<span class="pl-s1">where</span><span class="pl-c1">=</span><span class="pl-s">"id > :id"</span>,
<span class="pl-s1">where_args</span><span class="pl-c1">=</span>{<span class="pl-s">"id"</span>: <span class="pl-c1">20</span>}
)
<span class="pl-c"># Create two new columns, "upper" and "lower",</span>
<span class="pl-c"># and populate them from the converted title</span>
<span class="pl-s1">table</span>.<span class="pl-en">convert</span>(
<span class="pl-s">"title"</span>,
<span class="pl-k">lambda</span> <span class="pl-s1">v</span>: {
<span class="pl-s">"upper"</span>: <span class="pl-s1">v</span>.<span class="pl-en">upper</span>(),
<span class="pl-s">"lower"</span>: <span class="pl-s1">v</span>.<span class="pl-en">lower</span>()
}, <span class="pl-s1">multi</span><span class="pl-c1">=</span><span class="pl-c1">True</span>
)</pre>
<p>See the <a href="https://sqlite-utils.datasette.io/en/stable/python-api.html#converting-data-in-columns">full documentation for table.convert()</a> for more options.</p>
<h4 id="blog-performance">A more sophisticated example: analyzing log files</h4>
<p>I used the new <code>sqlite-utils convert</code> command earlier today, to debug a performance issue with my blog.</p>
<p>Most of my blog traffic is served via Cloudflare with a 15 minute cache timeout - but occasionally I'll hit an uncached page, and they had started to feel not quite as snappy as I would expect.</p>
<p>So I dipped into the Heroku dashboard, and saw this pretty sad looking graph:</p>
<p><img alt="Performance graph showing 95th percentile of 17s and max of 23s" src="https://static.simonwillison.net/static/2021/sad-performance.png" style="max-width:100%;" /></p>
<p>Somehow my 50th percentile was nearly 10 seconds, and my maximum page response time was 23 seconds! Something was clearly very wrong.</p>
<p>I use NGINX as part of my Heroku setup to buffer responses (see <a href="https://simonwillison.net/2017/Oct/2/nginx-heroku/">Running gunicorn behind nginx on Heroku for buffering and logging</a>), and I have custom NGINX configuration to write to the Heroku logs - mainly to work around a limitation in Heroku's default logging where it fails to record full user-agents or referrer headers.</p>
<p>I extended that configuration to record the NGINX <code>request_time</code>, <code>upstream_response_time</code>, <code>upstream_connect_time</code> and <code>upstream_header_time</code> variables, which I hoped would help me figure out what was going on.</p>
<p>After <a href="https://github.com/simonw/simonwillisonblog/commit/dd0faaa64c0e361ae1d760894e201cac7b0224a4">applying that change</a> I started seeing Heroku log lines that looked like this:</p>
<p><code>2021-08-05T17:58:28.880469+00:00 app[web.1]: measure#nginx.service=4.212 request="GET /search/?type=blogmark&page=2&tag=highavailability HTTP/1.1" status_code=404 request_id=25eb296e-e970-4072-b75a-606e11e1db5b remote_addr="10.1.92.174" forwarded_for="114.119.136.88, 172.70.142.28" forwarded_proto="http" via="1.1 vegur" body_bytes_sent=179 referer="-" user_agent="Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)" request_time="4.212" upstream_response_time="4.212" upstream_connect_time="0.000" upstream_header_time="4.212";</code></p>
<p>Next step: analyze those log lines.</p>
<p>I ran this command for a few minutes to gather some logs:</p>
<p><code>heroku logs -a simonwillisonblog --tail | grep 'measure#nginx.service' > /tmp/log.txt</code></p>
<p>Having collected 488 log lines, the next step was to load them into SQLite.</p>
<p>The <code>sqlite-utils insert</code> command likes to work with JSON, but I just had raw log lines. I used <code>jq</code> to convert each line into a <code>{"line": "raw log line"}</code> JSON object, then piped that as newline-delimited JSON into <code>sqlite-utils insert</code>:</p>
<pre><code>cat /tmp/log.txt | \
jq --raw-input '{line: .}' --compact-output | \
sqlite-utils insert /tmp/logs.db log - --nl
</code></pre>
<p><code>jq --raw-input</code> accepts input that is just raw lines of text, not yet valid JSON. <code>'{line: .}'</code> is a tiny <code>jq</code> program that builds <code>{"line": "raw input"}</code> objects. <code>--compact-output</code> causes <code>jq</code> to output newline-delimited JSON.</p>
<p>Then <code>sqlite-utils insert /tmp/logs.db log - --nl</code> reads that newline-delimited JSON into a new SQLite <code>log</code> table in a <code>logs.db</code> database file (<a href="https://sqlite-utils.datasette.io/en/stable/cli.html#inserting-newline-delimited-json">full documentation here</a>).</p>
<p><em><strong>Update 6th January 2022:</strong> <a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-20">sqlite-utils 3.20</a> introduced a new <code>sqlite-utils insert ... --lines</code> option for importing raw lines, so you can now achieve this without using <code>jq</code> at all. See
<a href="https://sqlite-utils.datasette.io/en/stable/cli.html#inserting-unstructured-data-with-lines-and-text">Inserting unstructured data with --lines and --text</a> for details.</em></p>
<p>Now I had a SQLite table with a single column, <code>line</code>. Next step: parse that nasty log format.</p>
<p>To my surprise I couldn't find an existing Python library for parsing <code>key=value key2="quoted value"</code> log lines. Instead I had to figure out a regular expression:</p>
<pre><code>([^\s=]+)=(?:"(.*?)"|(\S+))
</code></pre>
<p>Here's that expression visualized using <a href="https://www.debuggex.com/">Debuggex</a>:</p>
<p><img alt="Screenshot of the regex visualized with debuggex" src="https://static.simonwillison.net/static/2021/debuggex-log-parser-regex.png" style="max-width:100%;" /></p>
<p>I used that regular expression as part of a custom function passed in to the <code>sqlite-utils convert</code> tool:</p>
<pre><code>sqlite-utils convert /tmp/logs.db log line --import re --multi "$(cat <<EOD
r = re.compile(r'([^\s=]+)=(?:"(.*?)"|(\S+))')
pairs = {}
for key, value1, value2 in r.findall(value):
pairs[key] = value1 or value2
return pairs
EOD
)"
</code></pre>
<p>(This uses a <code>cat <<EOD</code> trick to avoid having to figure out how to escape the single and double quotes in the Python code for usage in a zsh shell command.)</p>
<p>Using <code>--multi</code> here created new columns for each of the key/value pairs seen in that log file.</p>
<p>One last step: convert the types. The new columns are all of type <code>text</code> but I want to do sorting and arithmetic on them so I need to convert them to integers and floats. I used <a href="https://sqlite-utils.datasette.io/en/stable/cli.html#transforming-tables">sqlite-utils transform</a> for that:</p>
<pre><code>sqlite-utils transform /tmp/logs.db log \
--type 'measure#nginx.service' float \
--type 'status_code' integer \
--type 'body_bytes_sent' integer \
--type 'request_time' float \
--type 'upstream_response_time' float \
--type 'upstream_connect_time' float \
--type 'upstream_header_time' float
</code></pre>
<p>Here's the <a href="https://simonwillison-net-performance-logs.vercel.app/logs/log">resulting log table</a> (published using <a href="https://datasette.io/plugins/datasette-publish-vercel">datasette-publish-vercel</a>).</p>
<p><img alt="Datasette showing the log table" src="https://static.simonwillison.net/static/2021/performance-logs.png" style="max-width:100%;" /></p>
<p>Once the logs were in Datasette, the problem quickly became apparent when I <a href="https://simonwillison-net-performance-logs.vercel.app/logs/log?_sort_desc=request_time">sorted by request_time</a>: an army of search engine crawlers were hitting deep linked filters in <a href="https://simonwillison.net/2017/Oct/5/django-postgresql-faceted-search/">my faceted search engine</a>, like <code>/search/?tag=geolocation&tag=offlineresources&tag=canvas&tag=javascript&tag=performance&tag=dragndrop&tag=crossdomain&tag=mozilla&tag=video&tag=tracemonkey&year=2009&type=blogmark</code>. These are expensive pages to generate! They're also very unlikely to be in my Cloudflare cache.</p>
<p>Could the answer be as simple as a <code>robots.txt</code> rule blocking access to <code>/search/</code>?</p>
<p>I <a href="https://github.com/simonw/simonwillisonblog/commit/4c0de5b9f01bb16fc89c587128a276055b0033bb">shipped that change</a> and waited a few hours to see what the impact would be:</p>
<p><img alt="Heroku metrics showing a dramatic improvement after the deploy, and especially about 8 hours later" src="https://static.simonwillison.net/static/2021/robots-txt-effect.png" style="max-width:100%;" /></p>
<p>It took a while for the crawlers to notice that my <code>robots.txt</code> had changed, but by 8 hours later my site performance was dramatically improved - I'm now seeing 99th percentile of around 450ms, compared to 25 seconds before I shipped the <code>robots.txt</code> change!</p>
<p>With this latest addition, <a href="https://sqlite-utils.datasette.io/">sqlite-utils</a> has evolved into a powerful tool for importing, cleaning and re-shaping data - especially when coupled with Datasette in order to explore, analyze and publish the results.</p>
<h4>TIL this week</h4>
<ul>
<li><a href="https://til.simonwillison.net/vscode/vs-code-regular-expressions">Search and replace with regular expressions in VS Code</a></li>
<li><a href="https://til.simonwillison.net/python/codespell">Check spelling using codespell</a></li>
<li><a href="https://til.simonwillison.net/imagemagick/set-a-gif-to-loop">Set a GIF to loop using ImageMagick</a></li>
<li><a href="https://til.simonwillison.net/sqlite/sqlite-aggregate-filter-clauses">SQLite aggregate filter clauses</a></li>
<li><a href="https://til.simonwillison.net/imagemagick/compress-animated-gif">Compressing an animated GIF with ImageMagick mogrify</a></li>
</ul>
<h4>Releases this week</h4>
<ul>
<li>
<strong><a href="https://github.com/simonw/sqlite-transform">sqlite-transform</a></strong>: <a href="https://github.com/simonw/sqlite-transform/releases/tag/1.2.1">1.2.1</a> - (<a href="https://github.com/simonw/sqlite-transform/releases">10 releases total</a>) - 2021-08-02
<br />Tool for running transformations on columns in a SQLite database</li>
<li>
<strong><a href="https://github.com/simonw/sqlite-utils">sqlite-utils</a></strong>: <a href="https://github.com/simonw/sqlite-utils/releases/tag/3.14">3.14</a> - (<a href="https://github.com/simonw/sqlite-utils/releases">82 releases total</a>) - 2021-08-02
<br />Python CLI utility and library for manipulating SQLite databases</li>
<li>
<strong><a href="https://github.com/simonw/datasette-json-html">datasette-json-html</a></strong>: <a href="https://github.com/simonw/datasette-json-html/releases/tag/1.0.1">1.0.1</a> - (<a href="https://github.com/simonw/datasette-json-html/releases">6 releases total</a>) - 2021-07-31
<br />Datasette plugin for rendering HTML based on JSON values</li>
<li>
<strong><a href="https://github.com/simonw/datasette-publish-fly">datasette-publish-fly</a></strong>: <a href="https://github.com/simonw/datasette-publish-fly/releases/tag/1.0.2">1.0.2</a> - (<a href="https://github.com/simonw/datasette-publish-fly/releases">5 releases total</a>) - 2021-07-30
<br />Datasette plugin for publishing data using Fly</li>
</ul>
Joining CSV and JSON data with an in-memory SQLite database2021-06-19T22:55:57+00:002021-06-19T22:55:57+00:00https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/#atom-series
<p>The new <code>sqlite-utils memory</code> command can import CSV and JSON data directly into an in-memory SQLite database, combine and query it using SQL and output the results as CSV, JSON or various other formats of plain text tables.</p>
<h4>sqlite-utils memory</h4>
<p>The new feature is part of <a href="https://sqlite-utils.datasette.io/en/latest/changelog.html#v3-10">sqlite-utils 3.10</a>, which I released this morning. You can install it using <code>brew install sqlite-utils</code> or <code>pip install sqlite-utils</code>.</p>
<p>I've recorded <a href="https://www.youtube.com/watch?v=OUjd0rkc678">this video</a> demonstrating the new feature - with full accompanying notes below.</p>
<iframe style="max-width: 100%" width="560" height="315" src="https://www.youtube.com/embed/OUjd0rkc678" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="allowfullscreen"> </iframe>
<p><code>sqlite-utils</code> already offers a mechanism for importing CSV and JSON data into a SQLite database file, in the form of the <a href="https://sqlite-utils.datasette.io/en/stable/cli.html#cli-inserting-data">sqlite-utils insert</a> command. Processing data with this involves two steps: first import it into a <code>temp.db</code> file, then use <a href="https://sqlite-utils.datasette.io/en/stable/cli.html#running-sql-queries">sqlite-utils query</a> to run queries and output the results.</p>
<p>Using SQL to re-shape data is really useful - since <code>sqlite-utils</code> can output in multiple different formats, I frequently find myself loading in a CSV file and exporting it back out as JSON, or vice-versa.</p>
<p>This week I realized that I had most of the pieces in place to reduce this to a single step. The new <code>sqlite-utils memory</code> command (<a href="https://sqlite-utils.datasette.io/en/stable/cli.html#cli-memory">full documentation here</a>) operates against a temporary, in-memory SQLite database. It can import data, execute SQL and output the result in a one-liner, without needing any temporary database files along the way.</p>
<p>Here's an example. My <a href="https://github.com/dogsheep">Dogsheep</a> GitHub organization has a number of repositories. GitHub make those available via an authentication-optional API endpoint at <a href="https://api.github.com/users/dogsheep/repos">https://api.github.com/users/dogsheep/repos</a> - which returns JSON that looks like this (simplified):</p>
<div class="highlight highlight-source-json"><pre>[
{
<span class="pl-s"><span class="pl-pds">"</span>id<span class="pl-pds">"</span></span>: <span class="pl-c1">197431109</span>,
<span class="pl-s"><span class="pl-pds">"</span>name<span class="pl-pds">"</span></span>: <span class="pl-s"><span class="pl-pds">"</span>dogsheep-beta<span class="pl-pds">"</span></span>,
<span class="pl-s"><span class="pl-pds">"</span>full_name<span class="pl-pds">"</span></span>: <span class="pl-s"><span class="pl-pds">"</span>dogsheep/dogsheep-beta<span class="pl-pds">"</span></span>,
<span class="pl-s"><span class="pl-pds">"</span>size<span class="pl-pds">"</span></span>: <span class="pl-c1">61</span>,
<span class="pl-s"><span class="pl-pds">"</span>stargazers_count<span class="pl-pds">"</span></span>: <span class="pl-c1">79</span>,
<span class="pl-s"><span class="pl-pds">"</span>watchers_count<span class="pl-pds">"</span></span>: <span class="pl-c1">79</span>,
<span class="pl-s"><span class="pl-pds">"</span>forks<span class="pl-pds">"</span></span>: <span class="pl-c1">0</span>,
<span class="pl-s"><span class="pl-pds">"</span>open_issues<span class="pl-pds">"</span></span>: <span class="pl-c1">11</span>
},
{
<span class="pl-s"><span class="pl-pds">"</span>id<span class="pl-pds">"</span></span>: <span class="pl-c1">256834907</span>,
<span class="pl-s"><span class="pl-pds">"</span>name<span class="pl-pds">"</span></span>: <span class="pl-s"><span class="pl-pds">"</span>dogsheep-photos<span class="pl-pds">"</span></span>,
<span class="pl-s"><span class="pl-pds">"</span>full_name<span class="pl-pds">"</span></span>: <span class="pl-s"><span class="pl-pds">"</span>dogsheep/dogsheep-photos<span class="pl-pds">"</span></span>,
<span class="pl-s"><span class="pl-pds">"</span>size<span class="pl-pds">"</span></span>: <span class="pl-c1">64</span>,
<span class="pl-s"><span class="pl-pds">"</span>stargazers_count<span class="pl-pds">"</span></span>: <span class="pl-c1">116</span>,
<span class="pl-s"><span class="pl-pds">"</span>watchers_count<span class="pl-pds">"</span></span>: <span class="pl-c1">116</span>,
<span class="pl-s"><span class="pl-pds">"</span>forks<span class="pl-pds">"</span></span>: <span class="pl-c1">5</span>,
<span class="pl-s"><span class="pl-pds">"</span>open_issues<span class="pl-pds">"</span></span>: <span class="pl-c1">18</span>
}
]</pre></div>
<p>With <code>sqlite-utils memory</code> we can see the 3 most popular repos by number of stars like this:</p>
<pre><code>$ curl -s 'https://api.github.com/users/dogsheep/repos' \
| sqlite-utils memory - '
select full_name, forks_count, stargazers_count as stars
from stdin order by stars desc limit 3
' -t
full_name forks_count stars
-------------------------- ------------- -------
dogsheep/twitter-to-sqlite 12 225
dogsheep/github-to-sqlite 14 139
dogsheep/dogsheep-photos 5 116
</code></pre>
<p>We're using <code>curl</code> to fetch the JSON and pipe it into <code>sqlite-utils memory</code> - the <code>-</code> means "read from standard input". Then we pass the following SQL query:</p>
<div class="highlight highlight-source-sql"><pre><span class="pl-k">select</span> full_name, forks_count, stargazers_count <span class="pl-k">as</span> stars
<span class="pl-k">from</span> stdin <span class="pl-k">order by</span> stars <span class="pl-k">desc</span> <span class="pl-k">limit</span> <span class="pl-c1">3</span></pre></div>
<p><code>stdin</code> is the temporary table created for the data piped in to the tool. The query selects three of the JSON properties, renames <code>stargazers_count</code> to <code>stars</code>, sorts by stars and return the first three.</p>
<p>The <code>-t</code> option here means "output as a formatted table" - without that option we get JSON:</p>
<pre><code>$ curl -s 'https://api.github.com/users/dogsheep/repos' \
| sqlite-utils memory - '
select full_name, forks_count, stargazers_count as stars
from stdin order by stars desc limit 3
'
[{"full_name": "dogsheep/twitter-to-sqlite", "forks_count": 12, "stars": 225},
{"full_name": "dogsheep/github-to-sqlite", "forks_count": 14, "stars": 139},
{"full_name": "dogsheep/dogsheep-photos", "forks_count": 5, "stars": 116}]
</code></pre>
<p>Or we can use <code>--csv</code> to get back CSV:</p>
<pre><code>$ curl -s 'https://api.github.com/users/dogsheep/repos' \
| sqlite-utils memory - '
select full_name, forks_count, stargazers_count as stars
from stdin order by stars desc limit 3
' --csv
full_name,forks_count,stars
dogsheep/twitter-to-sqlite,12,225
dogsheep/github-to-sqlite,14,139
dogsheep/dogsheep-photos,5,116
</code></pre>
<p>The <code>-t</code> option supports a number of different formats, specified using <code>--fmt</code>. If I wanted to generate a LaTeX table of the top repos by stars I could do this:</p>
<pre><code>$ curl -s 'https://api.github.com/users/dogsheep/repos' \
| sqlite-utils memory - '
select full_name, forks_count, stargazers_count as stars
from stdin order by stars desc limit 3
' -t --fmt=latex
\begin{tabular}{lrr}
\hline
full\_name & forks\_count & stars \\
\hline
dogsheep/twitter-to-sqlite & 12 & 225 \\
dogsheep/github-to-sqlite & 14 & 139 \\
dogsheep/dogsheep-photos & 5 & 116 \\
\hline
\end{tabular}
</code></pre>
<p>We can run aggregate queries too - let's add up the total size and total number of stars across all of those repositories:</p>
<pre><code>$ curl -s 'https://api.github.com/users/dogsheep/repos' \
| sqlite-utils memory - '
select sum(size), sum(stargazers_count) from stdin
' -t
sum(size) sum(stargazers_count)
----------- -----------------------
843 934
</code></pre>
<p>(I believe size here is measured in kilobytes: the GitHub API documentation isn't clear on this point.)</p>
<h4 id="joining-across-different-files">Joining across different files</h4>
<p>All of these examples have worked with JSON data piped into the tool - but you can also pass one or more files, of different formats, in a way that lets you execute joins against them.</p>
<p>As an example, let's combine two sources of data.</p>
<p>The New York Times publish a <a href="https://github.com/nytimes/covid-19-data/blob/master/us-states.csv">us-states.csv</a> file with Covid cases and deaths by state over time.</p>
<p>The CDC have an <a href="https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data">undocumented JSON endpoint</a> (which I've been <a href="https://github.com/simonw/cdc-vaccination-history">archiving here</a>) tracking the progress of vaccination across different states.</p>
<p>We're going to run a join from that CSV data to that JSON data, and output a table of results.</p>
<p>First, we need to download the files. The <a href="https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data">CDC JSON data</a> isn't quite in the right shape for our purposes:</p>
<div class="highlight highlight-source-json"><pre>{
<span class="pl-s"><span class="pl-pds">"</span>runid<span class="pl-pds">"</span></span>: <span class="pl-c1">2023</span>,
<span class="pl-s"><span class="pl-pds">"</span>vaccination_data<span class="pl-pds">"</span></span>: [
{
<span class="pl-s"><span class="pl-pds">"</span>Date<span class="pl-pds">"</span></span>: <span class="pl-s"><span class="pl-pds">"</span>2021-06-19<span class="pl-pds">"</span></span>,
<span class="pl-s"><span class="pl-pds">"</span>Location<span class="pl-pds">"</span></span>: <span class="pl-s"><span class="pl-pds">"</span>US<span class="pl-pds">"</span></span>,
<span class="pl-s"><span class="pl-pds">"</span>ShortName<span class="pl-pds">"</span></span>: <span class="pl-s"><span class="pl-pds">"</span>USA<span class="pl-pds">"</span></span>,
<span class="pl-s">...</span></pre></div>
<p><code>sqlite-utils</code> expects a flat JSON array of objects - we can use <a href="https://stedolan.github.io/jq/">jq</a> to re-shape the data like so:</p>
<pre><code>$ curl https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data \
| jq .vaccination_data > vaccination_data.json
</code></pre>
<p>The New York Times data is good as is:</p>
<pre><code>$ wget 'https://github.com/nytimes/covid-19-data/raw/master/us-states.csv'
</code></pre>
<p>Now that we have the data locally, we can run a join to combine it using the following command:</p>
<pre><code>$ sqlite-utils memory us-states.csv vaccination_data.json "
select
max(t1.date),
t1.state,
t1.cases,
t1.deaths,
t2.Census2019,
t2.Dist_Per_100K
from
t1
join t2 on t1.state = replace(t2.LongName, 'New York State', 'New York')
group by
t1.state
order by
Dist_Per_100K desc
" -t
max(t1.date) state cases deaths Census2019 Dist_Per_100K
-------------- ------------------------ ------- -------- ------------ ---------------
2021-06-18 District of Columbia 49243 1141 705749 149248
2021-06-18 Vermont 24360 256 623989 146257
2021-06-18 Rhode Island 152383 2724 1059361 141291
2021-06-18 Massachusetts 709263 17960 6892503 139692
2021-06-18 Maryland 461852 9703 6045680 138193
2021-06-18 Maine 68753 854 1344212 136894
2021-06-18 Hawaii 35903 507 1415872 136024
...
</code></pre>
<p>I'm using automatically created numeric aliases <code>t1</code> and <code>t2</code> for the files here, but I can also use their full table names <code>"us-states"</code> (quotes needed due to the hyphen) and <code>vaccination_data</code> instead.</p>
<p>The <code>replace()</code> operation there is needed because the <code>vaccination_data.json</code> file calls New York "New York State" while the <code>us-states.csv</code> file just calls it "New York".</p>
<p>The <code>max(t1.date)</code> and <code>group by t1.state</code> is <a href="http://www.sqlite.org/draft/lang_select.html#bareagg">a useful SQLite</a> trick: if you perform a <code>group by</code> and then ask for the <code>max()</code> of a value, the other columns returned from that table will be the columns for the row that contains that maximum value.</p>
<p>This demo is a bit of a stretch - once I reach this level of complexity I'm more likely to load the files into a SQLite database file on disk and open them up in <a href="https://datasette.io/">Datasette</a> - but it's a fun example of a more complex join in action.</p>
<h4>Also in sqlite-utils 3.10</h4>
<p>The <code>sqlite-utils memory</code> command has another new trick up its sleeve: it automatically detects which columns in a CSV or TSV file contain integer or float values and creates the corresponding in-memory SQLite table with the correct types. This ensures <code>max()</code> and <code>sum()</code> and <code>order by</code> work in a predictable manner, without accidentally sorting <code>1</code> as higher than <code>11</code>.</p>
<p>I didn't want to break backwards compatibility for existing users of the <code>sqlite-utils insert</code> command so I've added type detection there as a new option, <code>--detect-types</code> or <code>-d</code> for short:</p>
<pre><code>$ sqlite-utils insert my.db us_states us-states.csv --csv -d
[####################################] 100%
$ sqlite-utils schema my.db
CREATE TABLE "us_states" (
[date] TEXT,
[state] TEXT,
[fips] INTEGER,
[cases] INTEGER,
[deaths] INTEGER
);
</code></pre>
<p>There's more <a href="https://sqlite-utils.datasette.io/en/latest/changelog.html#v3-10">in the changelog</a>.</p>
<h4>Releases this week</h4>
<ul>
<li>
<strong><a href="https://github.com/simonw/sqlite-utils">sqlite-utils</a></strong>: <a href="https://github.com/simonw/sqlite-utils/releases/tag/3.10">3.10</a> - (<a href="https://github.com/simonw/sqlite-utils/releases">78 releases total</a>) - 2021-06-19
<br />Python CLI utility and library for manipulating SQLite databases</li>
<li>
<strong><a href="https://github.com/dogsheep/dogsheep-beta">dogsheep-beta</a></strong>: <a href="https://github.com/dogsheep/dogsheep-beta/releases/tag/0.10.2">0.10.2</a> - (<a href="https://github.com/dogsheep/dogsheep-beta/releases">20 releases total</a>) - 2021-06-13
<br />Build a search index across content from multiple SQLite database tables and run faceted searches against it using Datasette</li>
<li>
<strong><a href="https://github.com/simonw/yaml-to-sqlite">yaml-to-sqlite</a></strong>: <a href="https://github.com/simonw/yaml-to-sqlite/releases/tag/1.0">1.0</a> - (<a href="https://github.com/simonw/yaml-to-sqlite/releases">5 releases total</a>) - 2021-06-13
<br />Utility for converting YAML files to SQLite</li>
<li>
<strong><a href="https://github.com/simonw/markdown-to-sqlite">markdown-to-sqlite</a></strong>: <a href="https://github.com/simonw/markdown-to-sqlite/releases/tag/1.0">1.0</a> - (<a href="https://github.com/simonw/markdown-to-sqlite/releases">2 releases total</a>) - 2021-06-13
<br />CLI tool for loading markdown files into a SQLite database</li>
</ul>
<h4>TIL this week</h4>
<ul>
<li><a href="https://til.simonwillison.net/vim/mouse-support-in-vim">Mouse support in vim</a></li>
</ul>
Refactoring databases with sqlite-utils extract2020-09-23T16:02:08+00:002020-09-23T16:02:08+00:00https://simonwillison.net/2020/Sep/23/sqlite-utils-extract/#atom-series
<p>Yesterday <a href="https://simonwillison.net/2020/Sep/23/sqlite-advanced-alter-table/">I described</a> the new <a href="https://sqlite-utils.readthedocs.io/en/stable/cli.html#transforming-tables">sqlite-utils transform</a> mechanism for applying SQLite table transformations that go beyond those supported by <code>ALTER TABLE</code>. The other new feature in <a href="https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v2-20">sqlite-utils 2.20</a> builds on that capability to allow you to refactor a database table by extracting columns into separate tables. I’ve called it <a href="https://sqlite-utils.readthedocs.io/en/stable/cli.html#cli-extract">sqlite-utils extract</a>.</p>
<h4>The problem</h4>
<p>Much of the data I work with in Datasette starts off as a CSV file published by an organization or government. Since CSV files aren’t relational databases, they are often denormalized. It’s particularly common to see something like this:</p>
<table>
<thead>
<tr>
<th>Organization Group Code</th>
<th>Organization Group</th>
<th>Department Code</th>
<th>Department</th>
<th>Union Code</th>
<th>Union</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Public Protection</td>
<td>POL</td>
<td>Police</td>
<td>911</td>
<td>POA</td>
</tr>
<tr>
<td>4</td>
<td>Community Health</td>
<td>DPH</td>
<td>Public Health</td>
<td>250</td>
<td>SEIU, Local 1021, Misc</td>
</tr>
<tr>
<td>1</td>
<td>Public Protection</td>
<td>FIR</td>
<td>Fire Department</td>
<td>798</td>
<td>Firefighters,Local 798, Unit 1</td>
</tr>
<tr>
<td>1</td>
<td>Public Protection</td>
<td>POL</td>
<td>Police</td>
<td>911</td>
<td>POA</td>
</tr>
</tbody>
</table>
<p>This is an extract from the <a href="https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8-5mnd">San Francisco Employee Compensation</a> dataset from DataSF.</p>
<p>The <code>sqlite-utils extract</code> command-line tool, and the <code>table.extract()</code> Python method that underlies it, can be used to extract these duplicated column pairs out into separate tables with foreign key relationships from the main table.</p>
<h4>How to refactor that data</h4>
<p>Here's how to use <code>sqlite-utils</code> to clean up and refactor that compensation data.</p>
<p>First, grab the data. It's a 150M CSV file containing over 600,000 rows:</p>
<pre><code>curl -o salaries.csv 'https://data.sfgov.org/api/views/88g8-5mnd/rows.csv?accessType=DOWNLOAD'
</code></pre>
<p>Use <code>sqlite-utils insert</code> to load that into a SQLite database:</p>
<pre><code>sqlite-utils insert salaries.db salaries salaries.csv --csv
</code></pre>
<p>Fire up Datasette to check that the data looks right:</p>
<pre><code>datasette salaries.db
</code></pre>
<p>There's a catch here: the schema for the generated table (shown at the bottom of <code>http://localhost:8001/salaries/salaries</code>) reveals that because we imported from CSV every column is a text column. Since some of this data is numeric we should convert to numbers, so we can do things like sort the table by the highest salary.</p>
<p>We can do that using <code>sqlite-transform</code>:</p>
<pre><code>sqlite-utils transform salaries.db salaries \
--type 'Employee Identifier' integer \
--type Salaries float \
--type Overtime float \
--type 'Other Salaries' float \
--type 'Total Salary' float \
--type 'Retirement' float \
--type 'Health and Dental' float \
--type 'Other Benefits' float \
--type 'Total Benefits' float \
--type 'Total Compensation' float
</code></pre>
<p>[<strong>Update 13 August 2021</strong>: As of <a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-10">sqlite-utils 3.10</a> you can instead run <code>sqlite-utils insert salaries.db salaries salaries.csv --csv --detect-types</code>. The <code>--detect-types</code> (or <code>-d</code>) option will detect types for you during the initial import.]</p>
<p>Having run that command, here's the new database schema:</p>
<pre><code>$ sqlite3 salaries.db '.schema salaries'
CREATE TABLE IF NOT EXISTS "salaries" (
[rowid] INTEGER PRIMARY KEY,
[Year Type] TEXT,
[Year] TEXT,
[Organization Group Code] TEXT,
[Organization Group] TEXT,
[Department Code] TEXT,
[Department] TEXT,
[Union Code] TEXT,
[Union] TEXT,
[Job Family Code] TEXT,
[Job Family] TEXT,
[Job Code] TEXT,
[Job] TEXT,
[Employee Identifier] INTEGER,
[Salaries] FLOAT,
[Overtime] FLOAT,
[Other Salaries] FLOAT,
[Total Salary] FLOAT,
[Retirement] FLOAT,
[Health and Dental] FLOAT,
[Other Benefits] FLOAT,
[Total Benefits] FLOAT,
[Total Compensation] FLOAT
);
</code></pre>
<p>Now we can start extracting those columns. We do this using several rounds of the <code>sqlite-utils extract</code> command, one for each duplicated pairs.</p>
<p>For <code>Organization Group Code</code> and <code>Organization Group</code>:</p>
<pre><code>sqlite-utils extract salaries.db salaries \
'Organization Group Code' 'Organization Group' \
--table 'organization_groups' \
--fk-column 'organization_group_id' \
--rename 'Organization Group Code' code \
--rename 'Organization Group' name
</code></pre>
<p>This took about 12 minutes on my laptop, and displayed a progress bar as it runs. (UPDATE: <a href="https://github.com/simonw/sqlite-utils/issues/172">in issue #172</a> I improved the performance and knocked it down to just 4 seconds. I also removed the progress bar).</p>
<p>Here's the refactored database schema:</p>
<pre><code>$ sqlite3 salaries.db .schema
CREATE TABLE [organization_groups] (
[id] INTEGER PRIMARY KEY,
[code] TEXT,
[name] TEXT
);
CREATE TABLE IF NOT EXISTS "salaries" (
[rowid] INTEGER PRIMARY KEY,
[Year Type] TEXT,
[Year] TEXT,
[organization_group_id] INTEGER,
[Department Code] TEXT,
[Department] TEXT,
[Union Code] TEXT,
[Union] TEXT,
[Job Family Code] TEXT,
[Job Family] TEXT,
[Job Code] TEXT,
[Job] TEXT,
[Employee Identifier] INTEGER,
[Salaries] FLOAT,
[Overtime] FLOAT,
[Other Salaries] FLOAT,
[Total Salary] FLOAT,
[Retirement] FLOAT,
[Health and Dental] FLOAT,
[Other Benefits] FLOAT,
[Total Benefits] FLOAT,
[Total Compensation] FLOAT,
FOREIGN KEY(organization_group_id) REFERENCES organization_groups(id)
);
CREATE UNIQUE INDEX [idx_organization_groups_code_name]
ON [organization_groups] ([code], [name]);
</code></pre>
<p>Now fire up Datasette to confirm it had the desired effect:</p>
<pre><code>datasette salaries.db
</code></pre>
<p>Here's what that looks like:</p>
<p><img src="https://static.simonwillison.net/static/2020/refactored-one-column.png" alt="Screenshot of the first few columns of the table, showing links displayed in the new organization_group_id column" data-canonical-src="https://static.simonwillison.net/static/2020/refactored-one-column.png" style="max-width:100%;" /></p>
<p>Note that the new <code>organization_group_id</code> column still shows the name of the organization group, because Datasette automatically de-references foreign key relationships when it displays a table and uses any column called <code>name</code> (or <code>title</code> or <code>value</code>) as the label for a link to the record.</p>
<p>Let's extract the other columns. This will take a while:</p>
<pre><code>sqlite-utils extract salaries.db salaries \
'Department Code' 'Department' \
--table 'departments' \
--fk-column 'department_id' \
--rename 'Department Code' code \
--rename 'Department' name
sqlite-utils extract salaries.db salaries \
'Union Code' 'Union' \
--table 'unions' \
--fk-column 'union_id' \
--rename 'Union Code' code \
--rename 'Union' name
sqlite-utils extract salaries.db salaries \
'Job Family Code' 'Job Family' \
--table 'job_families' \
--fk-column 'job_family_id' \
--rename 'Job Family Code' code \
--rename 'Job Family' name
sqlite-utils extract salaries.db salaries \
'Job Code' 'Job' \
--table 'jobs' \
--fk-column 'job_id' \
--rename 'Job Code' code \
--rename 'Job' name
</code></pre>
<p>Our finished schema looks like this:</p>
<pre><code>$ sqlite3 salaries.db .schema
CREATE TABLE [organization_groups] (
[id] INTEGER PRIMARY KEY,
[code] TEXT,
[name] TEXT
);
CREATE TABLE [departments] (
[id] INTEGER PRIMARY KEY,
[code] TEXT,
[name] TEXT
);
CREATE TABLE [unions] (
[id] INTEGER PRIMARY KEY,
[code] TEXT,
[name] TEXT
);
CREATE TABLE [job_families] (
[id] INTEGER PRIMARY KEY,
[code] TEXT,
[name] TEXT
);
CREATE TABLE [jobs] (
[id] INTEGER PRIMARY KEY,
[code] TEXT,
[name] TEXT
);
CREATE TABLE IF NOT EXISTS "salaries" (
[rowid] INTEGER PRIMARY KEY,
[Year Type] TEXT,
[Year] TEXT,
[organization_group_id] INTEGER REFERENCES [organization_groups]([id]),
[department_id] INTEGER REFERENCES [departments]([id]),
[union_id] INTEGER REFERENCES [unions]([id]),
[job_family_id] INTEGER REFERENCES [job_families]([id]),
[job_id] INTEGER,
[Employee Identifier] INTEGER,
[Salaries] FLOAT,
[Overtime] FLOAT,
[Other Salaries] FLOAT,
[Total Salary] FLOAT,
[Retirement] FLOAT,
[Health and Dental] FLOAT,
[Other Benefits] FLOAT,
[Total Benefits] FLOAT,
[Total Compensation] FLOAT,
FOREIGN KEY(job_id) REFERENCES jobs(id)
);
CREATE UNIQUE INDEX [idx_organization_groups_code_name]
ON [organization_groups] ([code], [name]);
CREATE UNIQUE INDEX [idx_departments_code_name]
ON [departments] ([code], [name]);
CREATE UNIQUE INDEX [idx_unions_code_name]
ON [unions] ([code], [name]);
CREATE UNIQUE INDEX [idx_job_families_code_name]
ON [job_families] ([code], [name]);
CREATE UNIQUE INDEX [idx_jobs_code_name]
ON [jobs] ([code], [name]);
</code></pre>
<p>We've also shrunk our database file quite a bit. Before the transformations <code>salaries.db</code> was 159MB. It's now just 70MB - that's less than half the size!</p>
<p>I used <code>datasette publish cloudrun</code> to publish a copy of my final database here:</p>
<p><a href="https://sf-employee-compensation.datasettes.com/salaries/salaries">https://sf-employee-compensation.datasettes.com/salaries/salaries</a></p>
<p>Here's the command I used to publish it:</p>
<pre><code>datasette publish cloudrun salaries.db \
--service sf-employee-compensation \
--title "San Francisco Employee Compensation (as-of 21 Sep 2020)" \
--source "DataSF" \
--source_url "https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8-5mnd" \
--about "About this project" \
--about_url "https://simonwillison.net/2020/Sep/23/sqlite-utils-extract/" \
--install datasette-block-robots \
--install datasette-vega \
--install datasette-copyable \
--install datasette-graphql
</code></pre>
<h4>Bonus: explore salaries with GraphQL</h4>
<p>You may have noticed my <code>datasette publish</code> line above finished with the line <code>--install datasette-graphql</code>. This installs the <a href="https://github.com/simonw/datasette-graphql">datasette-graphql</a> plugin as part of the deployment to Cloud Run. Which means we can query the salary data using GraphQL as an alternative to SQL!</p>
<p>Here's a GraphQL query that shows the ten highest paid employees, including their various expanded foreign key references:</p>
<div class="highlight highlight-source-graphql"><pre>{
<span class="pl-v">salaries</span>(<span class="pl-v">sort_desc</span>:<span class="pl-c1"> Total_Compensation</span>, <span class="pl-v">first</span>: <span class="pl-c1">10</span>) {
<span class="pl-v">nodes</span> {
<span class="pl-v">Year_Type</span>
<span class="pl-v">Year</span>
<span class="pl-v">union_id</span> {
<span class="pl-v">id</span>
<span class="pl-v">name</span>
}
<span class="pl-v">job_id</span> {
<span class="pl-v">id</span>
<span class="pl-v">name</span>
}
<span class="pl-v">job_family_id</span> {
<span class="pl-v">id</span>
<span class="pl-v">name</span>
}
<span class="pl-v">department_id</span> {
<span class="pl-v">id</span>
<span class="pl-v">name</span>
}
<span class="pl-v">organization_group_id</span> {
<span class="pl-v">id</span>
<span class="pl-v">name</span>
}
<span class="pl-v">Salaries</span>
<span class="pl-v">Overtime</span>
<span class="pl-v">Other_Salaries</span>
<span class="pl-v">Total_Salary</span>
<span class="pl-v">Retirement</span>
<span class="pl-v">Health_and_Dental</span>
<span class="pl-v">Other_Benefits</span>
<span class="pl-v">Total_Benefits</span>
<span class="pl-v">Total_Compensation</span>
<span class="pl-v">rowid</span>
<span class="pl-v">Employee_Identifier</span>
}
}
}</pre></div>
<p>You can <a href="https://sf-employee-compensation.datasettes.com/graphql?query=%7B%0A%20%20salaries(sort_desc%3A%20Total_Compensation%2C%20first%3A%2010)%20%7B%0A%20%20%20%20nodes%20%7B%0A%20%20%20%20%20%20Year_Type%0A%20%20%20%20%20%20Year%0A%20%20%20%20%20%20union_id%20%7B%0A%20%20%20%20%20%20%20%20id%0A%20%20%20%20%20%20%20%20name%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20job_id%20%7B%0A%20%20%20%20%20%20%20%20id%0A%20%20%20%20%20%20%20%20name%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20job_family_id%20%7B%0A%20%20%20%20%20%20%20%20id%0A%20%20%20%20%20%20%20%20name%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20department_id%20%7B%0A%20%20%20%20%20%20%20%20id%0A%20%20%20%20%20%20%20%20name%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20organization_group_id%20%7B%0A%20%20%20%20%20%20%20%20id%0A%20%20%20%20%20%20%20%20name%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20Salaries%0A%20%20%20%20%20%20Overtime%0A%20%20%20%20%20%20Other_Salaries%0A%20%20%20%20%20%20Total_Salary%0A%20%20%20%20%20%20Retirement%0A%20%20%20%20%20%20Health_and_Dental%0A%20%20%20%20%20%20Other_Benefits%0A%20%20%20%20%20%20Total_Benefits%0A%20%20%20%20%20%20Total_Compensation%0A%20%20%20%20%20%20rowid%0A%20%20%20%20%20%20Employee_Identifier%0A%20%20%20%20%7D%0A%20%20%7D%0A%7D%0A">try that query out here</a> in the GraphiQL API explorer.</p>
Executing advanced ALTER TABLE operations in SQLite2020-09-23T01:00:35+00:002020-09-23T01:00:35+00:00https://simonwillison.net/2020/Sep/23/sqlite-advanced-alter-table/#atom-series
<p>SQLite's ALTER TABLE has some significant limitations: it can't drop columns (UPDATE: that was fixed in <a href="https://www.sqlite.org/changes.html#version_3_35_0">SQLite 3.35.0</a> in March 2021), it can't alter NOT NULL status, it can't change column types. Since I spend a lot of time with SQLite these days I've written some code to fix this - both from Python and as a command-line utility.</p>
<p>To SQLite's credit, not only are these limitations <a href="https://www.sqlite.org/lang_altertable.html">well explained</a> in the documentation but the explanation is accompanied by <a href="https://www.sqlite.org/lang_altertable.html#making_other_kinds_of_table_schema_changes">a detailed description</a> of the recommended workaround. The short version looks something like this:</p>
<ul>
<li>Start a transaction</li>
<li>Create a new temporary table with the exact shape you would like</li>
<li>Copy all of your old data across using INSERT INTO temp_table SELECT FROM old_table</li>
<li>Drop the old table</li>
<li>Rename the temp table to the old table</li>
<li>Commit the transaction</li>
</ul>
<p>My <a href="https://sqlite-utils.readthedocs.io/">sqlite-utils</a> tool and Python library aims to make working with SQLite as convenient as possible. So I set out to build a utility method for performing this kind of large scale table transformation. I've called it <code>table.transform(...)</code>.</p>
<p>Here are some simple examples of what it can do, lifted from <a href="https://sqlite-utils.readthedocs.io/en/stable/python-api.html#python-api-transform">the documentation</a>:</p>
<pre><span class="pl-c"># Convert the 'age' column to an integer, and 'weight' to a float</span>
<span class="pl-s1">table</span>.<span class="pl-en">transform</span>(<span class="pl-s1">types</span><span class="pl-c1">=</span>{<span class="pl-s">"age"</span>: <span class="pl-s1">int</span>, <span class="pl-s">"weight"</span>: <span class="pl-s1">float</span>})
<span class="pl-c"># Rename the 'age' column to 'initial_age':</span>
<span class="pl-s1">table</span>.<span class="pl-en">transform</span>(<span class="pl-s1">rename</span><span class="pl-c1">=</span>{<span class="pl-s">"age"</span>: <span class="pl-s">"initial_age"</span>})
<span class="pl-c"># Drop the 'age' column:</span>
<span class="pl-s1">table</span>.<span class="pl-en">transform</span>(<span class="pl-s1">drop</span><span class="pl-c1">=</span>{<span class="pl-s">"age"</span>})
<span class="pl-c"># Make `user_id` the new primary key</span>
<span class="pl-s1">table</span>.<span class="pl-en">transform</span>(<span class="pl-s1">pk</span><span class="pl-c1">=</span><span class="pl-s">"user_id"</span>)
<span class="pl-c"># Make the 'age' and 'weight' columns NOT NULL</span>
<span class="pl-s1">table</span>.<span class="pl-en">transform</span>(<span class="pl-s1">not_null</span><span class="pl-c1">=</span>{<span class="pl-s">"age"</span>, <span class="pl-s">"weight"</span>})
<span class="pl-c"># Make age allow NULL and switch weight to being NOT NULL:</span>
<span class="pl-s1">table</span>.<span class="pl-en">transform</span>(<span class="pl-s1">not_null</span><span class="pl-c1">=</span>{<span class="pl-s">"age"</span>: <span class="pl-c1">False</span>, <span class="pl-s">"weight"</span>: <span class="pl-c1">True</span>})
<span class="pl-c"># Set default age to 1:</span>
<span class="pl-s1">table</span>.<span class="pl-en">transform</span>(<span class="pl-s1">defaults</span><span class="pl-c1">=</span>{<span class="pl-s">"age"</span>: <span class="pl-c1">1</span>})
<span class="pl-c"># Now remove the default from that column:</span>
<span class="pl-s1">table</span>.<span class="pl-en">transform</span>(<span class="pl-s1">defaults</span><span class="pl-c1">=</span>{<span class="pl-s">"age"</span>: <span class="pl-c1">None</span>})</pre>
<p>Each time the <code>table.transform(...)</code> method runs it will create a brand new table, copy the data across and then drop the old table. You can combine multiple operations together in a single call, avoiding copying the table multiple times.</p>
<p>The <code>table.transform_sql(...)</code> method returns the SQL that would be executed instead of executing it directly, useful if you want to handle even more complex requirements.</p>
<h4 id="sqlite-utils-transform-cli">The "sqlite-utils transform" command-line tool</h4>
<p>Almost every feature in <code>sqlite-utils</code> is available in both the Python library and as a command-line utility, and <code>.transform()</code> is no exception. The <code>sqlite-utils transform</code> command can be used to apply complex table transformations directly from the command-line.</p>
<p>Here's an example, starting with the <a href="https://latest.datasette.io/fixtures">fixtures.db</a> database that powers Datasette's unit tests:</p>
<pre><code>$ wget https://static.simonwillison.net/static/2020/fixtures.db
$ sqlite3 fixtures.db '.schema facetable'
CREATE TABLE facetable (
pk integer primary key,
created text,
planet_int integer,
on_earth integer,
state text,
city_id integer,
neighborhood text,
tags text,
complex_array text,
distinct_some_null,
FOREIGN KEY ("city_id") REFERENCES [facet_cities](id)
);
$ sqlite-utils transform fixtures.db facetable \
--type on_earth text \
--drop complex_array \
--drop state \
--rename tags the_tags
$ sqlite3 fixtures.db '.schema facetable'
CREATE TABLE IF NOT EXISTS "facetable" (
[pk] INTEGER PRIMARY KEY,
[created] TEXT,
[planet_int] INTEGER,
[on_earth] TEXT,
[city_id] INTEGER REFERENCES [facet_cities]([id]),
[neighborhood] TEXT,
[the_tags] TEXT,
[distinct_some_null] TEXT
);
</code></pre>
<p>You can use the <code>--sql</code> option to see the SQL that would be executed without actually running it:</p>
<pre><code>$ wget https://latest.datasette.io/fixtures.db
$ sqlite-utils transform fixtures.db facetable \
--type on_earth text \
--drop complex_array \
--drop state \
--rename tags the_tags \
--sql
CREATE TABLE [facetable_new_442f07e26eef] (
[pk] INTEGER PRIMARY KEY,
[created] TEXT,
[planet_int] INTEGER,
[on_earth] TEXT,
[city_id] INTEGER REFERENCES [facet_cities]([id]),
[neighborhood] TEXT,
[the_tags] TEXT,
[distinct_some_null] TEXT
);
INSERT INTO [facetable_new_442f07e26eef] ([pk], [created], [planet_int], [on_earth], [city_id], [neighborhood], [the_tags], [distinct_some_null])
SELECT [pk], [created], [planet_int], [on_earth], [city_id], [neighborhood], [tags], [distinct_some_null] FROM [facetable];
DROP TABLE [facetable];
ALTER TABLE [facetable_new_442f07e26eef] RENAME TO [facetable];
</code></pre>
<h4>Plenty more tricks</h4>
<p><code>sqlite-utils</code> has plenty more tricks up its sleeve. I suggest spending some time browsing the <a href="https://sqlite-utils.readthedocs.io/en/stable/python-api.html">Python library reference</a> and the <a href="https://sqlite-utils.readthedocs.io/en/stable/cli.html">sqlite-utils CLI documentation</a>, or taking a look through through the <a href="https://sqlite-utils.readthedocs.io/en/stable/changelog.html">release notes</a>.</p>
Fun with binary data and SQLite2020-07-30T23:22:17+00:002020-07-30T23:22:17+00:00https://simonwillison.net/2020/Jul/30/fun-binary-data-and-sqlite/#atom-series
<p>This week I've been mainly experimenting with binary data storage in SQLite. <a href="https://sqlite-utils.datasette.io/">sqlite-utils</a> can now insert data from binary files, and <a href="https://datasette.io/plugins/datasette-media">datasette-media</a> can serve content over HTTP that originated as binary BLOBs in a database file.</p>
<p>Paul Ford piqued my interest in this when he tweeted about loading thousands of PDF documents into a SQLite database:</p>
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">I made a shell script that loads thousands of PDF docs into SQLite databases. That means I can have a web server that produces infinite randomly chosen US Military PowerPoint slides that I can scroll on my phone when I'm around the house. <a href="https://t.co/n3zUlbCuBB">pic.twitter.com/n3zUlbCuBB</a></p>— Paul Ford (@ftrain) <a href="https://twitter.com/ftrain/status/1287183861785473027?ref_src=twsrc%5Etfw">July 26, 2020</a></blockquote>
<p>The SQLite documentation claims that serving smaller binary files from BLOB columns can be <a href="https://www.sqlite.org/fasterthanfs.html">35% faster than the filesystem</a>. I've done a little bit of work with binary files in SQLite - the <a href="https://github.com/simonw/datasette-render-binary">datasette-render-binary</a> and <a href="https://github.com/simonw/datasette-render-images">datasette-render-images</a> both help display BLOB data - but I'd never really dug into it in much detail.</p>
<h4>sqlite-utils insert-files</h4>
<p>The first step was to make it easier to build database files that include binary data.</p>
<p><a href="https://sqlite-utils.readthedocs.io/">sqlite-utils</a> is my combination Python library and CLI tool for building SQLite databases. I've been steadily evolving it for a couple of years now, and it's the engine behind my <a href="https://dogsheep.github.io/">Dogsheep</a> collection of tools for personal analytics.</p>
<p>The new <a href="https://sqlite-utils.readthedocs.io/en/stable/cli.html#inserting-binary-data-from-files">insert-files command</a> can be used to insert content from binary files into a SQLite database, along with file metadata.</p>
<p>The most basic usage looks like this:</p>
<pre><code>sqlite-utils insert-files gifs.db images *.gif</code></pre>
<p>By default, this creates a database table like so:</p>
<pre><code>CREATE TABLE [images] (
[path] TEXT PRIMARY KEY,
[content] BLOB,
[size] INTEGER
);</code></pre>
<p>You can customize this table to include other file metadata using the <code>-c</code> (short for <code>--column</code>) option:</p>
<pre><code>sqlite-utils insert-files gifs.db images *.gif \
-c path -c md5 -c last_modified:mtime -c size --pk=path</code></pre>
<p>This creates a table with the following schema:</p>
<pre><code>CREATE TABLE [images] (
[path] TEXT PRIMARY KEY,
[md5] TEXT,
[last_modified] FLOAT,
[size] INTEGER
);</code></pre>
<p>If you pass a directory instead of a file name the command will recursively add every file in that directory.</p>
<p>I also improved <code>sqlite-utils</code> with respect to outputting binary data. The new <code>--raw</code> option dumps the binary contents of a column directly to standard out, so you can read an image back out of one of the above tables like so:</p>
<pre><code>sqlite-utils photos.db \
"select content from images where path=:path" \
-p path 'myphoto.jpg' \
--raw > myphoto.jpg</code></pre>
<p>This example also demonstrates the new support for <code>:parameters</code> passed using the new <code>-p</code> option, see <a href="https://github.com/simonw/sqlite-utils/issues/124">#124</a>.</p>
<p><code>sqlite-utils</code> usually communicates using JSON, but JSON doesn't have the ability to represent binary values. Datasette outputs binary values <a href="https://datasette-render-images-demo.datasette.io/favicons/favicons.json?_shape=array&_size=1">like so</a>:</p>
<pre><code>"data": {
"$base64": true,
"encoded": "iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAY..."
}</code></pre>
<p>I added support for the same format to <code>sqlite-utils</code> - so you can now query binary columns and get out that nested object, or pipe JSON with that nested structure in to <code>sqlite-utils insert</code> and have it stored as a binary BLOB in the database.</p>
<h4>datasette-media</h4>
<p><a href="https://github.com/simonw/datasette-media">datasette-media</a> is a plugin for serving binary content directly from Datasette on a special URL. I originally built it while working on <a href="https://github.com/dogsheep/dogsheep-photos">Dogsheep Photos</a> - given a SQLite file full of Apple Photos metadata I wanted to be able to serve thumbnails of the actual images via my Datasette web server.</p>
<p>Those photos were still stored on disk - the plugin lets you configure a SQL query like this which will cause hits to <code>/-/media/photos/$UUID</code> to serve that file from disk:</p>
<pre><code>{
"plugins": {
"datasette-media": {
"photo": {
"sql": "select filepath from apple_photos where uuid=:key"
}
}
}
}</code></pre>
<p><a href="https://github.com/simonw/datasette-media/issues/14">Issue #14</a> added support for <code>BLOB</code> columns as well. You can now configure the plugin like this to serve binary content that was stored in the database:</p>
<pre><code>{
"plugins": {
"datasette-media": {
"thumb": {
"sql": "select content from thumbnails where uuid=:key"
}
}
}
}</code></pre>
<p>This would serve content from a BLOB column in a <code>thumbnails</code> table from the URL <code>/-/media/thumb/$UUID</code>.</p>
<p>I really like this pattern of configuring plugins using SQL queries, where the returned column names have special meaning that is interpreted by the plugin. <a href="https://github.com/simonw/datasette-atom">datasette-atom</a> and <a href="https://github.com/simonw/datasette-ics">datasette-ics</a> use a similar trick.</p>
<p>I expanded <code>datasette-media</code> with a few other related features:</p>
<ul><li>Return a <code>content_url</code> column and it will proxy content from that URL</li><li>Set <code>"enable_transform": true</code> for a media bucket to enable <code>?w=</code> and <code>?h=</code> and <code>?format=</code> parameters for transforming the image before it is served to the user</li><li>Return a <code>content_filename</code> column to <a href="https://github.com/simonw/datasette-media#setting-a-download-file-name">set a download file name</a> (in a <code>content-disposition</code> HTTP header) prompting the user's browser to download the file</li></ul>
<p>See the <a href="https://github.com/simonw/datasette-media#readme">README</a> or <a href="https://github.com/simonw/datasette-media/releases">release notes</a> for more details.</p>
<h4>Also this week</h4>
<p>I renamed <code>datasette-insert-api</code> to just <a href="https://github.com/simonw/datasette-insert">datasette-insert</a>, reflecting my plans to add non-API features to that plugin in the future.</p>
<p>In doing so I had to figure out how to rename a PyPI package such that dependent projects would continue to work. I ended up building a <a href="https://github.com/simonw/pypi-rename">pypi-rename cookiecutter template</a> encoding what I learned.</p>
<p>I enabled PostgreSQL full-text search for my blog's Django Admin interface, and wrote <a href="https://github.com/simonw/til/blob/master/django/postgresql-full-text-search-admin.md">a TIL on how I did it</a>.</p>
<p>I <a href="https://github.com/simonw/db-to-sqlite/issues/26">added compound primary key support</a> to <code>db-to-sqlite</code>, so now it can convert PostgreSQL or MySQL databases to SQLite if they use compound primary keys.</p>
<h4>TIL this week</h4>
<ul><li><a href="https://github.com/simonw/til/blob/master/javascript/copy-button.md">Implementing a "copy to clipboard" button</a></li><li><a href="https://github.com/simonw/til/blob/master/django/postgresql-full-text-search-admin.md">PostgreSQL full-text search in the Django Admin</a></li><li><a href="https://github.com/simonw/til/blob/master/sqlite/blob-literals.md">SQLite BLOB literals</a></li></ul>
<h4>Releases this week</h4>
<ul><li><a href="https://github.com/simonw/sqlite-utils/releases/tag/2.13">sqlite-utils 2.13</a> - 2020-07-30</li><li><a href="https://github.com/simonw/datasette-media/releases/tag/0.5">datasette-media 0.5</a> - 2020-07-29</li><li><a href="https://github.com/simonw/db-to-sqlite/releases/tag/1.3">db-to-sqlite 1.3</a> - 2020-07-27</li><li><a href="https://github.com/simonw/datasette-media/releases/tag/0.4">datasette-media 0.4</a> - 2020-07-27</li><li><a href="https://github.com/simonw/sqlite-utils/releases/tag/2.12">sqlite-utils 2.12</a> - 2020-07-27</li><li><a href="https://github.com/simonw/datasette-render-images/releases/tag/0.3.1">datasette-render-images 0.3.1</a> - 2020-07-27</li><li><a href="https://github.com/simonw/datasette-render-images/releases/tag/0.3">datasette-render-images 0.3</a> - 2020-07-27</li><li><a href="https://github.com/simonw/datasette-auth-passwords/releases/tag/0.3.1">datasette-auth-passwords 0.3.1</a> - 2020-07-26</li><li><a href="https://github.com/simonw/datasette-insert/releases/tag/0.5">datasette-insert 0.5</a> - 2020-07-25</li></ul>
sqlite-utils: a Python library and CLI tool for building SQLite databases2019-02-25T03:29:20+00:002019-02-25T03:29:20+00:00https://simonwillison.net/2019/Feb/25/sqlite-utils/#atom-series
<p><a href="https://github.com/simonw/sqlite-utils">sqlite-utils</a> is a combination Python library and command-line tool I’ve been building over the past six months which aims to make creating new SQLite databases as quick and easy as possible.</p>
<p>It’s part of <a href="https://datasette.readthedocs.io/en/stable/ecosystem.html">the ecosystem of tools</a> I’m building around my <a href="https://datasette.readthedocs.io/">Datasette</a> project.</p>
<p>I spent the weekend adding all kinds of exciting command-line options to it, so I’m ready to describe it to the world.</p>
<h3><a id="A_Python_library_for_quickly_creating_databases_8"></a>A Python library for quickly creating databases</h3>
<p>A core idea behind Datasette is that <a href="https://www.sqlite.org/">SQLite</a> is the ideal format for publishing all kinds of interesting structured data. Datasette takes any SQLite database and adds a browsable web interface, <a href="https://datasette.readthedocs.io/en/stable/json_api.html">a JSON API</a> and the ability to <a href="https://datasette.readthedocs.io/en/stable/csv_export.html">export tables and queries as CSV</a>.</p>
<p>The other half of the equation then is tools to create SQLite databases. <a href="https://github.com/simonw/csvs-to-sqlite">csvs-to-sqlite</a> was my first CLI attempt at this. <code>sqlite-utils</code> takes a much more flexible and comprehensive approach.</p>
<p>I started working on <code>sqlite-utils</code> last year as part of my project to <a href="https://simonwillison.net/2018/Aug/6/russian-facebook-ads/">Analyze US Election Russian Facebook Ads</a>. The initial aim was to build a library that made constructing new SQLite databases inside of a <a href="https://jupyter.org/">Jupyter notebook</a> as productive as possible.</p>
<p>The core idea behind the library is that you can give it a list of Python dictionaries (equivalent to JSON objects) and it will automatically create a SQLite table with the correct schema, then insert those items into the new table.</p>
<p>To illustrate, let’s create a database using <a href="https://data.nasa.gov/resource/y77d-th95.json">this JSON file of meteorite landings</a> released by NASA (discovered via <a href="https://github.com/jdorfman/awesome-json-datasets">awesome-json-datasets</a> curated by Justin Dorfman).</p>
<p>Here’s the quickest way in code to turn that into a database:</p>
<pre><code>import requests
import sqlite_utils
db = sqlite_utils.Database("meteorites.db")
db["meteorites"].insert_all(
requests.get(
"https://data.nasa.gov/resource/y77d-th95.json"
).json(),
pk="id"
)
</code></pre>
<p>This two lines of code creates a new SQLite database on disk called <code>meteorites.db</code>, creates a table in that file called <code>meteorites</code>, detects the necessary columns based on the incoming data, inserts all of the rows and sets the <code>id</code> column up as the primary key.</p>
<p>To see the resulting database, run <code>datasette meteorites.db</code> and browse to <code>http://127.0.0.1:8001/</code>.</p>
<p>You can do a <em>lot more</em> with the library. You can create tables, insert and upsert data in bulk, configure foreign key relationships, configure SQLite full-text search and much more. I encourage you to <a href="https://sqlite-utils.readthedocs.io/en/latest/python-api.html">consult the documentation</a> for all of the details.</p>
<h3><a id="The_sqliteutils_commandline_tool_39"></a>The sqlite-utils command-line tool</h3>
<p>This is the new stuff built over the past few days, and I think it’s really fun.</p>
<p>First install the tool <a href="https://pypi.org/project/sqlite-utils/">from PyPI</a>, using <code>pip3 install sqlite-utils</code>.</p>
<p>Let’s start by grabbing a copy of <a href="https://static.simonwillison.net/static/2019/russian-ads.db">the russian-ads.db database</a> I created in <a href="https://simonwillison.net/2018/Aug/6/russian-facebook-ads/">Analyzing US Election Russian Facebook Ads</a> (4MB):</p>
<pre><code>$ cd /tmp
$ wget https://static.simonwillison.net/static/2019/russian-ads.db
</code></pre>
<p>We can see a list of tables in the database and their counts using the <code>tables</code> command:</p>
<pre><code>$ sqlite-utils tables russian-ads.db --counts
[{"table": "ads", "count": 3498},
{"table": "targets", "count": 1665},
{"table": "ad_targets", "count": 36559},
{"table": "ads_fts", "count": 3498},
{"table": "ads_fts_segments", "count": 120},
{"table": "ads_fts_segdir", "count": 1},
{"table": "ads_fts_docsize", "count": 3498},
{"table": "ads_fts_stat", "count": 1}]
</code></pre>
<p>By default, <code>sqlite-utils</code> outputs data as neatly formatted JSON. You can get CSV instead using the <code>--csv</code> option:</p>
<pre><code>$ sqlite-utils tables russian-ads.db --counts --csv
table,count
ads,3498
targets,1665
ad_targets,36559
ads_fts,3498
ads_fts_segments,120
ads_fts_segdir,1
ads_fts_docsize,3498
ads_fts_stat,1
</code></pre>
<p>Or if you want a pretty ASCII-art table, use <code>--table</code> (or the shortcut, <code>-t</code>):</p>
<pre><code>$ sqlite-utils tables russian-ads.db --counts -t
table count
---------------- -------
ads 3498
targets 1665
ad_targets 36559
ads_fts 3498
ads_fts_segments 120
ads_fts_segdir 1
ads_fts_docsize 3498
ads_fts_stat 1
</code></pre>
<p>The table view is built on top of <a href="https://pypi.org/project/tabulate/">tabulate</a>, which offers dozens of table variations. Run <code>sqlite-utils tables --help</code> for the full list - try <code>--table -fmt=rst</code> for output that can be pasted directly into a reStructuredText document (handy for writing documentation).</p>
<p>So far we’ve just looked at a list of tables. Lets run a SQL query:</p>
<pre><code>$ sqlite-utils russian-ads.db "select category, count(*) from targets group by category"
[{"category": "accessing_facebook_on", "count(*)": 1},
{"category": "age", "count(*)": 82},
{"category": "and_must_also_match", "count(*)": 228},
{"category": "army_reserve_industry", "count(*)": 3},
{"category": "behaviors", "count(*)": 16},
...
</code></pre>
<p>Again, this can be output as CSV using <code>--csv</code>, or a table with <code>--table</code>.</p>
<p>The default JSON output is objects wrapped in an array. Use <code>--arrays</code> to get an array of arrays instead. More interestingly: <code>--nl</code> causes the data to be output as <a href="http://ndjson.org/">newline-delimited JSON</a>, like this:</p>
<pre><code>$ sqlite-utils russian-ads.db "select category, count(*) from targets group by category" --nl
{"category": "accessing_facebook_on", "count(*)": 1}
{"category": "age", "count(*)": 82}
{"category": "and_must_also_match", "count(*)": 228}
{"category": "army_reserve_industry", "count(*)": 3}
{"category": "behaviors", "count(*)": 16}
...
</code></pre>
<p>This is a really interesting format for piping to other tools.</p>
<h3><a id="Creating_databases_from_JSON_on_the_commandline_115"></a>Creating databases from JSON on the command-line</h3>
<p>The <code>sqlite-utils insert</code> command can be used to create new tables by piping JSON or CSV directly into the tool. It’s the command-line equivalent of the <code>.insert_all()</code> Python function I demonstrated earlier.</p>
<p>Here’s how to create that meteorite database directly from the command-line:</p>
<pre><code>$ curl "https://data.nasa.gov/resource/y77d-th95.json" | \
sqlite-utils insert meteorites.db meteorites - --pk=id
</code></pre>
<p>This will use a SQLite database file called <code>meteorites.db</code> (creating one if it does not yet exist), create or use a table called <code>meteorites</code> and read the data from standard in (hence the pipe). You can pass a filename instead of a <code>-</code> here to read data from a file on disk.</p>
<p>The <code>insert</code> command accepts multiple formats - it defaults to expecting a JSON array of objects, but you can use <code>--nl</code> to accept newline-delimited JSON and <code>--csv</code> to accept CSV.</p>
<p>This means you can combine the tools! Let’s create a brand new database by exporting data from the old one, using newline-delimited JSON as the intermediary format:</p>
<pre><code>$ sqlite-utils russian-ads.db \
"select * from ads where text like '%veterans%'" --nl | \
sqlite-utils insert veterans.db ads - --nl
</code></pre>
<p>This creates a new file called <code>veterans.db</code> containing an <code>ads</code> table with just the ads that mentioned veterans somewhere in their body text.</p>
<p>Since we’re working with JSON, we can introduce other command-line tools into the mix.</p>
<p><a href="https://stedolan.github.io/jq/">jq</a> is a neat little tool for extracting data from a JSON file using its own mini domain-specific language.</p>
<p>The Nobel Prize API offers <a href="https://api.nobelprize.org/v1/laureate.json">a JSON file</a> listing all of the Nobel laureates - but they are contained as an array in a top level <code>"laureates"</code> key. <code>sqlite-utils</code> needs a flat array - so we can use <code>jq</code> to get exactly that:</p>
<pre><code>$ curl "https://api.nobelprize.org/v1/laureate.json" | \
jq ".laureates" | \
sqlite-utils insert nobel.db laureates -
</code></pre>
<p>Now we have a file called <code>nobel.db</code> containing all of the Nobel laureates.</p>
<p>Since Datasette recently <a href="https://datasette.readthedocs.io/en/stable/changelog.html#v0-27">grew the ability to export newline-delimited JSON</a>, we can also use this ability to directly consume data from Datasette. Lets grab <a href="https://fivethirtyeight.datasettes.com/fivethirtyeight-aa93d24/bob-ross%2Felements-by-episode?BEACH=1">every episode of the Joy of Painting in which Bob Ross painted a beach</a>:</p>
<pre><code>$ curl "https://fivethirtyeight.datasettes.com/fivethirtyeight-aa93d24/bob-ross%2Felements-by-episode.json?_facet=BEACH&BEACH=1&_shape=array&_nl=on" \
| sqlite-utils insert bob.db beach_episodes - --nl
</code></pre>
<h3><a id="Plenty_more_features_153"></a>Plenty more features</h3>
<p>As with the Python API, the <code>sqlite-utils</code> CLI tool has dozens of other options and <a href="https://sqlite-utils.readthedocs.io/en/latest/cli.html">extensive documentation</a>.</p>
<p>I’ve been really enjoying growing an <a href="https://datasette.readthedocs.io/en/stable/ecosystem.html">ecosystem of tools around Datasette</a>. <code>sqlite-utils</code> is the keystone here: it’s fundamental to other tools I’m building, such as <a href="https://github.com/simonw/db-to-sqlite">db-to-sqlite</a> (which can export any SQLAlchemy-supported database directly to a SQLite file on disk).</p>
<p>I’ve found myself increasingly turning to SQLite first for all kinds of ad-hoc analysis, and I’m excited to try out these new command-line abilities of <code>sqlite-utils</code> for real-world data spelunking tasks.</p>