Ask questions of SQLite databases and CSV/JSON files in your terminal

25th November 2024

I built a new plugin for my sqlite-utils CLI tool that lets you ask human-language questions directly of SQLite databases and CSV/JSON files on your computer.

It’s called sqlite-utils-ask. Here’s how you install it:

sqlite-utils install sqlite-utils-ask

It picks up API keys from an OPENAI_API_KEY environment variable, or you can install LLM and use llm keys set openai to store a key in a configuration file.

Then you can use it like this:

curl -O https://datasette.io/content.db
sqlite-utils ask content.db "how many sqlite-utils pypi downloads in 2024?"

This command will extract the SQL schema for the provided database file, send that through an LLM along with your question, get back a SQL query and attempt to run it to derive a result.

If all goes well it spits out an answer something like this:

SELECT SUM(downloads)
FROM stats
WHERE package = 'sqlite-utils' AND date >= '2024-01-01' AND date < '2025-01-01';

[
    {
        "SUM(downloads)": 4300221
    }
]

If the SQL query fails to execute (due to a syntax error of some kind) it passes that error back to the model for corrections and retries up to three times before giving up.

Add -v/--verbose to see the exact prompt it’s using:

System prompt:
You will be given a SQLite schema followed by a question. Generate a single SQL
query to answer that question. Return that query in a ```sql ... ```
fenced code block.

Example: How many repos are there?
Answer:
```sql
select count(*) from repos
```

Prompt:
...
CREATE TABLE [stats] (
   [package] TEXT,
   [date] TEXT,
   [downloads] INTEGER,
   PRIMARY KEY ([package], [date])
);
...
how many sqlite-utils pypi downloads in 2024?

I’ve truncated the above to just the relevant table—it actually includes the full schema of every table in that database.

By default, the tool sends just that database schema and your question to the LLM. If you add the -e/--examples option it will also include five common values for each of the text columns in that schema with an average length less than 32 characters. This can sometimes help get a better result, for example sending values “CA” and “FL” and “TX” for a state column can tip the model of that it should use state abbreviations rather than full names in its queries.

Asking questions of CSV and JSON data

The core sqlite-utils CLI usually works against SQLite files directly, but three years ago I added the ability to run SQL queries against CSV and JSON files directly with the sqlite-utils memory command. This works by loading that data into an in-memory SQLite database before executing a SQL query.

I decided to reuse that mechanism to enable LLM prompts against CSV and JSON data directly as well.

The sqlite-utils ask-files command looks like this:

sqlite-utils ask-files transactions.csv "total sales by year"

This command accepts one or more files, and you can provide a mix of CSV, TSV and JSON. Each provided file will be imported into a different table, allowing the model to construct join queries where necessary.

Implementation notes

The core of the plugin is implemented as around 250 lines of Python, using the sqlite-utils register_commands() plugin hook to add the ask and ask-files commands.

It adds LLM as a dependency, and takes advantage of LLM’s Python API to abstract over the details of talking to the models. This means sqlite-utils-ask can use any of the models supported by LLM or its plugins—if you want to run your prompt through Claude 3.5 Sonnet you can do this:

sqlite-utils install llm-claude-3
sqlite-utils ask content.db "count rows in news table" -m claude-3.5-sonnet

The plugin defaults to gpt-4o-mini initially to take advantage of that model’s automatic prompt caching: if you run multiple questions against the same schema you’ll end up sending the same lengthy prompt prefix multiple times, and OpenAI’s prompt caching should automatically kick in and provide a 50% discount on those input tokens.

Then I ran the actual numbers and found that gpt-4o-mini is cheap enough that even without caching a 4,000 token prompt (that’s a pretty large SQL schema) should cost less than a tenth of a cent. So those caching savings aren’t worth anything at all!

Posted 25th November 2024 at 1:33 am · Follow me on Mastodon, Bluesky, Twitter or subscribe to my newsletter

Simon Willison’s Weblog