Simon Willison’s Weblog

Subscribe

232 items tagged “sqlite”

2024

How does SQLite store data? Michal Pitr explores the design of the SQLite on-disk file format, as part of building an educational implementation of SQLite from scratch in Go. # 17th March 2024, 6:47 pm

Astro DB. A new scale-to-zero hosted SQLite offering, described as “A fully-managed SQL database designed exclusively for Astro”. It’s built on top of LibSQL, the SQLite fork maintained by the Turso database team.

Astro DB encourages defining your tables with TypeScript, and querying them via the Drizzle ORM.

Running Astro locally uses a local SQLite database. Deployed to Astro Cloud switches to their DB product, where the free tier currently includes 1GB of storage, one billion row reads per month and one million row writes per month.

Astro itself is a “web framework for content-driven websites”—so hosted SQLite is a bit of an unexpected product from them, though it does broadly fit the ecosystem they are building.

This approach reminds me of how Deno K/V works—another local SQLite storage solution that offers a proprietary cloud hosted option for deployment. # 12th March 2024, 6:02 pm

Datasette 1.0a10. The only changes in this alpha release concern the way Datasette handles database transactions. The database.execute_write_fn() internal method used to leave functions to implement transactions on their own—it now defaults to wrapping them in a transaction unless they opt out with the new transaction=False parameter.

In implementing this I found several places inside Datasette—in particular parts of the JSON write API—which had not been handling transactions correctly. Those are all now fixed. # 18th February 2024, 5:10 am

wddbfs – Mount a sqlite database as a filesystem. Ingenious hack from Adam Obeng. Install this Python tool and run it against a SQLite database:

wddbfs --anonymous --db-path path/to/content.db

Then tell the macOS Finder to connect to Go -> Connect to Server -> http://127.0.0.1:8080/ (connect as guest)—connecting via WebDAV.

/Volumes/127.0.0.1/content.db will now be a folder full of CSV, TSV, JSON and JSONL files—one of each format for every table.

This means you can open data from SQLite directly in any application that supports that format, and you can even run CLI commands such as grep, ripgrep or jq directly against the data!

Adam used WebDAV because “Despite how clunky it is, this seems to be the best way to implement a filesystem given that getting FUSE support is not straightforward”. What a neat trick. # 18th February 2024, 3:31 am

Announcing DuckDB 0.10.0. Somewhat buried in this announcement: DuckDB has Fixed-Length Arrays now, along with array_cross_product(a1, a2), array_cosine_similarity(a1, a2) and array_inner_product(a1, a2) functions.

This means you can now use DuckDB to find related content (and other tricks) using vector embeddings!

Also notable: “DuckDB can now attach MySQL, Postgres, and SQLite databases in addition to databases stored in its own format. This allows data to be read into DuckDB and moved between these systems in a convenient manner, as attached databases are fully functional, appear just as regular tables, and can be updated in a safe, transactional manner.” # 13th February 2024, 5:57 pm

SQL for Data Scientists in 100 Queries. New comprehensive SQLite SQL tutorial from Greg Wilson, author of Teaching Tech Together and founder of The Carpentries. # 6th February 2024, 11:08 pm

stanchion (via) Dan Gallagher’s new (under-development) SQLite extension that adds column-oriented tables to SQLite, using a virtual table implemented in Zig that stores records in row groups, where each row group has multiple segments (one for each column) and those segments are stored as SQLite BLOBs.

I’m surprised that this is possible using the virtual table mechanism. It has the potential to bring some of the analytical querying performance we’ve seen in engines like DuckDB to SQLite itself. # 31st January 2024, 10:32 pm

SQLite 3.45. Released today. The big new feature is JSONB support, a new, specific-to-SQLite binary internal representation of JSON which can provide up to a 3x performance improvement for JSON-heavy operations, plus a 5-10% saving it terms of bytes stored on disk. # 15th January 2024, 8:15 pm

2023

Weeknotes: datasette-enrichments, datasette-comments, sqlite-chronicle

I’ve mainly been working on Datasette Enrichments and continuing to explore the possibilities enabled by sqlite-chronicle.

[... 1123 words]

Tracking SQLite Database Changes in Git (via) A neat trick from Garrit Franke that I hadn’t seen before: you can teach “git diff” how to display human readable versions of the differences between binary files with a specific extension using the following:

git config diff.sqlite3.binary true
git config diff.sqlite3.textconv “echo .dump | sqlite3”

That way you can store binary files in your repo but still get back SQL diffs to compare them.

I still worry about the efficiency of storing binary files in Git, since I expect multiple versions of a text text file to compress together better. # 1st November 2023, 6:53 pm

SQLite 3.44: Interactive release notes. Anton Zhiyanov compiled interactive release notes for the new release of SQLite, demonstrating several of the new features. I’m most excited about order by in aggregates—group_concat(name order by name desc)—which is something I’ve wanted in the past. Anton demonstrates how it works with JSON aggregate functions as well. The new date formatting options look useful as well. # 1st November 2023, 3:47 pm

New sqlite3 CLI tool in Python 3.12. The newly released Python 3.12 includes a SQLite shell, which you can open using “python -m sqlite3”—handy for when you’re using a machine that has Python installed but no sqlite3 binary.

I installed Python 3.12 for macOS using the official installer from Python.org and now “/usr/local/bin/python3 -m sqlite3” gives me a SQLite 3.41.1 shell—a pleasantly recent version from March 2023 (the latest SQLite is 3.43.1, released in September). # 3rd October 2023, 6:57 pm

Weeknotes: the Datasette Cloud API, a podcast appearance and more

Datasette Cloud now has a documented API, plus a podcast appearance, some LLM plugins work and some geospatial excitement.

[... 1243 words]

Batch size one billion: SQLite insert speedups, from the useful to the absurd (via) Useful, detailed review of ways to maximize the performance of inserting a billion integers into a SQLite database table. # 26th September 2023, 5:31 pm

Upsert in SQL (via) Anton Zhiyanov is currently on a one-man quest to write detailed documentation for all of the fundamental SQL operations, comparing and contrasting how they work across multiple engines, generally with interactive examples.

Useful tips in here on why “insert... on conflict” is usually a better option than “insert or replace into” because the latter can perform a delete and then an insert, firing triggers that you may not have wanted to be fired. # 25th September 2023, 8:34 pm

Geospatial SQL queries in SQLite using TG, sqlite-tg and datasette-sqlite-tg. Alex Garcia built sqlite-tg—a SQLite extension that uses the brand new TG geospatial library to provide a whole suite of custom SQL functions for working with geospatial data.

Here are my notes on trying out his initial alpha releases. The extension already provides tools for converting between GeoJSON, WKT and WKB, plus the all important tg_intersects() function for testing if a polygon or point overlap each other.

It’s pretty useful already. Without any geospatial indexing at all I was still able to get 700ms replies to a brute-force point-in-polygon query against 150MB of GeoJSON timezone boundaries stored as JSON text in a table. # 25th September 2023, 7:45 pm

TG: Polygon indexing (via) TG is a brand new geospatial library by Josh Baker, author of the Tile38 in-memory spatial server (kind of a geospatial Redis). TG is written in pure C and delivered as a single C file, reminiscent of the SQLite amalgamation.

TG looks really interesting. It implements almost the exact subset of geospatial functionality that I find most useful: point-in-polygon, intersect, WKT, WKB, and GeoJSON—all with no additional dependencies.

The most interesting thing about it is the way it handles indexing. In this documentation Josh describes two approaches he uses to speeding up point-in-polygon and intersection using a novel approach that goes beyond the usual RTree implementation.

I think this could make the basis of a really useful SQLite extension—a lighter-weight alternative to SpatiaLite. # 23rd September 2023, 4:32 am

Note that there have been no breaking changes since the [SQLite] file format was designed in 2004. The changes shows in the version history above have all be one of (1) typo fixes, (2) clarifications, or (3) filling in the “reserved for future extensions” bits with descriptions of those extensions as they occurred.

D. Richard Hipp # 18th September 2023, 6:02 pm

Introducing datasette-litestream: easy replication for SQLite databases in Datasette. We use Litestream on Datasette Cloud for streaming backups of user data to S3. Alex Garcia extracted out our implementation into a standalone Datasette plugin, which bundles the Litestream Go binary (for the relevant platform) in the package you get when you run “datasette install datasette-litestream”—so now Datasette has a very robust answer to questions about SQLite disaster recovery beyond just the Datasette Cloud platform. # 13th September 2023, 7:28 pm

LLM now provides tools for working with embeddings

LLM is my Python library and command-line tool for working with language models. I just released LLM 0.9 with a new set of features that extend LLM to provide tools for working with embeddings.

[... 3466 words]

Queryable Logging with Blacklite (via) Will Sargent describes how he built Blacklite, a Java library for diagnostic logging that writes log events (as zstd compressed JSON objects) to a SQLite database and maintains 5,000 entries in a “live” database while entries beyond that range are cycled out to an archive.db file, which is cycled to archive.timestamp.db when it reaches 500,000 items.

Lots of interesting notes here on using SQLite for high performance logging.

“SQLite databases are also better log files in general. Queries are faster than parsing through flat files, with all the power of SQL. A vacuumed SQLite database is only barely larger than flat file logs. They are as easy to store and transport as flat file logs, but work much better when merging out of order or interleaved data between two logs.” # 21st August 2023, 6:13 pm

Dependency Management Data (via) This is a really neat CLI tool by Jamie Tanna, built using Go and SQLite but with a feature that embeds a Datasette instance (literally shelling out to start the process running from within the Go application) to provide an interface for browsing the resulting database.

It addresses the challenge of keeping track of the dependencies used across an organization, by gathering them into a SQLite database from a variety of different sources—currently Dependabot, Renovate and some custom AWS tooling.

The “Example” page links to a live Datasette instance and includes video demos of the tool in action. # 11th August 2023, 3:54 pm

sqlite-utils now supports plugins

sqlite-utils 3.34 is out with a major new feature: support for plugins.

[... 1327 words]

sqlite-vss v0.1.1 Annotated Release Notes (via) Alex Garcia’s sqlite-vss adds vector search directly to SQLite through a custom extension. It’s now easily installed for Python, Node.js, Deno, Elixir, Go, Rust and Ruby (“gem install sqlite-vss”), and is being used actively by enough people that Alex is getting actionable feedback, including fixes for memory leaks spotted in production. # 20th July 2023, 5:48 pm

Data analysis with SQLite and Python. I turned my 2hr45m workshop from PyCon into the latest official tutorial on the Datasette website. It includes an extensive handout which should be useful independently of the video itself. # 2nd July 2023, 4:48 pm

sqlean.py: Python’s sqlite3 with extensions. Anton Zhiyanov built a new Python package which bundles a fresh, compiled copy of SQLite with his SQLean family of C extensions built right in. Installing it gets you the latest SQLite—3.42.0—with nearly 200 additional functions, including things like define() and eval(), fileio_read() and fileio_write(), percentile_95() and uuid4() and many more. “import sqlean as sqlite3” works as a drop-in replacement for the module from the standard library. # 17th June 2023, 10:42 pm

Vector Search. Amjith Ramanujam provides a very thorough tutorial on implementing vector similarity search using SentenceTransformers embeddings (all-MiniLM-L6-v2) executed using sqlite-utils, then served via datasette-sqlite-vss and deployed using Fly. # 2nd June 2023, 5:02 am

SQLite 3.42.0. The latest SQLite has a tiny feature I requested on the SQLite Forum—SELECT unixepoch(’subsec’) now returns the current time in milliseconds since the Unix epoch, a big improvement on the previous recipe of select cast((julianday(’now’)—2440587.5) * 86400 * 1000 as integer)!

Also in the release: JSON5 support (JSON with multi-line strings and comments), a bunch of improvements to the query planner and CLI tool, plus various interesting internal changes. # 18th May 2023, 9:14 pm

Enriching data with GPT3.5 and SQLite SQL functions

I shipped openai-to-sqlite 0.3 yesterday with a fun new feature: you can now use the command-line tool to enrich data in a SQLite database by running values through an OpenAI model and saving the results, all in a single SQL query.

[... 1219 words]

Data analysis with SQLite and Python for PyCon 2023

I’m at PyCon 2023 in Salt Lake City this week.

[... 347 words]