Simon Willison’s Weblog


388 items tagged “datasette”


Extracting data from unstructured text and images with Datasette and GPT-4 Turbo. Datasette Extract is a new Datasette plugin that uses GPT-4 Turbo (released to general availability today) and GPT-4 Vision to extract structured data from unstructured text and images.

I put together a video demo of the plugin in action today, and posted it to the Datasette Cloud blog along with screenshots and a tutorial describing how to use it. # 9th April 2024, 11:03 pm

datasette-import. A new plugin for importing data into Datasette. This is a replacement for datasette-paste, duplicating and extending its functionality. datasette-paste had grown beyond just dealing with pasted CSV/TSV/JSON data—it handles file uploads as well now—which inspired the new name. # 6th April 2024, 10:40 pm

Add ETag header for static responses. I’ve been procrastinating on adding better caching headers for static assets (JavaScript and CSS) served by Datasette for several years, because I’ve been wanting to implement the perfect solution that sets far-future cache headers on every asset and ensures the URLs change when they are updated.

Agustin Bacigalup just submitted the best kind of pull request: he observed that adding ETag support for static assets would side-step the complexity while adding much of the benefit, and implemented it along with tests.

It’s a substantial performance improvement for any Datasette instance with a number of JavaScript plugins... like the ones we are building on Datasette Cloud. I’m just annoyed we didn’t ship something like this sooner! # 17th March 2024, 7:25 pm

Weeknotes: the aftermath of NICAR

NICAR was fantastic this year. Alex and I ran a successful workshop on Datasette and Datasette Cloud, and I gave a lightning talk demonstrating two new GPT-4 powered Datasette plugins—datasette-enrichments-gpt and datasette-extract. I need to write more about the latter one: it enables populating tables from unstructured content (using a variant of this technique) and it’s really effective. I got it working just in time for the conference.

[... 1430 words]

datasette/studio. I’m trying a new way to make Datasette available for small personal data manipulation projects, using GitHub Codespaces.

This repository is designed to be opened directly in Codespaces—detailed instructions in the README.

When the container starts it installs the datasette-studio family of plugins—including CSV upload, some enrichments and a few other useful feature—then starts the server running and provides a big green button to click to access the server via GitHub’s port forwarding mechanism. # 10th March 2024, 3:03 am

Datasette 1.0a12. Another alpha release, this time with a new query_actions() plugin hook, a new design for the table, database and query actions menus, a “does not contain” table filter and a fix for a minor bug with the JavaScript makeColumnActions() plugin mechanism. # 29th February 2024, 11:56 pm

Weeknotes: Getting ready for NICAR

Next week is NICAR 2024 in Baltimore—the annual data journalism conference hosted by Investigative Reporters and Editors. I’m running a workshop on Datasette, and I plan to spend most of my time in the hallway track talking to people about Datasette, Datasette Cloud and how the Datasette ecosystem can best help support their work.

[... 1390 words]

dclient 0.3. dclient is my CLI utility for working with remote Datasette instances—in particular for authenticating with them and then running both read-only SQL queries and inserting data using the new Datasette write JSON API. I just picked up work on the project again after a six month gap—the insert command can now be used to constantly stream data directly to hosted Datasette instances such as Datasette Cloud. # 25th February 2024, 8:06 pm

datasette-studio. I’ve been thinking for a while that it might be interesting to have a version of Datasette that comes bundled with a set of useful plugins, aimed at expanding Datasette’s default functionality to cover things like importing data and editing schemas.

This morning I built the very first experimental preview of what that could look like. Install it using pipx:

pipx install datasette-studio

I recommend pipx because it will ensure datasette-studio gets its own isolated environment, independent of any other Datasette installations you might have.

Now running “datasette-studio” instead of “datasette” will get you the version with the bundled plugins.

The implementation of this is fun—it’s a single pyproject.toml file defining the dependencies and setting up the datasette-studio CLI hook, which is enough to provide the full set of functionality.

Is this a good idea? I don’t know yet, but it’s certainly an interesting initial experiment. # 18th February 2024, 8:38 pm

Datasette 1.0a10. The only changes in this alpha release concern the way Datasette handles database transactions. The database.execute_write_fn() internal method used to leave functions to implement transactions on their own—it now defaults to wrapping them in a transaction unless they opt out with the new transaction=False parameter.

In implementing this I found several places inside Datasette—in particular parts of the JSON write API—which had not been handling transactions correctly. Those are all now fixed. # 18th February 2024, 5:10 am

Datasette 1.0a9. A new Datasette alpha release today. This adds basic alter table support API support, so you can request Datasette modify a table to add new columns needed for JSON objects submitted to the insert, upsert or update APIs.

It also makes some permission changes—fixing a minor bug with upsert permissions, and introducing a new rule where every permission plugin gets consulted for a permission check, with just one refusal vetoing that check. # 16th February 2024, 11:20 pm

Weeknotes: a Datasette release, an LLM release and a bunch of new plugins

I wrote extensive annotated release notes for Datasette 1.0a8 and LLM 0.13 already. Here’s what else I’ve been up to this past three weeks.

[... 1074 words]

Datasette 1.0a8: JavaScript plugins, new plugin hooks and plugin configuration in datasette.yaml

I just released Datasette 1.0a8. These are the annotated release notes.

[... 1709 words]

Weeknotes: datasette-test, datasette-build, PSF board retreat

I wrote about Page caching and custom templates in my last weeknotes. This week I wrapped up that work, modifying datasette-edit-templates to be compatible with the jinja2_environment_from_request() plugin hook. This means you can edit templates directly in Datasette itself and have those served either for the full instance or just for the instance when served from a specific domain (the Datasette Cloud case).

[... 757 words]

Weeknotes: Page caching and custom templates for Datasette Cloud

My main development focus this week has been adding public page caching to Datasette Cloud, and exploring what custom template support might look like for that service.

[... 924 words]


datasette-plot—a new Datasette Plugin for building data visualizations. I forgot to link to this here last week: Alex Garcia released the first version of datasette-plot, a brand new Datasette visualization plugin built on top of the Observable Plot charting library. We plan to use this as the new, updated alternative to my older datasette-vega plugin. # 31st December 2023, 5:04 am

Weeknotes: datasette-enrichments, datasette-comments, sqlite-chronicle

I’ve mainly been working on Datasette Enrichments and continuing to explore the possibilities enabled by sqlite-chronicle.

[... 1123 words]

Simon Willison (Part Two): How Datasette Helps With Investigative Reporting. The second part of my Newsroom Robots podcast conversation with Nikita Roy. This episode includes my best audio answer yet to the “what is Datasette?” question, plus notes on how to use LLMs in journalism despite their propensity to make things up. # 5th December 2023, 8:27 pm

Datasette Enrichments: a new plugin framework for augmenting your data

Today I’m releasing datasette-enrichments, a new feature for Datasette which provides a framework for applying “enrichments” that can augment your data.

[... 1202 words]

Annotate and explore your data with datasette-comments. New plugin for Datasette and Datasette Cloud: datasette-comments, providing tools for collaborating on data exploration with a team through posting comments on individual rows of data.

Alex Garcia built this for Datasette Cloud but as with almost all of our work there it’s also available as an open source Python package. # 30th November 2023, 9:59 pm

Weeknotes: DevDay, GitHub Universe, OpenAI chaos

Three weeks of conferences and Datasette Cloud work, four days of chaos for OpenAI.

[... 766 words]

Financial sustainability for open source projects at GitHub Universe

I presented a ten minute segment at GitHub Universe on Wednesday, ambitiously titled Financial sustainability for open source projects.

[... 2485 words]

Weeknotes: PyBay, AI Engineer Summit, Datasette metadata and JavaScript plugins

I’ve had a bit of a slow two weeks in terms of building things and writing code, thanks mainly to a couple of conference appearances. I did review and land a couple of major contributions to Datasette though.

[... 564 words]

Weeknotes: the Datasette Cloud API, a podcast appearance and more

Datasette Cloud now has a documented API, plus a podcast appearance, some LLM plugins work and some geospatial excitement.

[... 1243 words]

Getting started with the Datasette Cloud API. I wrote an introduction to the Datasette Cloud API for the company blog, with a tutorial showing how to use Python and GitHub Actions to import data from the Federal Register into a table in Datasette Cloud, then configure full-text search against it. # 28th September 2023, 11:05 pm

Geospatial SQL queries in SQLite using TG, sqlite-tg and datasette-sqlite-tg. Alex Garcia built sqlite-tg—a SQLite extension that uses the brand new TG geospatial library to provide a whole suite of custom SQL functions for working with geospatial data.

Here are my notes on trying out his initial alpha releases. The extension already provides tools for converting between GeoJSON, WKT and WKB, plus the all important tg_intersects() function for testing if a polygon or point overlap each other.

It’s pretty useful already. Without any geospatial indexing at all I was still able to get 700ms replies to a brute-force point-in-polygon query against 150MB of GeoJSON timezone boundaries stored as JSON text in a table. # 25th September 2023, 7:45 pm

Weeknotes: Embeddings, more embeddings and Datasette Cloud

Since my last weeknotes, a flurry of activity. LLM has embeddings support now, and Datasette Cloud has driven some major improvements to the wider Datasette ecosystem.

[... 2427 words]

Introducing datasette-litestream: easy replication for SQLite databases in Datasette. We use Litestream on Datasette Cloud for streaming backups of user data to S3. Alex Garcia extracted out our implementation into a standalone Datasette plugin, which bundles the Litestream Go binary (for the relevant platform) in the package you get when you run “datasette install datasette-litestream”—so now Datasette has a very robust answer to questions about SQLite disaster recovery beyond just the Datasette Cloud platform. # 13th September 2023, 7:28 pm

Datasette 1.0a4 and 1.0a5, plus weeknotes

Two new alpha releases of Datasette, plus a keynote at WordCamp, a new LLM release, two new LLM plugins and a flurry of TILs.

[... 2709 words]

Datasette Cloud and the Datasette 1.0 alphas. I sent out the Datasette Newsletter for the first time in quite a while, with updates on Datasette Cloud, the Datasette 1.0 alphas, a note about the security vulnerability in those alphas and a summary of some of my research into combining LLMs with Datasette. # 22nd August 2023, 7:56 pm