Simon Willison’s Weblog

Subscribe
Atom feed for datasette

418 items tagged “datasette”

Datasette is an open source tool for exploring and publishing data.

2024

Project: Civic Band—scraping and searching PDF meeting minutes from hundreds of municipalities

Visit Project: Civic Band - scraping and searching PDF meeting minutes from hundreds of municipalities

I interviewed Philip James about Civic Band, his “slowly growing collection of databases of the minutes from civic governments”. Philip demonstrated the site and talked through his pipeline for scraping and indexing meeting minutes from many different local government authorities around the USA.

[... 762 words]

Visualizing local election results with Datasette, Observable and MapLibre GL

Visit Visualizing local election results with Datasette, Observable and MapLibre GL

Alex Garcia and myself hosted the first Datasette Open Office Hours on Friday—a live-streamed video session where we hacked on a project together and took questions and tips from community members on Discord.

[... 3,390 words]

Datasette Public Office Hours, Friday Nov 8th at 2pm PT. Tomorrow afternoon (Friday 8th November) at 2pm PT we'll be hosting the first Datasette Public Office Hours - a livestream video session on Discord where Alex Garcia and myself will live code on some Datasette projects and hang out to chat about the project.

This is our first time trying this format. If it works out well I plan to turn it into a series.

Discord event card promoting Datasette Public Office Hours

# 7th November 2024, 7:10 pm / discord, open-source, datasette, alex-garcia

Datasette 0.65. Python 3.13 was released today, which broke compatibility with the Datasette 0.x series due to an issue with an underlying dependency. I've fixed that problem by vendoring and fixing the dependency and the new 0.65 release works on Python 3.13 (but drops support for Python 3.8, which is EOL this month). Datasette 1.0a16 added support for Python 3.13 last month.

# 7th October 2024, 6:07 pm / projects, datasette, python

django-plugin-datasette. I did some more work on my DJP plugin mechanism for Django at the DjangoCon US sprints today. I added a new plugin hook, asgi_wrapper(), released in DJP 0.3 and inspired by the similar hook in Datasette.

The hook only works for Django apps that are served using ASGI. It allows plugins to add their own wrapping ASGI middleware around the Django app itself, which means they can do things like attach entirely separate ASGI-compatible applications outside of the regular Django request/response cycle.

Datasette is one of those ASGI-compatible applications!

django-plugin-datasette uses that new hook to configure a new URL, /-/datasette/, which serves a full Datasette instance that scans through Django’s settings.DATABASES dictionary and serves an explore interface on top of any SQLite databases it finds there.

It doesn’t support authentication yet, so this will expose your entire database contents - probably best used as a local debugging tool only.

I did borrow some code from the datasette-mask-columns plugin to ensure that the password column in the auth_user column is reliably redacted. That column contains a heavily salted hashed password so exposing it isn’t necessarily a disaster, but I like to default to keeping hashes safe.

# 26th September 2024, 9:57 pm / projects, sqlite, plugins, djp, datasette, django

Datasette 1.0a16. This latest release focuses mainly on performance, as discussed here in Optimizing Datasette a couple of weeks ago.

It also includes some minor CSS changes that could affect plugins, and hence need to be included before the final 1.0 release. Those are outlined in detail in issues #2415 and #2420.

# 6th September 2024, 5:55 am / projects, datasette

Calling LLMs from client-side JavaScript, converting PDFs to HTML + weeknotes

Visit Calling LLMs from client-side JavaScript, converting PDFs to HTML + weeknotes

I’ve been having a bunch of fun taking advantage of CORS-enabled LLM APIs to build client-side JavaScript applications that access LLMs directly. I also span up a new Datasette plugin for advanced permission management.

[... 2,050 words]

Optimizing Datasette (and other weeknotes)

Visit Optimizing Datasette (and other weeknotes)

I’ve been working with Alex Garcia on an experiment involving using Datasette to explore FEC contributions. We currently have a 11GB SQLite database—trivial for SQLite to handle, but at the upper end of what I’ve comfortably explored with Datasette in the past.

[... 2,069 words]

datasette-checkbox. I built this fun little Datasette plugin today, inspired by a conversation I had in Datasette Office Hours.

If a user has the update-row permission and the table they are viewing has any integer columns with names that start with is_ or should_ or has_, the plugin adds interactive checkboxes to that table which can be toggled to update the underlying rows.

This makes it easy to quickly spin up an interface that allows users to review and update boolean flags in a table.

Animated demo showing checkboxes in columns for is_done, should_be_deleted and is_happy - checking the checkboxes shows an updated message next to each one which then fades away.

I have ambitions for a much more advanced version of this, where users can do things like add or remove tags from rows directly in that table interface - but for the moment this is a neat starting point, and it only took an hour to build (thanks to help from Claude to build an initial prototype, chat transcript here).

# 16th August 2024, 9:28 pm / projects, claude-3-5-sonnet, datasette, plugins

Datasette 1.0a15. Mainly bug fixes, but a couple of minor new features:

  • Datasette now defaults to hiding SQLite "shadow" tables, as seen in extensions such as SQLite FTS and sqlite-vec. Virtual tables that it makes sense to display, such as FTS core tables, are no longer hidden. Thanks, Alex Garcia. (#2296)
  • The Datasette homepage is now duplicated at /-/, using the default index.html template. This ensures that the information on that page is still accessible even if the Datasette homepage has been customized using a custom index.html template, for example on sites like datasette.io. (#2393)

Datasette also now serves more user-friendly CSRF pages, an improvement which required me to ship asgi-csrf 0.10.

# 16th August 2024, 5:06 am / releases, datasette, projects, csrf

Using sqlite-vec with embeddings in sqlite-utils and Datasette. My notes on trying out Alex Garcia's newly released sqlite-vec SQLite extension, including how to use it with OpenAI embeddings in both Datasette and sqlite-utils.

# 11th August 2024, 11:37 pm / embeddings, sqlite-utils, sqlite, datasette, openai, alex-garcia

Weeknotes: a staging environment, a Datasette alpha and a bunch of new LLMs

My big achievement for the last two weeks was finally wrapping up work on the Datasette Cloud staging environment. I also shipped a new Datasette 1.0 alpha and added support to the LLM ecosystem for a bunch of newly released models.

[... 1,465 words]

Datasette 1.0a14: The annotated release notes

Visit Datasette 1.0a14: The annotated release notes

Released today: Datasette 1.0a14. This alpha includes significant contributions from Alex Garcia, including some backwards-incompatible changes in the run-up to the 1.0 release.

[... 1,424 words]

Announcing our DjangoCon US 2024 Talks! I'm speaking at DjangoCon in Durham, NC in September.

My accepted talk title was How to design and implement extensible software with plugins. Here's my abstract:

Plugins offer a powerful way to extend software packages. Tools that support a plugin architecture include WordPress, Jupyter, VS Code and pytest - each of which benefits from an enormous array of plugins adding all kinds of new features and expanded capabilities.

Adding plugin support to an open source project can greatly reduce the friction involved in attracting new contributors. Users can work independently and even package and publish their work without needing to directly coordinate with the project's core maintainers. As a maintainer this means you can wake up one morning and your software grew new features without you even having to review a pull request!

There's one catch: information on how to design and implement plugin support for a project is scarce.

I now have three major open source projects that support plugins, with over 200 plugins published across those projects. I'll talk about everything I've learned along the way: when and how to use plugins, how to design plugin hooks and how to ensure your plugin authors have as good an experience as possible.

I'm going to be talking about what I've learned integrating Pluggy with Datasette, LLM and sqlite-utils. I've been looking for an excuse to turn this knowledge into a talk for ages, very excited to get to do it at DjangoCon!

# 17th July 2024, 3:20 am / djangocon, python, plugins, django, speaking, sqlite-utils, llm, datasette

datasette-python. I just released a small new plugin for Datasette to assist with debugging. It adds a python subcommand which runs a Python process in the same virtual environment as Datasette itself.

I built it initially to help debug some issues in Datasette installed via Homebrew. The Homebrew installation has its own virtual environment, and sometimes it can be useful to run commands like pip list in the same environment as Datasette itself.

Now you can do this:

brew install datasette
datasette install datasette-python
datasette python -m pip list

I built a similar plugin for LLM last year, called llm-python - it's proved useful enough that I duplicated the design for Datasette.

# 12th July 2024, 11:17 pm / projects, datasette, python, plugins

UK Parliament election results, now with Datasette. The House of Commons Library maintains a website of UK parliamentary election results data, currently listing 2010 through 2019 and with 2024 results coming soon.

The site itself is a Rails and PostgreSQL app, but I was delighted to learn today that they're also running a Datasette instance with the election results data, linked to from their homepage!

The data this website uses is available to query. as a Datasette endpoint. The database schema is published for reference. Mobile Safari screenshot on electionresults.parliament.uk

The raw data is also available as CSV files in their GitHub repository. Here's their Datasette configuration, which includes a copy of their SQLite database.

# 5th July 2024, 11:36 pm / elections, datasette, sqlite

Weeknotes: a livestream, a surprise keynote and progress on Datasette Cloud billing

Visit Weeknotes: a livestream, a surprise keynote and progress on Datasette Cloud billing

My first YouTube livestream with Val Town, a keynote at the AI Engineer World’s Fair and some work integrating Stripe with Datasette Cloud. Plus a bunch of upgrades to my blog.

[... 1,124 words]

Datasette 0.64.8. A very small Datasette release, fixing a minor potential security issue where the name of missing databases or tables was reflected on the 404 page in a way that could allow an attacker to present arbitrary text to a user who followed a link. Not an XSS attack (no code could be executed) but still a potential vector for confusing messages.

# 21st June 2024, 11:48 pm / security, releases, datasette, projects

Building search-based RAG using Claude, Datasette and Val Town

Visit Building search-based RAG using Claude, Datasette and Val Town

Retrieval Augmented Generation (RAG) is a technique for adding extra “knowledge” to systems built on LLMs, allowing them to answer questions against custom information not included in their training data. A common way to implement this is to take a question from a user, translate that into a set of search queries, run those against a search engine and then feed the results back into the LLM to generate an answer.

[... 3,372 words]

Civic Band. Exciting new civic tech project from Philip James: 30 (and counting) Datasette instances serving full-text search enabled collections of OCRd meeting minutes for different civic governments. Includes 20,000 pages for Alameda, 17,000 for Pittsburgh, 3,567 for Baltimore and an enormous 117,000 for Maui County.

Philip includes some notes on how they're doing it. They gather PDF minute notes from anywhere that provides API access to them, then run local Tesseract for OCR (the cost of cloud-based OCR proving prohibitive given the volume of data). The collection is then deployed to a single VPS running multiple instances of Datasette via Caddy, one instance for each of the covered regions.

# 19th June 2024, 9:30 pm / ocr, tesseract, data-journalism, datasette

Weeknotes: Datasette Studio and a whole lot of blogging

Visit Weeknotes: Datasette Studio and a whole lot of blogging

I’m still spinning back up after my trip back to the UK, so actual time spent building things has been less than I’d like. I presented an hour long workshop on command-line LLM usage, wrote five full blog entries (since my last weeknotes) and I’ve also been leaning more into short-form link blogging—a lot more prominent on this site now since my homepage redesign last week.

[... 736 words]

Language models on the command-line

Visit Language models on the command-line

I gave a talk about accessing Large Language Models from the command-line last week as part of the Mastering LLMs: A Conference For Developers & Data Scientists six week long online conference. The talk focused on my LLM Python command-line utility and ways you can use it (and its plugins) to explore LLMs and use them for useful tasks.

[... 4,992 words]

Datasette 0.64.7. A very minor dot-fix release for Datasette stable, addressing this bug where Datasette running against the latest version of SQLite - 3.46.0 - threw an error on canned queries that included :named parameters in their SQL.

The root cause was Datasette using a now invalid clever trick I came up with against the undocumented and unstable opcodes returned by a SQLite EXPLAIN query.

I asked on the SQLite forum and learned that the feature I was using was removed in this commit to SQLite. D. Richard Hipp explains:

The P4 parameter to OP_Variable was not being used for anything. By omitting it, we make the prepared statement slightly smaller, reduce the size of the SQLite library by a few bytes, and help sqlite3_prepare() and similar run slightly faster.

# 12th June 2024, 10:55 pm / projects, d-richard-hipp, datasette, sqlite, annotated-release-notes

Ham radio general exam question pool as JSON. I scraped a pass of my Ham radio general exam this morning. One of the tools I used to help me pass was a Datasette instance with all 429 questions from the official question pool. I've published that raw data as JSON on GitHub, which I converted from the official question pool document using an Observable notebook.

Relevant TIL: How I studied for my Ham radio general exam.

# 11th May 2024, 7:16 pm / datasette, radio, projects, observable, json, ham-radio

datasette-pins — a new Datasette plugin for pinning tables and queries. Alex Garcia built this plugin for Datasette Cloud, and as with almost every Datasette Cloud features we're releasing it as an open source package as well.

datasette-pins allows users with the right permission to "pin" tables, databases and queries to their homepage. It's a lightweight way to customize that homepage, especially useful as your Datasette instance grows to host dozens or even hundreds of tables.

# 9th May 2024, 6:29 pm / alex-garcia, datasette-cloud, datasette, plugins

Weeknotes: more datasette-secrets, plus a mystery video project

Visit Weeknotes: more datasette-secrets, plus a mystery video project

I introduced datasette-secrets two weeks ago. The core idea is to provide a way for end-users to store secrets such as API keys in Datasette, allowing other plugins to access them.

[... 982 words]

Food Delivery Leak Unmasks Russian Security Agents. This story is from April 2022 but I realize now I never linked to it.

Yandex Food, a popular food delivery service in Russia, suffered a major data leak.

The data included an order history with names, addresses and phone numbers of people who had placed food orders through that service.

Bellingcat were able to cross-reference this leak with addresses of Russian security service buildings—including those linked to the GRU and FSB.This allowed them to identify the names and phone numbers of people working for those organizations, and then combine that information with further leaked data as part of their other investigations.

If you look closely at the screenshots in this story they may look familiar: Bellingcat were using Datasette internally as a tool for exploring this data!

# 26th April 2024, 1:59 am / bellingcat, data-journalism, datasette

Weeknotes: Llama 3, AI for Data Journalism, llm-evals and datasette-secrets

Visit Weeknotes: Llama 3, AI for Data Journalism, llm-evals and datasette-secrets

Llama 3 landed on Thursday. I ended up updating a whole bunch of different plugins to work with it, described in Options for accessing Llama 3 from the terminal using LLM.

[... 1,030 words]

AI for Data Journalism: demonstrating what we can do with this stuff right now

Visit AI for Data Journalism: demonstrating what we can do with this stuff right now

I gave a talk last month at the Story Discovery at Scale data journalism conference hosted at Stanford by Big Local News. My brief was to go deep into the things we can use Large Language Models for right now, illustrated by a flurry of demos to help provide starting points for further conversations at the conference.

[... 6,081 words]

Extracting data from unstructured text and images with Datasette and GPT-4 Turbo. Datasette Extract is a new Datasette plugin that uses GPT-4 Turbo (released to general availability today) and GPT-4 Vision to extract structured data from unstructured text and images.

I put together a video demo of the plugin in action today, and posted it to the Datasette Cloud blog along with screenshots and a tutorial describing how to use it.

# 9th April 2024, 11:03 pm / datasette-cloud, openai, gpt-4, ai, llms, datasette, generative-ai, projects, vision-llms