Simon Willison’s Weblog


Weeknotes: a Datasette release, an LLM release and a bunch of new plugins

9th February 2024

I wrote extensive annotated release notes for Datasette 1.0a8 and LLM 0.13 already. Here’s what else I’ve been up to this past three weeks.

New plugins for Datasette

  • datasette-proxy-url is a very simple plugin that simple lets you configure a path within Datasette that serves content proxied from another URL.

    I built this one because I ran into a bug with Substack where Substack were denying requests to my newsletter’s RSS feed from code running in GitHub Actions! Frustrating, since the whole point of RSS is to be retrieved by bots.

    I solved it by deploying a quick proxy to a Datasette instance I already had up and running, effectively treating Datasette as a cheap deployment platform for random pieces of proxying infrastructure.

  • datasette-homepage-table lets you configure Datasette to display a specific table as the homepage of the instance. I’ve wanted this for a while myself, someone requested it on Datasette Discord and it turned out to be pretty quick to build.

  • datasette-events-db hooks into the new events mechanism in Datasette 1.0a8 and logs any events (create-table, login etc) to a datasette_events table. I released this partly as a debugging tool and partly because I like to ensure every Datasette plugin hook has at least one released plugin that uses it.

  • datasette-enrichments-quickjs was this morning’s project. It’s a plugin for Datasette Enrichments that takes advantage of the quickjs Python package—a wrapper around the excellent QuickJS engine—to support running a custom JavaScript function against every row in a table to populate a new column.

    QuickJS appears to provide a robust sandbox, including both memory and time limits! I need to write more about this plugin, it opens up some very exciting new possibilities for Datasette.

I also published some significant updates to existing plugins:

  • datasette-upload-csvs got a long-overdue improvement allowing it to upload CSVs to a specified database, rather than just using the first available one. As part of this I completely re-engineered how it works in terms of threading strategies, as described in issue 38. Plus it’s now tested against the Datasette 1.0 alpha series in addition to 0.x stable.

Plugins for LLM

LLM is my command-line tool and Python library for interacting with Large Language Models. I released one new plugin for that:

I released updates for two LLM plugins as well:

I finally started hacking on a llm-rag plugin which will provide an implementation of Retrieval Augmented Generation for LLM, similar to the process I describe in Embedding paragraphs from my blog with E5-large-v2.

I’ll write more about that once it’s in an interesting state.

shot-scraper 1.4

shot-scraper is my CLI tool for taking screenshots of web pages and running scraping code against them using JavaScript, built on top of Playwright.

I dropped into the repo to add HTTP Basic authentication support and found several excellent PRs waiting to be merged, so I bundled those together into a new release.

Here are the full release notes for shot-scraper 1.4:

  • New --auth-username x --auth-password y options for each shot-scraper command, allowing a username and password to be set for HTTP Basic authentication. #140
  • shot-scraper URL --interactive mode now respects the -w and -h arguments setting the size of the browser viewport. Thanks, mhalle. #128
  • New --scale-factor option for setting scale factors other than 2 (for retina). Thanks, Niel Thiart. #136
  • New --browser-arg option for passing extra browser arguments (such as --browser-args "--font-render-hinting=none") through to the underlying browser. Thanks, Niel Thiart. #137

Miscellaneous other projects

  • We had some pretty severe storms in the San Francisco Bay Area last week, inspired me to revisit my old PG&E outage scraper. PG&E’s outage map changed and broke that a couple of years ago, but I got a new scraper up and running just in time to start capturing outages.
  • I’ve been wanting a way to quickly create additional labels for my GitHub repositories for a while. I finally put together a simple system for that based on GitHub Actions, described in this TIL: Creating GitHub repository labels with an Actions workflow.