Weeknotes: a Datasette release, an LLM release and a bunch of new plugins
9th February 2024
I wrote extensive annotated release notes for Datasette 1.0a8 and LLM 0.13 already. Here’s what else I’ve been up to this past three weeks.
New plugins for Datasette
-
datasette-proxy-url is a very simple plugin that simple lets you configure a path within Datasette that serves content proxied from another URL.
I built this one because I ran into a bug with Substack where Substack were denying requests to my newsletter’s RSS feed from code running in GitHub Actions! Frustrating, since the whole point of RSS is to be retrieved by bots.
I solved it by deploying a quick proxy to a Datasette instance I already had up and running, effectively treating Datasette as a cheap deployment platform for random pieces of proxying infrastructure.
-
datasette-homepage-table lets you configure Datasette to display a specific table as the homepage of the instance. I’ve wanted this for a while myself, someone requested it on Datasette Discord and it turned out to be pretty quick to build.
-
datasette-events-db hooks into the new events mechanism in Datasette 1.0a8 and logs any events (
create-table
,login
etc) to adatasette_events
table. I released this partly as a debugging tool and partly because I like to ensure every Datasette plugin hook has at least one released plugin that uses it. -
datasette-enrichments-quickjs was this morning’s project. It’s a plugin for Datasette Enrichments that takes advantage of the quickjs Python package—a wrapper around the excellent QuickJS engine—to support running a custom JavaScript function against every row in a table to populate a new column.
QuickJS appears to provide a robust sandbox, including both memory and time limits! I need to write more about this plugin, it opens up some very exciting new possibilities for Datasette.
I also published some significant updates to existing plugins:
- datasette-upload-csvs got a long-overdue improvement allowing it to upload CSVs to a specified database, rather than just using the first available one. As part of this I completely re-engineered how it works in terms of threading strategies, as described in issue 38. Plus it’s now tested against the Datasette 1.0 alpha series in addition to 0.x stable.
Plugins for LLM
LLM is my command-line tool and Python library for interacting with Large Language Models. I released one new plugin for that:
- llm-embed-onnx is a thin wrapper on top of onnx_embedding_models by Benjamin Anderson which itself wraps the powerful ONNX Runtime. It makes several new embeddings models available for use with LLM, listed in the README.
I released updates for two LLM plugins as well:
-
llm-gpt4all got a release with improvements from three contributors. I’ll quote the release notes in full:
- Now provides access to model options such as
-o max_tokens 3
. Thanks, Mauve Signweaver. #3 - Models now work without an internet connection. Thanks, Cameron Yick. #10
- Documentation now includes the location of the model files. Thanks, Werner Robitza. #21
- Now provides access to model options such as
-
llm-sentence-transformers now has a
llm sentence-transformers register --trust-remote-code
option, which was necessary to support the newly released nomic-embed-text-v1 embedding model.
I finally started hacking on a llm-rag
plugin which will provide an implementation of Retrieval Augmented Generation for LLM, similar to the process I describe in Embedding paragraphs from my blog with E5-large-v2.
I’ll write more about that once it’s in an interesting state.
shot-scraper 1.4
shot-scraper is my CLI tool for taking screenshots of web pages and running scraping code against them using JavaScript, built on top of Playwright.
I dropped into the repo to add HTTP Basic authentication support and found several excellent PRs waiting to be merged, so I bundled those together into a new release.
Here are the full release notes for shot-scraper 1.4:
- New
--auth-username x --auth-password y
options for eachshot-scraper
command, allowing a username and password to be set for HTTP Basic authentication. #140shot-scraper URL --interactive
mode now respects the-w
and-h
arguments setting the size of the browser viewport. Thanks, mhalle. #128- New
--scale-factor
option for setting scale factors other than 2 (for retina). Thanks, Niel Thiart. #136- New
--browser-arg
option for passing extra browser arguments (such as--browser-args "--font-render-hinting=none"
) through to the underlying browser. Thanks, Niel Thiart. #137
Miscellaneous other projects
- We had some pretty severe storms in the San Francisco Bay Area last week, inspired me to revisit my old PG&E outage scraper. PG&E’s outage map changed and broke that a couple of years ago, but I got a new scraper up and running just in time to start capturing outages.
- I’ve been wanting a way to quickly create additional labels for my GitHub repositories for a while. I finally put together a simple system for that based on GitHub Actions, described in this TIL: Creating GitHub repository labels with an Actions workflow.
Releases
-
datasette-enrichments-quickjs 0.1a0—2024-02-09
Enrich data with a custom JavaScript function -
datasette-events-db 0.1a0—2024-02-08
Log Datasette events to a database table -
datasette 1.0a8—2024-02-07
An open source multi-tool for exploring and publishing data -
shot-scraper 1.4—2024-02-05
A command-line utility for taking automated screenshots of websites -
llm-sentence-transformers 0.2—2024-02-04
LLM plugin for embeddings using sentence-transformers -
datasette-homepage-table 0.2—2024-01-31
Show a specific Datasette table on the homepage -
datasette-upload-csvs 0.9—2024-01-30
Datasette plugin for uploading CSV files and converting them to database tables -
llm-embed-onnx 0.1—2024-01-28
Run embedding models using ONNX -
llm 0.13.1—2024-01-27
Access large language models from the command-line -
llm-gpt4all 0.3—2024-01-24
Plugin for LLM adding support for the GPT4All collection of models -
datasette-granian 0.1—2024-01-23
Run Datasette using the Granian HTTP server -
datasette-proxy-url 0.1.1—2024-01-23
Proxy a URL through a Datasette instance
TILs
More recent articles
- Notes from Bing Chat—Our First Encounter With Manipulative AI - 19th November 2024
- Project: Civic Band - scraping and searching PDF meeting minutes from hundreds of municipalities - 16th November 2024
- Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac - 12th November 2024