Weeknotes: the Datasette Cloud API, a podcast appearance and more

1st October 2023

Datasette Cloud now has a documented API, plus a podcast appearance, some LLM plugins work and some geospatial excitement.

The Datasette Cloud API

My biggest achievement this week is that I documented and announced the API for Datasette Cloud.

I wrote about this at length in Getting started with the Datasette Cloud API on the Datasette Cloud blog. I also used this as an opportunity to start a documentation site for the service, now available at datasette.cloud/docs.

The API is effectively the Datasette 1.0 alpha write API, described here previously. You can use the API to both read and write data to a Datasette Cloud space, with finely-grained permissions (powered by the datasette-auth-tokens plugin) so you can create tokens that are restricted to actions just against specified tables.

The blog entry about it doubles as a tutorial, describing how I wrote code to import the latest documents from the US Government Federal Register into a Datasette Cloud space, using a dependency-free Python script and GitHub Actions.

You can see that code in the new federal-register-to-datasette GitHub repository. It’s pretty small—just 70 lines of Python and 22 of YAML.

The more time I spend writing code against the Datasette API the more confident I get that it’s shaped in the right way. I’m happy to consider it stable for the 1.0 release now.

Talking Large Language Models with Rooftop Ruby

I recorded a podcast episode this week for Rooftop Ruby with Collin Donnell and Joel Drapper. It was a really high quality conversation—we went for about an hour and 20 minutes and covered a huge amount of ground.

After the podcast came out I took the MP3, ran it through MacWhisper and then spent several hours marking up speakers and editing the resulting text. I also added headings corresponding to the different topics we covered, along with inline links to other relevant material.

I’m really pleased with the resulting document, which you can find at Talking Large Language Models with Rooftop Ruby. It was quite a bit of work but I think it was worthwhile—I’ve since been able to answer some questions about LLMs on Mastodon and Twitter by linking directly to the point within the transcript that discussed those points.

I also dropped in my own audio player, developed with GPT-4 assistance, and provided links from the different transcript sessions that would jump the audio to that point in the conversation.

Also this week: while closing a bunch of VS Code tabs I stumbled across a partially written blog entry about Things I’ve learned about building CLI tools in Python, so I finished that up and published it.

I’m trying to leave less unfinished projects lying around on my computer, so if something is 90% finished I’ll try to wrap it up and put it out there to get it off my ever-expanding plate.

llm-llama-cpp

LLM has started to collect a small but healthy community on Discord, which is really exciting.

My absolute favourite community project so far is Drew Breunig’s Facet Finder, which he described in Finding Bathroom Faucets with Embeddings. He used llm-clip to calculate embeddings for 20,000 pictures of faucets, then ran both similarity and text search against them to help renovate his bathroom. It’s really fun!

I shipped a new version of the llm-llama-cpp plugin this week which was mostly written by other people: llm-llama-cpp 0.2b1. Alexis Métaireau and LoopControl submitted fixes to extend the default max token limit (fixing a frustrating issue with truncated responses) and to allow for increasing the number of GPU layers used to run the models.

I also shipped LLM 0.11, the main feature of which was support for the new OpenAI gpt-3.5-turbo-instruct model. I really need to split the OpenAI support out into a separate plugin so I can ship fixes to that without having to release the core LLM package.

And I put together an llm-plugin cookiecutter template, which I plan to use for all of my plugins going forward.

Getting excited about TG and sqlite-tg

TG is a brand new C library from Tile38 creator Josh Baker. It’s really exciting: it provides a set of fast geospatial operations—the exact subset I usually find myself needing, based around polygon intersections, GeoJSON, WKT, WKB and geospatial indexes—implemented with zero external dependencies. It’s shipped as a single C file, reminiscent of the SQLite amalgamation.

I noted in a few places that it could make a great SQLite extension... and Alex Garcia fell victim to my blatant nerd-sniping and built the first version of sqlite-tg within 24 hours!

I wrote about my own explorations of Alex’s work in Geospatial SQL queries in SQLite using TG, sqlite-tg and datasette-sqlite-tg. I’m thrilled at the idea of having a tiny, lightweight alternative to SpatiaLite as an addition to the Datasette ecosystem, and the SQLite world in general.

Two tiny Datasette releases

I released dot-releases for Datasette:

Both of these feature the same fix, described in Issue 2189: Server hang on parallel execution of queries to named in-memory databases.

Short version: it turns out the experimental work I did a while ago to try running SQL queries in parallel was causing threading deadlock issues against in-memory named SQLite databases. No-one had noticed because those are only available within Datasette plugins, but I’d started to experience them as I started writing my own plugins that used that feature.

ChatGPT in the newsroom

I signed up for a MOOC (Massive Open Online Courses) about journalism and ChatGPT!

How to use ChatGPT and other generative AI tools in your newsrooms is being taught by Aimee Rinehart and Sil Hamilton for the Knight Center.

I actually found out about it because people were being snarky about it on Twitter. That’s not a big surprise—there are many obvious problems with applying generative AI to journalism.

As you would hope, this course is not a hype-filled pitch for writing AI-generated news stories. It’s a conversation between literally thousands of journalists around the world about the ethical and practical implications of this technology.

I’m really enjoying it. I’m learning a huge amount about how people experience AI tools, the kinds of questions they have about them and the kinds of journalism problems that make sense for them to solve.

Releases this week

datasette-remote-actors 0.1a2—2023-09-28
Datasette plugin for fetching details of actors from a remote endpoint
llm-llama-cpp 0.2b1—2023-09-28
LLM plugin for running models using llama.cpp
datasette-auth-tokens 0.4a4—2023-09-26
Datasette plugin for authenticating access using API tokens
datasette 1.0a7—2023-09-21
An open source multi-tool for exploring and publishing data
datasette-upload-dbs 0.3.1—2023-09-20
Upload SQLite database files to Datasette
datasette-mask-columns 0.2.2—2023-09-20
Datasette plugin that masks specified database columns
llm 0.11—2023-09-19
Access large language models from the command-line

TIL this week

Understanding the CSS auto-resizing textarea trick—2023-09-30
Snapshot testing with Syrupy—2023-09-26
Geospatial SQL queries in SQLite using TG, sqlite-tg and datasette-sqlite-tg—2023-09-25
Trying out the facebook/musicgen-small sound generation model—2023-09-23

Posted 1st October 2023 at 12:03 am · Follow me on Mastodon, Bluesky, Twitter or subscribe to my newsletter

Simon Willison’s Weblog