Weeknotes: the Datasette Cloud API, a podcast appearance and more
1st October 2023
Datasette Cloud now has a documented API, plus a podcast appearance, some LLM plugins work and some geospatial excitement.
The Datasette Cloud API
My biggest achievement this week is that I documented and announced the API for Datasette Cloud.
I wrote about this at length in Getting started with the Datasette Cloud API on the Datasette Cloud blog. I also used this as an opportunity to start a documentation site for the service, now available at datasette.cloud/docs.
The API is effectively the Datasette 1.0 alpha write API, described here previously. You can use the API to both read and write data to a Datasette Cloud space, with finely-grained permissions (powered by the datasette-auth-tokens plugin) so you can create tokens that are restricted to actions just against specified tables.
The blog entry about it doubles as a tutorial, describing how I wrote code to import the latest documents from the US Government Federal Register into a Datasette Cloud space, using a dependency-free Python script and GitHub Actions.
You can see that code in the new federal-register-to-datasette GitHub repository. It’s pretty small—just 70 lines of Python and 22 of YAML.
The more time I spend writing code against the Datasette API the more confident I get that it’s shaped in the right way. I’m happy to consider it stable for the 1.0 release now.
Talking Large Language Models with Rooftop Ruby
I recorded a podcast episode this week for Rooftop Ruby with Collin Donnell and Joel Drapper. It was a really high quality conversation—we went for about an hour and 20 minutes and covered a huge amount of ground.
After the podcast came out I took the MP3, ran it through MacWhisper and then spent several hours marking up speakers and editing the resulting text. I also added headings corresponding to the different topics we covered, along with inline links to other relevant material.
I’m really pleased with the resulting document, which you can find at Talking Large Language Models with Rooftop Ruby. It was quite a bit of work but I think it was worthwhile—I’ve since been able to answer some questions about LLMs on Mastodon and Twitter by linking directly to the point within the transcript that discussed those points.
I also dropped in my own audio player, developed with GPT-4 assistance, and provided links from the different transcript sessions that would jump the audio to that point in the conversation.
Also this week: while closing a bunch of VS Code tabs I stumbled across a partially written blog entry about Things I’ve learned about building CLI tools in Python, so I finished that up and published it.
I’m trying to leave less unfinished projects lying around on my computer, so if something is 90% finished I’ll try to wrap it up and put it out there to get it off my ever-expanding plate.
LLM has started to collect a small but healthy community on Discord, which is really exciting.
My absolute favourite community project so far is Drew Breunig’s Facet Finder, which he described in Finding Bathroom Faucets with Embeddings. He used llm-clip to calculate embeddings for 20,000 pictures of faucets, then ran both similarity and text search against them to help renovate his bathroom. It’s really fun!
I shipped a new version of the llm-llama-cpp plugin this week which was mostly written by other people: llm-llama-cpp 0.2b1. Alexis Métaireau and LoopControl submitted fixes to extend the default max token limit (fixing a frustrating issue with truncated responses) and to allow for increasing the number of GPU layers used to run the models.
I also shipped LLM 0.11, the main feature of which was support for the new OpenAI
gpt-3.5-turbo-instruct model. I really need to split the OpenAI support out into a separate plugin so I can ship fixes to that without having to release the core LLM package.
And I put together an llm-plugin cookiecutter template, which I plan to use for all of my plugins going forward.
Getting excited about TG and sqlite-tg
TG is a brand new C library from Tile38 creator Josh Baker. It’s really exciting: it provides a set of fast geospatial operations—the exact subset I usually find myself needing, based around polygon intersections, GeoJSON, WKT, WKB and geospatial indexes—implemented with zero external dependencies. It’s shipped as a single C file, reminiscent of the SQLite amalgamation.
I wrote about my own explorations of Alex’s work in Geospatial SQL queries in SQLite using TG, sqlite-tg and datasette-sqlite-tg. I’m thrilled at the idea of having a tiny, lightweight alternative to SpatiaLite as an addition to the Datasette ecosystem, and the SQLite world in general.
Two tiny Datasette releases
I released dot-releases for Datasette:
Both of these feature the same fix, described in Issue 2189: Server hang on parallel execution of queries to named in-memory databases.
Short version: it turns out the experimental work I did a while ago to try running SQL queries in parallel was causing threading deadlock issues against in-memory named SQLite databases. No-one had noticed because those are only available within Datasette plugins, but I’d started to experience them as I started writing my own plugins that used that feature.
ChatGPT in the newsroom
I signed up for a MOOC (Massive Open Online Courses) about journalism and ChatGPT!
How to use ChatGPT and other generative AI tools in your newsrooms is being taught by Aimee Rinehart and Sil Hamilton for the Knight Center.
I actually found out about it because people were being snarky about it on Twitter. That’s not a big surprise—there are many obvious problems with applying generative AI to journalism.
As you would hope, this course is not a hype-filled pitch for writing AI-generated news stories. It’s a conversation between literally thousands of journalists around the world about the ethical and practical implications of this technology.
I’m really enjoying it. I’m learning a huge amount about how people experience AI tools, the kinds of questions they have about them and the kinds of journalism problems that make sense for them to solve.
Releases this week
Datasette plugin for fetching details of actors from a remote endpoint
LLM plugin for running models using llama.cpp
Datasette plugin for authenticating access using API tokens
An open source multi-tool for exploring and publishing data
Upload SQLite database files to Datasette
Datasette plugin that masks specified database columns
Access large language models from the command-line
TIL this week
More recent articles
- Prompt injection explained, November 2023 edition - 27th November 2023
- I'm on the Newsroom Robots podcast, with thoughts on the OpenAI board - 25th November 2023
- Weeknotes: DevDay, GitHub Universe, OpenAI chaos - 22nd November 2023
- Deciphering clues in a news article to understand how it was reported - 22nd November 2023
- Exploring GPTs: ChatGPT in a trench coat? - 15th November 2023
- Financial sustainability for open source projects at GitHub Universe - 10th November 2023
- ospeak: a CLI tool for speaking text in the terminal via OpenAI - 7th November 2023
- DALL-E 3, GPT4All, PMTiles, sqlite-migrate, datasette-edit-schema - 30th October 2023
- Now add a walrus: Prompt engineering in DALL-E 3 - 26th October 2023
- Execute Jina embeddings with a CLI using llm-embed-jina - 26th October 2023