Simon Willison’s Weblog

On embeddings 23 security 433 go 25 generativeai 395 video 29 ...

 

Recent entries

Weeknotes: Getting ready for NICAR five days ago

Next week is NICAR 2024 in Baltimore—the annual data journalism conference hosted by Investigative Reporters and Editors. I’m running a workshop on Datasette, and I plan to spend most of my time in the hallway track talking to people about Datasette, Datasette Cloud and how the Datasette ecosystem can best help support their work.

I’ve been working with Alex Garcia to get Datasette Cloud ready for the conference. We have a few new features that we’re putting the final touches on, in addition to ensuring features like Datasette Enrichments and Datasette Comments are in good shape for the event.

Releases

  • llm-mistral 0.3—2024-02-26
    LLM plugin providing access to Mistral models using the Mistral API

Mistral released Mistral Large this morning, so I rushed out a new release of my llm-mistral plugin to add support for it.

pipx install llm
llm install llm-mistral --upgrade
llm keys set mistral
# <Paste in your Mistral API key>
llm -m mistral-large 'Prompt goes here'

The plugin now hits the Mistral API endpoint that lists models (via a cache), which means future model releases should be supported automatically without needing a new plugin release.

  • dclient 0.3—2024-02-25
    A client CLI utility for Datasette instances

dclient provides a tool for interacting with a remote Datasette instance. You can use it to run queries:

dclient query https://datasette.io/content \
  "select * from news limit 3"

You can set aliases for your Datasette instances:

dclient alias add simon https://simon.datasette.cloud/data

And for Datasette 1.0 alpha instances with the write API (as seen on Datasette Cloud) you can insert data into a new or an existing table:

dclient auth add simon
# <Paste in your API token>
dclient insert simon my_new_table data.csv --create

The 0.3 release adds improved support for streaming data into a table. You can run a command like this:

tail -f log.ndjson | dclient insert simon my_table \
  --nl - --interval 5 --batch-size 20

The --interval 5 option is new: it means that records will be written to the API if 5 seconds have passed since the last write. --batch-size 20 means that records will be written in batches of 20, and will be sent as soon as the batch is full or the interval has passed.

I wrote about the new Datasette Events mechanism in the 1.0a8 release notes. This new plugin was originally built for Datasette Cloud—it forwards analytical events from an instance to a central analytics instance. Using Datasette Cloud for analytics for Datasette Cloud is a pleasing exercise in dogfooding.

A tiny cosmetic bug fix.

  • datasette 1.0a11—2024-02-19
    An open source multi-tool for exploring and publishing data

I’m increasing the frequency of the Datasette 1.0 alphas. This one has a minor permissions fix (the ability to replace a row using the insert API now requires the update-row permission) and a small cosmetic fix which I’m really pleased with: the menus displayed by the column action menu now align correctly with their cog icon!

Clicking on a cog icon now shows a menu directly below that icon, with a little grey arrow in the right place to align with the icon that was clicked

This is a pretty significant release: it adds finely-grained permission support such that Datasette’s core create-table, alter-table and drop-table permissions are now respected by the plugin.

The alter-table permission was introduced in Datasette 1.0a9 a couple of weeks ago.

When testing permissions it’s useful to have a really convenient way to sign in to Datasette using different accounts. This plugin provides that, but only if you start Datasette with custom plugin configuration or by using this new 1.0 alpha shortcut setting option:

datasette -s plugins.datasette-unsafe-actor-debug.enabled 1

An experiment in bundling plugins. pipx install datasette-studio gets you an installation of Datasette under a separate alias—datasette-studio—which comes preconfigured with a set of useful plugins.

The really fun thing about this one is that the entire package is defined by a pyproject.toml file, with no additional Python code needed. Here’s a truncated copy of that TOML:

[project]
name = "datasette-studio"
version = "0.1a0"
description = "Datasette pre-configured with useful plugins"
requires-python = ">=3.8"
dependencies = [
    "datasette>=1.0a10",
    "datasette-edit-schema",
    "datasette-write-ui",
    "datasette-configure-fts",
    "datasette-write",
]

[project.entry-points.console_scripts]
datasette-studio = "datasette.cli:cli"

I think it’s pretty neat that a full application can be defined like this in terms of 5 dependencies and a custom console_scripts entry point.

Datasette Studio is still very experimental, but I think it’s pointing in a promising direction.

This resolves a dreaded “database locked” error I was seeing occasionally in Datasette Cloud.

Short version: SQLite, when running in WAL mode, is almost immune to those errors... provided you remember to run all write operations in short, well-defined transactions.

I’d forgotten to do that in this plugin and it was causing problems.

After shipping this release I decided to make it much harder to make this mistake in the future, so I released Datasette 1.0a10 which now automatically wraps calls to database.execute_write_fn() in a transaction even if you forget to do so yourself.

Blog entries

My first full blog post of the year to end up on Hacker News, where it sparked a lively conversation with 489 comments!

TILs

Yet another experiment with audit tables in SQLite. This one uses a terrifying nested sequenc of json_patch() calls to assemble a JSON document describing the change made to the table.

Val Town is a very neat attempt at solving another of my favourite problems: how to execute user-provided code safely in a sandbox. It turns out to be the perfect mechanism for running simple scheduled functions such as code that reads data and writes it to Datasette Cloud using the write API.

FIPS is the Federal Information Processing Standard, and systems that obey it refuse to run Datasette due to its use of MD5 hash functions. I figured out how to get that to work anyway, since Datasette’s MD5 usage is purely cosmetic, not cryptographic.

This actually showed up on Hacker News without me noticing until a few days later, where many people told me that I should rewire my existing Ethernet cables rather than resorting to more exotic solutions.

I guess this is another super lightweight form of RAG: you can use the rg context options (include X lines before/after each match) to assemble just enough context to get useful answers to questions about code.

The killer app of Gemini Pro 1.5 is video 10 days ago

Last week Google introduced Gemini Pro 1.5, an enormous upgrade to their Gemini series of AI models.

Gemini Pro 1.5 has a 1,000,000 token context size. This is huge—previously that record was held by Claude 2.1 (200,000 tokens) and gpt-4-turbo (128,000 tokens)—though the difference in tokenizer implementations between the models means this isn’t a perfectly direct comparison.

I’ve been playing with Gemini Pro 1.5 for a few days, and I think the most exciting feature isn’t so much the token count... it’s the ability to use video as an input.

I’ve been accessing the model through the Google AI Studio interface—I don’t have API access yet.

I tried filling up the context with content from my blog, and got some good results. But things got really exciting when I began experimenting with video uploads (after some false starts—the team have been actively fixing bugs, with variable bitrate support landing just this morning).

I took this seven second video of one of my bookshelves:

Then I uploaded it with the following prompt:

JSON array of books in this video

That 7 second video consumed just 1,841 tokens out of my 1,048,576 token limit.

Gemini Pro 1.5 didn’t return JSON, but it did reply with this:

It is not possible to fully identify all the books in the video. However, here are some of the books that can be seen:

  • The Lord of the Rings by J.R.R. Tolkien
  • Structure and Interpretation of Computer Programs by Harold Abelson and Gerald Jay Sussman
  • Rework by Jason Fried and David Heinemeier Hansson
  • The Hacker Ethic and the Spirit of the Information Age by Pekka Himanen
  • The Google Story by David A. Vise
  • The Art of Agile Development by James Shore and Shane Warden
  • Creative Code by John Maeda
  • Web Operations by John Allspaw and Jesse Robbins
  • Python Framework by David Mertz
  • The Personal MBA by Josh Kaufman
  • Growing Up with Lucy by April Henry
  • Getting Things Done by David Allen
  • Site-Seeing: A Visual Approach to Web Usability by Luke Wroblewski
  • JavaScript: The Definitive Guide by David Flanagan
  • Building Accessible Websites by Joe Clark
  • Information Architecture for the World Wide Web by Louis Rosenfeld and Peter Morville
  • Don’t Make Me Think: A Common Sense Approach to Web Usability by Steve Krug
  • Programming Erlang: Software for a Concurrent World by Joe Armstrong
  • XSLT 2.0 Web Development by Dmitry Kirsanov
  • Pragmatic Version Control Using Git by Travis Swicegood
  • Party Animals: My Family and Other Communists by David Aaronovitch

A follow-up prompt:

as a JSON array of objects, with title and author keys

Gave me that back as JSON

[
  {
    "title": "The Lord of the Rings",
    "author": "J.R.R. Tolkien"
  },
  {
    "title": "Structure and Interpretation of Computer Programs",
    "author": "Harold Abelson and Gerald Jay Sussman"
  },
  {
    "title": "Rework",
    "author": "Jason Fried and David Heinemeier Hansson"
  },
  {
    "title": "The Hacker Ethic and the Spirit of the Information Age",
    "author": "Pekka Himanen"
  },
  {
    "title": "The Google Story",
    "author": "David A. Vise"
  },
  {
    "title": "The Art of Agile Development",
    "author": "James Shore and Shane Warden"
  },
  {
    "title": "Creative Code",
    "author": "John Maeda"
  },
  {
    "title": "Web Operations",
    "author": "John Allspaw and Jesse Robbins"
  },
  {
    "title": "Python Framework",
    "author": "David Mertz"
  },
  {
    "title": "The Personal MBA",
    "author": "Josh Kaufman"
  },
  {
    "title": "Growing Up with Lucy",
    "author": "April Henry"
  },
  {
    "title": "Getting Things Done",
    "author": "David Allen"
  },
  {
    "title": "Site-Seeing: A Visual Approach to Web Usability",
    "author": "Luke Wroblewski"
  },
  {
    "title": "JavaScript: The Definitive Guide",
    "author": "David Flanagan"
  },
  {
    "title": "Building Accessible Websites",
    "author": "Joe Clark"
  },
  {
    "title": "Information Architecture for the World Wide Web",
    "author": "Louis Rosenfeld and Peter Morville"
  },
  {
    "title": "Don't Make Me Think: A Common Sense Approach to Web Usability",
    "author": "Steve Krug"
  },
  {
    "title": "Programming Erlang: Software for a Concurrent World",
    "author": "Joe Armstrong"
  },
  {
    "title": "XSLT 2.0 Web Development",
    "author": "Dmitry Kirsanov"
  },
  {
    "title": "Pragmatic Version Control Using Git",
    "author": "Travis Swicegood"
  },
  {
    "title": "Party Animals: My Family and Other Communists",
    "author": "David Aaronovitch"
  }
]

So how well did it do?

Honestly, I’m pretty astonished by this. The video is 7s long, it pans quite fast (there is some motion blur in the video) and some of the books are partially obscured by items.

It didn’t get all of them, but it did about as good a job as I could have done given the same video.

I’m particularly impressed that it got “Site Seeing: A Visual Approach to Web Usability” by Luke Wroblewski, where the book itself is almost entirely obscured by a squirrel leaving only “Site-Seeing” visible.

I did spot one hallucination: it lists “The Personal MBA by Josh Kaufman”, but I don’t own that book—the closest thing in the video is a blurry few frames of a book called “The Beermat Entrepreneur”.

I decided to try it again with a different video and hopefully a better prompt. This time I videoed a bookshelf mostly full of cookbooks—this time in portrait, a bit longer (22 seconds) and panning down as well as across.

This video cost me 6,049 tokens—still a pretty tiny allocation.

My new prompt was:

Output a JSON array of {“title”: “...”, “authors”: “...”} objects for books in this video

And.... it refused:

User: a video, containing a bunch of books visible in the first frame including THE COCKTAIL, SMUGGLER'S COVE, THE CUROUS BARTENDER... 00:22 IMG_7592.MOV 6,049 tokens. Prompt: Output a JSON array of {"title": " ...", "authors". "..."} objects for books in this video. Model: No content, with a box open that says Probability of unsafe content Dangerous Content High - Edit safety settings

It looks like the safety filter may have taken offense to the word “Cocktail”!

I opened up the safety settings, dialled them down to “low” for every category and tried again. It appeared to refuse a second time.

So I channelled Mrs Doyle and said:

go on give me that JSON

And it worked!

[
  {
    "title": "The Cocktail Guy",
    "authors": "Rich Woods"
  },
  {
    "title": "Smuggler's Cove",
    "authors": "Martin Cate"
  },
  {
    "title": "The Curious Bartender",
    "authors": "Tristan Stephenson"
  },
  {
    "title": "The Art of Fermentation",
    "authors": "Sandor Ellix Katz"
  },
  {
    "title": "The Noma Guide to Fermentation",
    "authors": "René Redzepi & David Zilber"
  },
  {
    "title": "Cocktails #7",
    "authors": "Difford's Guide"
  },
  {
    "title": "The Curious Cookbook",
    "authors": ""
  },
  {
    "title": "Rival Crock Pot Cooking",
    "authors": ""
  },
  {
    "title": "The Art of French Cooking",
    "authors": "Fernande Garvin"
  },
  {
    "title": "The Book on Pie",
    "authors": "Erin Jeanne McDowell"
  },
  {
    "title": "The New Taste of Chocolate",
    "authors": ""
  },
  {
    "title": "Vegan Cakes and Bakes",
    "authors": "Jérôme Eckmeier & Daniela Lais"
  },
  {
    "title": "Kitchen Creamery",
    "authors": ""
  },
  {
    "title": "Good Food 101 Teatime Treats",
    "authors": "BBC"
  },
  {
    "title": "Betty Crocker's Cookbook",
    "authors": ""
  },
  {
    "title": "The Martha Stewart Cookbook",
    "authors": ""
  },
  {
    "title": "Feast",
    "authors": "Nigella Lawson"
  },
  {
    "title": "Moosewood Restaurant New Classics",
    "authors": ""
  },
  {
    "title": "World Food Café",
    "authors": "Chris & Carolyn Caldicott"
  },
  {
    "title": "Everyday Thai Cooking",
    "authors": "Katie Chin"
  },
  {
    "title": "Vegetarian Indian Cooking with Instant Pot",
    "authors": "Manali Singh"
  },
  {
    "title": "The Southern Vegetarian Cookbook",
    "authors": "Justin Fox Burks & Amy Lawrence"
  },
  {
    "title": "Vegetarian Cookbook",
    "authors": ""
  },
  {
    "title": "Französische Küche",
    "authors": ""
  },
  {
    "title": "Sushi-Making at Home",
    "authors": ""
  },
  {
    "title": "Kosher Cooking",
    "authors": ""
  },
  {
    "title": "The New Empanadas",
    "authors": "Marlena Spieler"
  },
  {
    "title": "Instant Pot Vegetarian Cookbook for Two",
    "authors": ""
  },
  {
    "title": "Vegetarian",
    "authors": "Wilkes & Cartwright"
  },
  {
    "title": "Breakfast",
    "authors": ""
  },
  {
    "title": "Nadiya's Kitchen",
    "authors": "Nadiya Hussain"
  },
  {
    "title": "New Food for Thought",
    "authors": "Jane Noraika"
  },
  {
    "title": "Beyond Curry Indian Cookbook",
    "authors": "D'Silva Sankalp"
  },
  {
    "title": "The 5 O'Clock Cookbook",
    "authors": ""
  },
  {
    "title": "Food Lab",
    "authors": "J. Kenji López-Alt"
  },
  {
    "title": "The Cook's Encyclopedia",
    "authors": ""
  },
  {
    "title": "The Cast Iron Nation",
    "authors": "Lodge"
  },
  {
    "title": "Urban Cook Book",
    "authors": ""
  },
  {
    "title": "In Search of Perfection",
    "authors": "Heston Blumenthal"
  },
  {
    "title": "Perfection",
    "authors": "Heston Blumenthal"
  },
  {
    "title": "An Economist Gets Lunch",
    "authors": "Tyler Cowen"
  },
  {
    "title": "The Colman's Mustard Cookbook",
    "authors": "Pam Hartley"
  },
  {
    "title": "The Student Grub Guide",
    "authors": "Williams"
  },
  {
    "title": "Easy Meals for One & Two",
    "authors": ""
  },
  {
    "title": "Jack Monroe Tin Can Cook",
    "authors": ""
  },
  {
    "title": "Slow Cooker",
    "authors": ""
  },
  {
    "title": "The Students' Sausage, Egg, and Beans Cookbook",
    "authors": ""
  },
  {
    "title": "Quick & Easy Students' Cookbook",
    "authors": ""
  },
  {
    "title": "Student Cookbook Guide",
    "authors": ""
  },
  {
    "title": "The Best Little Marinades Cookbook",
    "authors": "Adler"
  },
  {
    "title": "The New Book of Middle Eastern Food",
    "authors": "Claudia Roden"
  },
  {
    "title": "Vegetarian Meals",
    "authors": "Rosamond Richardson"
  },
  {
    "title": "Girl! Mother Tells You How",
    "authors": ""
  }
]

Once again, I find those results pretty astounding.

What to make of this

The ability to extract structured content from text is already one of the most exciting use-cases for LLMs. GPT-4 Vision and LLaVA expanded that to images. And now Gemini Pro 1.5 expands that to video.

The ability to analyze video like this feels SO powerful. Being able to take a 20 second video of a bookshelf and get back a JSON array of those books is just the first thing I thought to try.

The usual LLM caveats apply. It can miss things and it can hallucinate incorrect details. Half of the work in making the most of this class of technology is figuring out how to work around these limitations, but I feel like we’re making good progress on that.

There’s also the issue with the safety filters. As input to these models gets longer, the chance of something triggering a filter (like the first four letters of the word “cocktail”) goes up.

So, as always with modern AI, there are still plenty of challenges to overcome.

But this really does feel like another one of those glimpses of a future that’s suddenly far closer then I expected it to be.

A note on images v.s. video

Initially I had assumed that video was handled differently from images, due partly to the surprisingly (to me) low token counts involved in processing a video.

This thread on Hacker News convinced me otherwise.

From this blog post:

Gemini 1.5 Pro can also reason across up to 1 hour of video. When you attach a video, Google AI Studio breaks it down into thousands of frames (without audio), and then you can perform highly sophisticated reasoning and problem-solving tasks since the Gemini models are multimodal.

Then in the Gemini 1.5 technical report:

When prompted with a 45 minute Buster Keaton movie “Sherlock Jr." (1924) (2,674 frames at 1FPS, 684k tokens), Gemini 1.5 Pro retrieves and extracts textual information from a specific frame in and provides the corresponding timestamp.

I ran my own experiment: I grabbed a frame from my video and uploaded that to Gemini in a new prompt.

Screenshot of the Gemini interface with an uploaded image. A box reads Preview 258 / 1,048,576

That’s 258 tokens for a single image.

Using the numbers from the Buster Keaton example, 684,000 tokens / 2,674 frames = 256 tokens per frame. So it looks like it really does work by breaking down the video into individual frames and processing each one as an image.

For my own videos: 1,841 / 258 = 7.13 (the 7s video) and 6,049 / 258 = 23.45 (the 22s video)—which makes me believe that videos are split up into one frame per second and each frame costs ~258 tokens.

As a further experiment I grabbed a snapshot of the first frame of the video and uploaded that as a plain image. This time I didn’t even include a prompt:

An image of books on a bookshelf - the Lord of the Rings trilogy in the middle.

Gemini Pro 1.5 guessed what I wanted and replied:

The books in the image are:

  • The Lord of the Rings by J.R.R. Tolkien
  • Structure and Interpretation of Computer Programs by Harold Abelson and Gerald Jay Sussman
  • The Source by James A. Michener
  • Rework by Jason Fried and David Heinemeier Hansson
  • The Hacker Ethic and the Spirit of the Information Age by Pekka Himanen
  • Google by David A. Vise
  • The Design of Everyday Things by Don Norman
  • Linux Server Hacks by Evi Nemeth et al.
  • The C Programming Language by Brian W. Kernighan and Dennis M. Ritchie
  • High Output Management by Andrew S. Grove
  • Mapping Hacks by Schuyler Erle et al.

The image input was 258 tokens, the total token count after the response was 410 tokens—so 152 tokens for the response from the model. Those image tokens pack in a lot of information!

Screenshot of that interaction, showing 410/1,048,576 tokens used.

Weeknotes: a Datasette release, an LLM release and a bunch of new plugins 22 days ago

I wrote extensive annotated release notes for Datasette 1.0a8 and LLM 0.13 already. Here’s what else I’ve been up to this past three weeks.

New plugins for Datasette

  • datasette-proxy-url is a very simple plugin that simple lets you configure a path within Datasette that serves content proxied from another URL.

    I built this one because I ran into a bug with Substack where Substack were denying requests to my newsletter’s RSS feed from code running in GitHub Actions! Frustrating, since the whole point of RSS is to be retrieved by bots.

    I solved it by deploying a quick proxy to a Datasette instance I already had up and running, effectively treating Datasette as a cheap deployment platform for random pieces of proxying infrastructure.

  • datasette-homepage-table lets you configure Datasette to display a specific table as the homepage of the instance. I’ve wanted this for a while myself, someone requested it on Datasette Discord and it turned out to be pretty quick to build.

  • datasette-events-db hooks into the new events mechanism in Datasette 1.0a8 and logs any events (create-table, login etc) to a datasette_events table. I released this partly as a debugging tool and partly because I like to ensure every Datasette plugin hook has at least one released plugin that uses it.

  • datasette-enrichments-quickjs was this morning’s project. It’s a plugin for Datasette Enrichments that takes advantage of the quickjs Python package—a wrapper around the excellent QuickJS engine—to support running a custom JavaScript function against every row in a table to populate a new column.

    QuickJS appears to provide a robust sandbox, including both memory and time limits! I need to write more about this plugin, it opens up some very exciting new possibilities for Datasette.

I also published some significant updates to existing plugins:

  • datasette-upload-csvs got a long-overdue improvement allowing it to upload CSVs to a specified database, rather than just using the first available one. As part of this I completely re-engineered how it works in terms of threading strategies, as described in issue 38. Plus it’s now tested against the Datasette 1.0 alpha series in addition to 0.x stable.

Plugins for LLM

LLM is my command-line tool and Python library for interacting with Large Language Models. I released one new plugin for that:

I released updates for two LLM plugins as well:

I finally started hacking on a llm-rag plugin which will provide an implementation of Retrieval Augmented Generation for LLM, similar to the process I describe in Embedding paragraphs from my blog with E5-large-v2.

I’ll write more about that once it’s in an interesting state.

shot-scraper 1.4

shot-scraper is my CLI tool for taking screenshots of web pages and running scraping code against them using JavaScript, built on top of Playwright.

I dropped into the repo to add HTTP Basic authentication support and found several excellent PRs waiting to be merged, so I bundled those together into a new release.

Here are the full release notes for shot-scraper 1.4:

  • New --auth-username x --auth-password y options for each shot-scraper command, allowing a username and password to be set for HTTP Basic authentication. #140
  • shot-scraper URL --interactive mode now respects the -w and -h arguments setting the size of the browser viewport. Thanks, mhalle. #128
  • New --scale-factor option for setting scale factors other than 2 (for retina). Thanks, Niel Thiart. #136
  • New --browser-arg option for passing extra browser arguments (such as --browser-args "--font-render-hinting=none") through to the underlying browser. Thanks, Niel Thiart. #137

Miscellaneous other projects

  • We had some pretty severe storms in the San Francisco Bay Area last week, inspired me to revisit my old PG&E outage scraper. PG&E’s outage map changed and broke that a couple of years ago, but I got a new scraper up and running just in time to start capturing outages.
  • I’ve been wanting a way to quickly create additional labels for my GitHub repositories for a while. I finally put together a simple system for that based on GitHub Actions, described in this TIL: Creating GitHub repository labels with an Actions workflow.

Releases

TILs

Datasette 1.0a8: JavaScript plugins, new plugin hooks and plugin configuration in datasette.yaml 24 days ago

I just released Datasette 1.0a8. These are the annotated release notes.

This alpha release continues the migration of Datasette’s configuration from metadata.yaml to the new datasette.yaml configuration file, introduces a new system for JavaScript plugins and adds several new plugin hooks.

My plan is for this to be the last alpha that adds new features—the new plugin hooks, in this case. The next release will focus on wrapping up the stable APIs for 1.0, with a particular focus on template stability (so users can customize Datasette without fear of it breaking in future minor releases) and wrapping up the work on the stable JSON API.

Configuration

  • Plugin configuration now lives in the datasette.yaml configuration file, passed to Datasette using the -c/--config option. Thanks, Alex Garcia. (#2093)

    datasette -c datasette.yaml

    Where datasette.yaml contains configuration that looks like this:

    plugins:
      datasette-cluster-map:
        latitude_column: xlat
        longitude_column: xlon
    
  • Previously plugins were configured in metadata.yaml, which was confusing as plugin settings were unrelated to database and table metadata.

This almost concludes the work (driven mainly by Alex Garcia) to clean up how Datasette is configured prior to the 1.0 release. Moving things that aren’t metadata out of the metadata.yaml/json file is a big conceptual improvement, and one that absolutely needed to happen before 1.0.

  • The -s/--setting option can now be used to set plugin configuration as well. See Configuration via the command-line for details. (#2252)

    The above YAML configuration example using -s/--setting looks like this:

    datasette mydatabase.db\
      -s plugins.datasette-cluster-map.latitude_column xlat \
      -s plugins.datasette-cluster-map.longitude_column xlon

This feature is mainly for me. I start new Datasette instances dozens of times a day to try things out, and having to manually edit a datasette.yaml file before trying something new is an annoying little piece of friction.

With the -s option anything that can be represented in JSON or YAML can also be passed on the command-line.

I mainly love this as a copy-and-paste mechanism: my notes are crammed with datasette shell one-liners, and being able to paste something into my terminal to recreate a Datasette instance with a specific configuration is a big win.

The -s command uses dot-notation to specify nested keys, but it has a simple mechanism for representing more complex objects too: you can pass them in as JSON literal strings and Datasette will parse them. The --setting documentation includes this example of configuring datasette-proxy-url:

datasette mydatabase.db \
  -s plugins.datasette-proxy-url.paths '[{"path": "/proxy", "backend": "http://example.com/"}]'

Which is equivalent to the following datasette.yaml file:

plugins:
  datasette-proxy-url:
    paths:
    - path: /proxy
      backend: http://example.com/
  • The new /-/config page shows the current instance configuration, after redacting keys that could contain sensitive data such as API keys or passwords. (#2254)

Datasette has a set of introspection endpoints like this—/-/metadata and /-/settings and /-/threads, all of which can have .json added to get back the raw JSON. I find them really useful for debugging instances and understanding how they have been configured.

The redaction is new: previously I had designed a mechanism for passing secrets as environment variables in a way that would avoid them being exposed here, but I realized automated redaction is less likely to cause people to leak secrets by accident.

  • Existing Datasette installations may already have configuration set in metadata.yaml that should be migrated to datasette.yaml. To avoid breaking these installations, Datasette will silently treat table configuration, plugin configuration and allow blocks in metadata as if they had been specified in configuration instead. (#2247) (#2248) (#2249)

Originally the plan was to have Datasette fail to load if it spotted configuration in metadata.yaml that should have been migrated to datasette.yaml.

I changed my mind about this mainly as I experienced the enormous inconvenience of updating all of my Datasette instances to the new format—including rewriting the automated tests for my plugins.

I think my philosophy on this going forward is going to be that Datasette will take extra effort to keep older things working provided the additional code complexity in doing so is low enough to make it worth the trade-off. In this case I think it is.

Note that the datasette publish command has not yet been updated to accept a datasette.yaml configuration file. This will be addressed in #2195 but for the moment you can include those settings in metadata.yaml instead.

I promised myself I would ship 1.0a8 today no matter what, so I cut this feature at the last moment.

JavaScript plugins

Datasette now includes a JavaScript plugins mechanism, allowing JavaScript to customize Datasette in a way that can collaborate with other plugins.

This provides two initial hooks, with more to come in the future:

Thanks Cameron Yick for contributing this feature. (#2052)

The core problem we are trying to solve here comes from what happens when multiple plugins all try to customize the Datasette instance at the same time.

This is particularly important for visualization plugins.

An example: datasette-cluster-map and datasette-geojson-map both add a map to the top of the table page. This means if you have both plugins installed you can end up with two maps!

The new mechanism allows plugins to collaborate: each plugin can contribute one or more “panels” which will then be shown above the table view in an interface with toggles to switch between them.

The column actions mechanism is similar: it allows plugins to contribute additional actions to the column menu, which appears when you click the cog icon in the header of a table column.

Cameron Yick did a great job with this feature. I’ve been slow in getting a release out with it though—my hope is that we can iterate more productively on it now that it’s in an alpha release.

Plugin hooks

I wrote about my need for this in Page caching and custom templates for Datasette Cloud: I wanted a way to modify the Jinja environment based on the requested HTTP host, and this lets me do that.

  • New family of template slot plugin hooks: top_homepage, top_database, top_table, top_row, top_query, top_canned_query. Plugins can use these to provide additional HTML to be injected at the top of the corresponding pages. (#1191)

Another long-running need (the issue is from January 2021). Similar to the JavaScript plugin mechanism, this allows multiple plugins to add content to the page without one plugin overwriting the other.

The new Datasette Events system

Another hook inspired by Datasette Cloud. I want better analytics for that product to help track which features are being used, but I also wanted to do that in a privacy-forward manner. I decided to bake it into Datasette core and I intend to make it visible to the administrators of Datasette Cloud instances—so that it doubles as an audit log for what’s happening in their instances.

I realized that this has uses beyond analytics: if a plugin wants to do something extra any time a new table is created within Datasette it can use the track_events() plugin hook to listen out for the create-table event and take action when it occurs.

  • New internal function for plugin authors: await db.execute_isolated_fn(fn), for creating a new SQLite connection, executing code and then closing that connection, all while preventing other code from writing to that particular database. This connection will not have the prepare_connection() plugin hook executed against it, allowing plugins to perform actions that might otherwise be blocked by existing connection configuration. (#2218)

This came about because I was trying to figure out a way to use prepare_connection() hook to add authorizers that prevent users from deleting certain tables, but found that doing this prevented VACUUM from working.

The new internal function provides a clean slate for plugins to do anything they like with a SQLite connection, while simultaneously preventing any write operations from other code from executing (even against other connections) until that isolated operation is complete.

Documentation

I like including links to new documentation in the release notes, to give people a chance to catch useful new documentation that they might otherwise miss.

Minor fixes

  • Datasette no longer attempts to run SQL queries in parallel when rendering a table page, as this was leading to some rare crashing bugs. (#2189)
  • Fixed warning: DeprecationWarning: pkg_resources is deprecated as an API (#2057)
  • Fixed bug where ?_extra=columns parameter returned an incorrectly shaped response. (#2230)

Surprisingly few bug fixes in this alpha—most of the work in the last few months has been new features. I think this is a good sign in terms of working towards a stable 1.0.

LLM 0.13: The annotated release notes one month ago

I just released LLM 0.13, the latest version of my LLM command-line tool for working with Large Language Models—both via APIs and running models locally using plugins.

Here are the annotated release notes for the new version.

  • Added support for new OpenAI embedding models: 3-small and 3-large and three variants of those with different dimension sizes, 3-small-512, 3-large-256 and 3-large-1024. See OpenAI embedding models for details. #394

The original inspiration for shipping a new release was OpenAI’s announcement of new models yesterday: New embedding models and API updates.

I wrote a guide to embeddings in Embeddings: What they are and why they matter. Until recently the only available OpenAI embedding model was ada-002—released in December 2022 and now feeling a little bit old in the tooth.

The new 3-small model is similar to ada-002 but massively less expensive (a fifth of the price) and with higher benchmark scores.

3-large has even higher benchmark, but also produces much bigger vectors. Where ada-002 and 3-small produce 1536-dimensional vectors, 3-large produces 3072 dimensions!

Each dimension corresponds to a floating point number in the array of numbers produced when you embed a piece of content. The more numbers, the more storage space needed for those vectors and the longer any cosine-similarity calculations will take against them.

Here’s where things get really interesting though: since people often want to trade quality for smaller vector size, OpenAI now support a way of having their models return much smaller vectors.

LLM doesn’t yet have a mechanism for passing options to embedding models (unlike language models which can take -o setting value options), but I still wanted to make the new smaller sizes available.

That’s why I included 3-small-512, 3-large-256 and 3-large-1024: those are variants of the core models hard-coded to the specified vector size.

In the future I’d like to support options for embedding models, but this is a useful stop-gap.

  • The default gpt-4-turbo model alias now points to gpt-4-turbo-preview, which uses the most recent OpenAI GPT-4 turbo model (currently gpt-4-0125-preview). #396

Also announced yesterday—gpt-4-0125-preview is the latest version of the GPT-4 model which, according to OpenAI, “completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task”.

This is technically a breaking change—the gpt-4-turbo LLM alias used to point to the older model, but now points to OpenAI’s gpt-4-turbo-preview alias which in turn points to the latest model.

  • New OpenAI model aliases gpt-4-1106-preview and gpt-4-0125-preview.

These aliases let you call those models explicitly:

llm -m gpt-4-0125-preview 'Write a lot of code without being lazy'
  • OpenAI models now support a -o json_object 1 option which will cause their output to be returned as a valid JSON object. #373

This is a fun feature, which uses an OpenAI option that claims to guarantee valid JSON output.

Weirdly you have to include the word “json” in your prompt when using this or OpenAI will return an error!

llm -m gpt-4-turbo \
  '3 names and short bios for pet pelicans in JSON' \
  -o json_object 1

That returned the following for me just now:

{
  "pelicans": [
    {
      "name": "Gus",
      "bio": "Gus is a curious young pelican with an insatiable appetite for adventure. He's known amongst the dockworkers for playfully snatching sunglasses. Gus spends his days exploring the marina and is particularly fond of performing aerial tricks for treats."
    },
    {
      "name": "Sophie",
      "bio": "Sophie is a graceful pelican with a gentle demeanor. She's become somewhat of a local celebrity at the beach, often seen meticulously preening her feathers or posing patiently for tourists' photos. Sophie has a special spot where she likes to watch the sunset each evening."
    },
    {
      "name": "Captain Beaky",
      "bio": "Captain Beaky is the unofficial overseer of the bay, with a stern yet endearing presence. As a seasoned veteran of the coastal skies, he enjoys leading his flock on fishing expeditions and is always the first to spot the fishing boats returning to the harbor. He's respected by both his pelican peers and the fishermen alike."
    }
  ]
}

The JSON schema it uses is entirely made up. You can prompt it with an example schema and it will probably stick to it.

I wrote the first two, but llm-ollama is by Sergey Alexandrov and llm-bedrock-meta is by Fabian Labat. My plugin writing tutorial is starting to pay off!

  • The keys.json file for storing API keys is now created with 600 file permissions. #351

A neat suggestion from Christopher Bare.

  • Documented a pattern for installing plugins that depend on PyTorch using the Homebrew version of LLM, despite Homebrew using Python 3.12 when PyTorch have not yet released a stable package for that Python version. #397

LLM is packaged for Homebrew. The Homebrew package upgraded to Python 3.12 a while ago, which caused surprising problems because it turned out PyTorch—a dependency of some LLM plugins—doesn’t have a stable build out for 3.12 yet.

Christian Bush shared a workaround in an LLM issue thread, which I’ve now added to the documentation.

  • Underlying OpenAI Python library has been upgraded to >1.0. It is possible this could cause compatibility issues with LLM plugins that also depend on that library. #325

This was the bulk of the work. OpenAI released their 1.0 Python library a couple of months ago and it had a large number of breaking changes compared to the previous release.

At the time I pinned LLM to the previous version to paper over the breaks, but this meant you could not install LLM in the same environment as some other library that needed the more recent OpenAI version.

There were a lot of changes! You can find a blow by blow account of the upgrade in my pull request that bundled the work.

  • Arrow keys now work inside the llm chat command. #376

The recipe for doing this is so weird:

import readline
readline.parse_and_bind("\\e[D: backward-char")
readline.parse_and_bind("\\e[C: forward-char")

I asked on Mastodon if anyone knows of a less obscure solution, but it looks like that might be the best we can do!

  • LLM_OPENAI_SHOW_RESPONSES=1 environment variable now outputs much more detailed information about the HTTP request and response made to OpenAI (and OpenAI-compatible) APIs. #404

This feature worked prior to the OpenAI >1.0 upgrade by tapping in to some requests internals. OpenAI dropped requests for httpx so I had to rebuild this feature from scratch.

I ended up getting a TIL out of it: Logging OpenAI API requests and responses using HTTPX.

  • Dropped support for Python 3.7.

I wanted to stop seeing a pkg_resources related warning, which meant switching to Python 3.8’s importlib.medata. Python 3.7 hit end-of-life for support back in June 2023 so I think this is an OK change to make.

Weeknotes: datasette-test, datasette-build, PSF board retreat one month ago

I wrote about Page caching and custom templates in my last weeknotes. This week I wrapped up that work, modifying datasette-edit-templates to be compatible with the jinja2_environment_from_request() plugin hook. This means you can edit templates directly in Datasette itself and have those served either for the full instance or just for the instance when served from a specific domain (the Datasette Cloud case).

Testing plugins with Playwright

As Datasette 1.0 draws closer, I’ve started thinking about plugin compatibility. This is heavily inspired by my work on Datasette Cloud, which has been running the latest Datasette alphas for several months.

I spotted that datasette-cluster-map wasn’t working correctly on Datasette Cloud, as it hadn’t been upgraded to account for JSON API changes in Datasette 1.0.

datasette-cluster-map 0.18 fixed that, while continuing to work with previous versions of Datasette. More importantly, it introduced Playwright tests to exercise the plugin in a real Chromium browser running in GitHub Actions.

I’ve been wanting to establish a good pattern for this for a while, since a lot of Datasette plugins include JavaScript behaviour that warrants browser automation testing.

Alex Garcia figured this out for datasette-comments—inspired by his code I wrote up a TIL on Writing Playwright tests for a Datasette Plugin which I’ve now also used in datasette-search-all.

datasette-test

datasette-test is a new library that provides testing utilities for Datasette plugins. So far it offers two:

from datasette_test import Datasette
import pytest

@pytest.mark.asyncio
async def test_datasette():
    ds = Datasette(plugin_config={"my-plugin": {"config": "goes here"})

This datasette_test.Datasette class is a subclass of Datasette which helps write tests that work against both Datasette <1.0 and Datasette >=1.0a8 (releasing shortly). The way plugin configuration works is changing, and this plugin_config= parameter papers over that difference for plugin tests.

The other utility is a wait_until_responds("http://localhost:8001") function. Thes can be used to wait until a server has started, useful for testing with Playwright. I extracted this from Alex’s datasette-comments tests.

datasette-build

So far this is just the skeleton of a new tool. I plan for datasette-build to offer comprehensive support for converting a directory full of static data files—JSON, TSV, CSV and more—into a SQLite database, and eventually to other database backends as well.

So far it’s pretty minimal, but my goal is to use plugins to provide optional support for further formats, such as GeoJSON or Parquet or even .xlsx.

I really like using GitHub to keep smaller (less than 1GB) datasets under version control. My plan is for datasette-build to support that pattern, making it easy to load version-controlled data files into a SQLite database you can then query directly.

PSF board in-person meeting

I spent the last two days of this week at the annual Python Software Foundation in-person board meeting. It’s been fantastic catching up with the other board members over more than just a Zoom connection, and we had a very thorough two days figuring out strategy for the next year and beyond.

Blog entries

Releases

TILs

Elsewhere

Today

  • The One Billion Row Challenge in Go: from 1m45s to 4s in nine solutions (via) How fast can you read a billion semicolon delimited (name;float) lines and output a min/max/mean summary for each distinct name—13GB total?

    Ben Hoyt describes his 9 incrementally improved versions written in Go in detail. The key optimizations involved custom hashmaps, optimized line parsing and splitting the work across multiple CPU cores. #3rd March 2024, 7:08 am

Yesterday

1st March 2024

  • Streaming HTML out of order without JavaScript (via) A really interesting new browser capability. If you serve the following HTML:

    <template shadowrootmode="open">
    <slot name="item-1">Loading...</slot>
    </template>

    Then later in the same page stream an element specifying that slot:

    <span slot="item-1">Item number 1</span>

    The previous slot will be replaced while the page continues to load.

    I tried the demo in the most recent Chrome, Safari and Firefox (and Mobile Safari) and it worked in all of them.

    The key feature is shadowrootmode=open, which looks like it was added to Firefox 123 on February 19th 2024—the other two browsers are listed on caniuse.com as gaining it around March last year. #1st March 2024, 4:59 pm

  • Endatabas (via) Endatabas is “an open source immutable database”—also described as “SQL document database with full history”.

    It uses a variant of SQL which allows you to insert data into tables that don’t exist yet (they’ll be created automatically) then run standard select queries, joins etc. It maintains a full history of every record and supports the recent SQL standard “FOR SYSTEM_TIME AS OF” clause for retrieving historical records as they existed at a specified time (it defaults to the most recent versions).

    It’s written in Common Lisp plus a bit of Rust, and includes Docker images for running the server and client libraries in JavaScript and Python. The on-disk storage format is Apache Arrow, the license is AGPL and it’s been under development for just over a year.

    It’s also a document database: you can insert JSON-style nested objects directly into a table, and query them with path expressions like “select users.friends[1] from users where id = 123;”

    They have a WebAssembly version and a nice getting started tutorial which you can try out directly in your browser.

    Their “Why?” page lists full history, time travel queries, separation of storage from compute, schemaless tables and columnar storage as the five pillars that make up their product. I think it’s a really interesting amalgamation of ideas. #1st March 2024, 4:28 am

29th February 2024

  • Datasette 1.0a12. Another alpha release, this time with a new query_actions() plugin hook, a new design for the table, database and query actions menus, a “does not contain” table filter and a fix for a minor bug with the JavaScript makeColumnActions() plugin mechanism. #29th February 2024, 11:56 pm

  • GGUF, the long way around (via) Vicki Boykis dives deep into the GGUF format used by llama.cpp, after starting with a detailed description of how PyTorch models work and how they are traditionally persisted using Python pickle.

    Pickle lead to safetensors, a format that avoided the security problems with downloading and running untrusted pickle files.

    Llama.cpp introduced GGML, which popularized 16-bit (as opposed to 32-bit) quantization and bundled metadata and tensor data in a single file.

    GGUF fixed some design flaws in GGML and is the default format used by Llama.cpp today. #29th February 2024, 9:39 pm

  • The Zen of Python, Unix, and LLMs. Here’s the YouTube recording of my 1.5 hour conversation with Hugo Bowne-Anderson yesterday.

    I fed a Whisper transcript to Google Gemini Pro 1.5 and asked it for the themes from our conversation, and it said we talked about “Python’s success and versatility, the rise and potential of LLMs, data sharing and ethics in the age of LLMs, Unix philosophy and its influence on software development and the future of programming and human-computer interaction”. #29th February 2024, 9:04 pm

28th February 2024

  • For the last few years, Meta has had a team of attorneys dedicated to policing unauthorized forms of scraping and data collection on Meta platforms. The decision not to further pursue these claims seems as close to waving the white flag as you can get against these kinds of companies. But why? [...]

    In short, I think Meta cares more about access to large volumes of data and AI than it does about outsiders scraping their public data now. My hunch is that they know that any success in anti-scraping cases can be thrown back at them in their own attempts to build AI training databases and LLMs. And they care more about the latter than the former.

    Kieran McCarthy # 28th February 2024, 3:15 pm

  • Testcontainers (via) Not sure how I missed this: Testcontainers is a family of testing libraries (for Python, Go, JavaScript, Ruby, Rust and a bunch more) that make it trivial to spin up a service such as PostgreSQL or Redis in a container for the duration of your tests and then spin it back down again.

    The Python example code is delightful:

    redis = DockerContainer(“redis:5.0.3-alpine”).with_exposed_ports(6379)
    redis.start()
    wait_for_logs(redis, “Ready to accept connections”)

    I much prefer integration-style tests over unit tests, and I like to make sure any of my projects that depend on PostgreSQL or similar can run their tests against a real running instance. I’ve invested heavily in spinning up Varnish or Elasticsearch ephemeral instances in the past—Testcontainers look like they could save me a lot of time.

    The open source project started in 2015, span off a company called AtomicJar in 2021 and was acquired by Docker in December 2023. #28th February 2024, 2:41 am

27th February 2024

26th February 2024

  • Mistral Large. Mistral Medium only came out two months ago, and now it’s followed by Mistral Large. Like Medium, this new model is currently only available via their API. It scores well on benchmarks (though not quite as well as GPT-4) but the really exciting feature is function support, clearly based on OpenAI’s own function design.

    Functions are now supported via the Mistral API for both Mistral Large and the new Mistral Small, described as follows: “Mistral Small, optimised for latency and cost. Mistral Small outperforms Mixtral 8x7B and has lower latency, which makes it a refined intermediary solution between our open-weight offering and our flagship model.” #26th February 2024, 11:23 pm

25th February 2024

  • dclient 0.3. dclient is my CLI utility for working with remote Datasette instances—in particular for authenticating with them and then running both read-only SQL queries and inserting data using the new Datasette write JSON API. I just picked up work on the project again after a six month gap—the insert command can now be used to constantly stream data directly to hosted Datasette instances such as Datasette Cloud. #25th February 2024, 8:06 pm

24th February 2024

  • Upside down table trick with CSS (via) I was complaining how hard it is to build a horizontally scrollable table with a scrollbar at the top rather than the bottom and RGBCube on Lobste.rs suggested rotating the container 180 degrees and then the table contents and headers 180 back again... and it totally works! Demo in this CodePen. #24th February 2024, 9 pm

  • How to make self-hosted maps that work everywhere and cost next to nothing. Chris Amico provides a detailed roundup of the state of web mapping in 2024. It’s never been easier to entirely host your own mapping infrastructure, thanks to OpenStreetMap, Overture, MBTiles, PMTiles, Maplibre and a whole ecosystem of other fine open source projects.

    I like Protomaps creator Brandon Liu’s description of this: “post-scarcity web mapping”. #24th February 2024, 4:19 am

23rd February 2024

  • Does Offering ChatGPT a Tip Cause it to Generate Better Text? An Analysis (via) Max Woolf:“I have a strong hunch that tipping does in fact work to improve the output quality of LLMs and its conformance to constraints, but it’s very hard to prove objectively. [...] Let’s do a more statistical, data-driven approach to finally resolve the debate.” #23rd February 2024, 5:42 pm

  • Bloom Filters, explained by Sam Rose. Beautifully designed explanation of bloom filters, complete with interactive demos that illustrate exactly how they work. #23rd February 2024, 3:59 pm

  • PGlite (via) PostgreSQL compiled for WebAssembly and turned into a very neat JavaScript library. Previous attempts at running PostgreSQL in WASM have worked by bundling a full Linux virtual machine—PGlite just bundles a compiled PostgreSQL itself, which brings the size down to an impressive 3.7MB gzipped. #23rd February 2024, 3:56 pm