Weeknotes: Parquet in Datasette Lite, various talks, more LLM hacking
4th June 2023
I’ve fallen a bit behind on my weeknotes. Here’s a catchup for the last few weeks.
Parquet in Datasette Lite
Datasette Lite is my build of Datasette (a server-side Python web application) which runs entirely in the browser using WebAssembly and Pyodide. I recently added the ability to directly load Parquet files over HTTP.
This required an upgrade to the underlying version of Pyodide, in order to use the WebAssembly compiled version of the fastparquet library. That upgrade was blocked by a
AttributeError: module 'os' has no attribute 'link' error, but Roman Yurchak showed me a workaround which unblocked me.
So now the following works:
This will work with any URL to a Parquet file that is served with open CORS headers—files on GitHub (or in a GitHub Gist) get these headers automatically.
Also new in Datasette Lite: the
?memory=1 query string option, which starts Datasette Lite without loading any default demo databases. I added this to help me construct this demo for my new datasette-sqlite-url-lite plugin:
datasette-sqlite-url-lite—mostly written by GPT-4
datasette-sqlite-url is a really neat plugin by Alex Garcia which adds custom SQL functions to SQLite that allow you to parse URLs and extract their components.
There’s just one catch: the extension itself is written in C, and there isn’t yet a version of it compiled for WebAssembly to work in Datasette Lite.
I wanted to use some of the functions in it, so I decided to see if I could get a Pure Python alternative of it working. But this was a very low stakes project, so I decided to see if I could get GPT-4 to do essentially all of the work for me.
I prompted it like this—copying and pasting the examples directly from Alex’s documentation:
Write Python code to register the following SQLite custom functions:
select url_valid('https://sqlite.org'); -- 1 select url_scheme('https://www.sqlite.org/vtab.html#usage'); -- 'https' select url_host('https://www.sqlite.org/vtab.html#usage'); -- 'www.sqlite.org' select url_path('https://www.sqlite.org/vtab.html#usage'); -- '/vtab.html' select url_fragment('https://www.sqlite.org/vtab.html#usage'); -- 'usage'
The code it produced was almost exactly what I needed.
I wanted some tests too, so I prompted:
Write a suite of pytest tests for this
This gave me the tests I needed—with one error in the way they called SQLite, but still doing 90% of the work for me.
Videos for three of my recent talks are now available on YouTube:
- Big Opportunities in Small Data is the keynote I gave at Citus Con: An Event for Postgres 2023—talking about Datasette, SQLite and some tricks I would love to see the PostgreSQL community adopt from the explorations I’ve been doing around small data.
- The Data Enthusiast’s Toolkit is an hour long interview with Rizel Scarlett about both Datasette and my career to date. Frustratingly I had about 10 minutes of terrible microphone audio in the middle, but the conversation itself was really great.
- Data analysis with SQLite and Python is a video from PyCon of the full 2hr45m tutorial I gave there last month. The handout notes for that are available online too.
Entries this week
- It’s infuriatingly hard to understand how closed models train on their input
- ChatGPT should include inline tips
- Lawyer cites fake cases invented by ChatGPT, judge is not amused
- llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs
- Delimiters won’t save you from prompt injection
Releases this week
A pure Python alternative to sqlite-url ready to be used in Datasette Lite
Python CLI utility and library for manipulating SQLite databases
CLI tool for stripping tags from HTML
Count and truncate text based on tokens
Access large language models from the command-line
TIL this week
- Testing the Access-Control-Max-Age CORS header—2023-05-25
- Comparing two training datasets using sqlite-utils—2023-05-23
- mlc-chat—RedPajama-INCITE-Chat-3B on macOS—2023-05-22
- hexdump and hexdump -C—2023-05-22
- Exploring Baseline with Datasette Lite—2023-05-12
More recent articles
- Weeknotes: Embeddings, more embeddings and Datasette Cloud - 17th September 2023
- Build an image search engine with llm-clip, chat with models with llm chat - 12th September 2023
- LLM now provides tools for working with embeddings - 4th September 2023
- Datasette 1.0a4 and 1.0a5, plus weeknotes - 30th August 2023
- Making Large Language Models work for you - 27th August 2023
- Datasette Cloud, Datasette 1.0a3, llm-mlc and more - 16th August 2023
- How I make annotated presentations - 6th August 2023
- Weeknotes: Plugins for LLM, sqlite-utils and Datasette - 5th August 2023
- Catching up on the weird world of LLMs - 3rd August 2023
- Run Llama 2 on your own Mac using LLM and Homebrew - 1st August 2023