398 items tagged “projects”
Posts about projects I have worked on.
2024
Three major LLM releases in 24 hours (plus weeknotes)
I’m a bit behind on my weeknotes, so there’s a lot to cover here. But first... a review of the last 24 hours of Large Language Model news. All times are in US Pacific on April 9th 2024.
[... 1,401 words]Extracting data from unstructured text and images with Datasette and GPT-4 Turbo. Datasette Extract is a new Datasette plugin that uses GPT-4 Turbo (released to general availability today) and GPT-4 Vision to extract structured data from unstructured text and images.
I put together a video demo of the plugin in action today, and posted it to the Datasette Cloud blog along with screenshots and a tutorial describing how to use it.
Building files-to-prompt entirely using Claude 3 Opus
files-to-prompt is a new tool I built to help me pipe several files at once into prompts to LLMs such as Claude and GPT-4.
[... 3,235 words]datasette-import. A new plugin for importing data into Datasette. This is a replacement for datasette-paste, duplicating and extending its functionality. datasette-paste had grown beyond just dealing with pasted CSV/TSV/JSON data—it handles file uploads as well now—which inspired the new name.
s3-credentials 0.16.
I spent entirely too long this evening trying to figure out why files in my new supposedly public S3 bucket were unavailable to view. It turns out these days you need to set a PublicAccessBlockConfiguration
of {"BlockPublicAcls": false, "IgnorePublicAcls": false, "BlockPublicPolicy": false, "RestrictPublicBuckets": false}
.
The s3-credentials --create-bucket --public
option now does that for you. I also added a s3-credentials debug-bucket name-of-bucket
command to help figure out why a bucket isn't working as expected.
llm-command-r. Cohere released Command R Plus today—an open weights (non commercial/research only) 104 billion parameter LLM, a big step up from their previous 35 billion Command R model.
Both models are fine-tuned for both tool use and RAG. The commercial API has features to expose this functionality, including a web-search connector which lets the model run web searches as part of answering the prompt and return documents and citations as part of the JSON response.
I released a new plugin for my LLM command line tool this morning adding support for the Command R models.
In addition to the two models it also adds a custom command for running prompts with web search enabled and listing the referenced documents.
llm-nomic-api-embed. My new plugin for LLM which adds API access to the Nomic series of embedding models. Nomic models can be run locally too, which makes them a great long-term commitment as there’s no risk of the models being retired in a way that damages the value of your previously calculated embedding vectors.
textract-cli. This is my other OCR project from yesterday: I built the thinnest possible CLI wrapper around Amazon Textract, out of frustration at how hard that tool is to use on an ad-hoc basis.
It only works with JPEGs and PNGs (not PDFs) up to 5MB in size, reflecting limitations in Textract’s synchronous API: it can handle PDFs amazingly well but you have to upload them to an S3 bucket yet and I decided to keep the scope tight for the first version of this tool.
Assuming you’ve configured AWS credentials already, this is all you need to know:
pipx install textract-cli
textract-cli image.jpeg > output.txt
Running OCR against PDFs and images directly in your browser
I attended the Story Discovery At Scale data journalism conference at Stanford this week. One of the perennial hot topics at any journalism conference concerns data extraction: how can we best get data out of PDFs and images?
[... 2,263 words]Wrap text at specified width. New Observable notebook. I built this with the help of Claude 3 Opus—it’s a text wrapping tool which lets you set the width and also lets you optionally add a four space indent.
The four space indent is handy for posting on forums such as Hacker News that treat a four space indent as a code block.
llm-gemini 0.1a1. I upgraded my llm-gemini plugin to add support for the new Google Gemini Pro 1.5 model, which is beginning to roll out in early access.
The 1.5 model supports 1,048,576 input tokens and generates up to 8,192 output tokens—a big step up from Gemini 1.0 Pro which handled 30,720 and 2,048 respectively.
The big missing feature from my LLM tool at the moment is image input—a fantastic way to take advantage of that huge context window. I have a branch for this which I really need to get into a useful state.
llm cmd undo last git commit—a new plugin for LLM
I just released a neat new plugin for my LLM command-line tool: llm-cmd. It lets you run a command to to generate a further terminal command, review and edit that command, then hit <enter>
to execute it or <ctrl-c>
to cancel.
Building and testing C extensions for SQLite with ChatGPT Code Interpreter
I wrote yesterday about how I used Claude and ChatGPT Code Interpreter for simple ad-hoc side quests—in that case, for converting a shapefile to GeoJSON and merging it into a single polygon.
[... 4,612 words]GitHub Public repo history tool (via) I built this Observable Notebook to run queries against the GH Archive (via ClickHouse) to try to answer questions about repository history—in particular, were they ever made public as opposed to private in the past.
It works by combining together PublicEvent event (moments when a private repo was made public) with the most recent PushEvent event for each of a user’s repositories.
Weeknotes: the aftermath of NICAR
NICAR was fantastic this year. Alex and I ran a successful workshop on Datasette and Datasette Cloud, and I gave a lightning talk demonstrating two new GPT-4 powered Datasette plugins—datasette-enrichments-gpt and datasette-extract. I need to write more about the latter one: it enables populating tables from unstructured content (using a variant of this technique) and it’s really effective. I got it working just in time for the conference.
[... 1,430 words]llm-claude-3 0.3. Anthropic released Claude 3 Haiku today, their least expensive model: $0.25/million tokens of input, $1.25/million of output (GPT-3.5 Turbo is $0.50/$1.50). Unlike GPT-3.5 Haiku also supports image inputs.
I just released a minor update to my llm-claude-3 LLM plugin adding support for the new model.
datasette/studio. I’m trying a new way to make Datasette available for small personal data manipulation projects, using GitHub Codespaces.
This repository is designed to be opened directly in Codespaces—detailed instructions in the README.
When the container starts it installs the datasette-studio family of plugins—including CSV upload, some enrichments and a few other useful feature—then starts the server running and provides a big green button to click to access the server via GitHub’s port forwarding mechanism.
llm-claude-3. I built a new plugin for LLM—my command-line tool and Python library for interacting with Large Language Models—which adds support for the new Claude 3 models from Anthropic.
Datasette 1.0a12. Another alpha release, this time with a new query_actions() plugin hook, a new design for the table, database and query actions menus, a “does not contain” table filter and a fix for a minor bug with the JavaScript makeColumnActions() plugin mechanism.
Weeknotes: Getting ready for NICAR
Next week is NICAR 2024 in Baltimore—the annual data journalism conference hosted by Investigative Reporters and Editors. I’m running a workshop on Datasette, and I plan to spend most of my time in the hallway track talking to people about Datasette, Datasette Cloud and how the Datasette ecosystem can best help support their work.
[... 1,390 words]dclient 0.3. dclient is my CLI utility for working with remote Datasette instances—in particular for authenticating with them and then running both read-only SQL queries and inserting data using the new Datasette write JSON API. I just picked up work on the project again after a six month gap—the insert command can now be used to constantly stream data directly to hosted Datasette instances such as Datasette Cloud.
datasette-studio. I've been thinking for a while that it might be interesting to have a version of Datasette that comes bundled with a set of useful plugins, aimed at expanding Datasette's default functionality to cover things like importing data and editing schemas.
This morning I built the very first experimental preview of what that could look like. Install it using pipx
:
pipx install datasette-studio
I recommend pipx because it will ensure datasette-studio
gets its own isolated environment, independent of any other Datasette installations you might have.
Now running datasette-studio
instead of datasette
will get you the version with the bundled plugins.
The implementation of this is fun - it's a single pyproject.toml file defining the dependencies and setting up the datasette-studio
CLI hook, which is enough to provide the full set of functionality.
Is this a good idea? I don't know yet, but it's certainly an interesting initial experiment.
Datasette 1.0a10. The only changes in this alpha release concern the way Datasette handles database transactions. The database.execute_write_fn() internal method used to leave functions to implement transactions on their own—it now defaults to wrapping them in a transaction unless they opt out with the new transaction=False parameter.
In implementing this I found several places inside Datasette—in particular parts of the JSON write API—which had not been handling transactions correctly. Those are all now fixed.
Datasette 1.0a9. A new Datasette alpha release today. This adds basic alter table support API support, so you can request Datasette modify a table to add new columns needed for JSON objects submitted to the insert, upsert or update APIs.
It also makes some permission changes—fixing a minor bug with upsert permissions, and introducing a new rule where every permission plugin gets consulted for a permission check, with just one refusal vetoing that check.
Weeknotes: a Datasette release, an LLM release and a bunch of new plugins
I wrote extensive annotated release notes for Datasette 1.0a8 and LLM 0.13 already. Here’s what else I’ve been up to this past three weeks.
[... 1,074 words]Datasette 1.0a8: JavaScript plugins, new plugin hooks and plugin configuration in datasette.yaml
I just released Datasette 1.0a8. These are the annotated release notes.
[... 1,709 words]shot-scraper 1.4. I decided to add HTTP Basic authentication support to shot-scraper today and found several excellent pull requests waiting to be merged, by Niel Thiart and mhalle.
1.4 adds support for HTTP Basic auth, custom --scale-factor shots, additional --browser-arg arguments and a fix for --interactive mode.
llm-sentence-transformers 0.2. I added a new --trust-remote-code option when registering an embedding model, which means LLM can now run embeddings through the new Nomic AI nomic-embed-text-v1 model.
llm-embed-onnx. I wrote a new plugin for LLM that acts as a thin wrapper around onnx_embedding_models by Benjamin Anderson, providing access to seven embedding models that can run on the ONNX model framework.
The actual plugin is around 50 lines of code, which makes for a nice example of how thin a plugin wrapper can be that adds new models to my LLM tool.
LLM 0.13: The annotated release notes
I just released LLM 0.13, the latest version of my LLM command-line tool for working with Large Language Models—both via APIs and running models locally using plugins.
[... 1,278 words]