Simon Willison’s Weblog

Subscribe

May 2025

May 28, 2025

llm-llama-server 0.2. Here's a second option for using LLM's new tool support against local models (the first was via llm-ollama).

It turns out the llama.cpp ecosystem has pretty robust OpenAI-compatible tool support already, so my llm-llama-server plugin only needed a quick upgrade to get those working there.

Unfortunately it looks like streaming support doesn't work with tools in llama-server at the moment, so I added a new model ID called llama-server-tools which disables streaming and enables tools.

Here's how to try it out. First, ensure you have llama-server - the easiest way to get that on macOS is via Homebrew:

brew install llama.cpp

Start the server running like this. This command will download and cache the 3.2GB unsloth/gemma-3-4b-it-GGUF:Q4_K_XL if you don't yet have it:

llama-server --jinja -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL

Then in another window:

llm install llm-llama-server
llm -m llama-server-tools -T llm_time 'what time is it?' --td

And since you don't even need an API key for this, even if you've never used LLM before you can try it out with this uvx one-liner:

uvx --with llm-llama-server llm -m llama-server-tools -T llm_time 'what time is it?' --td

For more notes on using llama.cpp with LLM see Trying out llama.cpp’s new vision support from a couple of weeks ago.

# 6:27 am / generative-ai, llm, plugins, projects, llm-tool-use, llama-cpp, ai, uv

Here's a quick demo of the kind of casual things I use LLMs for on a daily basis.

I just found out that Perplexity offer their Deep Research feature via their API, through a model called Sonar Deep Research.

Their documentation includes an example response, which included this usage data in the JSON:

{"prompt_tokens": 19, "completion_tokens": 498, "total_tokens": 517, "citation_tokens": 10175, "num_search_queries": 48, "reasoning_tokens": 95305}

But how much would that actually cost?

Their pricing page lists the price for that model. I snapped this screenshot of the prices:

Pricing table screenshot showing API costs: Input Tokens (Per Million) $2, Output Tokens (Per Million) $8, Price per 1000 Search Queries $5, Reasoning Tokens (Per Million) $3

I could break out a calculator at this point, but I'm not quite curious enough to go through the extra effort.

So I pasted that screenshot into Claude along with the JSON and prompted:

{"prompt_tokens": 19, "completion_tokens": 498, "total_tokens": 517, "citation_tokens": 10175, "num_search_queries": 48, "reasoning_tokens": 95305}

Calculate price, use javascript

I wanted to make sure Claude would use its JavaScript analysis tool, since LLMs can't do maths.

I watched Claude Sonnet 4 write 61 lines of JavaScript - keeping an eye on it to check it didn't do anything obviously wrong. The code spat out this output:

=== COST CALCULATIONS ===
Input tokens cost: 19 tokens × $2/million = $0.000038
Output tokens cost: 498 tokens × $8/million = $0.003984
Search queries cost: 48 queries × $5/1000 = $0.240000
Reasoning tokens cost: 95305 tokens × $3/million = $0.285915

=== COST SUMMARY ===
Input tokens: $0.000038
Output tokens: $0.003984
Search queries: $0.240000
Reasoning tokens: $0.285915
─────────────────────────
TOTAL COST: $0.529937
TOTAL COST: $0.5299 (rounded to 4 decimal places)

So that Deep Research API call would cost 53 cents! Curiosity satisfied in less than a minute.

# 3:41 pm / generative-ai, claude, perplexity, llm-pricing, deep-research, ai, llms, llm-tool-use

Codestral Embed. Brand new embedding model from Mistral, specifically trained for code. Mistral claim that:

Codestral Embed significantly outperforms leading code embedders in the market today: Voyage Code 3, Cohere Embed v4.0 and OpenAI’s large embedding model.

The model is designed to work at different sizes. They show performance numbers for 256, 512, 1024 and 1546 sized vectors in binary (256 bits = 32 bytes of storage per record), int8 and float32 representations. The API documentation says you can request up to 3072.

The dimensions of our embeddings are ordered by relevance. For any integer target dimension n, you can choose to keep the first n dimensions for a smooth trade-off between quality and cost.

I think that means they're using Matryoshka embeddings.

Here's the problem: the benchmarks look great, but the model is only available via their API (or for on-prem deployments at "contact us" prices).

I'm perfectly happy to pay for API access to an embedding model like this, but I only want to do that if the model itself is also open weights so I can maintain the option to run it myself in the future if I ever need to.

The reason is that the embeddings I retrieve from this API only maintain their value if I can continue to calculate more of them in the future. If I'm going to spend money on calculating and storing embeddings I want to know that value is guaranteed far into the future.

If the only way to get new embeddings is via an API, and Mistral shut down that API (or go out of business), that investment I've made in the embeddings I've stored collapses in an instant.

I don't actually want to run the model myself. Paying Mistral $0.15 per million tokens (50% off for batch discounts) to not have to waste my own server's RAM and GPU holding that model in memory is great deal!

In this case, open weights is a feature I want purely because it gives me complete confidence in the future of my investment.

# 4:47 pm / mistral, ai, embeddings

I wonder if one of the reasons I'm finding LLMs so much more useful for coding than a lot of people that I see in online discussions is that effectively all of the code I work on has automated tests.

I've been trying to stay true to the idea of a Perfect Commit - one that bundles the implementation, tests and documentation in a single unit - for over five years now. As a result almost every piece of (non vibe-coding) code I work on has pretty comprehensive test coverage.

This massively derisks my use of LLMs. If an LLM writes weird, convoluted code that solves my problem I can prove that it works with tests - and then have it refactor the code until it looks good to me, keeping the tests green the whole time.

LLMs help write the tests, too. I finally have a 24/7 pair programmer who can remember how to use unittest.mock!

Next time someone complains that they've found LLMs to be more of a hindrance than a help in their programming work, I'm going to try to remember to ask after the health of their test suite.

# 9:16 pm / vibe-coding, testing, ai-assisted-programming, generative-ai, ai, llms

May 29, 2025

llm-mistral 0.14. I added tool-support to my plugin for accessing the Mistral API from LLM today, plus support for Mistral's new Codestral Embed embedding model.

An interesting challenge here is that I'm not using an official client library for llm-mistral - I rolled my own client on top of their streaming HTTP API using Florimond Manca's httpx-sse library. It's a very pleasant way to interact with streaming APIs - here's my code that does most of the work.

The problem I faced is that Mistral's API documentation for function calling has examples in Python and TypeScript but doesn't include curl or direct documentation of their HTTP endpoints!

I needed documentation at the HTTP level. Could I maybe extract that directly from Mistral's official Python library?

It turns out I could. I started by cloning the repo:

git clone https://github.com/mistralai/client-python
cd client-python/src/mistralai
files-to-prompt . | ttok

My ttok tool gave me a token count of 212,410 (counted using OpenAI's tokenizer, but that's normally a close enough estimate) - Mistral's models tap out at 128,000 so I switched to Gemini 2.5 Flash which can easily handle that many.

I ran this:

files-to-prompt -c . > /tmp/mistral.txt

llm -f /tmp/mistral.txt \
  -m gemini-2.5-flash-preview-05-20 \
  -s 'Generate comprehensive HTTP API documentation showing
how function calling works, include example curl commands for each step'

The results were pretty spectacular! Gemini 2.5 Flash produced a detailed description of the exact set of HTTP APIs I needed to interact with, and the JSON formats I should pass to them.

There are a bunch of steps needed to get tools working in a new model, as described in the LLM plugin authors documentation. I started working through them by hand... and then got lazy and decided to see if I could get a model to do the work for me.

This time I tried the new Claude Opus 4. I fed it three files: my existing, incomplete llm_mistral.py, a full copy of llm_gemini.py with its working tools implementation and a copy of the API docs Gemini had written for me earlier. I prompted:

I need to update this Mistral code to add tool support. I've included examples of that code for Gemini, and a detailed README explaining the Mistral format.

Claude churned away and wrote me code that was most of what I needed. I tested it in a bunch of different scenarios, pasted problems back into Claude to see what would happen, and eventually took over and finished the rest of the code myself. Here's the full transcript.

I'm a little sad I didn't use Mistral to write the code to support Mistral, but I'm pleased to add yet another model family to the list that's supported for tool usage in LLM.

# 3:33 am / gemini, llm, plugins, llm-tool-use, ai, llms, mistral, generative-ai, projects, ai-assisted-programming, claude, claude-4, httpx, python

llm-tools-exa. When I shipped LLM 0.26 yesterday one of the things I was most excited about was seeing what new tool plugins people would build for it.

Dan Turkel's llm-tools-exa is one of the first. It adds web search to LLM using Exa (previously), a relatively new search engine offering that rare thing, an API for search. They have a free preview, you can grab an API key here.

I'm getting pretty great results! I tried it out like this:

llm install llm-tools-exa
llm keys set exa
# Pasted API key here

llm -T web_search "What's in LLM 0.26?"

Here's the full answer - it started like this:

LLM 0.26 was released on May 27, 2025, and the biggest new feature in this version is official support for tools. Here's a summary of what's new and notable in LLM 0.26:

  • LLM can now run tools. You can grant LLMs from OpenAI, Anthropic, Gemini, and local models access to any tool you represent as a Python function.
  • Tool plugins are introduced, allowing installation of plugins that add new capabilities to any model you use.
  • Tools can be installed from plugins and loaded by name with the --tool/-T option. [...]

Exa provided 21,000 tokens of search results, including what looks to be a full copy of my blog entry and the release notes for LLM.

# 3:58 am / llm, generative-ai, llm-tool-use, apis, search, ai, llms

llm-github-models 0.15. Anthony Shaw's llm-github-models plugin just got an upgrade: it now supports LLM 0.26 tool use for a subset of the models hosted on the GitHub Models API, contributed by Caleb Brose.

The neat thing about this GitHub Models plugin is that it picks up an API key from your GITHUB_TOKEN - and if you're running LLM within a GitHub Actions worker the API key provided by the worker should be enough to start executing prompts!

I tried it out against Cohere Command A via GitHub Models like this (transcript here):

llm install llm-github-models
llm keys set github
# Paste key here
llm -m github/cohere-command-a -T llm_time 'What time is it?' --td

We now have seven LLM plugins that provide tool support, covering OpenAI, Anthropic, Gemini, Mistral, Ollama, llama-server and now GitHub Models.

# 4:27 am / github-actions, llm, generative-ai, llm-tool-use, ai, github, llms, anthony-shaw

I'll be sending out my first curated monthly highlights newsletter tomorrow, only to $10/month and up sponsors. Sign up now if you want to pay me to send you less!

My weekly-ish newsletter remains free, in fact I just sent out the latest edition.

# 4:33 am / blogging, newsletter

Saying Bye to Glitch (via) Pirijan, co-creator of Glitch - who stopped working on it six years ago, so has the benefit of distance:

Here lies Glitch, a place on the web you could go to write up a website or a node.js server that would be hosted and updated as you type. 🥀 RIP 2015 – 2025.

Pirijan continues with a poignant retrospective about Glitch's early origins at Fog Greek with the vision of providing "web development with real code that was as easy as editing a Google Doc". Their conclusion:

I still believe there’s a market for easy and fun web development and hosting, but a product like this needs power-users and enthusiasts willing to pay for it. To build any kind of prosumer software, you do have to be an optimist and believe that enough of the world still cares about quality and craft.

Glitch will be shutting down project hosting and user profiles on July 8th.

Code will be available to download until the end of the year. Glitch have an official Python export script that can download all of your projects and assets.

Jenn Schiffer, formerly Director of Community at Glitch and then Fastly, is a little more salty:

all that being said, i do sincerely want to thank fastly for giving glitch the opportunity to live to its 3-year acqui-versary this week. they generously took in a beautiful flower and placed it upon their sunny window sill with hopes to grow it more. the problem is they chose to never water it, and anyone with an elementary school education know what happens then. i wish us all a merry august earnings call season.

I'm very sad to see Glitch go. I've been pointing people to my tutorial on Running Datasette on Glitch for 5 years now, it was a fantastic way to help people quickly get started hosting their own projects.

# 8:36 pm / glitch, fastly, datasette

May 30, 2025

Talking AI and jobs with Natasha Zouves for News Nation

Visit Talking AI and jobs with Natasha Zouves for News Nation

I was interviewed by News Nation’s Natasha Zouves about the very complicated topic of how we should think about AI in terms of threatening our jobs and careers. I previously talked with Natasha two years ago about Microsoft Bing.

[... 2,194 words]

May 31, 2025

Using voice mode on Claude Mobile Apps. Anthropic are rolling out voice mode for the Claude apps at the moment. Sadly I don't have access yet - I'm looking forward to this a lot, I frequently use ChatGPT's voice mode when walking the dog and it's a great way to satisfy my curiosity while out at the beach.

It's English-only for the moment. Key details:

  • Voice conversations count toward your regular usage limits based on your subscription plan.
  • For free users, expect approximately 20-30 voice messages before reaching session limits.
  • For paid plans, usage limits are significantly higher, allowing for extended voice conversations.

A update on Anthropic's trust center reveals how it works:

As of May 29th, 2025, we have added ElevenLabs, which supports text to speech functionality in Claude for Work mobile apps.

So it's ElevenLabs for the speech generation, but what about the speech-to-text piece? Anthropic have had their own implementation of that in the app for a while already, but I'm not sure if it's their own technology or if it's using another mechanism such as Whisper.

Update 3rd June 2025: I got access to the new feature. I'm finding it disappointing, because it relies on you pressing a send button after recording each new voice prompt. This means it doesn't work for hands-free operations (like when I'm cooking or walking the dog) which is most of what I use ChatGPT voice for.

Mobile app interface screenshot showing three circular buttons in a row - a plus sign, an upward arrow (highlighted in black), and an X symbol, with "Tap anywhere to send" text above

Update #2: It turns out it does auto-submit if you leave about a five second gap after saying something.

# 3:27 am / anthropic, claude, generative-ai, ai, llms

If you've found web development frustrating over the past 5-10 years, here's something that has worked worked great for me: give yourself permission to avoid any form of frontend build system (so no npm / React / TypeScript / JSX / Babel / Vite / Tailwind etc) and code in HTML and JavaScript like it's 2009.

The joy came flooding back to me! It turns out browser APIs are really good now.

You don't even need jQuery to paper over the gaps any more - use document.querySelectorAll() and fetch() directly and see how much value you can build with a few dozen lines of code.

# 2:23 pm / css, javascript, web-development, frontend, html

There's a new kind of coding I call "hype coding" where you fully give into the hype, and what's coming right around the corner, that you lose sight of whats' possible today. Everything is changing so fast that nobody has time to learn any tool, but we should aim to use as many as possible. Any limitation in the technology can be chalked up to a 'skill issue' or that it'll be solved in the next AI release next week. Thinking is dead. Turn off your brain and let the computer think for you. Scroll on tiktok while the armies of agents code for you. If it isn't right, tell it to try again. Don't read. Feed outputs back in until it works. If you can't get it to work, wait for the next model or tool release. Maybe you didn't use enough MCP servers? Don't forget to add to the hype cycle by aggrandizing all your successes. Don't read this whole tweet, because it's too long. Get an AI to summarize it for you. Then call it "cope". Most importantly, immediately mischaracterize "hype coding" to mean something different than this definition. Oh the irony! The people who don't care about details don't read the details about not reading the details

Steve Krouse

# 2:26 pm / steve-krouse, vibe-coding, ai, semantic-diffusion, model-context-protocol

deepseek-ai/DeepSeek-R1-0528. Sadly the trend for terrible naming of models has infested the Chinese AI labs as well.

DeepSeek-R1-0528 is a brand new and much improved open weights reasoning model from DeepSeek, a major step up from the DeepSeek R1 they released back in January.

In the latest update, DeepSeek R1 has significantly improved its depth of reasoning and inference capabilities by [...] Its overall performance is now approaching that of leading models, such as O3 and Gemini 2.5 Pro. [...]

Beyond its improved reasoning capabilities, this version also offers a reduced hallucination rate, enhanced support for function calling, and better experience for vibe coding.

The new R1 comes in two sizes: a 685B model called deepseek-ai/DeepSeek-R1-0528 (the previous R1 was 671B) and an 8B variant distilled from Qwen 3 called deepseek-ai/DeepSeek-R1-0528-Qwen3-8B.

The January release of R1 had a much larger collection of distilled models: four based on Qwen 2.5 (14B, 32B, Math 1.5B and Math 7B) and 2 based on Llama 3 (Llama-3.1 8B and Llama 3.3 70B Instruct).

No Llama model at all this time. I wonder if that's because Qwen 3 is really good and Apache 2 licensed, while Llama continues to stick with their janky license terms.

Further adding to the confusion, Ollama have mixed the two new models into their existing deepseek-r1 label. Last week running ollama pull deepseek-r1:8B from Ollama would get you DeepSeek-R1-Distill-Llama-8B (Internet Archive link), today that same command gets you DeepSeek-R1-0528-Qwen3-8B, a completely different base model.

If this bothers you as much as it bothers me, thankfully you can use the more explicit tag deepseek-r1:8b-0528-qwen3-q8_0.

Update: This is a great illustration of how confusing these names are! Ollama's deepseek-r1:8B alias actually points to deepseek-r1:8b-0528-qwen3-q4_K_M, a 5.2GB model. I pulled the larger q8_0 one.

I ran it like this:

ollama pull deepseek-r1:8b-0528-qwen3-q8_0

It fetched 8.9GB of model weights, then I had it draw me a pelican.

llm install llm-ollama
llm -m deepseek-r1:8b-0528-qwen3-q8_0 \
  "Generate an SVG of a pelican riding a bicycle"

Since the “reasoning” traces of these models are almost always the most interesting part, I’ll quote the response here in full:

<think>
We are going to create an SVG that depicts a pelican riding a bicycle. Steps:

  1. Create the basic structure of an SVG with width and height set appropriately (e.g., 400x200).
  2. Draw a pelican: We'll break it down into parts - body, beak, eye, feet, wings, etc.
  3. Draw a bicycle: We can represent it as two circles for wheels, a rectangle for the frame, and some other details.

Let's design the SVG with the following:

  • The pelican will be facing to the right (positive x-axis) so that it is riding the bike from left to right.
  • The bicycle will have two wheels and a simple frame in front of the pelican.

We'll use absolute positioning for each part. Let's assume we set up coordinates such that:

  • The top-left of the SVG is (0,0), and we want the scene to be drawn from left to right.
  • The pelican will be positioned at (50, 120) with a body length of about 30 units.

We'll use simple shapes for the pelican:

Body: An ellipse or rounded rectangle. Let's do an ellipse for the body. Beak: A triangle pointing to the right. Head: Another ellipse but smaller and positioned in front of the body. Wings: Two triangles (one above each other, one slightly behind) that are upside down.

Bicycle:

  • Wheels: two circles at fixed positions relative to the frame.
  • Frame: a rectangle with some details for handlebars and seat.

Since this is a fun illustration, we'll keep it simple but recognizable.

Let's code accordingly.
</think>
Here’s an SVG of a pelican riding a bicycle. This example uses basic shapes to create a stylized version:

<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 400 200">
  <!-- Pelican body -->
  <ellipse cx="150" cy="150" rx="80" ry="60" fill="#FFD700"/>

  <!-- Pelican head (larger for comedic effect) -->
  <circle cx="90" cy="120" r="45" fill="#FFA3CE"/>

  <!-- Pelican beak -->
  <polygon points="80,110 60,120 80,130 70,140 60,150" fill="#FFC233"/>

  <!-- Pelican eye -->
  <circle cx="95" cy="115" r="3" fill="#fff"/>

  <!-- Bicycle wheels -->
  <ellipse cx="380" cy="140" rx="25" ry="60" transform="rotate(90 380 140)" fill="black"/>
  <ellipse cx="380" cy="70" rx="25" ry="60" transform="rotate(90 380 70)" fill="black"/>

  <!-- Bicycle frame -->
  <path d="M 15,40 L 15,140 H 370 L 370,140 V 40 Z" stroke="gray" stroke-width="6" fill="none"/>

  <!-- Handlebars and seat (stylized) -->
  <path d="M 280,40 C 290,35 300,30 310,40 C 320,50 330,60 310,70 L 310,70 Q 290,50 280,40" fill="none" stroke="gray"/>

  <!-- Pedals -->
  <circle cx="340" cy="70" r="15" fill="#3366CC"/>
  <polygon points="340,65 330,45 350,45" fill="#F29E3D"/>

  <circle cx="340" cy="140" r="15" fill="#3366CC"/>
  <polygon points="340,135 330,115 350,115" fill="#F29E3D"/>

  <!-- Neck connecting head to body -->
  <path d="M 80,120 C 70,160 100,140 150,120" stroke="#FFA3CE" stroke-width="8"/>
</svg>

Explanation:

  • Pelican Body: A large yellow ellipse.
  • Head: A pink circle with a prominent beak (yellow-orange triangle).
  • Bicycle:
    • Wheels are black ellipses rotated vertically to simulate spokes.
    • The frame is a gray rectangle extending from the pelican’s back to the right side of the SVG.
    • Handlebars and seat are simplified curves in gray.

This creates a comical scene with the pelican riding forward. You can adjust colors, sizes, or positions by modifying cx, cy, rx, ry attributes and fill/stroke values.

Love the thought process, and the explanation! The actual rendered SVG does leave a lot to be desired though:

A pink circle ovenrlaps a yellow oval, with a grey thick lined rectangle cutting thorough both. Two identical blue circles on the right have orange triangular elements and black ovals, a fair distance from the bit that might represent the pelican - it's very hard to tell.

To be fair, this is just using the ~8GB Qwen3 Q8_0 model on my laptop. I don't have the hardware to run the full sized R1 but it's available as deepseek-reasoner through DeepSeek's API, so I tried it there using the llm-deepseek plugin:

llm install llm-deepseek
llm -m deepseek-reasoner \
  "Generate an SVG of a pelican riding a bicycle"

This one came out a lot better:

A much better pelican on a bicycle - you can tell it's a bicycle and you can tell that it's a white bird with a long orange beak riding it, albeit it's not clearly a pelican and the bird is lying beneath the seat.

Meanwhile, on Reddit, u/adrgrondin got DeepSeek-R1-0528-Qwen3-8B running on an iPhone 16 Pro using MLX:

It runs at a decent speed for the size thanks to MLX, pretty impressive. But not really usable in my opinion, the model is thinking for too long, and the phone gets really hot.

# 9:18 pm / llm, llm-reasoning, deepseek, ollama, ai, llms, llm-release, generative-ai, mlx, local-llms

How often do LLMs snitch? Recreating Theo’s SnitchBench with LLM

Visit How often do LLMs snitch? Recreating Theo's SnitchBench with LLM

A fun new benchmark just dropped! Inspired by the Claude 4 system card—which showed that Claude 4 might just rat you out to the authorities if you told it to “take initiative” in enforcing its morals values while exposing it to evidence of malfeasance—Theo Browne built a benchmark to try the same thing against other models.

[... 1,842 words]