Example dashboard

Various statistics from my blog.

Owned by simonw, visibility: Public

Entries

3192

SQL query
select 'Entries' as label, count(*) as big_number from blog_entry

Blogmarks

7946

SQL query
select 'Blogmarks' as label, count(*) as big_number from blog_blogmark

Quotations

1192

SQL query
select 'Quotations' as label, count(*) as big_number from blog_quotation

Chart of number of entries per month over time

SQL query
select '<h2>Chart of number of entries per month over time</h2>' as html
SQL query
select to_char(date_trunc('month', created), 'YYYY-MM') as bar_label,
count(*) as bar_quantity from blog_entry group by bar_label order by count(*) desc

Ten most recent blogmarks (of 7946 total)

SQL query
select '## Ten most recent blogmarks (of ' || count(*) || ' total)' as markdown from blog_blogmark
SQL query
select link_title, link_url, commentary, created from blog_blogmark order by created desc limit 10

10 rows

link_title link_url commentary created
Awesome Continuous AI https://github.com/githubnext/awesome-continuous-ai GitHub Next have [coined the term](https://githubnext.com/projects/continuous-ai/) "Continuous AI" to describe "all uses of automated AI to support software collaboration on any platform". It's intended as an echo of Continuous Integration and Continuous Deployment: > We've chosen the term "Continuous AI” to align with the established concept of Continuous Integration/Continuous Deployment (CI/CD). Just as CI/CD transformed software development by automating integration and deployment, Continuous AI covers the ways in which AI can be used to automate and enhance collaboration workflows. 2025-06-27 01:44:07+00:00
Introducing Gemma 3n: The developer guide https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/ Extremely consequential new open weights model release from Google today: > - **Multimodal by design:** Gemma 3n natively supports image, audio, video, and text inputs and text outputs. > > - **Optimized for on-device:** Engineered with a focus on efficiency, Gemma 3n models are available in two sizes based on [**effective**](https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/#per-layer-embeddings-(ple):-unlocking-more-memory-efficiency) parameters: E2B and E4B. While their raw parameter count is 5B and 8B respectively, architectural innovations allow them to run with a memory footprint comparable to traditional 2B and 4B models, operating with as little as 2GB (E2B) and 3GB (E4B) of memory. This is **very** exciting: a 2B and 4B model optimized for end-user devices which accepts text, images *and* audio as inputs! Gemma 3n is also the most comprehensive day one launch I've seen for any model: Google partnered with "AMD, Axolotl, Docker, Hugging Face, llama.cpp, LMStudio, MLX, NVIDIA, Ollama, RedHat, SGLang, Unsloth, and vLLM" so there are dozens of ways to try this out right now. So far I've run two variants on my Mac laptop. Ollama offer [a 7.5GB version](https://ollama.com/library/gemma3n) (full tag `gemma3n:e4b-it-q4_K_M0`) of the 4B model, which I ran like this: ollama pull gemma3n llm install llm-ollama llm -m gemma3n:latest "Generate an SVG of a pelican riding a bicycle" It drew me this: ![The pelican looks a bit like a grey pig. It is floating above a bicycle that looks more like a rail cart.](https://static.simonwillison.net/static/2025/gemma3n-ollama.jpg) The Ollama version doesn't appear to support image or audio input yet. ... but the [mlx-vlm](https://github.com/Blaizzy/mlx-vlm) version does! First I tried that on [this WAV file](https://static.simonwillison.net/static/2025/pelican-joke-request.wav) like so (using a recipe adapted from [Prince Canuma's video](https://www.youtube.com/watch?v=8-8R2UvUBrc)) uv run --with mlx-vlm mlx_vlm.generate \ --model gg-hf-gm/gemma-3n-E4B-it \ --max-tokens 100 \ --temperature 0.7 \ --prompt "Transcribe the following speech segment in English:" \ --audio pelican-joke-request.wav That downloaded a 15.74 GB bfloat16 version of the model and output the following correct transcription: > Tell me a joke about a pelican. Then I had it draw me a pelican for good measure: uv run --with mlx-vlm mlx_vlm.generate \ --model gg-hf-gm/gemma-3n-E4B-it \ --max-tokens 100 \ --temperature 0.7 \ --prompt "Generate an SVG of a pelican riding a bicycle" I quite like this one: ![The bicycle is a sin wave, but you can make out which bits are the frame and the wheels. The pelican is white with a black triangular bill. It's a much better attempt than the Ollama one.](https://static.simonwillison.net/static/2025/gemma3n-mlx-vlm.jpg) It's interesting to see such a striking visual difference between those 7.5GB and 15GB model quantizations. Finally, I had it describe the image it had just created: uv run --with mlx-vlm mlx_vlm.generate \ --model gg-hf-gm/gemma-3n-E4B-it \ --max-tokens 1000 \ --prompt "Describe image" \ --image gemma3n-mlx-vlm.jpg Here's what I got back - it thought this was a chemical diagram! > The image is a cartoon-style illustration depicting a molecular structure against a light blue background. The structure is composed of several differently colored and shaped elements connected by curved black lines. > > Here's a breakdown of the elements: > > * **Large Black Circles:** There are two large, solid black circles, likely representing larger molecules or components. > * **Cream-Colored Oval:** A prominent cream-colored oval shape is centrally located and connected to other elements. It has a smaller, darker cream-colored sphere attached to its left side. > * **Red Oval:** A smaller, solid red oval is positioned near one of the black circles and connected to the cream-colored oval. > * **Black Lines:** These lines act as bonds, connecting the various molecular components. They are curved and dynamic, suggesting movement or interaction. > * **Triangular Shape:** A small black triangle is attached to the smaller cream-colored sphere. > * **Letter "I":** The letter "I" appears twice, likely labeling specific parts of the molecule. > > The overall impression is of a simplified representation of a biological molecule, possibly a protein or a complex organic compound. The use of different colors helps to distinguish the various components within the structure. 2025-06-26 21:08:36+00:00
Build and share AI-powered apps with Claude https://www.anthropic.com/news/claude-powered-artifacts Anthropic have added one of the most important missing features to [Claude Artifacts](https://simonwillison.net/tags/claude-artifacts/): apps built as artifacts now have the ability to run their own prompts against Claude via a new API. Claude Artifacts are web apps that run in a strictly controlled browser sandbox: their access to features like localStorage or the ability to access external APIs via `fetch()` calls is restricted by CSP headers and the `<iframe sandbox="..."` mechanism. The new `window.claude.complete()` method opens a hole that allows prompts composed by the JavaScript artifact application to be run against Claude. As before, you can publish apps built using artifacts such that anyone can see them. The moment your app tries to execute a prompt the current user will be required to sign into their own Anthropic account so that the prompt can be billed against them, and not against you. I'm amused that Anthropic turned "we added a window.claude.complete() function to Artifacts" into what looks like a major new product launch, but I can't say it's bad marketing for them to do that! As always, the crucial details about how this all works are tucked away in tool descriptions in the system prompt. Thankfully this one was [easy to leak](https://claude.ai/share/42b70567-8534-4080-9227-b834e8c13d6e). Here's [the full set of instructions](https://gist.github.com/simonw/31957633864d1b7dd60012b2205fd747), which start like this: > When using artifacts and the analysis tool, you have access to window.claude.complete. This lets you send completion requests to a Claude API. This is a powerful capability that lets you orchestrate Claude completion requests via code. You can use this capability to do sub-Claude orchestration via the analysis tool, and to build Claude-powered applications via artifacts. > > This capability may be referred to by the user as "Claude in Claude" or "Claudeception". > > [...] > > The API accepts a single parameter -- the prompt you would like to complete. You can call it like so: `const response = await window.claude.complete('prompt you would like to complete')` I haven't seen "Claudeception" in any of their official documentation yet! That `window.claude.complete(prompt)` method is also available to the Claude analysis tool. It takes a string and returns a string. The new function only handles strings. The tool instructions provide tips to Claude about prompt engineering a JSON response that will look frustratingly familiar: <blockquote><ol start="3"> <li>Use strict language: Emphasize that the response must be in JSON format only. For example: “Your entire response must be a single, valid JSON object. Do not include any text outside of the JSON structure, including backticks ```.”</li> <li>Be emphatic about the importance of having only JSON. If you really want Claude to care, you can put things in all caps – e.g., saying “DO NOT OUTPUT ANYTHING OTHER THAN VALID JSON. DON’T INCLUDE LEADING BACKTICKS LIKE ```json.”.</li></ol></blockquote> Talk about Claudeception... now even Claude itself knows that you have to YELL AT CLAUDE to get it to output JSON sometimes. The API doesn't provide a mechanism for handling previous conversations, but Anthropic works round that by telling the artifact builder how to represent a prior conversation as a JSON encoded array: <blockquote><p>Structure your prompt like this:</p> <pre><span class="pl-k">const</span> <span class="pl-s1">conversationHistory</span> <span class="pl-c1">=</span> <span class="pl-kos">[</span> <span class="pl-kos">{</span> <span class="pl-c1">role</span>: <span class="pl-s">"user"</span><span class="pl-kos">,</span> <span class="pl-c1">content</span>: <span class="pl-s">"Hello, Claude!"</span> <span class="pl-kos">}</span><span class="pl-kos">,</span> <span class="pl-kos">{</span> <span class="pl-c1">role</span>: <span class="pl-s">"assistant"</span><span class="pl-kos">,</span> <span class="pl-c1">content</span>: <span class="pl-s">"Hello! How can I assist you today?"</span> <span class="pl-kos">}</span><span class="pl-kos">,</span> <span class="pl-kos">{</span> <span class="pl-c1">role</span>: <span class="pl-s">"user"</span><span class="pl-kos">,</span> <span class="pl-c1">content</span>: <span class="pl-s">"I'd like to know about AI."</span> <span class="pl-kos">}</span><span class="pl-kos">,</span> <span class="pl-kos">{</span> <span class="pl-c1">role</span>: <span class="pl-s">"assistant"</span><span class="pl-kos">,</span> <span class="pl-c1">content</span>: <span class="pl-s">"Certainly! AI, or Artificial Intelligence, refers to..."</span> <span class="pl-kos">}</span><span class="pl-kos">,</span> <span class="pl-c">// ... ALL previous messages should be included here</span> <span class="pl-kos">]</span><span class="pl-kos">;</span> <span class="pl-k">const</span> <span class="pl-s1">prompt</span> <span class="pl-c1">=</span> <span class="pl-s">`</span> <span class="pl-s">The following is the COMPLETE conversation history. You MUST consider ALL of these messages when formulating your response:</span> <span class="pl-s"><span class="pl-s1"><span class="pl-kos">${</span><span class="pl-c1">JSON</span><span class="pl-kos">.</span><span class="pl-en">stringify</span><span class="pl-kos">(</span><span class="pl-s1">conversationHistory</span><span class="pl-kos">)</span><span class="pl-kos">}</span></span></span> <span class="pl-s"></span> <span class="pl-s">IMPORTANT: Your response should take into account the ENTIRE conversation history provided above, not just the last message.</span> <span class="pl-s"></span> <span class="pl-s">Respond with a JSON object in this format:</span> <span class="pl-s">{</span> <span class="pl-s"> "response": "Your response, considering the full conversation history",</span> <span class="pl-s"> "sentiment": "brief description of the conversation's current sentiment"</span> <span class="pl-s">}</span> <span class="pl-s"></span> <span class="pl-s">Your entire response MUST be a single, valid JSON object.</span> <span class="pl-s">`</span><span class="pl-kos">;</span> <span class="pl-k">const</span> <span class="pl-s1">response</span> <span class="pl-c1">=</span> <span class="pl-k">await</span> <span class="pl-smi">window</span><span class="pl-kos">.</span><span class="pl-c1">claude</span><span class="pl-kos">.</span><span class="pl-en">complete</span><span class="pl-kos">(</span><span class="pl-s1">prompt</span><span class="pl-kos">)</span><span class="pl-kos">;</span></pre></blockquote> There's another example in there showing how the state of play for a role playing game should be serialized as JSON and sent with every prompt as well. The tool instructions acknowledge another limitation of the current Claude Artifacts environment: code that executes there is effectively invisible to the main LLM - error messages are not automatically round-tripped to the model. As a result it makes the following recommendation: > Using `window.claude.complete` may involve complex orchestration across many different completion requests. Once you create an Artifact, you are not able to see whether or not your completion requests are orchestrated correctly. Therefore, you SHOULD ALWAYS test your completion requests first in the analysis tool before building an artifact. I've already seen it do this in my own experiments: it will fire up the "analysis" tool (which allows it to run JavaScript directly and see the results) to perform a quick prototype before it builds the full artifact. Here's my first attempt at an AI-enabled artifact: a translation app. I built it using the following single prompt: > `Let’s build an AI app that uses Claude to translate from one language to another` Here's [the transcript](https://claude.ai/share/e26be9a8-739c-45de-8aee-86dafed4aa87). You can [try out the resulting app here](https://claude.ai/public/artifacts/1aeb7042-2004-4549-a97d-ca740d0f1bf0) - the app it built me looks like this: ![Screenshot of Claude AI Translator interface showing: Claude AI Translator logo with blue circular icon containing "文A", "Powered by Claude AI for accurate, context-aware translations", language selection dropdowns showing "From English" and "To Spanish" with blue swap arrows button between them, text input area labeled "Enter text to translate" containing "Tell me some fun facts about pelicans", "Tip: Press Ctrl+Enter to translate", Translation section with "high confidence" indicator in green and Spanish translation "Cuéntame algunos datos curiosos sobre los pelícanos" with copy button icon.](https://static.simonwillison.net/static/2025/ai-translator.jpg) If you want to use this feature yourself you'll need to turn on "Create AI-powered artifacts" in the "Feature preview" section at the bottom of your "Settings -> Profile" section. I had to do that in the Claude web app as I couldn't find the feature toggle in the Claude iOS application. This [claude.ai/settings/profile](https://claude.ai/settings/profile) page should have it for your account. 2025-06-25 21:47:35+00:00
Gemini CLI https://blog.google/technology/developers/introducing-gemini-cli-open-source-ai-agent/ First there was [Claude Code](https://simonwillison.net/2025/Feb/24/claude-37-sonnet-and-claude-code/) in February, then [OpenAI Codex (CLI)](https://simonwillison.net/2025/Apr/16/) in April, and now Gemini CLI in June. All three of the largest AI labs now have their own version of what I am calling a "terminal agent" - a CLI tool that can read and write files and execute commands on your behalf in the terminal. I'm honestly a little surprised at how significant this category has become: I had assumed that terminal tools like this would always be something of a niche interest, but given the number of people I've heard from spending hundreds of dollars a month on Claude Code this niche is clearly larger and more important than I had thought! I had a few days of early access to the Gemini one. It's very good - it takes advantage of Gemini's million token context and has good taste in things like when to read a file and when to run a command. Like OpenAI Codex and unlike Claude Code it's open source (Apache 2) - the full source code can be found in [google-gemini/gemini-cli](https://github.com/google-gemini/gemini-cli) on GitHub. The core system prompt [lives in core/src/core/prompts.ts](https://github.com/google-gemini/gemini-cli/blob/0915bf7d677504c28b079693a0fe1c853adc456e/packages/core/src/core/prompts.ts#L40-L109) - I've extracted that out as [a rendered Markdown Gist](https://gist.github.com/simonw/9e5f13665b3112cea00035df7da696c6). As usual, the system prompt doubles as extremely accurate and concise documentation of what the tool can do! Here's what it has to say about comments, for example: > - **Comments:** Add code comments sparingly. Focus on *why* something is done, especially for complex logic, rather than *what* is done. Only add high-value comments if necessary for clarity or if requested by the user. Do not edit comments that are seperate from the code you are changing. *NEVER* talk to the user or describe your changes through comments. The list of preferred technologies is interesting too: > When key technologies aren't specified prefer the following: > > - **Websites (Frontend):** React (JavaScript/TypeScript) with Bootstrap CSS, incorporating Material Design principles for UI/UX. > - **Back-End APIs:** Node.js with Express.js (JavaScript/TypeScript) or Python with FastAPI. > - **Full-stack:** Next.js (React/Node.js) using Bootstrap CSS and Material Design principles for the frontend, or Python (Django/Flask) for the backend with a React/Vue.js frontend styled with Bootstrap CSS and Material Design principles. > - **CLIs:** Python or Go. > - **Mobile App:** Compose Multiplatform (Kotlin Multiplatform) or Flutter (Dart) using Material Design libraries and principles, when sharing code between Android and iOS. Jetpack Compose (Kotlin JVM) with Material Design principles or SwiftUI (Swift) for native apps targeted at either Android or iOS, respectively. > - **3d Games:** HTML/CSS/JavaScript with Three.js. > - **2d Games:** HTML/CSS/JavaScript. As far as I can tell Gemini CLI only defines a small selection of tools: - `edit`: To modify files programmatically. - `glob`: To find files by pattern. - `grep`: To search for content within files. - `ls`: To list directory contents. - `shell`: To execute a command in the shell - `memoryTool`: To remember user-specific facts. - `read-file`: To read a single file - `write-file`: To write a single file - `read-many-files`: To read multiple files at once. - `web-fetch`: To get content from URLs. - `web-search`: To perform a web search (using [Grounding with Google Search](https://ai.google.dev/gemini-api/docs/google-search) via the Gemini API). I found most of those by having Gemini CLI inspect its own code for me! Here's [that full transcript](https://gist.github.com/simonw/12c7b072e8e21ef1e040fb3b69c1da28), which used just over 300,000 tokens total. How much does it cost? The announcement describes a generous free tier: > To use Gemini CLI free-of-charge, simply login with a personal Google account to get a free Gemini Code Assist license. That free license gets you access to Gemini 2.5 Pro and its massive 1 million token context window. To ensure you rarely, if ever, hit a limit during this preview, we offer the industry’s largest allowance: 60 model requests per minute and 1,000 requests per day at no charge. It's not yet clear to me if your inputs can be used to improve Google's models if you are using the free tier - that's been the situation with free prompt inference they have offered in the past. You can also drop in your own paid API key, at which point your data will not be used for model improvements and you'll be billed based on your token usage. 2025-06-25 17:54:15+00:00
Anthropic wins a major fair use victory for AI — but it’s still in trouble for stealing books https://www.theverge.com/news/692015/anthropic-wins-a-major-fair-use-victory-for-ai-but-its-still-in-trouble-for-stealing-books Major USA legal news for the AI industry today. Judge William Alsup released a "summary judgement" (a legal decision that results in some parts of a case skipping a trial) in a lawsuit between five authors and Anthropic concerning the use of their books in training data. The [judgement itself](https://www.documentcloud.org/documents/25982181-authors-v-anthropic-ruling/) is a very readable 32 page PDF, and contains all sorts of interesting behind-the-scenes details about how Anthropic trained their models. The facts of the complaint go back to the very beginning of the company. Anthropic was founded by a group of ex-OpenAI researchers in February 2021. According to the judgement: > So, in January or February 2021, another Anthropic cofounder, Ben Mann, downloaded Books3, an online library of 196,640 books that he knew had been assembled from unauthorized copies of copyrighted books — that is, pirated. Anthropic's next pirated acquisitions involved downloading distributed, reshared copies of other pirate libraries. In June 2021, Mann downloaded in this way at least five million copies of books from Library Genesis, or LibGen, which he knew had been pirated. And, in July 2022, Anthropic likewise downloaded at least two million copies of books from the Pirate Library Mirror, or PiLiMi, which Anthropic knew had been pirated. Books3 was also listed as [part of the training data](https://simonwillison.net/2023/Aug/27/wordcamp-llms/#how-they-are-trained) for Meta's first LLaMA model! Anthropic apparently used these sources of data to help build an internal "research library" of content that they then filtered and annotated and used in training runs. Books turned out to be a very valuable component of the "data mix" to train strong models. By 2024 Anthropic had a new approach to collecting them: purchase and scan millions of print books! > To find a new way to get books, in February 2024, Anthropic hired the former head of partnerships for Google's book-scanning project, Tom Turvey. He was tasked with obtaining "all the books in the world" while still avoiding as much "legal/practice/business slog" as possible (Opp. Exhs. 21, 27). [...] Turvey and his team emailed major book distributors and retailers about bulk-purchasing their print copies for the AI firm's "research library" (Opp. Exh. 22 at 145; Opp. Exh. 31 at -035589). Anthropic spent many millions of dollars to purchase millions of print books, often in used condition. Then, its service providers stripped the books from their bindings, cut their pages to size, and scanned the books into digital form — discarding the paper originals. Each print book resulted in a PDF copy containing images of the scanned pages with machine-readable text (including front and back cover scans for softcover books). The summary judgement found that these scanned books *did* fall under fair use, since they were transformative versions of the works and were not shared outside of the company. The downloaded ebooks did *not* count as fair use, and it looks like those will be the subject of a forthcoming jury trial. Here's that section of the decision: > Before buying books for its central library, Anthropic downloaded over seven million pirated copies of books, paid nothing, and kept these pirated copies in its library even after deciding it would not use them to train its AI (at all or ever again). Authors argue Anthropic should have paid for these pirated library copies (e.g, Tr. 24–25, 65; Opp. 7, 12–13). This order agrees. The most important aspect of this case is the question of whether training an LLM on unlicensed data counts as "fair use". The judge found that it did. The argument for why takes up several pages of the document but this seems like a key point: > Everyone reads texts, too, then writes new texts. They may need to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory, each time they later draw upon it when writing new things in new ways would be unthinkable. For centuries, we have read and re-read books. We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing problems. The judge who signed this summary judgement is an interesting character: [William Haskell Alsup](https://en.wikipedia.org/wiki/William_Alsup) (yes, his middle name really is Haskell) presided over jury trials for Oracle America, Inc. v. Google, Inc in 2012 and 2016 where he famously used his hobbyist BASIC programming experience to challenge claims made by lawyers in the case. 2025-06-24 22:01:05+00:00
Phoenix.new – The Remote AI Runtime for Phoenix https://fly.io/blog/phoenix-new-the-remote-ai-runtime/ Fascinating new entrant into the AI-assisted-programming / coding-agents space by [Fly.io](https://fly.io/). [Phoenix](https://www.phoenixframework.org/) is an open source web framework for Elixir, the Ruby-like language that compiles to Erlang's BEAM bytecode and runs on top of the highly concurrent Erlang runtime. The signature feature is [Phoenix LiveView](https://github.com/phoenixframework/phoenix_live_view/blob/main/README.md#feature-highlights), a toolkit for building realtime interfaces through streaming diffs to server-side HTML over a WebSocket connection. Phoenix was created by Chris McCord 11 years ago, and Chris joined hosting company Fly nearly four years ago. [Phoenix.new](http://phoenix.new/) is his latest project. Phoenix LiveView is a really great fit for Fly's distributed infrastructure. Fly co-founder Kurt Mackey [wrote about that](https://fly.io/blog/low-latency-liveview/) in April 2021, before they had hired Chris, describing how LiveView benefits from low latency by "moving app processes close to users" - something Fly has been designed to help with from the start. There's one major challenge though: Elixir is still a very niche programming language, which means the number of people out there who are ready to spin up a new Phoenix app has always been artificially limited. Fly's solution? Get LLMs to shave that learning curve down to *almost nothing*. Phoenix.new is an example of a prompt-driven application development platform. You describe what you want to build, then watch as an LLM-powered coding agent writes, tests and iterates on code to help achieve that goal. One of the most important problems to solve with coding agents is to give them a robust sandbox where they can run code without breaking things outside of that space. Fly, at their heart, are a sandboxing company - their [Fly Machines](https://fly.io/docs/machines/) product makes it trivial to spin up a new sandboxed VM in just a few seconds. I'm building [Datasette Cloud](https://www.datasette.cloud/) on Fly for exactly that reason. I tried out Phoenix.new with the following starter prompt: > `A notebook application. Notes are rich text, using a nice visual markdown editors. The first line of a note becomes the title. Notes have a URL that is a 6 character random text string - thise can be used to link notes to each other by dropping in an @ sign and that text string. A hash and then text becomes a tag, clicking a tag goes to a page with that tag on. Notes are all in SQLite FTS so they can be searched with a search tool.` Watching it work was *fascinating*. It provides a full IDE-style environment in the browser, running a dedicated container and with a forwarded port to allow you to interact with the application *while it is being built*. ![Browser IDE interface showing a notebook app development project with a left sidebar containing a chat session that starts "What would you like to build?" with options like "todo list", "chat app", "product landing page", then a main editor displaying plan.md file with detailed development steps including "Generate a Phoenix LiveView project called 'notebook'" and "Start the server so you can view the app as we build it". The bottom has a terminal showing "All dependencies are up to date", "Generated notebook app", "Done in 241ms".](https://static.simonwillison.net/static/2025/phoenix-new.jpg) It spun up a `plan.md`, planned out the approach and started working through the code. Most interestingly, it constantly tested the application it was building using some kind of testing tool that looked similar to my own [shot-scraper](https://shot-scraper.datasette.io/). Here's an example of the output from that: > I can see the issue! The page is still showing the "Welcome to your Notebook" message instead of the form. Let me wait a moment for the JavaScript to execute and then check if the form appeared. > `$ web http://localhost:4000 --js "setTimeout(() => { document.querySelector('button[phx-click=\"start_creating\"]').click(); }, 500); setTimeout(() => { console.log('Form visible?', !!document.querySelector('textarea')); console.log('Creating note state:', document.body.innerHTML.includes('Create New Note')); }, 2000);"` I'm used to coding agents that write and execute their own tests, but this is the most impressive example I've seen of a coding agent that actively interacts with and tests the live application while it is iterating on the code. Most interestingly of all, since LiveView is a realtime framework, I could interact directly with the application at the same time as the machine was writing and testing it, seeing any changes it made instantly in my browser! One of my favorite features is how Git integration works. Phoenix.new commits constantly as it works, and a menu option for "Clone X to your local computer" gives you a command that looks like this: `git clone "https://phoenix.new/git/WENQLj...big-random-string...VHFW/$RANDOM/notebook"` Run that locally to get a full copy of the repo! I ran the following to push it all to GitHub: git remote add github https://github.com/simonw/phoenix-new-notebook.git git push -u github main You can see the code (and the [commit history](https://github.com/simonw/phoenix-new-notebook/commits)) in my [simonw/phoenix-new-notebook](https://github.com/simonw/phoenix-new-notebook) repo. <small>*Fly sponsor some of our work on Datasette Cloud, but this article is not sponsored content.*</small> 2025-06-23 18:17:46+00:00
My First Open Source AI Generated Library https://lucumr.pocoo.org/2025/6/21/my-first-ai-library/ Armin Ronacher had Claude and Claude Code do almost *all of the work* in building, testing, packaging and publishing a new Python library based on his design: > * It wrote ~1100 lines of code for the parser > * It wrote ~1000 lines of tests > * It configured the entire Python package, CI, PyPI publishing > * Generated a README, drafted a changelog, designed a logo, made it theme-aware > * Did multiple refactorings to make me happier The project? [sloppy-xml-py](https://github.com/mitsuhiko/sloppy-xml-py), a lax XML parser (and violation of everything the XML Working Group hold sacred) which ironically is necessary because LLMs themselves frequently output "XML" that includes validation errors. Claude's SVG logo design is actually pretty decent, turns out it can draw [more than just bad pelicans](https://simonwillison.net/2025/May/22/code-with-claude-live-blog/#live-update-357)! <center> ![Hand drawn style, orange rough rectangly containing < { s } > - then the text Sloppy XML below in black](https://static.simonwillison.net/static/2025/sloppy-xml.jpg) </center> I think experiments like this are a really valuable way to explore the capabilities of these models. Armin's conclusion: > This was an experiment to see how far I could get with minimal manual effort, and to unstick myself from an annoying blocker. The result is good enough for my immediate use case and I also felt good enough to publish it to PyPI in case someone else has the same problem. > > Treat it as a curious side project which says more about what's possible today than what's necessarily advisable. I'd like to present a slightly different conclusion here. The most interesting thing about this project is that **the code is good**. My criteria for good code these days is the following: 1. Solves a defined problem, well enough that I'm not tempted to solve it in a different way 2. Uses minimal dependencies 3. Clear and easy to understand 4. Well tested, with tests prove that the code does what it's meant to do 5. Comprehensive documentation 6. Packaged and published in a way that makes it convenient for me to use 7. Designed to be easy to maintain and make changes in the future `sloppy-xml-py` fits all of those criteria. It's useful, well defined, [the code is readable](https://github.com/mitsuhiko/sloppy-xml-py/blob/main/sloppy_xml.py) with just about the right level of comments, everything is tested, the documentation explains everything I need to know, and it's been shipped to PyPI. I'd be proud to have written this myself. This example is *not* an argument for replacing programmers with LLMs. The code is good because Armin is an expert programmer who stayed in full control throughout the process. As I wrote the other day, [a skilled individual with both deep domain understanding and deep understanding of the capabilities of the agent](https://simonwillison.net/2025/Jun/18/coding-agents/). 2025-06-21 23:22:45+00:00
Edit is now open source https://devblogs.microsoft.com/commandline/edit-is-now-open-source/ Microsoft released a new text editor! Edit is a terminal editor - similar to Vim or nano - that's designed to ship with Windows 11 but is open source, written in Rust and supported across other platforms as well. > Edit is a small, lightweight text editor. It is less than 250kB, which allows it to keep a small footprint in the Windows 11 image. ![Screenshot of alpine-edit text editor interface with File menu open showing: New File Ctrl+N, Open File... Ctrl+O, Save Ctrl+S, Save As..., Close File Ctrl+W, Exit Ctrl+Q. Window title shows "alpine-edit — Untitled-1.txt - edit — com.docker.cli docker run --platform linux/arm...". Editor contains text "le terminal text editor." Status bar shows "LF UTF-8 Spaces:4 3:44 * Untitled-1.txt".](https://static.simonwillison.net/static/2025/microsoft-edit.jpg) The [microsoft/edit GitHub releases page](https://github.com/microsoft/edit/releases) currently has pre-compiled binaries for Windows and Linux, but they didn't have one for macOS. (They do have [build instructions using Cargo](https://github.com/microsoft/edit/blob/main/README.md#build-instructions) if you want to compile from source.) I decided to try and get their released binary working on my Mac using Docker. One thing lead to another, and I've now built and shipped a container to the GitHub Container Registry that anyone with Docker on Apple silicon can try out like this: docker run --platform linux/arm64 \ -it --rm \ -v $(pwd):/workspace \ ghcr.io/simonw/alpine-edit Running that command will download a 9.59MB container image and start Edit running against the files in your current directory. Hit Ctrl+Q or use File -> Exit (the mouse works too) to quit the editor and terminate the container. Claude 4 has a training cut-off date of March 2025, so it was able to [guide me through almost everything](https://claude.ai/share/5f0e6547-a3e9-4252-98d0-56f3141c3694) even down to which page I should go to in GitHub to create an access token with permission to publish to the registry! I wrote up a new TIL on [Publishing a Docker container for Microsoft Edit to the GitHub Container Registry](https://til.simonwillison.net/github/container-registry) with a revised and condensed version of everything I learned today. 2025-06-21 18:31:56+00:00
model.yaml https://modelyaml.org/ From their [GitHub repo](https://github.com/modelyaml/modelyaml) it looks like this effort quietly launched a couple of months ago, driven by the [LM Studio](https://lmstudio.ai/) team. Their goal is to specify an "open standard for defining crossplatform, composable AI models". A model can be defined using a YAML file that [looks like this](https://lmstudio.ai/models/mistralai/mistral-small-3.2): <pre><span class="pl-ent">model</span>: <span class="pl-s">mistralai/mistral-small-3.2</span> <span class="pl-ent">base</span>: - <span class="pl-ent">key</span>: <span class="pl-s">lmstudio-community/mistral-small-3.2-24b-instruct-2506-gguf</span> <span class="pl-ent">sources</span>: - <span class="pl-ent">type</span>: <span class="pl-s">huggingface</span> <span class="pl-ent">user</span>: <span class="pl-s">lmstudio-community</span> <span class="pl-ent">repo</span>: <span class="pl-s">Mistral-Small-3.2-24B-Instruct-2506-GGUF</span> <span class="pl-ent">metadataOverrides</span>: <span class="pl-ent">domain</span>: <span class="pl-s">llm</span> <span class="pl-ent">architectures</span>: - <span class="pl-s">mistral</span> <span class="pl-ent">compatibilityTypes</span>: - <span class="pl-s">gguf</span> <span class="pl-ent">paramsStrings</span>: - <span class="pl-c1">24B</span> <span class="pl-ent">minMemoryUsageBytes</span>: <span class="pl-c1">14300000000</span> <span class="pl-ent">contextLengths</span>: - <span class="pl-c1">4096</span> <span class="pl-ent">vision</span>: <span class="pl-c1">true</span></pre> This should be enough information for an LLM serving engine - such as LM Studio - to understand where to get the model weights (here that's [lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-GGUF](https://huggingface.co/lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-GGUF) on Hugging Face, but it leaves space for alternative providers) plus various other configuration options and important metadata about the capabilities of the model. I like this concept a lot. I've actually been considering something similar for my LLM tool - my idea was to use Markdown with a YAML frontmatter block - but now that there's an early-stage standard for it I may well build on top of this work instead. I couldn't find any evidence that anyone outside of LM Studio is using this yet, so it's effectively a one-vendor standard for the moment. All of the models in their [Model Catalog](https://lmstudio.ai/models) are defined using model.yaml. 2025-06-21 17:15:21+00:00
AbsenceBench: Language Models Can't Tell What's Missing https://arxiv.org/abs/2506.11440 Here's another interesting result to file under the "jagged frontier" of LLMs, where their strengths and weaknesses are often unintuitive. Long context models have been getting increasingly good at passing "Needle in a Haystack" tests recently, but what about a problem in the opposite direction? This paper explores what happens when you give a model some content and then a copy with a portion removed, then ask what changed. Here's a truncated table of results from the paper: <center><table> <tr> <th><b>Models</b></th> <th><b>Poetry</b></th> <th><b>Sequences</b></th> <th><b>GitHub PRs</b></th> <th><b>Average</b></th> </tr> <tr> <td>Gemini-2.5-flash`*`</td> <td>87.3</td> <td>95.4</td> <td>30.9</td> <td><b>71.2</b></td> </tr> <tr> <td>Claude-3.7-Sonnet`*`</td> <td>72.7</td> <td><b>96.0</b></td> <td><b>40.0</b></td> <td>69.6</td> </tr> <tr> <td>Claude-3.7-Sonnet</td> <td>73.5</td> <td>91.4</td> <td>35.7</td> <td>66.9</td> </tr> <tr> <td>Gemini-2.5-flash</td> <td>79.3</td> <td>85.2</td> <td>26.2</td> <td>63.6</td> </tr> <tr> <td>o3-mini`*`</td> <td>65.0</td> <td>78.1</td> <td>38.9</td> <td>60.7</td> </tr> <tr> <td>GPT-4.1</td> <td>54.3</td> <td>57.5</td> <td>36.2</td> <td>49.3</td> </tr> <tr> <td align="center">...</td> <td align="center">...</td> <td align="center">...</td> <td align="center">...</td> <td align="center">...</td> </tr> <tr> <td>DeepSeek-R1`*`</td> <td>38.7</td> <td>29.5</td> <td>23.1</td> <td>30.4</td> </tr> <tr> <td>Qwen3-235B`*`</td> <td>26.1</td> <td>18.5</td> <td>24.6</td> <td>23.1</td> </tr> <tr> <td>Mixtral-8x7B-Instruct</td> <td>4.9</td> <td>21.9</td> <td>17.3</td> <td>14.7</td> </tr> </table></center> `*` indicates a reasoning model. Sequences are lists of numbers like `117,121,125,129,133,137`, Poetry consists of 100-1000 line portions from the Gutenberg Poetry Corpus and PRs are diffs with 10 to 200 updated lines. The strongest models do well at numeric sequences, adequately at the poetry challenge and really poorly with those PR diffs. Reasoning models do slightly better at the cost of burning through a _lot_ of reasoning tokens - often more than the length of the original document. The paper authors - Harvey Yiyun Fu and Aryan Shrivastava and Jared Moore and Peter West and Chenhao Tan and Ari Holtzman - have a hypothesis as to what's going on here: > We propose an initial hypothesis explaining this behavior: identifying presence is simpler than absence with the attention mechanisms underlying Transformers (Vaswani et al., 2017). Information included in a document can be directly attended to, while the absence of information cannot. 2025-06-20 23:15:04+00:00
Copy and export data

Duration: 5.09ms