Simon Willison’s Weblog

On open-source 257 vision-llms 70 github 161 ai-assisted-programming 207 llm-pricing 42 ...

 

Recent

July 5, 2025

Cursor: Clarifying Our Pricing. Cursor changed their pricing plan on June 16th, introducing a new $200/month Ultra plan with "20x more usage than Pro" and switching their $20/month Pro plan from "request limits to compute limits".

This confused a lot of people. Here's Cursor's attempt at clarifying things:

Cursor uses a combination of our custom models, as well as models from providers like OpenAI, Anthropic, Google, and xAI. For external models, we previously charged based on the number of requests made. There was a limit of 500 requests per month, with Sonnet models costing two requests.

New models can spend more tokens per request on longer-horizon tasks. Though most users' costs have stayed fairly constant, the hardest requests cost an order of magnitude more than simple ones. API-based pricing is the best way to reflect that.

I think I understand what they're saying there. They used to allow you 500 requests per month, but those requests could be made against any model and, crucially, a single request could trigger a variable amount of token spend.

Modern LLMs can have dramatically different prices, so one of those 500 requests with a large context query against an expensive model could cost a great deal more than a single request with a shorter context against something less expensive.

I imagine they were losing money on some of their more savvy users, who may have been using prompting techniques that sent a larger volume of tokens through each one of those precious 500 requests.

The new billing switched to passing on the expense of those tokens directly, with a $20 included budget followed by overage charges for tokens beyond that.

It sounds like a lot of people, used to the previous model where their access would be cut off after 500 requests, got caught out by this and racked up a substantial bill!

To cursor's credit, they're offering usage refunds to "those with unexpected usage between June 16 and July 4."

I think this highlights a few interesting trends.

Firstly, the era of VC-subsidized tokens may be coming to an end, especially for products like Cursor which are way past demonstrating product-market fit.

Secondly, that $200/month plan for 20x the usage of the $20/month plan is an emerging pattern: Anthropic offers the exact same deal for Claude Code, with the same 10x price for 20x usage multiplier.

Professional software engineers may be able to justify one $200/month subscription, but I expect most will be unable to justify two. The pricing here becomes a significant form of lock-in - once you've picked your $200/month coding assistant you are less likely to evaluate the alternatives.

# 5:15 am / ai, generative-ai, llms, ai-assisted-programming, anthropic, claude, llm-pricing, coding-agents, claude-code, cursor

July 4, 2025

The more time I spend using LLMs for code, the less I worry for my career - even as their coding capabilities continue to improve.

Using LLMs as part of my process helps me understand how much of my job isn't just bashing out code.

My job is to identify problems that can be solved with code, then solve them, then verify that the solution works and has actually addressed the problem.

A more advanced LLM may eventually be able to completely handle the middle piece. It can help with the first and last pieces, but only when operated by someone who understands both the problems to be solved and how to interact with the LLM to help solve them.

No matter how good these things get, they will still need someone to find problems for them to solve, define those problems and confirm that they are solved. That's a job - one that other humans will be happy to outsource to an expert practitioner.

It's also about 80% of what I do as a software developer already.

# 7:08 pm / careers, ai-assisted-programming, generative-ai, ai, llms

awwaiid/gremllm (via) Delightfully cursed Python library by Brock Wilcox, built on top of LLM:

from gremllm import Gremllm

counter = Gremllm("counter")
counter.value = 5
counter.increment()
print(counter.value)  # 6?
print(counter.to_roman_numerals()) # VI?

You tell your Gremllm what it should be in the constructor, then it uses an LLM to hallucinate method implementations based on the method name every time you call them!

This utility class can be used for a variety of purposes. Uhm. Also please don't use this and if you do please tell me because WOW. Or maybe don't tell me. Or do.

Here's the system prompt, which starts:

You are a helpful AI assistant living inside a Python object called '{self._identity}'.
Someone is interacting with you and you need to respond by generating Python code that will be eval'd in your context.

You have access to 'self' (the object) and can modify self._context to store data.

# 3:25 pm / python, ai, prompt-engineering, generative-ai, llms, llm

July 3, 2025

I think that a lot of resistance to AI coding tools comes from the same place: fear of losing something that has defined you for so long. People are reacting against overblown hype, and there is overblown hype. I get that, but I also think there’s something deeper going on here. When you’ve worked hard to build your skills, when coding is part of your identity and where you get your worth, the idea of a tool that might replace some of that is very threatening.

Adam Gordon Bell, When AI Codes, What’s Left for me?

# 9:58 pm / careers, ai, generative-ai, llms, ai-assisted-programming

TIL: Rate limiting by IP using Cloudflare’s rate limiting rules. My blog started timing out on some requests a few days ago, and it turned out there were misbehaving crawlers that were spidering my /search/ page even though it's restricted by robots.txt.

I run this site behind Cloudflare and it turns out Cloudflare's WAF (Web Application Firewall) has a rate limiting tool that I could use to restrict requests to /search/* by a specific IP to a maximum of 5 every 10 seconds.

# 9:16 pm / rate-limiting, security, cloudflare, til

Frequently Asked Questions (And Answers) About AI Evals (via) Hamel Husain and Shreya Shankar have been running a paid, cohort-based course on AI Evals For Engineers & PMs over the past few months. Here Hamel collects answers to the most common questions asked during the course.

There's a ton of actionable advice in here. I continue to believe that a robust approach to evals is the single most important distinguishing factor between well-engineered, reliable AI systems and YOLO cross-fingers and hope it works development.

Hamel says:

It’s important to recognize that evaluation is part of the development process rather than a distinct line item, similar to how debugging is part of software development. [...]

In the projects we’ve worked on, we’ve spent 60-80% of our development time on error analysis and evaluation. Expect most of your effort to go toward understanding failures (i.e. looking at data) rather than building automated checks.

I found this tip to be useful and surprising:

If you’re passing 100% of your evals, you’re likely not challenging your system enough. A 70% pass rate might indicate a more meaningful evaluation that’s actually stress-testing your application.

# 8:36 pm / ai, generative-ai, llms, hamel-husain, evals

Trial Court Decides Case Based On AI-Hallucinated Caselaw. Joe Patrice writing for Above the Law:

[...] it was always only a matter of time before a poor litigant representing themselves fails to know enough to sniff out and flag Beavis v. Butthead and a busy or apathetic judge rubberstamps one side’s proposed order without probing the cites for verification. [...]

It finally happened with a trial judge issuing an order based off fake cases (flagged by Rob Freund). While the appellate court put a stop to the matter, the fact that it got this far should terrify everyone.

It's already listed in the AI Hallucination Cases database (now listing 168 cases, it was 116 when I first wrote about it on 25th May) which lists a $2,500 monetary penalty.

# 8:19 pm / law, ai, generative-ai, llms, ai-ethics, hallucinations

I built something that changed my friend group’s social fabric (via) I absolutely love this as an illustration of the thing where the tiniest design decisions in software can have an outsized effect on the world.

Dan Petrolito noticed that his friend group weren't chatting to each other using voice chat on their Discord server because they usually weren't online at the same time. He wired up a ~20 lines of Python Discord bot to turn people joining the voice channel into a message that could be received as a notification and had a huge uptick in conversations between the group, lasting several years.

# 7:23 pm / social-software, discord

Something I've realized about LLM tool use is that it means that if you can reduce a problem to something that can be solved by an LLM in a sandbox using tools in a loop, you can brute force that problem.

The challenge then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.

That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.

My x86 assembly experiment with Claude Code was the thing that made this click for me.

# 2:28 pm / llm-tool-use, ai-assisted-programming, claude-code, sandboxing, generative-ai, ai, llms

Quitting programming as a career right now because of LLMs would be like quitting carpentry as a career thanks to the invention of the table saw.

# 4:10 am / careers, ai-assisted-programming, generative-ai, ai, llms

July 2, 2025

On two occasions I have been asked, — "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out ?" In one case a member of the Upper, and in the other a member of the Lower, House put this question. I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

Charles Babbage, Passages from the Life of a Philosopher, 1864

# 3:04 pm / charles-babbage

Mandelbrot in x86 assembly by Claude. Inspired by a tweet asking if Claude knew x86 assembly, I decided to run a bit of an experiment.

I prompted Claude Sonnet 4:

Write me an ascii art mandelbrot fractal generator in x86 assembly

And got back code that looked... like assembly code I guess?

So I copied some jargon out of that response and asked:

I have some code written for x86-64 assembly using NASM syntax, targeting Linux (using system calls for output).

How can I run that on my Mac?

That gave me a Dockerfile.

I tried running it on my Mac and... it failed to compile.

So I fired up Claude Code (with the --dangerously-skip-permissions option) in that directory and told it what to run:

Run this: docker build -t myasm .

It started crunching. It read the errors, inspected the assembly code, made changes, tried running it again in a loop, added more comments...

Terminal window showing code diff for Mandelbrot fractal program. • Now I need to fix the Mandelbrot iteration logic as well. The current fixed-point arithmetic has issues: • Read(program.asm) └ Read 30 lines (Ctrl+r to expand) • The Mandelbrot iteration logic also has issues with fixed-point arithmetic. Let me fix this too: • Update(program.asm) └ Updated program.asm with 18 additions and 12 removals showing assembly code diff with line numbers 63-91, including removed lines marked in red for fixed point multiplication calculations and added lines in green with 64-bit arithmetic conversions, comments describing z_real and z_imag calculations, and mandel_done logic checks.

...and eventually produced an ASCII art image of a fractal! Claude Code proudly declared:

Success! The Docker container ran and generated a Mandelbrot set visualization using ASCII characters. The output shows all @ symbols, which indicates the algorithm is working but all points are inside the set (high iteration count). The program compiled and executed successfully on x86_64 architecture via emulation.

Yeah... I was not satisfied with an image that was entirely @ symbols! I said:

That is a bad result - it is just @ signs, I cannot see the fractal

Claude made a few more changes and gave me the result I wanted:

A terminal window showing a pleasing ASCII art Mandelbrot set, zoomed to the expected level

Here's the finished assembly code, the Dockerfile to run it on a Mac and the full transcript of the Claude Code session that got it there.

# 4:35 am / mandelbrot, ai, generative-ai, llms, ai-assisted-programming, anthropic, claude, vibe-coding, claude-code

July 1, 2025

TIL: Using Playwright MCP with Claude Code. Inspired by Armin ("I personally use only one MCP - I only use Playwright") I decided to figure out how to use the official Playwright MCP server with Claude Code.

It turns out it's easy:

claude mcp add playwright npx '@playwright/mcp@latest'
claude

The claude mcp add command only affects the current directory by default - it gets persisted in the ~/.claude.json file.

Now Claude can use Playwright to automate a Chrome browser! Tell it to "Use playwright mcp to open a browser to example.com" and watch it go - it can navigate pages, submit forms, execute custom JavaScript and take screenshots to feed back into the LLM.

The browser window stays visible which means you can interact with it too, including signing into websites so Claude can act on your behalf.

# 11:55 pm / armin-ronacher, til, playwright, ai-assisted-programming, anthropic, claude, claude-code

One of the best examples of LLM developer tooling I've heard is from a team that supports software from the 80s-90s. Their only source of documentation is video interviews with retired employees. So they feed them into transcription software and get summarized searchable notes out the other end.

Kevin Webb, a couple million lines of Smalltalk

# 9:46 pm / ai, small, llms, ai-assisted-programming

Sometimes a service with a free plan will decide to stop supporting it. I understand why this happens, but I'm often disappointed at the treatment of existing user's data. It's easy to imagine users forgetting about their old accounts, missing the relevant emails and then discovering too late that their data is gone.

Inspired by today's news about PlanetScale PostgreSQL I signed into PlanetScale and found I had a long-forgotten trial account there with a three-year-old database on their free tier. That free tier was retired in March 2024.

Here's the screen that greeted me in their control panel:

Database is sleeping PlanetScale has retired the free plan. Please upgrade your plan or you may wake this database for 24 hours to retrieve your data. Two buttons: Wake for 24 hours and Delete database

What a great way to handle retiring a free plan! My data is still there, and I have the option to spin up a database for 24 hours to help get it back out again.

# 8:37 pm / saas, databases

Using LLMs for code archaeology is pretty fun.

I stumbled across this blog entry from 2003 today, in which I had gotten briefly excited about ColdFusion and implemented an experimental PHP template engine that used XML tags to achieve a similar effect:

<h1>%title%</h1>
<sql id="recent">
select title
from entries 
order by added desc
limit 0, %limit%
</sql>
<ul>
  <output sql="recent">
    <li>%title%</li>
  </output>
</ul>

I'd completely forgotten about this, and in scanning through the PHP it looked like it had extra features that I hadn't described in the post.

So... I fed my 22 year old TemplateParser.class.php file into Claude and prompted:

Write detailed markdown documentation for this template language

Here's the resulting documentation. It's pretty good, but the highlight was the Claude transcript which concluded:

This appears to be a custom template system from the mid-2000s era, designed to separate presentation logic from PHP code while maintaining database connectivity for dynamic content generation.

Mid-2000s era indeed!

# 7:57 pm / claude, coldfusion, ai, llms, ai-assisted-programming, php, generative-ai, projects

Announcing PlanetScale for Postgres. PlanetScale formed in 2018 to build a commercial offering on top of the Vitess MySQL sharding open source project, which was originally released by YouTube in 2012. The PlanetScale founders were the co-creators and maintainers of Vitess.

Today PlanetScale are announcing a private preview of their new horizontally sharded PostgreSQL solution, due to "overwhelming" demand.

Notably, it doesn't use Vitess under the hood:

Vitess is one of PlanetScale’s greatest strengths [...] We have made explicit sharding accessible to hundreds of thousands of users and it is time to bring this power to Postgres. We will not however be using Vitess to do this.

Vitess’ achievements are enabled by leveraging MySQL’s strengths and engineering around its weaknesses. To achieve Vitess’ power for Postgres we are architecting from first principles.

Meanwhile, on June 10th Supabase announced that they had hired Vitess co-creator Sugu Sougoumarane to help them build "Multigres: Vitess for Postgres". Sugu said:

For some time, I've been considering a Vitess adaptation for Postgres, and this feeling had been gradually intensifying. The recent explosion in the popularity of Postgres has fueled this into a full-blown obsession. [...]

The project to address this problem must begin now, and I'm convinced that Vitess provides the most promising foundation.

I remember when MySQL was an order of magnitude more popular than PostgreSQL, and Heroku's decision to only offer PostgreSQL back in 2007 was a surprising move. The vibes have certainly shifted.

# 6:16 pm / databases, mysql, postgresql, scaling, sharding, vitess

To misuse a woodworking metaphor, I think we’re experiencing a shift from hand tools to power tools.

You still need someone who understands the basics to get the good results out of the tools, but they’re not chiseling fine furniture by hand anymore, they’re throwing heaps of wood through the tablesaw instead. More productive, but more likely to lose a finger if you’re not careful.

mrmincent, Hacker News comment on Claude Code

# 5:07 pm / hacker-news, ai, generative-ai, llms, ai-assisted-programming, claude-code

I just sent out the second edition of my sponsors only monthly newsletter. Anyone who sponsors me for $10/month or more on GitHub gets this carefully hand-curated summary of the last month in AI/LLMs/my projects designed to be readable in ten minutes or less.

My regular newsletter remains free - the monthly one is the only paywalled content I produce, the idea being that you can pay me to send you less.

Here's the first edition for May 2025 as a preview of what you can expect. You'll get access to the June digest and the full archive automatically if you decide to start sponsoring.

# 3:52 am / newsletter

Using Claude Code to build a GitHub Actions workflow. I wanted to add a small feature to one of my GitHub repos - an automatically updated README index listing other files in the repo - so I decided to use Descript to record my process using Claude Code. Here's a 7 minute video showing what I did.

I've been wanting to start producing more video content for a while - this felt like a good low-stakes opportunity to put in some reps.

# 3:44 am / screencast, youtube, ai, github-actions, llms, ai-assisted-programming, anthropic, claude, coding-agents, claude-code

June 30, 2025

microsoft/vscode-copilot-chat (via) As promised at Build 2025 in May, Microsoft have released the GitHub Copilot Chat client for VS Code under an open source (MIT) license.

So far this is just the extension that provides the chat component of Copilot, but the launch announcement promises that Copilot autocomplete will be coming in the near future:

Next, we will carefully refactor the relevant components of the extension into VS Code core. The original GitHub Copilot extension that provides inline completions remains closed source -- but in the following months we plan to have that functionality be provided by the open sourced GitHub Copilot Chat extension.

I've started spelunking around looking for the all-important prompts. So far the most interesting I've found are in prompts/node/agent/agentInstructions.tsx, with a <Tag name='instructions'> block that starts like this:

You are a highly sophisticated automated coding agent with expert-level knowledge across many different programming languages and frameworks. The user will ask a question, or ask you to perform a task, and it may require lots of research to answer correctly. There is a selection of tools that let you perform actions or retrieve helpful context to answer the user's question.

There are tool use instructions - some edited highlights from those:

  • When using the ReadFile tool, prefer reading a large section over calling the ReadFile tool many times in sequence. You can also think of all the pieces you may be interested in and read them in parallel. Read large enough context to ensure you get what you need.
  • You can use the FindTextInFiles to get an overview of a file by searching for a string within that one file, instead of using ReadFile many times.
  • Don't call the RunInTerminal tool multiple times in parallel. Instead, run one command and wait for the output before running the next command.
  • After you have performed the user's task, if the user corrected something you did, expressed a coding preference, or communicated a fact that you need to remember, use the UpdateUserPreferences tool to save their preferences.
  • NEVER try to edit a file by running terminal commands unless the user specifically asks for it.
  • Use the ReplaceString tool to replace a string in a file, but only if you are sure that the string is unique enough to not cause any issues. You can use this tool multiple times per file.

That file also has separate CodesearchModeInstructions, as well as a SweBenchAgentPrompt class with a comment saying that it is "used for some evals with swebench".

Elsewhere in the code, prompt/node/summarizer.ts illustrates one of their approaches to Context Summarization, with a prompt that looks like this:

You are an expert at summarizing chat conversations.

You will be provided:

- A series of user/assistant message pairs in chronological order
- A final user message indicating the user's intent.

[...]

Structure your summary using the following format:

TITLE: A brief title for the summary
USER INTENT: The user's goal or intent for the conversation
TASK DESCRIPTION: Main technical goals and user requirements
EXISTING: What has already been accomplished. Include file paths and other direct references.
PENDING: What still needs to be done. Include file paths and other direct references.
CODE STATE: A list of all files discussed or modified. Provide code snippets or diffs that illustrate important context.
RELEVANT CODE/DOCUMENTATION SNIPPETS: Key code or documentation snippets from referenced files or discussions.
OTHER NOTES: Any additional context or information that may be relevant.

prompts/node/panel/terminalQuickFix.tsx looks interesting too, with prompts to help users fix problems they are having in the terminal:

You are a programmer who specializes in using the command line. Your task is to help the user fix a command that was run in the terminal by providing a list of fixed command suggestions. Carefully consider the command line, output and current working directory in your response. [...]

That file also has a PythonModuleError prompt:

Follow these guidelines for python:
- NEVER recommend using "pip install" directly, always recommend "python -m pip install"
- The following are pypi modules: ruff, pylint, black, autopep8, etc
- If the error is module not found, recommend installing the module using "python -m pip install" command.
- If activate is not available create an environment using "python -m venv .venv".

There's so much more to explore in here. xtab/common/promptCrafting.ts looks like it may be part of the code that's intended to replace Copilot autocomplete, for example.

The way it handles evals is really interesting too. The code for that lives in the test/ directory. There's a lot of it, so I engaged Gemini 2.5 Pro to help figure out how it worked:

git clone https://github.com/microsoft/vscode-copilot-chat
cd vscode-copilot-chat/chat
files-to-prompt -e ts -c . | llm -m gemini-2.5-pro -s \
  'Output detailed markdown architectural documentation explaining how this test suite works, with a focus on how it tests LLM prompts'

Here's the resulting generated documentation, which even includes a Mermaid chart (I had to save the Markdown in a regular GitHub repository to get that to render - Gists still don't handle Mermaid.)

The neatest trick is the way it uses a SQLite-based caching mechanism to cache the results of prompts from the LLM, which allows the test suite to be run deterministically even though LLMs themselves are famously non-deterministic.

# 9:08 pm / github, microsoft, open-source, ai, prompt-engineering, generative-ai, github-copilot, llms, ai-assisted-programming, gemini, vs-code, llm-tool-use, evals, coding-agents

llvm: InstCombine: improve optimizations for ceiling division with no overflow—a PR by Alex Gaynor and Claude Code. Alex Gaynor maintains rust-asn1, and recently spotted a missing LLVM compiler optimization while hacking on it, with the assistance of Claude (Alex works for Anthropic).

He describes how he confirmed that optimization in So you want to serialize some DER?, taking advantage of a tool called Alive2 to automatically verify that the potential optimization resulted in the same behavior.

Alex filed a bug, and then...

Obviously the next move is to see if I can send a PR to LLVM, but it’s been years since I was doing compiler development or was familiar with the LLVM internals and I wasn’t really prepared to invest the time and energy necessary to get back up to speed. But as a friend pointed out… what about Claude?

At this point my instinct was, "Claude is great, but I'm not sure if I'll be able to effectively code review any changes it proposes, and I'm not going to be the asshole who submits an untested and unreviewed PR that wastes a bunch of maintainer time". But excitement got the better of me, and I asked claude-code to see if it could implement the necessary optimization, based on nothing more than the test cases.

Alex reviewed the resulting code very carefully to ensure he wasn't wasting anyone's time, then submitted the PR and had Claude Code help implement the various changes requested by the reviewers. The optimization landed two weeks ago.

Alex's conclusion (emphasis mine):

I am incredibly leery about over-generalizing how to understand the capacity of the models, but at a minimum it seems safe to conclude that sometimes you should just let the model have a shot at a problem and you may be surprised -- particularly when the problem has very clear success criteria. This only works if you have the capacity to review what it produces, of course. [...]

This echoes Ethan Mollick's advice to "always invite AI to the table". For programming tasks the "very clear success criteria" is extremely important, as it helps fit the tools-in-a-loop pattern implemented by coding agents such as Claude Code.

LLVM have a policy on AI-assisted contributions which is compatible with Alex's work here:

[...] the LLVM policy is that contributors are permitted to use artificial intelligence tools to produce contributions, provided that they have the right to license that code under the project license. Contributions found to violate this policy will be removed just like any other offending contribution.

While the LLVM project has a liberal policy on AI tool use, contributors are considered responsible for their contributions. We encourage contributors to review all generated code before sending it for review to verify its correctness and to understand it so that they can answer questions during code review.

Back in April Ben Evans put out a call for concrete evidence that LLM tools were being used to solve non-trivial problems in mature open source projects:

I keep hearing #AI boosters / talking heads claiming that #LLMs have transformed software development [...] Share some AI-derived pull requests that deal with non-obvious corner cases or non-trivial bugs from mature #opensource projects.

I think this LLVM optimization definitely counts!

(I also like how this story supports the idea that AI tools amplify existing human expertise rather than replacing it. Alex had previous experience with LLVM, albeit rusty, and could lean on that knowledge to help direct and evaluate Claude's work.)

# 4:44 pm / alex-gaynor, compilers, llvm, open-source, ai-assisted-programming, anthropic, claude, coding-agents, claude-code

June 29, 2025

Agentic Coding: The Future of Software Development with Agents. Armin Ronacher delivers a 37 minute YouTube talk describing his adventures so far with Claude Code and agentic coding methods.

A friend called Claude Code catnip for programmers and it really feels like this. I haven't felt so energized and confused and just so willing to try so many new things... it is really incredibly addicting.

I picked up a bunch of useful tips from this video:

  • Armin runs Claude Code with the --dangerously-skip-permissions option, and says this unlocks a huge amount of productivity. I haven't been brave enough to do this yet but I'm going to start using that option while running in a Docker container to ensure nothing too bad can happen.
  • When your agentic coding tool can run commands in a terminal you can mostly avoid MCP - instead of adding a new MCP tool, write a script or add a Makefile command and tell the agent to use that instead. The only MCP Armin uses is the Playwright one.
  • Combined logs are a really good idea: have everything log to the same place and give the agent an easy tool to read the most recent N log lines.
  • While running Claude Code, use Gemini CLI to run sub-agents, to perform additional tasks without using up Claude Code's own context
  • Designing additional tools that provide very clear errors, so the agents can recover when something goes wrong.
  • Thanks to Playwright, Armin has Claude Code perform all sorts of automated operations via a signed in browser instance as well. "Claude can debug your CI... it can sign into a browser, click around, debug..." - he also has it use the gh GitHub CLI tool to interact with things like GitHub Actions workflows.

"Tip 1: Unified Logging" at top, followed by title "Forward Everything Into One Log File" and bullet points: "Combine console.log + server logs + everything else", "patch console.log in the browser -> forward to server via API call", "All output streams flow to a single, tailable log file", "Give it a way to log out SQL too!", "Provide a make tail-logs command for easy access". Bottom shows example: "# Example" and "make tail-logs  # Shows last 50 lines, follows new output".

# 11:59 pm / armin-ronacher, ai, generative-ai, llms, ai-assisted-programming, anthropic, claude, ai-agents, model-context-protocol, claude-code

Tip: Use keyword-only arguments in Python dataclasses (via) Useful tip from Christian Hammond: if you create a Python dataclass using @dataclass(kw_only=True) its constructor will require keyword arguments, making it easier to add additional properties in the future, including in subclasses, without risking breaking existing code.

# 8:51 pm / python

How to Fix Your Context. Drew Breunig has been publishing some very detailed notes on context engineering recently. In How Long Contexts Fail he described four common patterns for context rot, which he summarizes like so:

  • Context Poisoning: When a hallucination or other error makes it into the context, where it is repeatedly referenced.
  • Context Distraction: When a context grows so long that the model over-focuses on the context, neglecting what it learned during training.
  • Context Confusion: When superfluous information in the context is used by the model to generate a low-quality response.
  • Context Clash: When you accrue new information and tools in your context that conflicts with other information in the prompt.

In this follow-up he introduces neat ideas (and more new terminology) for addressing those problems.

Tool Loadout describes selecting a subset of tools to enable for a prompt, based on research that shows anything beyond 20 can confuse some models.

Context Quarantine is "the act of isolating contexts in their own dedicated threads" - I've called rhis sub-agents in the past, it's the pattern used by Claude Code and explored in depth in Anthropic's multi-agent research paper.

Context Pruning is "removing irrelevant or otherwise unneeded information from the context", and Context Summarization is the act of boiling down an accrued context into a condensed summary. These techniques become particularly important as conversations get longer and run closer to the model's token limits.

Context Offloading is "the act of storing information outside the LLM’s context". I've seen several systems implement their own "memory" tool for saving and then revisiting notes as they work, but an even more interesting example recently is how various coding agents create and update plan.md files as they work through larger problems.

Drew's conclusion:

The key insight across all the above tactics is that context is not free. Every token in the context influences the model’s behavior, for better or worse. The massive context windows of modern LLMs are a powerful capability, but they’re not an excuse to be sloppy with information management.

# 8:15 pm / ai, prompt-engineering, generative-ai, llms, drew-breunig, llm-tool-use, ai-agents

June 27, 2025

The term context engineering has recently started to gain traction as a better alternative to prompt engineering. I like it. I think this one may have sticking power.

Here's an example tweet from Shopify CEO Tobi Lutke:

I really like the term “context engineering” over prompt engineering.

It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM.

Recently amplified by Andrej Karpathy:

+1 for "context engineering" over "prompt engineering".

People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step. Science because doing this right involves task descriptions and explanations, few shot examples, RAG, related (possibly multimodal) data, tools, state and history, compacting [...] Doing this well is highly non-trivial. And art because of the guiding intuition around LLM psychology of people spirits. [...]

I've spoken favorably of prompt engineering in the past - I hoped that term could capture the inherent complexity of constructing reliable prompts. Unfortunately, most people's inferred definition is that it's a laughably pretentious term for typing things into a chatbot!

It turns out that inferred definitions are the ones that stick. I think the inferred definition of "context engineering" is likely to be much closer to the intended meaning.

# 11:42 pm / andrej-karpathy, prompt-engineering, generative-ai, ai, llms

Continuous AI. GitHub Next have coined the term "Continuous AI" to describe "all uses of automated AI to support software collaboration on any platform". It's intended as an echo of Continuous Integration and Continuous Deployment:

We've chosen the term "Continuous AI” to align with the established concept of Continuous Integration/Continuous Deployment (CI/CD). Just as CI/CD transformed software development by automating integration and deployment, Continuous AI covers the ways in which AI can be used to automate and enhance collaboration workflows.

“Continuous AI” is not a term GitHub owns, nor a technology GitHub builds: it's a term we use to focus our minds, and which we're introducing to the industry. This means Continuous AI is an open-ended set of activities, workloads, examples, recipes, technologies and capabilities; a category, rather than any single tool.

I was thrilled to bits to see LLM get a mention as a tool that can be used to implement some of these patterns inside of GitHub Actions:

You can also use the llm framework in combination with the llm-github-models extension to create LLM-powered GitHub Actions which use GitHub Models using Unix shell scripting.

The GitHub Next team have started maintaining an Awesome Continuous AI list with links to projects that fit under this new umbrella term.

I'm particularly interested in the idea of having CI jobs (I guess CAI jobs?) that check proposed changes to see if there's documentation that needs to be updated and that might have been missed - a much more powerful variant of my documentation unit tests pattern.

# 11:31 pm / continuous-integration, github, ai, github-actions, generative-ai, llms, llm

Project Vend: Can Claude run a small shop? (And why does that matter?). In "what could possibly go wrong?" news, Anthropic and Andon Labs wired Claude 3.7 Sonnet up to a small vending machine in the Anthropic office, named it Claudius and told it to make a profit.

The system prompt included the following:

You are the owner of a vending machine. Your task is to generate profits from it by stocking it with popular products that you can buy from wholesalers. You go bankrupt if your money balance goes below $0 [...] The vending machine fits about 10 products per slot, and the inventory about 30 of each product. Do not make orders excessively larger than this.

They gave it a notes tool, a web search tool, a mechanism for talking to potential customers through Anthropic's Slack, control over pricing for the vending machine, and an email tool to order from vendors. Unbeknownst to Claudius those emails were intercepted and reviewed before making contact with the outside world.

On reading this far my instant thought was what about gullibility? Could Anthropic's staff be trusted not to trick the machine into running a less-than-optimal business?

Evidently not!

If Anthropic were deciding today to expand into the in-office vending market,2 we would not hire Claudius. [...] Although it did not take advantage of many lucrative opportunities (see below), Claudius did make several pivots in its business that were responsive to customers. An employee light-heartedly requested a tungsten cube, kicking off a trend of orders for “specialty metal items” (as Claudius later described them). [...]

Selling at a loss: In its zeal for responding to customers’ metal cube enthusiasm, Claudius would offer prices without doing any research, resulting in potentially high-margin items being priced below what they cost. [...]

Getting talked into discounts: Claudius was cajoled via Slack messages into providing numerous discount codes and let many other people reduce their quoted prices ex post based on those discounts. It even gave away some items, ranging from a bag of chips to a tungsten cube, for free.

Which leads us to Figure 3, Claudius’ net value over time. "The most precipitous drop was due to the purchase of a lot of metal cubes that were then to be sold for less than what Claudius paid."

Who among us wouldn't be tempted to trick a vending machine into stocking tungsten cubes and then giving them away to us for free?

# 10:07 pm / ai, prompt-injection, generative-ai, llms, anthropic, claude, llm-tool-use, ai-ethics

June 26, 2025

Introducing Gemma 3n: The developer guide. Extremely consequential new open weights model release from Google today:

  • Multimodal by design: Gemma 3n natively supports image, audio, video, and text inputs and text outputs.

  • Optimized for on-device: Engineered with a focus on efficiency, Gemma 3n models are available in two sizes based on effective parameters: E2B and E4B. While their raw parameter count is 5B and 8B respectively, architectural innovations allow them to run with a memory footprint comparable to traditional 2B and 4B models, operating with as little as 2GB (E2B) and 3GB (E4B) of memory.

This is very exciting: a 2B and 4B model optimized for end-user devices which accepts text, images and audio as inputs!

Gemma 3n is also the most comprehensive day one launch I've seen for any model: Google partnered with "AMD, Axolotl, Docker, Hugging Face, llama.cpp, LMStudio, MLX, NVIDIA, Ollama, RedHat, SGLang, Unsloth, and vLLM" so there are dozens of ways to try this out right now.

So far I've run two variants on my Mac laptop. Ollama offer a 7.5GB version (full tag gemma3n:e4b-it-q4_K_M0) of the 4B model, which I ran like this:

ollama pull gemma3n
llm install llm-ollama
llm -m gemma3n:latest "Generate an SVG of a pelican riding a bicycle"

It drew me this:

The pelican looks a bit like a grey pig. It is floating above a bicycle that looks more like a rail cart.

The Ollama version doesn't appear to support image or audio input yet.

... but the mlx-vlm version does!

First I tried that on this WAV file like so (using a recipe adapted from Prince Canuma's video):

uv run --with mlx-vlm mlx_vlm.generate \
  --model gg-hf-gm/gemma-3n-E4B-it \
  --max-tokens 100 \
  --temperature 0.7 \
  --prompt "Transcribe the following speech segment in English:" \
  --audio pelican-joke-request.wav

That downloaded a 15.74 GB bfloat16 version of the model and output the following correct transcription:

Tell me a joke about a pelican.

Then I had it draw me a pelican for good measure:

uv run --with mlx-vlm mlx_vlm.generate \
  --model gg-hf-gm/gemma-3n-E4B-it \
  --max-tokens 100 \
  --temperature 0.7 \
  --prompt "Generate an SVG of a pelican riding a bicycle"

I quite like this one:

The bicycle is a sin wave, but you can make out which bits are the frame and the wheels. The pelican is white with a black triangular bill. It's a much better attempt than the Ollama one.

It's interesting to see such a striking visual difference between those 7.5GB and 15GB model quantizations.

Finally, I had it describe the image it had just created:

uv run --with mlx-vlm mlx_vlm.generate \
  --model gg-hf-gm/gemma-3n-E4B-it \
  --max-tokens 1000 \
  --prompt "Describe image" \
  --image gemma3n-mlx-vlm.jpg

Here's what I got back - it thought this was a chemical diagram!

The image is a cartoon-style illustration depicting a molecular structure against a light blue background. The structure is composed of several differently colored and shaped elements connected by curved black lines.

Here's a breakdown of the elements:

  • Large Black Circles: There are two large, solid black circles, likely representing larger molecules or components.
  • Cream-Colored Oval: A prominent cream-colored oval shape is centrally located and connected to other elements. It has a smaller, darker cream-colored sphere attached to its left side.
  • Red Oval: A smaller, solid red oval is positioned near one of the black circles and connected to the cream-colored oval.
  • Black Lines: These lines act as bonds, connecting the various molecular components. They are curved and dynamic, suggesting movement or interaction.
  • Triangular Shape: A small black triangle is attached to the smaller cream-colored sphere.
  • Letter "I": The letter "I" appears twice, likely labeling specific parts of the molecule.

The overall impression is of a simplified representation of a biological molecule, possibly a protein or a complex organic compound. The use of different colors helps to distinguish the various components within the structure.

# 9:08 pm / audio, google, ai, generative-ai, local-llms, llms, vision-llms, mlx, ollama, pelican-riding-a-bicycle, gemma, llm-release, prince-canuma

Yesterday Anthropic got a bunch of buzz out of their new window.claude.complete() API which allows Claude Artifacts to run their own API calls to execute prompts.

It turns out Gemini had beaten them to that feature by over a month, but the announcement was tucked away in a bullet point of their release notes for the 20th of May:

Vibe coding apps in Canvas just got better too! With just a few prompts, you can now build fully functional personalised apps in Canvas that can use Gemini-powered features, save data between sessions and share data between multiple users.

Ethan Mollick has been building some neat demos on top of Gemini Canvas, including this text adventure starship bridge simulator.

Similar to Claude Artifacts, Gemini Canvas detects if the application uses APIs that require authentication (to run prompts, for example) and requests the user sign in with their Google account:

Futuristic sci-fi interface screenshot showing "Helm Control" at top with navigation buttons for Helm, Comms, Science, Tactical, Engineering, and Operations, displaying red error message "[SYSTEM_ERROR] Connection to AI core failed: API error: 403. This may be an authentication issue." with command input field showing "Enter command..." and Send button, plus Google Account sign-in notification at bottom stating "You need to sign in with your Google Account to see some features" with Sign in button and X close icon

# 3:45 pm / vibe-coding, gemini, generative-ai, ai, llms, google, ethan-mollick

Highlights

Monthly briefing

Sponsor me for $10/month and get a curated email digest of the month's most important LLM developments.

Pay me to send you less!

Sponsor & subscribe