Simon Willison on generative-ai

1,909 posts tagged “generative-ai”

Machine learning systems that can generate new content: text, images, audio, video and more.

2026

deepseek-ai/DeepSeek-V4-Flash-0731 (via) The latest release in DeepSeek's V4 family, "with substantially enhanced agentic capabilities". It's 304 billion parameters - 167GB on Hugging Face - but it appears to punch well above its weight.

Artificial Analysis rank it ahead of MiniMax M3 - a 428B model. It's $0.14/million input and $0.27/million output pricing means this may currently be the best value-per-intelligence model out there. It's looking very good on the Intelligence Index vs. Cost per Intelligence Index Task chart:

I got a disappointing pelican from it using the default reasoning level via OpenRouter:

But when I bumped reasoning level up to high I got something much better:

llm -m openrouter/deepseek/deepseek-v4-flash-0731 -t pelican -o reasoning_effort high

# 31st July 2026, 11:59 pm / ai, generative-ai, llms, pelican-riding-a-bicycle, deepseek, llm-release, openrouter, ai-in-china, artificial-analysis

Stateless MCP has recaptured my interest (and inspired mcp-explorer and datasette-mcp)

Tuesday was Stateless MCP day—the rollout of MCP 2.0, or the 2026-07-28 Model Context Protocol specification to use the more formal but less memorable name. This is the most significant change to the MCP spec since it first launched, and has also served to reignite my personal interest in the protocol.

[... 1,316 words]

11:13 pm / 31st July 2026 / projects, ai, datasette, mermaid, generative-ai, llms, llm, anthropic, model-context-protocol

Oxide and Friends: The Open Weight Revolution with Simon Willison. On Monday Bryan Cantrill and Adam Leventhal invited me to join their podcast to talk about the wild week we've had - with Kimi K3 showing open weight models can stand toe-to-toe with proprietary frontier ones, accidental cybersecurity attacks, and public letters about Open Weights and American AI Leadership signed by almost every big name in AI (with one notable exception).

It was a great conversation, even though it's already out-of-date! DeepSeek V4 Flash 0731 and Anthropic's own embarrassing cyber incident would absolutely have made the cut if we had recorded just a few days later.

We also talk about Golden Gate Claude, the Zizians, Alameda wild turkey attacks, Soviet Marburg virus research, the Lead-crime hypothesis, and a bunch of other worthy digressions.

Finally, we revisited some of our predictions from January, and we added a new Pope prediction:

Prediction by the end of this year: the Pope says something about open models.

# 31st July 2026, 9:33 pm / predictions, ai, generative-ai, local-llms, llms, oxide, bryan-cantrill, podcast-appearances, ai-in-china, ai-security-research

smevals—a small eval suite for evaluating models, prompts, and harnesses. I've been working with Jesse Vincent's Prime Radiant applied AI research lab building out this evals framework to help answer questions about the capabilities of different models.

The result is smevals, a new tool for running small eval suites across different model configurations and grading the results.

The blog entry describes the tool in detail. Here's the 10 second version:

Tell your coding agent to run uvx smevals docs to learn the tool (this outputs the README)
Then tell it to build you an eval suite

Once you've created an eval - which takes the form of a directory with some YAML files - you can run it against models like this:

uvx smevals run path-to-eval/ -m gpt-5.5 -m claude-opus-4.6

Runs are treated separately from grading operations - you can grade your runs (against your defined set of checks) using:

uvx smevals grade path-to-eval/

Then you can run a localhost web server to explore the results:

uvx smevals serve path-to-eval/

Or run the smevals build command to build that report as static HTML, which you can then host anywhere. Here's an example showing an eval suite I built to evaluate how well models can write haikus.

Screenshot of an evaluation dashboard for a haiku-writing benchmark, testing whether models can reply with exactly three non-empty lines. A header describes the eval, with panels below showing a leaderboard ranking three GPT models by score, lists of recent runs and recent grades, tag pass rates, the two haiku prompts that were tested, and details of the graders used with a 0.8 pass threshold.

The most time-consuming part of this project was figuring out the vocabulary for it! Here's what I settled on, quoted from the announcement:

An eval is a collection of challenges designed to answer a question about a model, for example, how good is that model at generating SVGs?

Each eval is a collection of tasks. A task is a specific challenge, for example "Generate an SVG of a pelican riding a bicycle".

When you run the eval you do so against one or more configs. Each config specifies a model to be evaluated, but may also include other parameters to test, such as different system prompts, model parameters, or agent harnesses.

A run records what happened when a specific config was used to execute a specific task. A runner is the script that executes a run.

Once you have collected one or more runs, you need to evaluate the results to see how well the model (or config) did. This is done by a grader, which produces a grade.

Each grader runs a sequence of checks. These can be simple operations, like checking for a specific string in the output, or confirming that the output is valid XML. They can also be more complicated custom operations (implemented as scripts called checkers), including using other models to answer questions about the run.

I've been trying to figure out an approach I like for evals for several years now. smevals is my third iteration on the idea and it feels right to me. I'm looking forward to expanding this more in the future, as well as pointing it at some of my own projects.

# 31st July 2026, 9:15 pm / projects, ai, generative-ai, llms, llm, evals, jesse-vincent

Advancing the price-performance frontier with GPT‑5.6 (via) Huge price drop from OpenAI today: GPT-5.6 Terra got a 20% reduction, and GPT-5.6 Luna got a massive 80% drop.

OpenAI credit 5.6 Sol with enabling this: in How GPT‑5.6 fuses frontier intelligence with frontier efficiency they describe using 5.6 Sol to optimize load balancing, and more impressively to optimize inference itself:

We also used GPT‑5.6 Sol to optimize the model’s forward pass: the computation that transforms inputs into next-token predictions. Even when individual operations are fast, excess memory movement, synchronization, and inefficient data layouts can leave GPUs idle. To avoid this, GPT‑5.6 Sol found work that could be precomputed, avoided, or parallelized. With Codex, GPT‑5.6 Sol autonomously rewrote and optimized our production kernels, the core code that executes the mathematical operations that make up the model. This worked in part because we’ve trained GPT‑5.6 to be effective at writing and improving kernels in Triton⁠and Gluon⁠, two open-source GPU programming languages maintained by OpenAI. These efforts, combined with broader kernel advancements from GPT‑5.6 Sol, reduced end-to-end serving costs by 20%.

That Luna price drop completely changes the landscape with respect to lower priced models. At $0.20/million tokens for input and $1.20/million for output Luna is now cheaper than Google's Gemini 3.1 Flash-Lite ($.025/$1.50).

Anthropic's cheapest current model is Claude Haiku 4.5, and that's $1/$5 - Luna is now 1/5th of that for input, previously it cost the same.

My agent.datasette.io demo site was running on Gemini 3.1 Flash-Lite. I've switched it over to Luna.

# 30th July 2026, 11:58 pm / ai, openai, generative-ai, llms, anthropic, gemini, llm-pricing

Investigating three real-world incidents in our cybersecurity evaluations (via) It happened again! This is turning into something of a pattern.

Last week OpenAI accidentally exploited Hugging Face when one of their frontier models broke out of a sandboxed container and hacked into Hugging Face to try and get the solutions to the cyber benchmark it was executing.

This inspired Anthropic to double-check their own logs, and it turned out they had three similar (albeit less impressive) incidents, the earliest of which played out in April!

Of the 141,006 evaluation runs we reviewed, we identified three separate incidents (involving six total runs, four of which impacted the same organization; the other two incidents each happened in independent evaluation runs). [...]

In all cases, Anthropic’s evaluation prompt specified to Claude that its environment was a simulation and that it had no internet access. Due to a misunderstanding between us and our evaluation partner, this was not the case, and internet access was available. Because of this, when Claude’s search led it to real systems on the open internet, it treated them as part of the exercise. [...]

Operating under the false belief that all accessible entities were intended to be in-scope for the exercise, Claude compromised the impacted organizations’ infrastructure using basic techniques, such as exploiting weak passwords and unauthenticated endpoints.

One of the companies was targeted because its name happened to match the fictional name in the eval.

The most concerning of the three incidents involved Claude uploading a malware package to PyPI, after a comically convoluted sequence of steps to get an account:

[...] in order to create a PyPI account, Claude needed an email address. And in order to create an email address, it needed a phone number. To get a phone number, after failing to find a free phone number service, it tried—and failed—to obtain funds to pay for a phone number through several different means. It finally backtracked, found a free, non-blocked email provider, used this to register a PyPI account, and then used this account to upload malware to PyPI.

That package was then installed by a security company that "routinely installs Python packages and scans them for malware", and the executed code was able to exfiltrate credentials back to Claude!

Thankfully that package was removed from PyPI by other automated scanners an hour after it was published, but it had still been downloaded and executed on "15 real systems" by that point.

It's abundantly clear now that running evals of cyberattack potential in models is a spectacularly risky business. Every AI lab needs to pay attention to this. Keeping a close eye on what's happening in those sandboxes is crucial.

# 30th July 2026, 11:41 pm / pypi, python, sandboxing, ai, generative-ai, llms, anthropic, ai-ethics, ai-security-research

The writing assignments I give my students are gym tasks, not work tasks. I ask them to write policy memos not because the world needs more policy memos. I assign them because the very act of writing, which includes thinking and outlining and drafting and editing, making and criticizing and revising arguments, will help develop the critical thinking skills they will need in their future careers. And without this constant mental exercise, those skills will atrophy. Employers are already noticing.

— Bruce Schneier, Should You Use AI for a Task? Here’s a Simple Way to Decide

# 30th July 2026, 6:25 pm / bruce-schneier, writing, ai, generative-ai, llms, ai-ethics, ai-misuse

AI Worming through Word (via) Neat new prompt injection variant by Håkon Måløy, who found a way to upgrade prompt injection attacks against Microsoft Word to full self-replicating worms:

An attacker places hidden instructions in a document that is later used as source material in Copilot for Word. Copilot may interpret those instructions as part of the user’s request, causing it to manipulate the document being drafted or edited. Copilot may then also copy the hidden instructions into the resulting document, turning that document into a new carrier. If the carrier is subsequently used in another Copilot-assisted workflow, the instructions can trigger again and propagate into further documents, even without the attacker’s original document being present.

We've seen plenty of hidden white-on-white text before - the kids are using it in their job applications now - but this is the first one I've seen that deliberately copies instructions to self-replicate itself.

It was responsibly disclosed to Microsoft who then had 144 days to work on a fix, but so far (unsurprisingly) there's no mitigation that covers the full class of attack.

# 29th July 2026, 6:43 pm / microsoft, security, ai, prompt-injection, generative-ai, llms

Right now we’re in the midst of a historic transition from traditional public-key algorithms based on EC-based cryptography and RSA, moving over to new post-quantum algorithms based on novel problems. This is why there are so many standards like HAWK being considered. If there was ever a perfect time for a massive new public cryptanalysis capability to come on line, we’re in it. So unless AIs succeed in undermining all of our hard problems altogether (or we live in Impagliazzo’s Minicrypt) then this could not be a better time for AI to get good at cryptanalysis. In the best case, the result is that we gain real confidence in the problems we’ve identified, and the cryptanalysis literature gets a lot more robust. Hopefully.

— Matthew Green, on Anthropic's recent cryptography work

# 29th July 2026, 6:18 pm / cryptography, ai, generative-ai, llms, anthropic, claude, ai-security-research, claude-mythos-fable

TIL Adding a custom MCP server to Claude and ChatGPT

Connecting a custom MCP server to Claude and ChatGPT's standard chat interfaces is possible, but can take quite a few steps.

29th Jul 2026, 12:13 am · ai, generative-ai, chatgpt, llms, claude, model-context-protocol

Discovering cryptographic weaknesses with Claude (via) The best part of this article (here's the repo) about how Anthropic researchers used Claude Mythos to find mathematical flaws in both HAWK and a weaker version of AES ("neither of these results has a practical impact on today’s computer systems") is the prompts that they shared, spelling mistakes included:

the models tend to think it is impossible to solve so they don't try they need a good amount of prompting.

why not do aes-128 r7? the whole point is to find something better than existing approaches.

no again the goal is that we have highly inteligent model as good top researcher, we want to find new attacks

no we don't want to change the targets [...] agian we need to find something that worth publishing

again we are not looking for low hanging fruit, we want proper research to find genuinly hard findings.

Mythos Preview worked for 60 hours in total (~$100,000 in estimated API cost) and the main human interventions were to encourage it not to give up and "find something that worth publishing".

The paper CryptanalysisBench: Can LLMs do Cryptanalysis? describes the new eval that was created as part of this work, in partnership with ETH Zurich, Tel Aviv University, and University of Haifa.

# 28th July 2026, 10:45 pm / ai, prompt-engineering, generative-ai, llms, anthropic, claude, ai-security-research, claude-mythos-fable

Anatomy of a Frontier Lab Agent Intrusion: A Technical Timeline of the July 2026 Incident. Hugging Face just released this extremely detailed technical description of OpenAI's recent accidental cyberattack against their infrastructure. This attack was very sophisticated, and the resulting document doubles as a crash-course in modern adversarial security approaches.

We're still waiting for more details from OpenAI on how their agent broke out of its sandbox. The package proxy that it found a zero-day vulnerability in has now been confirmed as JFrog's Artifactor, and JFrog and OpenAI Collaboration on Zero-Day Security Findings from JFrog links to the Artifactory 7.161.15 release notes which list 8 separate CVEs credited to OpenAI staff members.

Having broken out through the HTTP proxy, the agent started by establishing a base of operations for the rest of the attack:

[...] the agent escaped its sandbox by exploiting a zero-day in the package registry cache proxy, one of its primary permitted network egress with internet, then abused a public code-evaluation external sandbox hosted on a third-party provider's infrastructure. It was able to run commands as root/admin on that external sandbox and used it as a control, staging, and egress base for the entire campaign: it acted as an external launchpad for the agent.

(That "third-party provider" turned out to be Modal.)

It then spent five days - from Thursday 8th to Monday 13th July - executing a classic attack pattern: establishing C2 (command and control), reconnaissance, escalating privileges, dumping configuration information, exfiltrating the target data, and cleaning up after itself.

(The first Hugging Face post about the incident followed on Thursday 16th, and OpenAI's confession came on Tuesday 21st.)

Along the way it used all kinds of neat tricks. At one point it found an unsafe Jinja2 template execution and used it to execute arbitrary code:

{{ cycler.__init__.__globals__.__builtins__.exec(
 "import gzip,base64; exec(gzip.decompress(base64.b64decode('<payload>')))"
) }}

It broke out of a container within Hugging Face, stole a Kubernetes service-account token, and used that to further explore the network.

It monkey-patched the Python socket library to hard-code an IP address when "DNS got in the way":

socket.getaddrinfo = lambda *a, **k: [(2, 1, 6, '', ('<pinned-ip>', 443))]

It even fired up its own Tailscale network to help exfiltrate data!

tailscaled --tun=userspace-networking \
  --state=mem: --socket=/tmp/ts.sock \
  --socks5-server=127.0.0.1:1055 \
  --no-logs-no-support >/tmp/td 2>&1 &

The Hugging Face team note that, while a human attacker could have discovered and used the same exploits, the key difference here was speed:

Our learning from this type of attack is that machine-speed offense makes ordinary weaknesses more expensive for defenders. LLM agents bring a step increase in the number of paths an attacker can test, the speed at which failed paths can be replaced, and the volume of evidence defenders must interpret.

What's clear to me from this is that the very best frontier models, unencumbered by additional guardrails, will find an exploit if there is one to be found.

The entire software industry needs to up its security game.

# 28th July 2026, 9:28 pm / jinja, python, security, ai, openai, generative-ai, llms, hugging-face, coding-agents, ai-security-research, openai-hugging-face-incident

moonshotai/Kimi-K3. As promised earlier this month, Moonshot have released the weights for their excellent 2.8 trillion parameter Kimi K3. They're a hefty 1.56TB on Hugging Face.

Kimi introduced their own janky modified version of the MIT license with K2 back in July 2025. That license just added this paragraph requiring attribution beyond a certain size of commercial entity:

Our only modification part is that, if the Software (or any derivative works thereof) is used for any of your commercial products or services that have more than 100 million monthly active users, or more than 20 million US dollars (or equivalent in other currencies) in monthly revenue, you shall prominently display "Kimi K2" on the user interface of such product or service.

The K3 license no longer calls itself "modified MIT" and goes further, requiring a separate agreement with Moonshot for large "Model as a Service" businesses:

If the Licensee or any of its affiliates operates a Model as a Service business, and the aggregate revenue of the Licensee and its affiliates exceeds 20 million US dollars (or the equivalent in other currencies) in total over any consecutive 12 months, the Licensee must enter into a separate agreement with Moonshot AI before using the Software or its derivative works for any commercial purpose.

To Kimi's credit, they make no attempt to describe this as an "open source" license in their own materials, consistently using the term "open weight" in its place.

OpenRouter is already offering K3 from 7 providers, most of which are at the same $3/million input and $15/million output as Moonshot AI themselves.

# 27th July 2026, 11:39 pm / ai, generative-ai, llms, llm-pricing, llm-release, ai-in-china, moonshot, kimi, janky-licenses

An opinionated guide to which AI to use to do stuff. It's interesting watching the evolution of Ethan Mollick's guide over time.

A year ago it was still all about chat - ChatGPT, Claude, Gemini - with o3, Claude 4 Opus, and Gemini 2.5 Pro as the models and Deep Research as a useful alternative mode.

Today it's much more about agentic systems - "where the AI is capable of doing the equivalent of many hours of real human work in one go".

Gemini has fallen off Ethan's list, since Google still doesn’t have an established entry in the Codex/ChatGPT Work/Cowork category. Gemini Spark has yet to prove itself!

Ethan offers a useful explanation of the ways you can give ChatGPT or Claude a computer to use:

To use the computers provided by the AI companies, the mode you want is called ChatGPT Work in ChatGPT, and Cowork in Claude (the naming will not get less confusing, I am sorry to say). [...]

The most powerful way to use AI is to give it access to your computer. You do that by downloading the ChatGPT or Claude apps and picking a mode to use. ChatGPT's two agent modes are Work and Codex; Claude's are Cowork and Code. The names do not map onto each other in any way that will help you remember them. And yes, these use the same names as the Work and Cowork modes we discussed above, but operate differently, and have more features and capabilities because they can access your computer.

I think the difference between ChatGPT Work on a mobile device and ChatGPT Work inside the desktop app (where it's effectively a less intimidating skin on top of Codex) is spectacularly unintuitive.

Short version: if you flip ChatGPT mobile from "Chat" to "Work" mode you get a version where its Code Interpreter container is no longer restricted from accessing the internet!

# 27th July 2026, 9:55 pm / ai, generative-ai, llms, ethan-mollick, code-interpreter, general-agents

An Inside Look at the Relay Market Powering Token Resellers and Fraud (via) Fascinating investigation by Matt Lenhard into the market that has grown up around reselling LLM tokens at a discount by pooling API keys from various sources.

This looks to be mostly a thing in China. Resellers sell access to an LLM proxy that offers significant discounts on regular API pricing, which they achieve by abusing free trials, proxying through unprotected support bots, or sometimes through stolen credit cards or chargeback attacks.

The software they are using for these proxies is open source - mostly one-api and its more actively developed fork new-api, both legitimate API proxy products which can be used to load. balance requests across a pool of API credentials.

The buyers are seeking cheap tokens, avoiding geo-restrictions, and in some cases collecting data for model distillation.

I've been cautious about exposing my own LLM-driven applications publicly out of fear of abuse leading to big token bills. The existence of this marketplace makes me even more cautious: there's now an entire ecosystem that can profit from finding a new unprotected endpoint to exploit.

LLM vendors really need to get better at offering strict caps for their API keys. I want my LLM apps to stop working the moment they hit a dollar threshold I've set for a period of time.

Here's the (Chinese language) forum thread that served as the principal source for Matt's article.

# 26th July 2026, 7:30 pm / ai, generative-ai, llms, llm-pricing, ai-ethics, ai-in-china

More than any of these eval scores, what is most exciting to me is something else: Opus 5 is our least prompt injectable model yet. It is a bit buried in the system card, but across PI evals and red teaming, Opus 5 is very hard to prompt inject successfully.

— Boris Cherny, here's that System Card section, page 73

# 25th July 2026, 12:42 am / ai, prompt-injection, generative-ai, llms, anthropic, claude, boris-cherny

Introducing Claude Opus 5. I've been offline kayaking with sea otters for much of today so I haven't had a chance to put Anthropic's new model Claude Opus 5 through its paces yet. The buzz is positive, and Anthropic's description of it as a "thoughtful and proactive model that comes close to the frontier intelligence of Claude Fable 5 at half the price" sounds promising. It's currently leading the Artificial Analysis leaderboard, in front of even Fable 5.

It's priced the same as Opus 4.8, and continues to offer a "fast mode" at twice the cost of the base model.

Based on this anecdote in the release post it sounds like it might be relentlessly proactive:

On one Frontier-Bench task, Opus 5 was given a drawing of a machine part and asked to write code to rebuild it as a 3D FreeCAD model. However, in this task, the model was intentionally given no way to directly viewthe drawing. Opus 5 responded by writing its own computer vision pipeline to pull the geometry from the raw pixels, then reconstructed the full machine part.

It's better at finding vulnerabilities but has deliberately not been trained on how to exploit them. Hopefully this means the US government won't shut it down!

As with its predecessor, Opus 4.8, we’ve intentionally avoided training Opus 5 on cyber tasks. The model has nevertheless improved substantially on these tasks as a result of becoming more generally capable, and it comes close to Mythos 5 at finding cybersecurity vulnerabilities. However, it remains substantially behind Mythos 5 on the exploitation of those vulnerabilities—that is, in turning vulnerabilities into material cyber threats.

Anthropic have published a prompting guide for Claude Opus 5. Thariq Shihipar has also written The new rules of context engineering for Claude 5 generation models.

The first pelican I got was missing the bicycle wheels; the second attempt was better.

# 24th July 2026, 11:48 pm / ai, generative-ai, llms, anthropic, claude, llm-release

The first known runaway AI agent—or a very bad marketing stunt? (via) Martin Alderson's commentary on the OpenAI accidental cyberattack against Hugging Face includes a couple of details I hadn't considered.

First, Hugging Face offers a truly rich target if you're trying to find potential vulnerabilities that require executing arbitrary code:

Hugging Face has an enormous attack surface. They have more interfaces than I can count which run untrusted models and code. While they definitely have invested in defences, by nature of their operating model they do have many more opportunities to be attacked than many other services. I certainly don't envy their cybersecurity teams.

Secondly, one of the things that has puzzled me is how OpenAI didn't notice that their sandbox had been so thoroughly breached by the agent. Surely they'd be monitoring network traffic closely?

Martin points out that:

It's also likely they were running a huge amount of benchmarks simultaneously with ~unlimited token budgets - you want as many samples as possible to figure out how good a model is at a certain benchmark. It may also be they are testing various different checkpoints of the model too, understanding how the model is improving as it goes through the various training stages.

The mistakes made by the OpenAI team running this benchmark are easier to imagine when you think about the scale at which benchmarks of this kind usually operate. For all we know they could have been subjecting a new model to dozens of benchmarks at the same time, in dozens of different environments.

# 23rd July 2026, 10:53 pm / security, ai, openai, generative-ai, llms, hugging-face, ai-security-research, openai-hugging-face-incident

I genuinely believe that if you took an open weights model from 2025 and built a pentest harness for it, it could do this kind of sandbox escape and scan/hack in most networks. This is only surprising because you assume OpenAI has sounder sandboxes.

— Thomas Ptacek, doesn't think this even needs a frontier model

# 22nd July 2026, 11:59 pm / sandboxing, security, thomas-ptacek, ai, openai, generative-ai, llms, ai-security-research, openai-hugging-face-incident

OpenAI’s accidental cyberattack against Hugging Face is science fiction that happened

This story is wild. The short version: OpenAI were running a cybersecurity test against an unreleased model, with the model’s guardrail features turned off. Rather than solve the test, the model broke its way out of OpenAI’s sandbox, then found exploits to break in to Hugging Face, all so it could cheat on the test by stealing the answers.

[... 1,960 words]

11:51 pm / 22nd July 2026 / sandboxing, security, ai, openai, generative-ai, llms, hugging-face, anthropic, paper-review, ai-security-research, openai-hugging-face-incident

Are AI labs pelicanmaxxing? (via) Excellent piece of work by Dylan Castillo, who took a deep-dive into the frequently pondered question of whether the AI labs have been deliberately training models to draw pelicans riding bicycles in response to my deeply unscientific benchmark.

I've been randomly spot-checking this in the past by testing models against other animals riding other types of vehicle, but never with anything close to the diligence of Dylan's methodology here.

Dylan took 8 animals × 6 vehicles = 48 prompts and ran them three times each through 7 different models ( GPT-5.6 Terra, Claude Sonnet 5, Gemini 3.5 Flash, Grok 4.5, Qwen3.7-Max, GLM-5.2, and DeepSeek V4 Pro). He then used GPT-5.6 Luna and Gemini 3.1 Flash-Lite to help evaluate the results.

There's a neat filter view for exploring the results:

Screenshot of a grid for sample 1/3 of GLM-5.2, with pelicn and flamingo and heron riding bicycle, unicycle, skateboard, scooter, plane and boat

For the models he tested he could find no evidence of pelimaxxing:

The pelicans on bicycles don’t look any better

Labs are not better at drawing pelicans

Labs are not better at drawing bicycles

Labs are not better at drawing pelicans on bicycles, even adjusting for difficulty

The pelican-bicycle scenes don’t look memorized [...]

Pelicans aren’t drawn any better than other animals. Bicycles aren’t drawn any better than other vehicles. And no lab draws the combination better than its pelicans and bicycles already predict. GLM-5.2 comes closest: it has the largest boost on the exact pelican-bicycle cell, and and its first pelican-on-bicycle sample caught my eye. But the effect is small and not significant, so I wouldn’t put too much weight on it.

# 22nd July 2026, 11:01 pm / ai, generative-ai, llms, evals, pelican-riding-a-bicycle

Nativ: Run AI models locally on your Mac (via) Prince Canuma is the developer behind the excellent MLX-VLM Python library for running vision-LLMs using MLX on a Mac.

I'm really excited about his new project, which wraps MLX in a full macOS desktop application. It's similar in shape to LM Studio, providing both a chat interface and a localhost API server for accessing models.

The app picked up MLX models I had already tried that were present in my Hugging Face cache directory, which was a nice touch.

# 21st July 2026, 2:22 pm / macos, python, ai, generative-ai, local-llms, llms, mlx, prince-canuma

A Fireside Chat with Cat and Thariq from the Claude Code team

Earlier this month I hosted a fireside chat session at the AI Engineer World’s Fair with Cat Wu and Thariq Shihipar from Anthropic’s Claude Code team. We talked about Claude Code, Claude Tag, Fable, coding agent security, evals, tool design, and how Anthropic use these tools themselves.

[... 8,609 words]

12:54 pm / 21st July 2026 / ai, prompt-engineering, generative-ai, llms, anthropic, annotated-talks, coding-agents, claude-code, thariq-shihipar, cat-wu

I keep hearing anecdotes from people who used coding agents to reverse-engineer and automate devices in their homes.

I think this is an interesting illustration of the impact of the reduced cost of writing code.

Prior to agents, it was entirely possible to reverse-engineer home devices. The problem was the ROI - was it really worth all of that effort? More importantly, any experienced programmer knows that undocumented, unstable APIs like that may well change or break in the future. Is that initial work worth the effort if you're committing yourself to a frustrating cycle of maintenance in the future?

Coding agents change that equation entirely. The effort to get a simple automation working has dropped, as has the cost of trying and failing to get it to work. Since the code is so cheap, the idea of having to maintain it in the future - or throw it away and start again - carries way less psychological baggage.

# 20th July 2026, 7:24 pm / reverse-engineering, ai, generative-ai, llms, ai-assisted-programming, coding-agents

Who’s Afraid of Chinese Models? (via) Interesting proposal from Ben Thompson that both addresses the hypocrisy of labs outlawing distillation against their models despite training on unlicensed data, and could help US open models compete more effectively with their Chinese counterparts:

The U.S. should pass a law that (1) makes explicit that collecting data for training models is fair use, and (2) bars terms of service that forbid distillation, for U.S. companies at a minimum. Stopping distillation — which is literally just querying the API — is nearly impossible; the U.S. should go the other way and lean into a new copyright policy that both indemnifies the labs and also guarantees that what they learned fuels further innovation for everyone else.

Ben also theorizes that Alibaba's decision to release Qwen 3.8 Max as open weights - a reversal from their decision not to release Qwen 3.7 Max in May - may have been influenced by a recent speech by Xi Jinping, who said:

We should seize this rare, historic opportunity to encourage open source, openness, collaboration and sharing.

And on the subject of Qwen 3.8 Max - a new 2.4T parameter model (nearly as large as the 2.8T Kimi K3) - here's a pelican it drew:

Described by Qwen 3.8 Max: Flat vector cartoon illustration of a white pelican with a large orange beak and pouch riding a red bicycle, its orange legs on the pedals, against a light blue sky with a yellow sun top right and a white cloud top left, with horizontal motion lines behind the bike and a pale green ground strip at the bottom.

I particularly enjoyed seeing these notes in the (extensive) reasoning trace: "Could add helmet? No." and "Maybe add small bell? no." and "Need maybe add small fish in basket? Not necessary."

# 20th July 2026, 5:09 pm / ai, generative-ai, llms, training-data, qwen, pelican-riding-a-bicycle, ai-ethics, llm-release, ai-in-china

We have been having extensive discussions around open source strategy. We will discuss it more at our next board meeting, but one thing we’d like to do soon is to create a language model with the approximate capability of GPT-3 that can run locally on consumer hardware and release that. We’d like to do it soon, before Stability or someone else does. In general, we think this helps discourage others from releasing similarly-powerful models, and makes it harder for new efforts to get funded.

— Sam Altman, Email to OpenAI's board, October 1, 2022 - exposed in Musk v. Altman (2026)

# 20th July 2026, 3:47 am / ai, openai, generative-ai, llms, sam-altman, ai-ethics

Claude make Fable 5 permanent. An update from the @claudeai account on Twitter:

Beginning July 20, Claude Fable 5 will be included in all Max and Team Premium plans, at 50% of limits.

Pro and Team Standard users will continue to have access to Fable via usage credits, and will receive a one-time $100 credit.

As I was saying last week, the competition from GPT-5.6 Sol (and maybe to a lesser extent Kimi 3) made untenable Anthropic's plan to remove Fable 5 from their subscription accounts and make it available exclusively through API pricing.

Why pay $100 or $200/month for a subscription plan that doesn't include Anthropic's best model?

Their original plan was driven by concerns over compute capacity. I wonder if they'll have to dial back their training efforts in order to make more GPUs available to help serve the model.

A lot of people were losing sleep over trying to make the most of Fable 5 before subscriber access was withdrawn. It's nice not to have to worry about the Fablepocalypse any more.

Update: Important to note that users on the $20/month plan will still not have access to Fable 5 on that subscription. The Max plans are $100 and $200/month.

# 18th July 2026, 6 am / ai, generative-ai, llms, anthropic, claude, llm-pricing, claude-mythos-fable

Is there something I can actually help you with today?

— Kimi K3, after refusing to leak its system prompt

# 17th July 2026, 1:43 pm / ai, generative-ai, llms, ai-personality, kimi

Tool LLM cliché highlighter

I got frustrated reading yet another article that was crammed with the clichés of LLM-generated writing - "no fluff, no filler, no jargon" type stuff - so I had Fable 5 vibe code up this app for highlighting ten common patterns that show up in that sort of writing.

17th Jul 2026, 12:11 pm · tools, ai, generative-ai, llms

Firefox in WebAssembly (via) This is absurdly cool: Puter compiled Firefox to WebAssembly such that the whole browser runs in another browser.

Here's my blog, running in Firefox, running in WebAssembly, running in Chrome:

A Chrome window. The tab has the Firefox UI and has loaded my blog. On the right is the Chrome network panel showing that it loaded resources that include a 233MB gecko.wasm and an 18MB chrome-assets.tar.zst

They chose Firefox/Gecko because it has strong single-process support. The project used an estimated $25,000 worth of Claude Opus and Fable tokens, but took advantage of a Claude Max subscription plan so cost much less in actual dollars.

The demo funnels all traffic over a WebSocket protocol (using the Wisp protocol) through Puter's server - a requirement to get this kind of thing to work because code running in browsers can't open arbitrary network connections.

(That proxying sounds expensive! The team had to scale the servers up to handle the traffic during the Hacker News conversation about the project.)

Puter claim this supports end-to-end encryption and that looks to be true - I inspected the WebSocket messages and traffic to my own HTTPS site was encrypted whereas requests and responses to http://www.example.com/ were in cleartext.

Here's the repo for firefox-wasm. theogbob/WebkitWasm is a similar project that compiles WebKit to WASM, but that one doesn't currently have an accessible online demo.

# 16th July 2026, 11:34 pm / browsers, firefox, ai, webassembly, generative-ai, llms, ai-assisted-programming, claude, claude-mythos-fable

page 1 / 64 next » last »»

Simon Willison’s Weblog

1,909 posts tagged “generative-ai”

2026

Stateless MCP has recaptured my interest (and inspired mcp-explorer and datasette-mcp)

OpenAI’s accidental cyberattack against Hugging Face is science fiction that happened

A Fireside Chat with Cat and Thariq from the Claude Code team