Jules, our asynchronous coding agent, is now available for everyone |
https://blog.google/technology/google-labs/jules-now-available/ |
I wrote about the Jules beta [back in May](https://simonwillison.net/2025/May/19/jules/). Google's version of the OpenAI Codex PR-submitting hosted coding tool graduated from beta today.
I'm mainly linking to this now because I like the new term they are using in this blog entry: **Asynchronous coding agent**. I like it so much I [gave it a tag](https://simonwillison.net/tags/asynchronous-coding-agents/).
I continue to avoid the term "agent" as infuriatingly vague, but I can grudgingly accept it when accompanied by a prefix that clarifies the type of agent we are talking about. "Asynchronous coding agent" feels just about obvious enough to me to be useful.
... I just ran a Google search for `"asynchronous coding agent" -jules` and came up with a few more notable examples of this name being used elsewhere:
- [Introducing Open SWE: An Open-Source Asynchronous Coding Agent](https://blog.langchain.com/introducing-open-swe-an-open-source-asynchronous-coding-agent/) is an announcement from LangChain just this morning of their take on this pattern. They provide a hosted version (bring your own API keys) or you can run it yourself with [their MIT licensed code](https://github.com/langchain-ai/open-swe).
- The press release for GitHub's own version of this [GitHub Introduces Coding Agent For GitHub Copilot](https://github.com/newsroom/press-releases/coding-agent-for-github-copilot) states that "GitHub Copilot now includes an asynchronous coding agent". |
2025-08-06 19:36:24+00:00 |
Qwen3-4B Instruct and Thinking |
https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507 |
Yet another interesting model from Qwen - these are tiny compared to their other recent releases (just 4B parameters, 7.5GB on Hugging Face and even smaller when quantized) but with a 262,144 context length, which Qwen suggest is essential for all of those thinking tokens.
The new model somehow beats the significantly larger Qwen3-30B-A3B Thinking on the AIME25 and HMMT25 benchmarks, according to Qwen's self-reported scores.
The easiest way to try it on a Mac is via LM Studio, who already have their own MLX quantized versions out in |
2025-08-06 16:40:11+00:00 |
Tom MacWright: Observable Notebooks 2.0 |
https://macwright.com/2025/07/31/observable-notebooks-2 |
Observable announced [Observable Notebooks 2.0](https://observablehq.com/notebook-kit/) last week - the latest take on their JavaScript notebook technology, this time with an [open file format](https://observablehq.com/notebook-kit/kit) and a brand new [macOS desktop app](https://observablehq.com/notebook-kit/desktop).
Tom MacWright worked at Observable during their first iteration and here provides thoughtful commentary from an insider-to-outsider perspective on how their platform has evolved over time.
I particularly appreciated this aside on the downsides of evolving your own not-quite-standard language syntax:
> Notebook Kit and Desktop [support vanilla JavaScript](https://observablehq.com/notebook-kit/#vanilla-java-script), which is excellent and cool. The Observable changes to JavaScript were always tricky and meant that we struggled to use off-the-shelf parsers, and users couldn't use standard JavaScript tooling like eslint. This is stuff like the `viewof` operator which meant that [Observable was not JavaScript](https://observablehq.com/@observablehq/observable-javascript). [...] *Sidenote*: I now work on [Val Town](https://www.val.town/), which is also a platform based on writing JavaScript, and when I joined it *also* had a tweaked version of JavaScript. We used the `@` character to let you 'mention' other vals and implicitly import them. This was, like it was in Observable, not worth it and we switched to standard syntax: don't mess with language standards folks! |
2025-08-06 16:37:13+00:00 |
No, AI is not Making Engineers 10x as Productive |
https://colton.dev/blog/curing-your-ai-10x-engineer-imposter-syndrome/ |
Colton Voege on "curing your AI 10x engineer imposter syndrome".
There's a lot of rhetoric out there suggesting that if you can't 10x your productivity through tricks like running a dozen Claude Code instances at once you're falling behind. Colton's piece here is a pretty thoughtful exploration of why that likely isn't true. I found myself agreeing with quite a lot of this article.
I'm a pretty huge proponent for AI-assisted development, but I've never found those 10x claims convincing. I've estimated that LLMs make me 2-5x more productive on the parts of my job which involve typing code into a computer, which is itself a small portion of that I do as a software engineer.
That's not too far from this article's assumptions. From the article:
> I wouldn't be surprised to learn AI helps many engineers do certain tasks 20-50% faster, but the nature of software bottlenecks mean this doesn't translate to a 20% productivity increase and certainly not a 10x increase.
I think that's an under-estimation - I suspect engineers that really know how to use this stuff effectively will get more than a 0.2x increase - but I do think all of the *other stuff* involved in building software makes the 10x thing unrealistic in most cases. |
2025-08-06 00:11:56+00:00 |
Claude Opus 4.1 |
https://www.anthropic.com/news/claude-opus-4-1 |
Surprise new model from Anthropic today - Claude Opus 4.1, which they describe as "a drop-in replacement for Opus 4".
My favorite thing about this model is the version number - treating this as a .1 version increment looks like it's an accurate depiction of the model's capabilities.
Anthropic's own benchmarks show very small incremental gains.
Comparing Opus 4 and Opus 4.1 (I [got 4.1 to extract this information from a screenshot](https://claude.ai/share/c7366629-784a-4088-9fc4-15613aa41a7f) of Anthropic's own benchmark scores, then asked it to look up the links, then verified the links myself and fixed a few):
- **Agentic coding** ([SWE-bench Verified](https://github.com/SWE-bench/SWE-bench)) From 72.5% to 74.5%
- **Agentic terminal coding** ([Terminal-Bench](https://github.com/laude-institute/terminal-bench)) From 39.2% to 43.3%
- **Graduate-level reasoning** ([GPQA Diamond](https://github.com/idavidrein/gpqa)) From 79.6% to 80.9%
- **Agentic tool use** ([TAU-bench](https://github.com/sierra-research/tau-bench))
- Retail: From 81.4% to 82.4%
- **Airline: From 59.6% to 56.0%** *(decreased)*
- **Multilingual Q&A** ([MMMLU](https://huggingface.co/datasets/openai/MMMLU)): From 88.8% to 89.5%
- **Visual reasoning** ([MMMU validation](https://mmmu-benchmark.github.io/)): From 76.5% to 77.1%
- **High school math competition** ([AIME 2025](https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions)) From 75.5% to 78.0%
Likewise, the [model card](https://assets.anthropic.com/m/4c024b86c698d3d4/original/Claude-4-1-System-Card.pdf) shows only tiny changes to the various safety metrics that Anthropic track.
It's priced the same as Opus 4 - $15/million for input and $75/million for output, making it one of [the most expensive models](https://www.llm-prices.com/#sb=input&sd=descending) on the market today.
I had it [draw me this pelican](https://gist.github.com/simonw/7fead138d31d751d65c7253a1c18751b) riding a bicycle:

For comparison I got a fresh new pelican [out of Opus 4](https://gist.github.com/simonw/96a958e39aaed10e1e47c1aab2d05e20) which I actually like a little more:

I shipped [llm-anthropic 0.18](https://github.com/simonw/llm-anthropic/releases/tag/0.18) with support for the new model. |
2025-08-05 17:17:37+00:00 |
A Friendly Introduction to SVG |
https://www.joshwcomeau.com/svg/friendly-introduction-to-svg/ |
This SVG tutorial by Josh Comeau is fantastic. It's filled with neat interactive illustrations - with a pleasing subtly "click" audio effect as you adjust their sliders - and provides a useful introduction to a bunch of well chosen SVG fundamentals.
I finally understand what all four numbers in the `viewport="..."` attribute are for! |
2025-08-05 05:20:18+00:00 |
Usage charts for my LLM tool against OpenRouter |
https://openrouter.ai/apps?url=https%3A%2F%2Fllm.datasette.io%2F |
OpenRouter proxies requests to a large number of different LLMs and provides high level statistics of which models are the most popular among their users.
Tools that call OpenRouter can include `HTTP-Referer` and `X-Title` headers to credit that tool with the token usage. My [llm-openrouter](https://github.com/simonw/llm-openrouter/) plugin [does that here](https://github.com/simonw/llm-openrouter/blob/8e4be78e60337154b063faaa7161dddd91462730/llm_openrouter.py#L99C13-L99C20).
... which means [this page](https://openrouter.ai/apps?url=https%3A%2F%2Fllm.datasette.io%2F) displays aggregate stats across users of that plugin! Looks like someone has been running a lot of traffic through [Qwen 3 14B](https://openrouter.ai/qwen/qwen3-14b) recently.
 |
2025-08-04 20:00:47+00:00 |
Qwen-Image: Crafting with Native Text Rendering |
https://qwenlm.github.io/blog/qwen-image/ |
Not content with releasing [six excellent open weights LLMs in July](https://simonwillison.net/2025/Jul/30/chinese-models/), Qwen are kicking off August with their first ever image generation model.
Qwen-Image is a 20 billion parameter MMDiT (Multimodal Diffusion Transformer, originally proposed for Stable Diffusion 3) model under an Apache 2.0 license. The [Hugging Face repo](https://huggingface.co/Qwen/Qwen-Image) is 53.97GB.
Qwen released a [detailed technical report](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Qwen_Image.pdf) (PDF) to accompany the model. The model builds on their Qwen-2.5-VL vision LLM, and they also made extensive use of that model to help create some of their their training data:
> In our data annotation pipeline, we utilize a capable image captioner (e.g., Qwen2.5-VL) to generate not only comprehensive image descriptions, but also structured metadata that captures essential image properties and quality attributes.
>
> Instead of treating captioning and metadata extraction as independent tasks, we designed an annotation framework in which the captioner concurrently describes visual content and generates detailed information in a structured format, such as JSON. Critical details such as object attributes, spatial relationships, environmental context, and verbatim transcriptions of visible text are captured in the caption, while key image properties like type, style, presence of watermarks, and abnormal elements (e.g., QR codes or facial mosaics) are reported in a structured format.
They put a *lot* of effort into the model's ability to render text in a useful way. 5% of the training data (described as "billions of image-text pairs") was data "synthesized through controlled text rendering techniques", ranging from simple text through text on an image background up to much more complex layout examples:
> To improve the model’s capacity to follow complex, structured prompts involving layout-sensitive content, we propose a synthesis strategy based on programmatic editing of pre-defined templates, such as PowerPoint slides or User Interface Mockups. A comprehensive rule-based system is designed to automate the substitution of placeholder text while maintaining the integrity of layout structure, alignment, and formatting.
I tried the model out using the [ModelScope demo](https://modelscope.cn/aigc/imageGeneration?tab=advanced) - I signed in with GitHub and verified my account via a text message to a phone number. Here's what I got for "A raccoon holding a sign that says "I love trash" that was written by that raccoon":

The raccoon has very neat handwriting!
**Update**: A version of the model exists that can edit existing images but it's [not yet been released](https://github.com/QwenLM/Qwen-Image/issues/3#issuecomment-3151573614):
> Currently, we have only open-sourced the text-to-image foundation model, but the editing model is also on our roadmap and planned for future release. |
2025-08-04 19:11:36+00:00 |
I Saved a PNG Image To A Bird |
https://www.youtube.com/watch?v=hCQCP-5g5bo |
Benn Jordan provides one of the all time great YouTube video titles, and it's justified. He drew an image in an audio spectrogram, played that sound to a talented starling (internet celebrity ["The Mouth"](https://www.tiktok.com/@farijuana_bird/video/7452882774991572254)) and recorded the result that the starling almost perfectly imitated back to him.
> Hypothetically, if this were an audible file transfer protocol that used a 10:1 data compression ratio, that's nearly 2 megabytes of information per second. While there are a lot of caveats and limitations there, the fact that you could set up a speaker in your yard and conceivably store any amount of data in songbirds is crazy.
This video is full of so much more than just that. Fast forward to [5m58s](https://www.youtube.com/watch?v=hCQCP-5g5bo&t=358s) for footage of a nest full of brown pelicans showing the sounds made by their chicks! |
2025-08-04 16:32:51+00:00 |
XBai o4 |
https://huggingface.co/MetaStoneTec/XBai-o4 |
Yet *another* open source (Apache 2.0) LLM from a Chinese AI lab. This model card claims:
> **XBai o4** excels in complex reasoning capabilities and has now completely surpassed OpenAI-o3-mini in Medium mode.
This a 32.8 billion parameter model released by MetaStone AI, a new-to-me lab who released their first model in March - [MetaStone-L1-7B](https://huggingface.co/MetaStoneTec/MetaStone-L1-7B), then followed that with MetaStone-S1 [1.5B](https://huggingface.co/MetaStoneTec/MetaStone-S1-1.5B), [7B](https://huggingface.co/MetaStoneTec/MetaStone-S1-7B) and [32B](https://huggingface.co/MetaStoneTec/MetaStone-S1-32B) in July and now XBai o4 in August.
The MetaStone-S1 models were accompanied with a with a paper, [Test-Time Scaling with Reflective Generative Model](https://arxiv.org/abs/2507.01951).
There is *very* little information available on the English-language web about MetaStone AI. Their paper shows a relationship with USTC, [University of Science and Technology of China](https://en.wikipedia.org/wiki/University_of_Science_and_Technology_of_China) in Hefei. One of their researchers [confirmed on Twitter](https://x.com/WangMagic_/status/1951690465222217872) that their CEO is from [KWAI](https://en.wikipedia.org/wiki/Kuaishou) which lead me to [this Chinese language article](https://www.qbitai.com/2024/07/168071.html) from July last year about Li Yan, formerly of KWAI and now the founder of Wen Xiaobai and [evidently](https://x.com/simonw/status/1951694450369208361) [now](https://x.com/WangMagic_/status/1951694611191324929) the CEO of MetaStone. [www.wenxiaobai.com](https://www.wenxiaobai.com) is listed as the "official website" linked to from [the XBai-o4 README](https://github.com/MetaStone-AI/XBai-o4) on GitHub.
Ivan Fioravanti [got it working under MLX](https://huggingface.co/mlx-community/models?search=xbai-o4) in 4bit, 5bit, 6bit, 8bit and 4bit-DWQ sizes. I tried his [6bit one](https://huggingface.co/mlx-community/XBai-o4-6bit) (a 24.81GB download) in LM Studio and had it "Generate an SVG of a pelican riding a bicycle". It thought for 39.49 seconds ([thinking trace here](https://gist.github.com/simonw/78182fc3409e36f8d22217992967b9d6#response)) and gave me this:

I also had it "Write an HTML and JavaScript page implementing space invaders", which you can [try out here](https://tools.simonwillison.net/space-invaders-xbai-o4-6bit-mlx). It's not great - the invaders don't fire bullets (it listed that as a "potential improvement") and I don't think it's winnable due to timing issues, but it's an interesting addition to [my collection](https://simonwillison.net/tags/space-invaders/).
LM Studio reported a constant memory usage of 25.54GB and according to Activity Monitor used 87.5% of my GPU and around 39% of a CPU core.
It did a *lot* of work in the thinking trace - which took 10m35s! - including writing the first draft of much of the code that ended up in the finished result. Here's an extract:
> Let me think of variables:
>
> `let enemyDirection = 1; // 1 for right, -1 for left`
>
> In the game loop:
> ```
> enemies.forEach(enemy => {>
> enemy.x += enemyDirection * enemy.dx;>
> });
> ```
> Then check if any enemy is out of bounds. If so, reverse direction and move all enemies down.
>
> But how to check if any enemy is at the edge?
>
> Maybe find the max x of all enemies and see if it exceeds `canvas.width - enemy.width`.
>
> Alternatively, track the leftmost and rightmost positions of enemies. Maybe a bit complex.
>
> For simplicity, perhaps check if any enemy's `x + width > canvas.width or x < 0`.
Here's [the full transcript](https://gist.github.com/simonw/457d55a5c426e0bbea9d47aee225c391).
My initial impressions are that I'm not as impressed with this model for running on my own laptop as I was with [Qwen3-Coder-30B-A3B-Instruct](https://simonwillison.net/2025/Jul/31/qwen3-coder-flash/) or [GLM-4.5 Air](https://simonwillison.net/2025/Jul/29/space-invaders/).
But... how extraordinary is it that *another* Chinese AI lab has been able to produce a competitive model, this time with far less fanfare than we've seen from Qwen and Moonshot AI and Z.ai. |
2025-08-03 22:21:17+00:00 |