<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: nicholas-carlini</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/nicholas-carlini.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-04-07T20:52:54+00:00</updated><author><name>Simon Willison</name></author><entry><title>Anthropic's Project Glasswing - restricting Claude Mythos to security researchers - sounds necessary to me</title><link href="https://simonwillison.net/2026/Apr/7/project-glasswing/#atom-tag" rel="alternate"/><published>2026-04-07T20:52:54+00:00</published><updated>2026-04-07T20:52:54+00:00</updated><id>https://simonwillison.net/2026/Apr/7/project-glasswing/#atom-tag</id><summary type="html">
    &lt;p&gt;Anthropic &lt;em&gt;didn't&lt;/em&gt; release their latest model, Claude Mythos (&lt;a href="https://www-cdn.anthropic.com/53566bf5440a10affd749724787c8913a2ae0841.pdf"&gt;system card PDF&lt;/a&gt;), today. They have instead made it available to a very restricted set of preview partners under their newly announced &lt;a href="https://www.anthropic.com/glasswing"&gt;Project Glasswing&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The model is a general purpose model, similar to Claude Opus 4.6, but Anthropic claim that its cyber-security research abilities are strong enough that they need to give the software industry as a whole time to prepare.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Mythos Preview has already found thousands of high-severity vulnerabilities, including some in &lt;em&gt;every major operating system and web browser&lt;/em&gt;. Given the rate of AI progress, it will not be long before such capabilities proliferate, potentially beyond actors who are committed to deploying them safely.&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;Project Glasswing partners will receive access to Claude Mythos Preview to find and fix vulnerabilities or weaknesses in their foundational systems—systems that represent a very large portion of the world’s shared cyberattack surface. We anticipate this work will focus on tasks like local vulnerability detection, black box testing of binaries, securing endpoints, and penetration testing of systems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There's a great deal more technical detail in &lt;a href="https://red.anthropic.com/2026/mythos-preview/"&gt; Assessing Claude Mythos Preview’s cybersecurity capabilities&lt;/a&gt; on the Anthropic Red Team blog:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;In one case, Mythos Preview wrote a web browser exploit that chained together four vulnerabilities, writing a complex &lt;a href="https://en.wikipedia.org/wiki/JIT_spraying "&gt;JIT heap spray&lt;/a&gt; that escaped both renderer and OS sandboxes. It autonomously obtained local privilege escalation exploits on Linux and other operating systems by exploiting subtle race conditions and KASLR-bypasses. And it autonomously wrote a remote code execution exploit on FreeBSD's NFS server that granted full root access to unauthenticated users by splitting a 20-gadget ROP chain over multiple packets.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Plus this comparison with Claude 4.6 Opus:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Our internal evaluations showed that Opus 4.6 generally had a near-0% success rate at autonomous exploit development. But Mythos Preview is in a different league. For example, Opus 4.6 turned the vulnerabilities it had found in Mozilla’s Firefox 147 JavaScript engine—all patched in Firefox 148—into JavaScript shell exploits only two times out of several hundred attempts. We re-ran this experiment as a benchmark for Mythos Preview, which developed working exploits 181 times, and achieved register control on 29 more.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Saying "our model is too dangerous to release" is a great way to build buzz around a new model, but in this case I expect their caution is warranted.&lt;/p&gt;
&lt;p&gt;Just a few days (&lt;a href="https://simonwillison.net/2026/Apr/3/"&gt;last Friday&lt;/a&gt;) ago I started a new &lt;a href="https://simonwillison.net/tags/ai-security-research/"&gt;ai-security-research&lt;/a&gt; tag on this blog to acknowledge an uptick in credible security professionals pulling the alarm on how good modern LLMs have got at vulnerability research.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_kernel/"&gt;Greg Kroah-Hartman&lt;/a&gt; of the Linux kernel:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Months ago, we were getting what we called 'AI slop,' AI-generated security reports that were obviously wrong or low quality. It was kind of funny. It didn't really worry us.&lt;/p&gt;
&lt;p&gt;Something happened a month ago, and the world switched. Now we have real reports. All open source projects have real reports that are made with AI, but they're good, and they're real.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://mastodon.social/@bagder/116336957584445742"&gt;Daniel Stenberg&lt;/a&gt; of &lt;code&gt;curl&lt;/code&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The challenge with AI in open source security has transitioned from an AI slop tsunami into more of a ... plain security report tsunami. Less slop but lots of reports. Many of them really good.&lt;/p&gt;
&lt;p&gt;I'm spending hours per day on this now. It's intense.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And Thomas Ptacek published &lt;a href="https://sockpuppet.org/blog/2026/03/30/vulnerability-research-is-cooked/"&gt;Vulnerability Research Is Cooked&lt;/a&gt;, a post inspired by his &lt;a href="https://securitycryptographywhatever.com/2026/03/25/ai-bug-finding/"&gt;podcast conversation&lt;/a&gt; with Anthropic's Nicholas Carlini.&lt;/p&gt;
&lt;p&gt;Anthropic have a 5 minute &lt;a href="https://www.youtube.com/watch?v=INGOC6-LLv0"&gt;talking heads video&lt;/a&gt; describing the Glasswing project. Nicholas Carlini appears as one of those talking heads, where he said (highlights mine):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It has the ability to chain together vulnerabilities. So what this means is you find two vulnerabilities, either of which doesn't really get you very much independently. But this model is able to create exploits out of three, four, or sometimes five vulnerabilities that in sequence give you some kind of very sophisticated end outcome. [...]&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;I've found more bugs in the last couple of weeks than I found in the rest of my life combined&lt;/strong&gt;. We've used the model to scan a bunch of open source code, and the thing that we went for first was operating systems, because this is the code that underlies the entire internet infrastructure. &lt;strong&gt;For OpenBSD, we found a bug that's been present for 27 years, where I can send a couple of pieces of data to any OpenBSD server and crash it&lt;/strong&gt;. On Linux, we found a number of vulnerabilities where as a user with no permissions, I can elevate myself to the administrator by just running some binary on my machine. For each of these bugs, we told the maintainers who actually run the software about them, and they went and fixed them and have deployed the patches  patches so that anyone who runs the software is no longer vulnerable to these attacks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I found this on the &lt;a href="https://www.openbsd.org/errata78.html"&gt;OpenBSD 7.8 errata page&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;025: RELIABILITY FIX: March 25, 2026&lt;/strong&gt;  &lt;em&gt;All architectures&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;TCP packets with invalid SACK options could crash the kernel.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://ftp.openbsd.org/pub/OpenBSD/patches/7.8/common/025_sack.patch.sig"&gt;A source code patch exists which remedies this problem.&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I tracked that change down in the &lt;a href="https://github.com/openbsd/src"&gt;GitHub mirror&lt;/a&gt; of the OpenBSD CVS repo (apparently they still use CVS!) and found it &lt;a href="https://github.com/openbsd/src/blame/master/sys/netinet/tcp_input.c#L2461"&gt;using git blame&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/openbsd-27-years.jpg" alt="Screenshot of a Git blame view of C source code around line 2455 showing TCP SACK hole validation logic. Code includes checks using SEQ_GT, SEQ_LT macros on fields like th-&amp;gt;th_ack, tp-&amp;gt;snd_una, sack.start, sack.end, tp-&amp;gt;snd_max, and tp-&amp;gt;snd_holes. Most commits are from 25–27 years ago with messages like &amp;quot;more SACK hole validity testin...&amp;quot; and &amp;quot;knf&amp;quot;, while one recent commit from 3 weeks ago (&amp;quot;Ignore TCP SACK packets wit...&amp;quot;) is highlighted with an orange left border, adding a new guard &amp;quot;if (SEQ_LT(sack.start, tp-&amp;gt;snd_una)) continue;&amp;quot;" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Sure enough, the surrounding code is from 27 years ago.&lt;/p&gt;
&lt;p&gt;I'm not sure which Linux vulnerability Nicholas was describing, but it may have been &lt;a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=5133b61aaf437e5f25b1b396b14242a6bb0508e2"&gt;this NFS one&lt;/a&gt; recently covered &lt;a href="https://mtlynch.io/claude-code-found-linux-vulnerability/"&gt;by Michael Lynch
&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;There's enough smoke here that I believe there's a fire. It's not surprising to find vulnerabilities in decades-old software, especially given that they're mostly written in C, but what's new is that coding agents run by the latest frontier LLMs are proving tirelessly capable at digging up these issues.&lt;/p&gt;
&lt;p&gt;I actually thought to myself on Friday that this sounded like an industry-wide reckoning in the making, and that it might warrant a huge investment of time and money to get ahead of the inevitable barrage of vulnerabilities. Project Glasswing incorporates "$100M in usage credits ... as well as $4M in direct donations to open-source security organizations". Partners include AWS, Apple, Microsoft, Google, and the Linux Foundation. It would be great to see OpenAI involved as well - GPT-5.4 already has a strong reputation for finding security vulnerabilities and they have stronger models on the near horizon.&lt;/p&gt;
&lt;p&gt;The bad news for those of us who are &lt;em&gt;not&lt;/em&gt; trusted partners is this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We do not plan to make Claude Mythos Preview generally available, but our eventual goal is to enable our users to safely deploy Mythos-class models at scale—for cybersecurity purposes, but also for the myriad other benefits that such highly capable models will bring. To do so, we need to make progress in developing cybersecurity (and other) safeguards that detect and block the model’s most dangerous outputs. We plan to launch new safeguards with an upcoming Claude Opus model, allowing us to improve and refine them with a model that does not pose the same level of risk as Mythos Preview.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I can live with that. I think the security risks really are credible here, and having extra time for trusted teams to get ahead of them is a reasonable trade-off.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/thomas-ptacek"&gt;thomas-ptacek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicholas-carlini"&gt;nicholas-carlini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-security-research"&gt;ai-security-research&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="security"/><category term="thomas-ptacek"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="nicholas-carlini"/><category term="ai-ethics"/><category term="llm-release"/><category term="ai-security-research"/></entry><entry><title>Vulnerability Research Is Cooked</title><link href="https://simonwillison.net/2026/Apr/3/vulnerability-research-is-cooked/#atom-tag" rel="alternate"/><published>2026-04-03T23:59:08+00:00</published><updated>2026-04-03T23:59:08+00:00</updated><id>https://simonwillison.net/2026/Apr/3/vulnerability-research-is-cooked/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://sockpuppet.org/blog/2026/03/30/vulnerability-research-is-cooked/"&gt;Vulnerability Research Is Cooked&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Thomas Ptacek's take on the sudden and enormous impact the latest frontier models are having on the field of vulnerability research.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Within the next few months, coding agents will drastically alter both the practice and the economics of exploit development. Frontier model improvement won’t be a slow burn, but rather a step function. Substantial amounts of high-impact vulnerability research (maybe even most of it) will happen simply by pointing an agent at a source tree and typing “find me zero days”.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Why are agents so good at this? A combination of baked-in knowledge, pattern matching ability and brute force:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;You can't design a better problem for an LLM agent than exploitation research.&lt;/p&gt;
&lt;p&gt;Before you feed it a single token of context, a frontier LLM already encodes supernatural amounts of correlation across vast bodies of source code. Is the Linux KVM hypervisor connected to the &lt;code&gt;hrtimer&lt;/code&gt; subsystem, &lt;code&gt;workqueue&lt;/code&gt;, or &lt;code&gt;perf_event&lt;/code&gt;? The model knows.&lt;/p&gt;
&lt;p&gt;Also baked into those model weights: the complete library of documented "bug classes" on which all exploit development builds: stale pointers, integer mishandling, type confusion, allocator grooming, and all the known ways of promoting a wild write to a controlled 64-bit read/write in Firefox.&lt;/p&gt;
&lt;p&gt;Vulnerabilities are found by pattern-matching bug classes and constraint-solving for reachability and exploitability. Precisely the implicit search problems that LLMs are most gifted at solving. Exploit outcomes are straightforwardly testable success/failure trials. An agent never gets bored and will search forever if you tell it to.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The article was partly inspired by &lt;a href="https://securitycryptographywhatever.com/2026/03/25/ai-bug-finding/"&gt;this episode of the Security Cryptography Whatever podcast&lt;/a&gt;, where David Adrian, Deirdre Connolly, and Thomas interviewed Anthropic's Nicholas Carlini for 1 hour 16 minutes.&lt;/p&gt;
&lt;p&gt;I just started a new tag here for &lt;a href="https://simonwillison.net/tags/ai-security-research/"&gt;ai-security-research&lt;/a&gt; - it's up to 11 posts already.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/thomas-ptacek"&gt;thomas-ptacek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/careers"&gt;careers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicholas-carlini"&gt;nicholas-carlini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-security-research"&gt;ai-security-research&lt;/a&gt;&lt;/p&gt;



</summary><category term="security"/><category term="thomas-ptacek"/><category term="careers"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="nicholas-carlini"/><category term="ai-ethics"/><category term="ai-security-research"/></entry><entry><title>The Claude C Compiler: What It Reveals About the Future of Software</title><link href="https://simonwillison.net/2026/Feb/22/ccc/#atom-tag" rel="alternate"/><published>2026-02-22T23:58:43+00:00</published><updated>2026-02-22T23:58:43+00:00</updated><id>https://simonwillison.net/2026/Feb/22/ccc/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.modular.com/blog/the-claude-c-compiler-what-it-reveals-about-the-future-of-software"&gt;The Claude C Compiler: What It Reveals About the Future of Software&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
On February 5th Anthropic's Nicholas Carlini wrote about a project to use &lt;a href="https://www.anthropic.com/engineering/building-c-compiler"&gt;parallel Claudes to build a C compiler&lt;/a&gt; on top of the brand new Opus 4.6&lt;/p&gt;
&lt;p&gt;Chris Lattner (Swift, LLVM, Clang, Mojo) knows more about C compilers than most. He just published this review of the code.&lt;/p&gt;
&lt;p&gt;Some points that stood out to me:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Good software depends on judgment, communication, and clear abstraction. AI has amplified this.&lt;/li&gt;
&lt;li&gt;AI coding is automation of implementation, so design and stewardship become more important.&lt;/li&gt;
&lt;li&gt;Manual rewrites and translation work are becoming AI-native tasks, automating a large category of engineering effort.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Chris is generally impressed with CCC (the Claude C Compiler):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Taken together, CCC looks less like an experimental research compiler and more like a competent textbook implementation, the sort of system a strong undergraduate team might build early in a project before years of refinement. That alone is remarkable.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's a long way from being a production-ready compiler though:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Several design choices suggest optimization toward passing tests rather than building general abstractions like a human would. [...] These flaws are informative rather than surprising, suggesting that current AI systems excel at assembling known techniques and optimizing toward measurable success criteria, while struggling with the open-ended generalization required for production-quality systems.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The project also leads to deep open questions about how agentic engineering interacts with licensing and IP for both open source and proprietary code:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If AI systems trained on decades of publicly available code can reproduce familiar structures, patterns, and even specific implementations, where exactly is the boundary between learning and copying?&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/c"&gt;c&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/compilers"&gt;compilers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicholas-carlini"&gt;nicholas-carlini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;&lt;/p&gt;



</summary><category term="c"/><category term="compilers"/><category term="open-source"/><category term="ai"/><category term="ai-assisted-programming"/><category term="anthropic"/><category term="claude"/><category term="nicholas-carlini"/><category term="coding-agents"/><category term="agentic-engineering"/></entry><entry><title>Opus 4.6 and Codex 5.3</title><link href="https://simonwillison.net/2026/Feb/5/two-new-models/#atom-tag" rel="alternate"/><published>2026-02-05T20:29:20+00:00</published><updated>2026-02-05T20:29:20+00:00</updated><id>https://simonwillison.net/2026/Feb/5/two-new-models/#atom-tag</id><summary type="html">
    &lt;p&gt;Two major new model releases today, within about 15 minutes of each other.&lt;/p&gt;
&lt;p&gt;Anthropic &lt;a href="https://www.anthropic.com/news/claude-opus-4-6"&gt;released Opus 4.6&lt;/a&gt;. Here's &lt;a href="https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe216"&gt;its pelican&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Slightly wonky bicycle frame but an excellent pelican, very clear beak and pouch, nice feathers." src="https://static.simonwillison.net/static/2026/opus-4.6-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;OpenAI &lt;a href="https://openai.com/index/introducing-gpt-5-3-codex/"&gt;release GPT-5.3-Codex&lt;/a&gt;, albeit only via their Codex app, not yet in their API. Here's &lt;a href="https://gist.github.com/simonw/bfc4a83f588ac762c773679c0d1e034b"&gt;its pelican&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Not nearly as good - the bicycle is a bit mangled, the pelican not nearly as well rendered - it's more of a line drawing." src="https://static.simonwillison.net/static/2026/codex-5.3-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;I've had a bit of preview access to both of these models and to be honest I'm finding it hard to find a good angle to write about them - they're both &lt;em&gt;really good&lt;/em&gt;, but so were their predecessors Codex 5.2 and Opus 4.5. I've been having trouble finding tasks that those previous models couldn't handle but the new ones are able to ace.&lt;/p&gt;
&lt;p&gt;The most convincing story about capabilities of the new model so far is Nicholas Carlini from Anthropic talking about Opus 4.6 and &lt;a href="https://www.anthropic.com/engineering/building-c-compiler"&gt;Building a C compiler with a team of parallel Claudes&lt;/a&gt; - Anthropic's version of Cursor's &lt;a href="https://simonwillison.net/2026/Jan/23/fastrender/"&gt;FastRender project&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/c"&gt;c&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicholas-carlini"&gt;nicholas-carlini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex"&gt;codex&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parallel-agents"&gt;parallel-agents&lt;/a&gt;&lt;/p&gt;



</summary><category term="c"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="nicholas-carlini"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="codex"/><category term="parallel-agents"/></entry><entry><title>Quoting Nicholas Carlini</title><link href="https://simonwillison.net/2025/Nov/20/nicholas-carlini/#atom-tag" rel="alternate"/><published>2025-11-20T01:01:44+00:00</published><updated>2025-11-20T01:01:44+00:00</updated><id>https://simonwillison.net/2025/Nov/20/nicholas-carlini/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://nicholas.carlini.com/writing/2025/are-llms-worth-it.html"&gt;&lt;p&gt;Previously, when malware developers wanted to go and monetize their exploits, they would do exactly one thing: encrypt every file on a person's computer and request a ransome to decrypt the files. In the future I think this will change.&lt;/p&gt;
&lt;p&gt;LLMs allow attackers to instead process every file on the victim's computer, and tailor a blackmail letter specifically towards that person. One person may be having an affair on their spouse. Another may have lied on their resume. A third may have cheated on an exam at school. It is unlikely that any one person has done any of these specific things, but it is very likely that there exists something that is blackmailable for every person. Malware + LLMs, given access to a person's computer, can find that and monetize it.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://nicholas.carlini.com/writing/2025/are-llms-worth-it.html"&gt;Nicholas Carlini&lt;/a&gt;, Are large language models worth it? Misuse: malware at scale&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicholas-carlini"&gt;nicholas-carlini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-misuse"&gt;ai-misuse&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="nicholas-carlini"/><category term="ai-ethics"/><category term="ai-misuse"/></entry><entry><title>New prompt injection papers: Agents Rule of Two and The Attacker Moves Second</title><link href="https://simonwillison.net/2025/Nov/2/new-prompt-injection-papers/#atom-tag" rel="alternate"/><published>2025-11-02T23:09:33+00:00</published><updated>2025-11-02T23:09:33+00:00</updated><id>https://simonwillison.net/2025/Nov/2/new-prompt-injection-papers/#atom-tag</id><summary type="html">
    &lt;p&gt;Two interesting new papers regarding LLM security and prompt injection came to my attention this weekend.&lt;/p&gt;
&lt;h4 id="agents-rule-of-two-a-practical-approach-to-ai-agent-security"&gt;Agents Rule of Two: A Practical Approach to AI Agent Security&lt;/h4&gt;
&lt;p&gt;The first is &lt;a href="https://ai.meta.com/blog/practical-ai-agent-security/"&gt;Agents Rule of Two: A Practical Approach to AI Agent Security&lt;/a&gt;, published on October 31st on the Meta AI blog. It doesn't list authors but it was &lt;a href="https://x.com/MickAyzenberg/status/1984355145917088235"&gt;shared on Twitter&lt;/a&gt; by Meta AI security researcher Mick Ayzenberg.&lt;/p&gt;
&lt;p&gt;It proposes a "Rule of Two" that's inspired by both my own &lt;a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/"&gt;lethal trifecta&lt;/a&gt; concept and the Google Chrome team's &lt;a href="https://chromium.googlesource.com/chromium/src/+/main/docs/security/rule-of-2.md"&gt;Rule Of 2&lt;/a&gt; for writing code that works with untrustworthy inputs:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;At a high level, the Agents Rule of Two states that until robustness research allows us to reliably detect and refuse prompt injection, agents &lt;strong&gt;must satisfy no more than two&lt;/strong&gt; of the following three properties within a session to avoid the highest impact consequences of prompt injection.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;[A]&lt;/strong&gt; An agent can process untrustworthy inputs&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;[B]&lt;/strong&gt; An agent can have access to sensitive systems or private data&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;[C]&lt;/strong&gt; An agent can change state or communicate externally&lt;/p&gt;
&lt;p&gt;It's still possible that all three properties are necessary to carry out a request. If an agent requires all three without starting a new session (i.e., with a fresh context window), then the agent should not be permitted to operate autonomously and at a minimum requires supervision --- via human-in-the-loop approval or another reliable means of validation.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's accompanied by this handy diagram:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/agents-rule-of-two-updated.jpg" alt="Venn diagram titled &amp;quot;Choose Two&amp;quot; showing three overlapping circles labeled A, B, and C. Circle A (top): &amp;quot;Process untrustworthy inputs&amp;quot; with description &amp;quot;Externally authored data may contain prompt injection attacks that turn an agent malicious.&amp;quot; Circle B (bottom left): &amp;quot;Access to sensitive systems or private data&amp;quot; with description &amp;quot;This includes private user data, company secrets, production settings and configs, source code, and other sensitive data.&amp;quot; Circle C (bottom right): &amp;quot;Change state or communicate externally&amp;quot; with description &amp;quot;Overwrite or change state through write actions, or transmitting data to a threat actor through web requests or tool calls.&amp;quot; The two-way overlaps between circles are labeled &amp;quot;Lower risk&amp;quot; while the center where all three circles overlap is labeled &amp;quot;Danger&amp;quot;." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I like this &lt;em&gt;a lot&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;I've spent several years now trying to find clear ways to explain the risks of prompt injection attacks to developers who are building on top of LLMs. It's frustratingly difficult.&lt;/p&gt;
&lt;p&gt;I've had the most success with the lethal trifecta, which boils one particular class of prompt injection attack down to a simple-enough model: if your system has access to private data, exposure to untrusted content and a way to communicate externally then it's vulnerable to private data being stolen.&lt;/p&gt;
&lt;p&gt;The one problem with the lethal trifecta is that it only covers the risk of data exfiltration: there are plenty of other, even nastier risks that arise from prompt injection attacks against LLM-powered agents with access to tools which the lethal trifecta doesn't cover.&lt;/p&gt;
&lt;p&gt;The Agents Rule of Two neatly solves this, through the addition of "changing state" as a property to consider. This brings other forms of tool usage into the picture: anything that can change state triggered by untrustworthy inputs is something to be very cautious about.&lt;/p&gt;
&lt;p&gt;It's also refreshing to see another major research lab concluding that prompt injection remains an unsolved problem, and attempts to block or filter them have not proven reliable enough to depend on. The current solution is to design systems with this in mind, and the Rule of Two is a solid way to think about that.&lt;/p&gt;
&lt;p id="exception"&gt;&lt;strong&gt;Update&lt;/strong&gt;: On thinking about this further there's one aspect of the Rule of Two model that doesn't work for me: the Venn diagram above marks the combination of untrustworthy inputs and the ability to change state as "safe", but that's not right. Even without access to private systems or sensitive data that pairing can still produce harmful results. Unfortunately adding an exception for that pair undermines the simplicity of the "Rule of Two" framing!&lt;/p&gt;
&lt;p id="update-2"&gt;&lt;strong&gt;Update 2&lt;/strong&gt;: Mick Ayzenberg responded to this note in &lt;a href="https://news.ycombinator.com/item?id=45794245#45802448"&gt;a comment on Hacker News&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Thanks for the feedback! One small bit of clarification, the framework would describe access to any sensitive system as part of the [B] circle, not only private systems or private data.&lt;/p&gt;
&lt;p&gt;The intention is that an agent that has removed [B] can write state and communicate freely, but not with any systems that matter (wrt critical security outcomes for its user). An example of an agent in this state would be one that can take actions in a tight sandbox or is isolated from production.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The Meta team also &lt;a href="https://news.ycombinator.com/item?id=45794245#45802046"&gt;updated their post&lt;/a&gt; to replace "safe" with "lower risk" as the label on the intersections between the different circles. I've updated my screenshots of their diagrams in this post, &lt;a href="https://static.simonwillison.net/static/2025/agents-rule-of-two.jpg"&gt;here's the original&lt;/a&gt; for comparison.&lt;/p&gt;
&lt;p&gt;Which brings me to the second paper...&lt;/p&gt;
&lt;h4 id="the-attacker-moves-second-stronger-adaptive-attacks-bypass-defenses-against-llm-jailbreaks-and-prompt-injections"&gt;The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections&lt;/h4&gt;
&lt;p&gt;This paper is dated 10th October 2025 &lt;a href="https://arxiv.org/abs/2510.09023"&gt;on Arxiv&lt;/a&gt; and comes from a heavy-hitting team of 14 authors - Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, Florian Tramèr - including representatives from OpenAI, Anthropic, and Google DeepMind.&lt;/p&gt;
&lt;p&gt;The paper looks at 12 published defenses against prompt injection and jailbreaking and subjects them to a range of "adaptive attacks" - attacks that are allowed to expend considerable effort iterating multiple times to try and find a way through.&lt;/p&gt;
&lt;p&gt;The defenses did not fare well:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;By systematically tuning and scaling general optimization techniques—gradient descent, reinforcement learning, random search, and human-guided exploration—we bypass 12 recent defenses (based on a diverse set of techniques) with attack success rate above 90% for most; importantly, the majority of defenses originally reported near-zero attack success rates.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Notably the "Human red-teaming setting" scored 100%, defeating all defenses. That red-team consisted of 500 participants in an online competition they ran with a $20,000 prize fund.&lt;/p&gt;
&lt;p&gt;The key point of the paper is that static example attacks - single string prompts designed to bypass systems - are an almost useless way to evaluate these defenses. Adaptive attacks are far more powerful, as shown by this chart:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/attack-success-rate.jpg" alt="Bar chart showing Attack Success Rate (%) for various security systems across four categories: Prompting, Training, Filtering Model, and Secret Knowledge. The chart compares three attack types shown in the legend: Static / weak attack (green hatched bars), Automated attack (ours) (orange bars), and Human red-teaming (ours) (purple dotted bars). Systems and their success rates are: Spotlighting (28% static, 99% automated), Prompt Sandwich (21% static, 95% automated), RPO (0% static, 99% automated), Circuit Breaker (8% static, 100% automated), StruQ (62% static, 100% automated), SeqAlign (5% static, 96% automated), ProtectAI (15% static, 90% automated), PromptGuard (26% static, 94% automated), PIGuard (0% static, 71% automated), Model Armor (0% static, 90% automated), Data Sentinel (0% static, 80% automated), MELON (0% static, 89% automated), and Human red-teaming setting (0% static, 100% human red-teaming)." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The three automated adaptive attack techniques used by the paper are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gradient-based methods&lt;/strong&gt; - these were the least effective, using the technique described in the legendary &lt;a href="https://arxiv.org/abs/2307.15043"&gt;Universal and Transferable Adversarial Attacks on Aligned Language Models&lt;/a&gt; paper &lt;a href="https://simonwillison.net/2023/Jul/27/universal-and-transferable-attacks-on-aligned-language-models/"&gt;from 2023&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reinforcement learning methods&lt;/strong&gt; - particularly effective against black-box models: "we allowed the attacker model to interact directly with the defended system and observe its outputs", using 32 sessions of 5 rounds each.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search-based methods&lt;/strong&gt; - generate candidates with an LLM, then evaluate and further modify them using LLM-as-judge and other classifiers.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The paper concludes somewhat optimistically:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] Adaptive evaluations are therefore more challenging to perform,
making it all the more important that they are performed. We again urge defense authors to release simple, easy-to-prompt defenses that are amenable to human analysis. [...] Finally, we hope that our analysis here will increase the standard for defense evaluations, and in so doing, increase the likelihood that reliable jailbreak and prompt injection defenses will be developed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Given how totally the defenses were defeated, I do not share their optimism that reliable defenses will be developed any time soon.&lt;/p&gt;
&lt;p&gt;As a review of how far we still have to go this paper packs a powerful punch. I think it makes a strong case for Meta's Agents Rule of Two as the best practical advice for building secure LLM-powered agent systems today in the absence of prompt injection defenses we can rely on.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/definitions"&gt;definitions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-injection"&gt;prompt-injection&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicholas-carlini"&gt;nicholas-carlini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paper-review"&gt;paper-review&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lethal-trifecta"&gt;lethal-trifecta&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="definitions"/><category term="security"/><category term="openai"/><category term="prompt-injection"/><category term="anthropic"/><category term="nicholas-carlini"/><category term="paper-review"/><category term="lethal-trifecta"/></entry><entry><title>My Thoughts on the Future of "AI"</title><link href="https://simonwillison.net/2025/Mar/19/my-thoughts-on-the-future-of-ai/#atom-tag" rel="alternate"/><published>2025-03-19T04:55:45+00:00</published><updated>2025-03-19T04:55:45+00:00</updated><id>https://simonwillison.net/2025/Mar/19/my-thoughts-on-the-future-of-ai/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://nicholas.carlini.com/writing/2025/thoughts-on-future-ai.html"&gt;My Thoughts on the Future of &amp;quot;AI&amp;quot;&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Nicholas Carlini, previously deeply skeptical about the utility of LLMs, discusses at length his thoughts on where the technology might go.&lt;/p&gt;
&lt;p&gt;He presents compelling, detailed arguments for both ends of the spectrum - his key message is that it's best to maintain very wide error bars for what might happen next:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I wouldn't be surprised if, in three to five years, language models are capable of performing most (all?) cognitive economically-useful tasks beyond the level of human experts. And I also wouldn't be surprised if, in five years, the best models we have are better than the ones we have today, but only in “normal” ways where costs continue to decrease considerably and capabilities continue to get better but there's no fundamental paradigm shift that upends the world order. To deny the &lt;em&gt;potential&lt;/em&gt; for either of these possibilities seems to me to be a mistake.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If LLMs do hit a wall, it's not at all clear what that wall might be:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I still believe there is something fundamental that will get in the way of our ability to build LLMs that grow exponentially in capability. But I will freely admit to you now that I have no earthly idea what that limitation will be. I have no evidence that this line exists, other than to make some form of vague argument that when you try and scale something across many orders of magnitude, you'll probably run into problems you didn't see coming.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;There's lots of great stuff in here. I particularly liked this explanation of how you get R1:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;You take DeepSeek v3, and ask it to solve a bunch of hard problems, and when it gets the answers right, you train it to do more of that and less of whatever it did when it got the answers wrong. The idea here is actually really simple, and it works surprisingly well.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicholas-carlini"&gt;nicholas-carlini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="nicholas-carlini"/><category term="deepseek"/><category term="ai-in-china"/></entry><entry><title>Career Update: Google DeepMind -&gt; Anthropic</title><link href="https://simonwillison.net/2025/Mar/5/google-deepmind-anthropic/#atom-tag" rel="alternate"/><published>2025-03-05T22:24:02+00:00</published><updated>2025-03-05T22:24:02+00:00</updated><id>https://simonwillison.net/2025/Mar/5/google-deepmind-anthropic/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://nicholas.carlini.com/writing/2025/career-update.html"&gt;Career Update: Google DeepMind -&amp;gt; Anthropic&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Nicholas Carlini (&lt;a href="https://simonwillison.net/tags/nicholas-carlini/"&gt;previously&lt;/a&gt;) on joining Anthropic, driven partly by his frustration at friction he encountered publishing his research at Google DeepMind after their merge with Google Brain. His area of expertise is adversarial machine learning.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The recent advances in machine learning and language modeling are going to be transformative &lt;span style="font-size: 0.75em; line-height: 0; position: relative; vertical-align: baseline; top: -0.5em;"&gt;[&lt;a href="https://nicholas.carlini.com/writing/2025/career-update.html#footnote4"&gt;d&lt;/a&gt;]&lt;/span&gt; But in order to realize this potential future in a way that doesn't put everyone's safety and security at risk, we're going to need to make a &lt;em&gt;lot&lt;/em&gt; of progress---and soon. We need to make so much progress that no one organization will be able to figure everything out by themselves; we need to work together, we need to talk about what we're doing, and we need to start doing this now.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/machine-learning"&gt;machine-learning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicholas-carlini"&gt;nicholas-carlini&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="machine-learning"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="nicholas-carlini"/></entry><entry><title>yet-another-applied-llm-benchmark</title><link href="https://simonwillison.net/2024/Nov/6/yet-another-applied-llm-benchmark/#atom-tag" rel="alternate"/><published>2024-11-06T20:00:23+00:00</published><updated>2024-11-06T20:00:23+00:00</updated><id>https://simonwillison.net/2024/Nov/6/yet-another-applied-llm-benchmark/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/carlini/yet-another-applied-llm-benchmark"&gt;yet-another-applied-llm-benchmark&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Nicholas Carlini introduced this personal LLM benchmark suite &lt;a href="https://nicholas.carlini.com/writing/2024/my-benchmark-for-large-language-models.html"&gt;back in February&lt;/a&gt; as a collection of over 100 automated tests he runs against new LLM models to evaluate their performance against the kinds of tasks &lt;a href="https://nicholas.carlini.com/writing/2024/how-i-use-ai.html"&gt;he uses them for&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;There are two defining features of this benchmark that make it interesting. Most importantly, I've implemented a simple dataflow domain specific language to make it easy for me (or anyone else!) to add new tests that realistically evaluate model capabilities. This DSL allows for specifying both how the question should be asked and also how the answer should be evaluated. [...]  And then, directly as a result of this, I've written nearly 100 tests for different situations I've actually encountered when working with LLMs as assistants&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The DSL he's using is &lt;em&gt;fascinating&lt;/em&gt;. Here's an example:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;"Write a C program that draws an american flag to stdout." &amp;gt;&amp;gt; LLMRun() &amp;gt;&amp;gt; CRun() &amp;gt;&amp;gt; \
    VisionLLMRun("What flag is shown in this image?") &amp;gt;&amp;gt; \
    (SubstringEvaluator("United States") | SubstringEvaluator("USA")))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This triggers an LLM to execute the prompt asking for a C program that renders an American Flag, runs that through a C compiler and interpreter (executed in a Docker container), then passes the output of that to a vision model to guess the flag and checks that it returns a string containing "United States" or "USA".&lt;/p&gt;
&lt;p&gt;The DSL itself is implemented &lt;a href="https://github.com/carlini/yet-another-applied-llm-benchmark/blob/main/evaluator.py"&gt;entirely in Python&lt;/a&gt;, using the &lt;code&gt;__rshift__&lt;/code&gt; magic method for &lt;code&gt;&amp;gt;&amp;gt;&lt;/code&gt; and &lt;code&gt;__rrshift__&lt;/code&gt; to enable strings to be piped into a custom object using &lt;code&gt;"command to run" &amp;gt;&amp;gt; LLMRunNode&lt;/code&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/dsl"&gt;dsl&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicholas-carlini"&gt;nicholas-carlini&lt;/a&gt;&lt;/p&gt;



</summary><category term="dsl"/><category term="python"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="evals"/><category term="nicholas-carlini"/></entry><entry><title>Quoting Nicholas Carlini</title><link href="https://simonwillison.net/2024/Sep/18/nicholas-carlini/#atom-tag" rel="alternate"/><published>2024-09-18T18:52:56+00:00</published><updated>2024-09-18T18:52:56+00:00</updated><id>https://simonwillison.net/2024/Sep/18/nicholas-carlini/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://www.youtube.com/watch?v=umfeF0Dx-r4"&gt;&lt;p&gt;The problem that you face is that it's relatively easy to take a model and make it look like it's aligned. You ask GPT-4, “how do I end all of humans?” And the model says, “I can't possibly help you with that”. But there are a million and one ways to take the exact same question - pick your favorite - and you can make the model still answer the question even though initially it would have refused. And the question this reminds me a lot of coming from adversarial machine learning. We have a very simple objective: Classify the image correctly according to the original label. And yet, despite the fact that it was essentially trivial to find all of the bugs in principle, the community had a very hard time coming up with actually effective defenses. We wrote like over 9,000 papers in ten years, and have made very very very limited progress on this one small problem. You all have a harder problem and maybe less time.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://www.youtube.com/watch?v=umfeF0Dx-r4"&gt;Nicholas Carlini&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/jailbreaking"&gt;jailbreaking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/machine-learning"&gt;machine-learning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicholas-carlini"&gt;nicholas-carlini&lt;/a&gt;&lt;/p&gt;



</summary><category term="jailbreaking"/><category term="machine-learning"/><category term="security"/><category term="ai"/><category term="nicholas-carlini"/></entry><entry><title>How I Use "AI" by Nicholas Carlini</title><link href="https://simonwillison.net/2024/Aug/4/how-i-use-ai-by-nicholas-carlini/#atom-tag" rel="alternate"/><published>2024-08-04T16:55:33+00:00</published><updated>2024-08-04T16:55:33+00:00</updated><id>https://simonwillison.net/2024/Aug/4/how-i-use-ai-by-nicholas-carlini/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://nicholas.carlini.com/writing/2024/how-i-use-ai.html"&gt;How I Use &amp;quot;AI&amp;quot; by Nicholas Carlini&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Nicholas is an author on &lt;a href="https://arxiv.org/abs/2307.15043"&gt;Universal and Transferable Adversarial Attacks on Aligned Language Models&lt;/a&gt;, one of my favorite LLM security papers from last year. He understands the flaws in this class of technology at a deeper level than most people.&lt;/p&gt;
&lt;p&gt;Despite that, this article describes several of the many ways he still finds utility in these models in his own work:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;But the reason I think that the recent advances we've made aren't just hype is that, over the past year, I have spent at least a few hours every week interacting with various large language models, and have been consistently impressed by their ability to solve increasingly difficult tasks I give them. And as a result of this, I would say I'm at least 50% faster at writing code for both my research projects and my side projects as a result of these models.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The way Nicholas is using these models closely matches my own experience - things like “Automating nearly every monotonous task or one-off script” and “Teaching me how to use various frameworks having never previously used them”.&lt;/p&gt;
&lt;p&gt;I feel that this piece inadvertently captures the frustration felt by those of us who get value out of these tools on a daily basis and still constantly encounter people who are adamant that they offer no real value. Saying “this stuff is genuine useful” remains a surprisingly controversial statement, almost two years after the ChatGPT launch opened up LLMs to a giant audience.&lt;/p&gt;
&lt;p&gt;I also enjoyed this footnote explaining why he put “AI” in scare quotes in the title:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I hate this word. It's not AI. But I want people who use this word, and also people who hate this word, to find this post. And so I guess I'm stuck with it for marketing, SEO, and clickbait.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=41150317"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicholas-carlini"&gt;nicholas-carlini&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="nicholas-carlini"/></entry></feed>