| Skills in OpenAI API |
https://developers.openai.com/cookbook/examples/skills_in_api |
OpenAI's adoption of Skills continues to gain ground. You can now use Skills directly in the OpenAI API with their [shell tool](https://developers.openai.com/api/docs/guides/tools-shell/). You can zip skills up and upload them first, but I think an even neater interface is the ability to send skills with the JSON request as inline base64-encoded zip data, as seen [in this script](https://github.com/simonw/research/blob/main/openai-api-skills/openai_inline_skills.py):
<pre><span class="pl-s1">r</span> <span class="pl-c1">=</span> <span class="pl-en">OpenAI</span>().<span class="pl-c1">responses</span>.<span class="pl-c1">create</span>(
<span class="pl-s1">model</span><span class="pl-c1">=</span><span class="pl-s">"gpt-5.2"</span>,
<span class="pl-s1">tools</span><span class="pl-c1">=</span>[
{
<span class="pl-s">"type"</span>: <span class="pl-s">"shell"</span>,
<span class="pl-s">"environment"</span>: {
<span class="pl-s">"type"</span>: <span class="pl-s">"container_auto"</span>,
<span class="pl-s">"skills"</span>: [
{
<span class="pl-s">"type"</span>: <span class="pl-s">"inline"</span>,
<span class="pl-s">"name"</span>: <span class="pl-s">"wc"</span>,
<span class="pl-s">"description"</span>: <span class="pl-s">"Count words in a file."</span>,
<span class="pl-s">"source"</span>: {
<span class="pl-s">"type"</span>: <span class="pl-s">"base64"</span>,
<span class="pl-s">"media_type"</span>: <span class="pl-s">"application/zip"</span>,
<span class="pl-s">"data"</span>: <span class="pl-s1">b64_encoded_zip_file</span>,
},
}
],
},
}
],
<span class="pl-s1">input</span><span class="pl-c1">=</span><span class="pl-s">"Use the wc skill to count words in its own SKILL.md file."</span>,
)
<span class="pl-en">print</span>(<span class="pl-s1">r</span>.<span class="pl-c1">output_text</span>)</pre>
I built that example script after first having Claude Code for web use [Showboat](https://simonwillison.net/2026/Feb/10/showboat-and-rodney/) to explore the API for me and create [this report](https://github.com/simonw/research/blob/main/openai-api-skills/README.md). My opening prompt for the research project was:
> `Run uvx showboat --help - you will use this tool later`
>
> `Fetch https://developers.openai.com/cookbook/examples/skills_in_api.md to /tmp with curl, then read it`
>
> `Use the OpenAI API key you have in your environment variables`
>
> `Use showboat to build up a detailed demo of this, replaying the examples from the documents and then trying some experiments of your own` |
2026-02-11 19:19:22+00:00 |
| GLM-5: From Vibe Coding to Agentic Engineering |
https://z.ai/blog/glm-5 |
This is a *huge* new MIT-licensed model: 754B parameters and [1.51TB on Hugging Face](https://huggingface.co/zai-org/GLM-5) twice the size of [GLM-4.7](https://huggingface.co/zai-org/GLM-4.7) which was 368B and 717GB (4.5 and 4.6 were around that size too).
It's interesting to see Z.ai take a position on what we should call professional software engineers building with LLMs - I've seen "Agentic Engineering" show up in a few other places recently. most notable [from Andrej Karpathy](https://twitter.com/karpathy/status/2019137879310836075) and [Addy Osmani](https://addyosmani.com/blog/agentic-engineering/).
I ran my "Generate an SVG of a pelican riding a bicycle" prompt through GLM-5 via [OpenRouter](https://openrouter.ai/) and got back [a very good pelican on a disappointing bicycle frame](https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f0725d):
 |
2026-02-11 18:56:14+00:00 |
| cysqlite - a new sqlite driver |
https://charlesleifer.com/blog/cysqlite---a-new-sqlite-driver/ |
Charles Leifer has been maintaining [pysqlite3](https://github.com/coleifer/pysqlite3) - a fork of the Python standard library's `sqlite3` module that makes it much easier to run upgraded SQLite versions - since 2018.
He's been working on a ground-up [Cython](https://cython.org/) rewrite called [cysqlite](https://github.com/coleifer/cysqlite) for almost as long, but it's finally at a stage where it's ready for people to try out.
The biggest change from the `sqlite3` module involves transactions. Charles explains his discomfort with the `sqlite3` implementation at length - that library provides two different variants neither of which exactly match the autocommit mechanism in SQLite itself.
I'm particularly excited about the support for [custom virtual tables](https://cysqlite.readthedocs.io/en/latest/api.html#tablefunction), a feature I'd love to see in `sqlite3` itself.
`cysqlite` provides a Python extension compiled from C, which means it normally wouldn't be available in Pyodide. I [set Claude Code on it](https://github.com/simonw/research/tree/main/cysqlite-wasm-wheel) and it built me [cysqlite-0.1.4-cp311-cp311-emscripten_3_1_46_wasm32.whl](https://github.com/simonw/research/blob/main/cysqlite-wasm-wheel/cysqlite-0.1.4-cp311-cp311-emscripten_3_1_46_wasm32.whl), a 688KB wheel file with a WASM build of the library that can be loaded into Pyodide like this:
<pre><span class="pl-k">import</span> <span class="pl-s1">micropip</span>
<span class="pl-k">await</span> <span class="pl-s1">micropip</span>.<span class="pl-c1">install</span>(
<span class="pl-s">"https://simonw.github.io/research/cysqlite-wasm-wheel/cysqlite-0.1.4-cp311-cp311-emscripten_3_1_46_wasm32.whl"</span>
)
<span class="pl-k">import</span> <span class="pl-s1">cysqlite</span>
<span class="pl-en">print</span>(<span class="pl-s1">cysqlite</span>.<span class="pl-c1">connect</span>(<span class="pl-s">":memory:"</span>).<span class="pl-c1">execute</span>(
<span class="pl-s">"select sqlite_version()"</span>
).<span class="pl-c1">fetchone</span>())</pre>
(I also learned that wheels like this have to be built for the emscripten version used by that edition of Pyodide - my experimental wheel loads in Pyodide 0.25.1 but fails in 0.27.5 with a `Wheel was built with Emscripten v3.1.46 but Pyodide was built with Emscripten v3.1.58` error.)
You can try my wheel in [this new Pyodide REPL](https://7ebbff98.tools-b1q.pages.dev/pyodide-repl) i had Claude build as a mobile-friendly alternative to Pyodide's [own hosted console](https://pyodide.org/en/stable/console.html).
I also had Claude build [this demo page](https://simonw.github.io/research/cysqlite-wasm-wheel/demo.html) that executes the original test suite in the browser and displays the results:
 |
2026-02-11 17:34:40+00:00 |
| Structured Context Engineering for File-Native Agentic Systems |
https://arxiv.org/abs/2602.05447 |
New paper by Damon McMillan exploring challenging LLM context tasks involving large SQL schemas (up to 10,000 tables) across different models and file formats:
> Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data, comprising 9,649 experiments across 11 models, 4 formats (YAML, Markdown, JSON, Token-Oriented Object Notation [TOON]), and schemas ranging from 10 to 10,000 tables.
Unsurprisingly, the biggest impact was the models themselves - with frontier models (Opus 4.5, GPT-5.2, Gemini 2.5 Pro) beating the leading open source models (DeepSeek V3.2, Kimi K2, Llama 4).
Those frontier models benefited from filesystem based context retrieval, but the open source models had much less convincing results with those, which reinforces my feeling that the filesystem coding agent loops aren't handled as well by open weight models just yet. The [Terminal Bench 2.0](https://www.tbench.ai/leaderboard/terminal-bench/2.0) leaderboard is still dominated by Anthropic, OpenAI and Gemini.
The "grep tax" result against [TOON](https://github.com/toon-format/toon) was an interesting detail. TOON is meant to represent structured data in as few tokens as possible, but it turns out the model's unfamiliarity with that format led to them spending significantly more tokens over multiple iterations trying to figure it out:
 |
2026-02-09 23:56:51+00:00 |
| AI Doesn’t Reduce Work—It Intensifies It |
https://hbr.org/2026/02/ai-doesnt-reduce-work-it-intensifies-it |
Aruna Ranganathan and Xingqi Maggie Ye from Berkeley Haas School of Business report initial findings in the HBR from their April to December 2025 study of 200 employees at a "U.S.-based technology company".
This captures an effect I've been observing in my own work with LLMs: the productivity boost these things can provide is *exhausting*.
> AI introduced a new rhythm in which workers managed several active threads at once: manually writing code while AI generated an alternative version, running multiple agents in parallel, or reviving long-deferred tasks because AI could “handle them” in the background. They did this, in part, because they felt they had a “partner” that could help them move through their workload.
>
> While this sense of having a “partner” enabled a feeling of momentum, the reality was a continual switching of attention, frequent checking of AI outputs, and a growing number of open tasks. This created cognitive load and a sense of always juggling, even as the work felt productive.
I'm frequently finding myself with work on two or three projects running parallel. I can get *so much done*, but after just an hour or two my mental energy for the day feels almost entirely depleted.
I've had conversations with people recently who are losing sleep because they're finding building yet another feature with "just one more prompt" irresistible.
The HBR piece calls for organizations to build an "AI practice" that structures how AI is used to help avoid burnout and counter effects that "make it harder for organizations to distinguish genuine productivity gains from unsustainable intensity".
I think we've just disrupted decades of existing intuition about sustainable working practices. It's going to take a while and some discipline to find a good new balance. |
2026-02-09 16:43:07+00:00 |
| Vouch |
https://github.com/mitchellh/vouch |
Mitchell Hashimoto's new system to help address the deluge of worthless AI-generated PRs faced by open source projects now that the friction involved in contributing has dropped so low.
[He says](https://twitter.com/mitchellh/status/2020252149117313349):
> The idea is simple: Unvouched users can't contribute to your projects. Very bad users can be explicitly "denounced", effectively blocked. Users are vouched or denounced by contributors via GitHub issue or discussion comments or via the CLI.
>
> Integration into GitHub is as simple as adopting the published GitHub actions. Done. Additionally, the system itself is generic to forges and not tied to GitHub in any way.
>
> Who and how someone is vouched or denounced is up to the project. I'm not the value police for the world. Decide for yourself what works for your project and your community. |
2026-02-07 23:57:57+00:00 |
| Claude: Speed up responses with fast mode |
https://code.claude.com/docs/en/fast-mode |
New "research preview" from Anthropic today: you can now access a faster version of their frontier model Claude Opus 4.6 by typing `/fast` in Claude Code... but at a cost that's 6x the normal price.
Opus is usually $5/million input and $25/million output. The new fast mode is $30/million input and $150/million output!
There's a 50% discount until the end of February 16th, so only a 3x multiple (!) before then.
How much faster is it? The linked documentation doesn't say, but [on Twitter](https://x.com/claudeai/status/2020207322124132504) Claude say:
> Our teams have been building with a 2.5x-faster version of Claude Opus 4.6.
>
> We’re now making it available as an early experiment via Claude Code and our API.
Claude Opus 4.5 had a context limit of 200,000 tokens. 4.6 has an option to increase that to 1,000,000 at 2x the input price ($10/m) and 1.5x the output price ($37.50/m) once your input exceeds 200,000 tokens. These multiples hold for fast mode too, so after Feb 16th you'll be able to pay a hefty $60/m input and $225/m output for Anthropic's fastest best model. |
2026-02-07 23:10:33+00:00 |
| pydantic/monty |
https://github.com/pydantic/monty |
Everyone's [building sandboxes](https://simonwillison.net/2026/Jan/8/llm-predictions-for-2026/#1-year-we-re-finally-going-to-solve-sandboxing) for running untrusted code right now. Here's Pydantic's latest attempt at the problem - they've implemented a custom Python-like language (a subset of Python) in Rust and made it available as both a Rust library and a Python package.
> Monty avoids the cost, latency, complexity and general faff of using full container based sandbox for running LLM generated code.
>
> Instead, it let's you safely run Python code written by an LLM embedded in your agent, with startup times measured in single digit microseconds not hundreds of milliseconds.
>
> What Monty **can** do:
>
> - Run a reasonable subset of Python code - enough for your agent to express what it wants to do
> - Completely block access to the host environment: filesystem, env variables and network access are all implemented via external function calls the developer can control
> - Call functions on the host - only functions you give it access to [...]
A quick way to try it out is via [uv]():
uv run --with pydantic-monty python -m asyncio
Then try this in the Python interactive prompt (the `-m asyncio` enables top-level await):
<pre><span class="pl-k">import</span> <span class="pl-s1">pydantic_monty</span>
<span class="pl-s1">code</span> <span class="pl-c1">=</span> <span class="pl-s1">pydantic_monty</span>.<span class="pl-c1">Monty</span>(<span class="pl-s">'print("hello " + str(4 * 5))'</span>)
<span class="pl-k">await</span> <span class="pl-s1">pydantic_monty</span>.<span class="pl-c1">run_monty_async</span>(<span class="pl-s1">code</span>)</pre>
It's a *very* small subset of Python - it doesn't even support class declarations yet! But... that's not actually a problem. The neat thing about providing tools like this for LLMs is that they're really good at iterating against error messages - an agent can run some Python code, get an error message telling it that classes aren't supported and then try again.
I wanted to try this in a browser - so I fired up [a code research task](https://simonwillison.net/2025/Nov/6/async-code-research/) in Claude Code for web and kicked it off with the following:
> Clone https://github.com/pydantic/monty to /tmp and figure out how to compile it into a python WebAssembly wheel that can then be loaded in Pyodide. The wheel file itself should be checked into the repo along with build scripts and passing pytest playwright teat scrips that load Pyodide from a CDN and the wheel from a “python -m http.server” localhost and demonstrate it working
Then a little later:
> I want an additional WASM file that works independently of Pyodide, which is also usable in a web browser - build that too along with playwright tests that show it working. Also build two HTML files - one called demo.html and one called pyodide-demo.html - these should work similar to https://tools.simonwillison.net/micropython (download that code with curl to inspect it) - one should load the WASM build, the other should load Pyodide and have it use the WASM wheel. These will be served by GitHub Pages so they can load the
Here's [the transcript](https://gisthost.github.io/?22d88e6367d7e002c4fb383c213c2df2/page-001.html), and the [final research report](https://github.com/simonw/research/tree/main/monty-wasm-pyodide).
The end result is I now have the Monty Rust code compiled to WebAssembly in two different shapes - as a `.wasm` bundle you can load and call from JavaScript, and as a `monty-wasm-pyodide/pydantic_monty-0.0.3-cp313-cp313-emscripten_4_0_9_wasm32.whl` wheel file which can be loaded into [Pyodide](https://pyodide.org/) and then called from Python in Pyodide in WebAssembly in a browser.
![Screenshot of a web app titled "Monty via Pyodide" with description "Run Monty (a sandboxed Python interpreter by Pydantic) inside Pyodide (CPython compiled to WebAssembly). This loads the pydantic-monty wheel and uses its full Python API. Code is saved in the URL for sharing." A green banner reads "Code executed successfully!" Below are example buttons labeled "Basic", "Inputs", "Reuse", "Error Handling", "Fibonacci", and "Classes". A code editor labeled "Python Code (runs inside Monty sandbox via Pyodide):" contains: "import pydantic_monty\n\n# Create interpreter with input variables\nm = pydantic_monty.Monty('x + y', inputs=['x', 'y'])\n\n# Run with different inputs\nresult1 = m.run(inputs={\"x\": 10, \"y\": 20})\nprint(f\"10 + 20 = {result1}\")\n\nresult2 = m.run(inputs={\"x\": 100, \"y\": 200})" with "Run Code" and "Clear" buttons. The Output section shows "10 + 20 = 30" and "100 + 200 = 300" with a "Copy" button. Footer reads "Executed in 4.0ms".](https://static.simonwillison.net/static/2026/monty-pyodide.jpg) |
2026-02-06 21:44:38+00:00 |
| An Update on Heroku |
https://www.heroku.com/blog/an-update-on-heroku/ |
An ominous headline to see on the official Heroku blog and yes, it's bad news.
> Today, Heroku is transitioning to a sustaining engineering model focused on stability, security, reliability, and support. Heroku remains an actively supported, production-ready platform, with an emphasis on maintaining quality and operational excellence rather than introducing new features. We know changes like this can raise questions, and we want to be clear about what this means for customers.
Based on context I'm guessing a "sustaining engineering model" (this definitely isn't a widely used industry term) means that they'll keep the lights on and that's it.
This is a very frustrating piece of corporate communication. "We want to be clear about what this means for customers" - then proceeds to *not be clear* about what this means for customers.
Why are they doing this? Here's their explanation:
> We’re focusing our product and engineering investments on areas where we can deliver the greatest long-term customer value, including helping organizations build and deploy enterprise-grade AI in a secure and trusted way.
My blog is the only project I have left running on Heroku. I guess I'd better migrate it away (probably to Fly) before Salesforce lose interest completely. |
2026-02-06 18:44:21+00:00 |
| Mitchell Hashimoto: My AI Adoption Journey |
https://mitchellh.com/writing/my-ai-adoption-journey |
Some really good and unconventional tips in here for getting to a place with coding agents where they demonstrably improve your workflow and productivity. I particularly liked:
- [Reproduce your own work](https://mitchellh.com/writing/my-ai-adoption-journey#step-2-reproduce-your-own-work) - when learning to use coding agents Mitchell went through a period of doing the work manually, then recreating the same solution using agents as an exercise:
> I literally did the work twice. I'd do the work manually, and then I'd fight an agent to produce identical results in terms of quality and function (without it being able to see my manual solution, of course).
- [End-of-day agents](https://mitchellh.com/writing/my-ai-adoption-journey#step-3-end-of-day-agents) - letting agents step in when your energy runs out:
> To try to find some efficiency, I next started up a new pattern: **block out the last 30 minutes of every day to kick off one or more agents.** My hypothesis was that *perhaps* I could gain some efficiency if the agent can make some *positive progress* in the times I can't work anyways.
- [Outsource the Slam Dunks](https://mitchellh.com/writing/my-ai-adoption-journey#step-4-outsource-the-slam-dunks) - once you know an agent can likely handle a task, have it do that task while you work on something more interesting yourself. |
2026-02-05 23:39:07+00:00 |