Simon Willison’s Weblog


Sunday, 2nd February 2025

Hacker News conversation on feature flags. I posted the following comment in a thread on Hacker News about feature flags, in response to this article It’s OK to hardcode feature flags. This kicked off a very high quality conversation on build-vs-buy and running feature flags at scale involving a bunch of very experienced and knowledgeable people. I recommend reading the comments.

The single biggest value add of feature flags is that they de-risk deployment. They make it less frightening and difficult to turn features on and off, which means you'll do it more often. This means you can build more confidently and learn faster from what you build. That's worth a lot.

I think there's a reasonable middle ground-point between having feature flags in a JSON file that you have to redeploy to change and using an (often expensive) feature flags as a service platform: roll your own simple system.

A relational database lookup against primary keys in a table with a dozen records is effectively free. Heck, load the entire collection at the start of each request - through a short lived cache if your profiling says that would help.

Once you start getting more complicated (flags enabled for specific users etc) you should consider build-vs-buy more seriously, but for the most basic version you really can have no-deploy-changes at minimal cost with minimal effort.

There are probably good open source libraries you can use here too, though I haven't gone looking for any in the last five years.

# 1:18 am / hacker-news, feature-flags

A professional workflow for translation using LLMs. Tom Gally is a professional translator who has been exploring the use of LLMs since the release of GPT-4. In this Hacker News comment he shares a detailed workflow for how he uses them to assist in that process.

Tom starts with the source text and custom instructions, including context for how the translation will be used. Here's an imaginary example prompt, which starts:

The text below in Japanese is a product launch presentation for Sony's new gaming console, to be delivered by the CEO at Tokyo Game Show 2025. Please translate it into English. Your translation will be used in the official press kit and live interpretation feed. When translating this presentation, please follow these guidelines to create an accurate and engaging English version that preserves both the meaning and energy of the original: [...]

It then lists some tone, style and content guidelines custom to that text.

Tom runs that prompt through several different LLMs and starts by picking sentences and paragraphs from those that form a good basis for the translation.

As he works on the full translation he uses Claude to help brainstorm alternatives for tricky sentences:

When I am unable to think of a good English version for a particular sentence, I give the Japanese and English versions of the paragraph it is contained in to an LLM (usually, these days, Claude) and ask for ten suggestions for translations of the problematic sentence. Usually one or two of the suggestions work fine; if not, I ask for ten more. (Using an LLM as a sentence-level thesaurus on steroids is particularly wonderful.)

He uses another LLM and prompt to check his translation against the original and provide further suggestions, which he occasionally acts on. Then as a final step he runs the finished document through a text-to-speech engine to try and catch any "minor awkwardnesses" in the result.

I love this as an example of an expert using LLMs as tools to help further elevate their work. I'd love to read more examples like this one from experts in other fields.

# 4:23 am / translation, generative-ai, hacker-news, ai, llms

llm-anthropic. I've renamed my llm-claude-3 plugin to llm-anthropic, on the basis that Claude 4 will probably happen at some point so this is a better name for the plugin.

If you're a previous user of llm-claude-3 you can upgrade to the new plugin like this:

llm install -U llm-claude-3

This should remove the old plugin and install the new one, because the latest llm-claude-3 depends on llm-anthropic. Just installing llm-anthropic may leave you with both plugins installed at once.

There is one extra manual step you'll need to take during this upgrade: creating a new anthropic stored key with the same API token you previously stored under claude. You can do that like so:

llm keys set anthropic --value "$(llm keys get claude)"

I released llm-anthropic 0.12 yesterday with new features not previously included in llm-claude-3:

  • Support for Claude's prefill feature, using the new -o prefill '{' option and the accompanying -o hide_prefill 1 option to prevent the prefill from being included in the output text. #2
  • New -o stop_sequences '```' option for specifying one or more stop sequences. To specify multiple stop sequences pass a JSON array of strings :-o stop_sequences '["end", "stop"].
  • Model options are now documented in the README.

If you install or upgrade llm-claude-3 you will now get llm-anthropic instead, thanks to a tiny package on PyPI which depends on the new plugin name. I created that with my pypi-rename cookiecutter template.

Here's the issue for the rename. I archived the llm-claude-3 repository on GitHub, and got to use the brand new PyPI archiving feature to archive the llm-claude-3 project on PyPI as well.

# 6:17 am / llm, anthropic, claude, plugins, ai, pypi, llms, python, generative-ai

[In response to a question about releasing model weights]

Yes, we are discussing. I personally think we have been on the wrong side of history here and need to figure out a different open source strategy; not everyone at OpenAI shares this view, and it's also not our current highest priority.

Sam Altman, in a Reddit AMA

# 8:11 am / openai, llms, ai, generative-ai, open-source, sam-altman

Part of the concept of ‘Disruption’ is that important new technologies tend to be bad at the things that matter to the previous generation of technology, but they do something else important instead. Asking if an LLM can do very specific and precise information retrieval might be like asking if an Apple II can match the uptime of a mainframe, or asking if you can build Photoshop inside Netscape. No, they can’t really do that, but that’s not the point and doesn’t mean they’re useless. They do something else, and that ‘something else’ matters more and pulls in all of the investment, innovation and company creation. Maybe, 20 years later, they can do the old thing too - maybe you can run a bank on PCs and build graphics software in a browser, eventually - but that’s not what matters at the beginning. They unlock something else.

What is that ‘something else’ for generative AI, though? How do you think conceptually about places where that error rate is a feature, not a bug?

Benedict Evans, Are better models better?

# 2:37 pm / benedict-evans, llms, ai, generative-ai

OpenAI reasoning models: Advice on prompting (via) OpenAI's documentation for their o1 and o3 "reasoning models" includes some interesting tips on how to best prompt them:

This appears to be a purely aesthetic change made for consistency with their instruction hierarchy concept. As far as I can tell the old system prompts continue to work exactly as before - you're encouraged to use the new developer message type but it has no impact on what actually happens.

Since my LLM tool already bakes in a llm --system "system prompt" option which works across multiple different models from different providers I'm not going to rush to adopt this new language!

  • Use delimiters for clarity: Use delimiters like markdown, XML tags, and section titles to clearly indicate distinct parts of the input, helping the model interpret different sections appropriately.

Anthropic have been encouraging XML-ish delimiters for a while (I say -ish because there's no requirement that the resulting prompt is valid XML). My files-to-prompt tool has a -c option which outputs Claude-style XML, and in my experiments this same option works great with o1 and o3 too:

git clone
cd limbo/bindings/python

files-to-prompt . -c | llm -m o3-mini \
  -o reasoning_effort high \
  --system 'Write a detailed README with extensive usage examples'
  • Limit additional context in retrieval-augmented generation (RAG): When providing additional context or documents, include only the most relevant information to prevent the model from overcomplicating its response.

This makes me thing that o1/o3 are not good models to implement RAG on at all - with RAG I like to be able to dump as much extra context into the prompt as possible and leave it to the models to figure out what's relevant.

  • Try zero shot first, then few shot if needed: Reasoning models often don't need few-shot examples to produce good results, so try to write prompts without examples first. If you have more complex requirements for your desired output, it may help to include a few examples of inputs and desired outputs in your prompt. Just ensure that the examples align very closely with your prompt instructions, as discrepancies between the two may produce poor results.

Providing examples remains the single most powerful prompting tip I know, so it's interesting to see advice here to only switch to examples if zero-shot doesn't work out.

  • Be very specific about your end goal: In your instructions, try to give very specific parameters for a successful response, and encourage the model to keep reasoning and iterating until it matches your success criteria.

This makes sense: reasoning models "think" until they reach a conclusion, so making the goal as unambiguous as possible leads to better results.

  • Markdown formatting: Starting with o1-2024-12-17, reasoning models in the API will avoid generating responses with markdown formatting. To signal to the model when you do want markdown formatting in the response, include the string Formatting re-enabled on the first line of your developer message.

This one was a real shock to me! I noticed that o3-mini was outputting characters instead of Markdown * bullets and initially thought that was a bug.

I first saw this while running this prompt against limbo/bindings/python using files-to-prompt:

git clone
cd limbo/bindings/python

files-to-prompt . -c | llm -m o3-mini \
  -o reasoning_effort high \
  --system 'Write a detailed README with extensive usage examples'

Here's the full result, which includes text like this (note the weird bullets):

• High‑performance, in‑process database engine written in Rust  
• SQLite‑compatible SQL interface  
• Standard Python DB‑API 2.0–style connection and cursor objects

I ran it again with this modified prompt:

Formatting re-enabled. Write a detailed README with extensive usage examples.

And this time got back proper Markdown, rendered in this Gist. That did a really good job, and included bulleted lists using this valid Markdown syntax instead:

- **`make test`**: Run tests using pytest.
- **`make lint`**: Run linters (via [ruff](
- **`make check-requirements`**: Validate that the `requirements.txt` files are in sync with `pyproject.toml`.
- **`make compile-requirements`**: Compile the `requirements.txt` files using pip-tools.

Py-Limbo. Py-Limbo is a lightweight, in-process, OLTP (Online Transaction Processing) database management system built as a Python extension module on top of Rust. It is designed to be compatible with SQLite in both usage and API, while offering an opportunity to experiment with Rust-backed database functionality. Note: Py-Limbo is a work-in-progress (Alpha stage) project. Some features (e.g. transactions, executemany, fetchmany) are not yet supported. Table of Contents - then a hierarchical nested table of contents.

(Using LLMs like this to get me off the ground with under-documented libraries is a trick I use several times a month.)

Update: OpenAI's Nikunj Handa:

we agree this is weird! fwiw, it’s a temporary thing we had to do for the existing o-series models. we’ll fix this in future releases so that you can go back to naturally prompting for markdown or no-markdown.

# 8:56 pm / o1, openai, o3, markdown, ai, llms, prompt-engineering, generative-ai, inference-scaling, rag, ai-assisted-programming, documentation, limbo, llm

While we encourage people to use AI systems during their role to help them work faster and more effectively, please do not use AI assistants during the application process. We want to understand your personal interest in Anthropic without mediation through an AI system, and we also want to evaluate your non-AI-assisted communication skills. Please indicate 'Yes' if you have read and agree.

Why do you want to work at Anthropic? (We value this response highly - great answers are often 200-400 words.)

Anthropic, online job application form

# 9:38 pm / anthropic, ethics, generative-ai, ai, llms