Notes on OpenAI’s new o1 chain-of-thought models

12th September 2024

OpenAI released two major new preview models today: o1-preview and o1-mini (that mini one is not a preview)—previously rumored as having the codename “strawberry”. There’s a lot to understand about these models—they’re not as simple as the next step up from GPT-4o, instead introducing some major trade-offs in terms of cost and performance in exchange for improved “reasoning” capabilities.

Trained for chain of thought
Low-level details from the API documentation
Hidden reasoning tokens
Examples
What’s new in all of this

Trained for chain of thought

OpenAI’s elevator pitch is a good starting point:

We’ve developed a new series of AI models designed to spend more time thinking before they respond.

One way to think about these new models is as a specialized extension of the chain of thought prompting pattern—the “think step by step” trick that we’ve been exploring as a a community for a couple of years now, first introduced in the paper Large Language Models are Zero-Shot Reasoners in May 2022.

OpenAI’s article Learning to Reason with LLMs explains how the new models were trained:

Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.

[...]

Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason.

Effectively, this means the models can better handle significantly more complicated prompts where a good result requires backtracking and “thinking” beyond just next token prediction.

I don’t really like the term “reasoning” because I don’t think it has a robust definition in the context of LLMs, but OpenAI have committed to using it here and I think it does an adequate job of conveying the problem these new models are trying to solve.

Low-level details from the API documentation

Some of the most interesting details about the new models and their trade-offs can be found in their API documentation:

For applications that need image inputs, function calling, or consistently fast response times, the GPT-4o and GPT-4o mini models will continue to be the right choice. However, if you’re aiming to develop applications that demand deep reasoning and can accommodate longer response times, the o1 models could be an excellent choice.

Some key points I picked up from the docs:

API access to the new o1-preview and o1-mini models is currently reserved for tier 5 accounts—you’ll need to have spent at least $1,000 on API credits.
No system prompt support—the models use the existing chat completion API but you can only send user and assistant messages.
No streaming support, tool usage, batch calls or image inputs either.
“Depending on the amount of reasoning required by the model to solve the problem, these requests can take anywhere from a few seconds to several minutes.”

Most interestingly is the introduction of “reasoning tokens”—tokens that are not visible in the API response but are still billed and counted as output tokens. These tokens are where the new magic happens.

Thanks to the importance of reasoning tokens—OpenAI suggests allocating a budget of around 25,000 of these for prompts that benefit from the new models—the output token allowance has been increased dramatically—to 32,768 for o1-preview and 65,536 for the supposedly smaller o1-mini! These are an increase from the gpt-4o and gpt-4o-mini models which both currently have a 16,384 output token limit.

One last interesting tip from that API documentation:

Limit additional context in retrieval-augmented generation (RAG): When providing additional context or documents, include only the most relevant information to prevent the model from overcomplicating its response.

This is a big change from how RAG is usually implemented, where the advice is often to cram as many potentially relevant documents as possible into the prompt.

Hidden reasoning tokens

A frustrating detail is that those reasoning tokens remain invisible in the API—you get billed for them, but you don’t get to see what they were. OpenAI explain why in Hiding the Chains of Thought:

Assuming it is faithful and legible, the hidden chain of thought allows us to “read the mind” of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.

Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.

So two key reasons here: one is around safety and policy compliance: they want the model to be able to reason about how it’s obeying those policy rules without exposing intermediary steps that might include information that violates those policies. The second is what they call competitive advantage—which I interpret as wanting to avoid other models being able to train against the reasoning work that they have invested in.

I’m not at all happy about this policy decision. As someone who develops against LLMs interpretability and transparency are everything to me—the idea that I can run a complex prompt and have key details of how that prompt was evaluated hidden from me feels like a big step backwards.

Examples

OpenAI provide some initial examples in the Chain of Thought section of their announcement, covering things like generating Bash scripts, solving crossword puzzles and calculating the pH of a moderately complex solution of chemicals.

These examples show that the ChatGPT UI version of these models does expose details of the chain of thought... but it doesn’t show the raw reasoning tokens, instead using a separate mechanism to summarize the steps into a more human-readable form.

OpenAI also have two new cookbooks with more sophisticated examples, which I found a little hard to follow:

Using reasoning for data validation shows a multiple step process for generating example data in an 11 column CSV and then validating that in various different ways.
Using reasoning for routine generation showing o1-preview code to transform knowledge base articles into a set of routines that an LLM can comprehend and follow.

I asked on Twitter for examples of prompts that people had found which failed on GPT-4o but worked on o1-preview. A couple of my favourites:

How many words are in your response to this prompt? by Matthew Berman—the model thinks for ten seconds across five visible turns before answering “There are seven words in this sentence.”
Explain this joke: “Two cows are standing in a field, one cow asks the other: “what do you think about the mad cow disease that’s going around?”. The other one says: “who cares, I’m a helicopter!” by Fabian Stelzer—the explanation makes sense, apparently other models have failed here.

Great examples are still a bit thin on the ground though. Here’s a relevant note from OpenAI researcher Jason Wei, who worked on creating these new models:

Results on AIME and GPQA are really strong, but that doesn’t necessarily translate to something that a user can feel. Even as someone working in science, it’s not easy to find the slice of prompts where GPT-4o fails, o1 does well, and I can grade the answer. But when you do find such prompts, o1 feels totally magical. We all need to find harder prompts.

Ethan Mollick has been previewing the models for a few weeks, and published his initial impressions. His crossword example is particularly interesting for the visible reasoning steps, which include notes like:

I noticed a mismatch between the first letters of 1 Across and 1 Down. Considering “CONS” instead of “LIES” for 1 Across to ensure alignment.

What’s new in all of this

It’s going to take a while for the community to shake out the best practices for when and where these models should be applied. I expect to continue mostly using GPT-4o (and Claude 3.5 Sonnet), but it’s going to be really interesting to see us collectively expand our mental model of what kind of tasks can be solved using LLMs given this new class of model.

I expect we’ll see other AI labs, including the open model weights community, start to replicate some of these results with their own versions of models that are specifically trained to apply this style of chain-of-thought reasoning.

Posted 12th September 2024 at 10:36 pm · Follow me on Mastodon, Bluesky, Twitter or subscribe to my newsletter

Simon Willison’s Weblog