OpenAI o3-mini, now available in LLM
31st January 2025
OpenAI’s o3-mini is out today. As with other o-series models it’s a slightly difficult one to evaluate—we now need to decide if a prompt is best run using GPT-4o, o1, o3-mini or (if we have access) o1 Pro.
Confusing matters further, the benchmarks in the o3-mini system card (PDF) aren’t a universal win for o3-mini across all categories. It generally benchmarks higher than GPT-4o and o1 but not across everything.
The biggest win for o3-mini is on the Codeforces ELO competitive programming benchmark, which I think is described by this 2nd January 2025 paper, with the following scores:
- o3-mini (high) 2130
- o3-mini (medium) 2036
- o1 1891
- o3-mini (low) 1831
- o1-mini 1650
- o1-preview 1258
- GPT-4o 900
Weirdly, that GPT-4o score was in an older copy of the System Card PDF which has been replaced by an updated document that doesn’t mention Codeforces ELO scores at all.
One note from the System Card that stood out for me concerning intended applications of o3-mini for OpenAI themselves:
We also plan to allow users to use o3-mini to search the internet and summarize the results in ChatGPT. We expect o3-mini to be a useful and safe model for doing this, especially given its performance on the jailbreak and instruction hierarchy evals detailed in Section 4 below.
This is notable because the existing o1 models on ChatGPT have not yet had access to their web search tool—despite the mixture of search and “reasoning” models having very clear benefits.
I released LLM 0.21 with support for the new model, plus its -o reasoning_effort high
(or medium
or low
) option for tweaking the reasoning effort—details in this issue.
Note that the new model is currently only available for Tier 3 and higher users, which requires you to have spent at least $100 on the API.
o3-mini is priced at $1.10/million input tokens, $4.40/million output tokens—less than half the price of GPT-4o (currently $2.50/$10) and massively cheaper than o1 ($15/60).
I tried using it to summarize this conversation about o3-mini on Hacker News, using my hn-summary.sh script.
hn-summary.sh 42890627 -o o3-mini
Here’s the result—it used 18,936 input tokens and 2,905 output tokens for a total cost of 3.3612 cents.
Another characteristic worth noting is o3-mini’s token output limit—the measure of how much text it can output in one go. That’s 100,000 tokens, compared to 16,000 for GPT-4o and just 8,000 for both DeepSeek R1 and Claude 3.5.
Invisible “reasoning tokens” come out of the same budget, so it’s likely not possible to have it output the full 100,000.
The model accepts up to 200,000 tokens of input, an improvement on GPT-4o’s 128,000.
An application where output limits really matter is translation between human languages, where the output can realistically be expected to have a similar length to the input. It will be interesting seeing how well o3-mini works for that, especially given its low price.
More recent articles
- A selfish personal argument for releasing code as Open Source - 24th January 2025
- Anthropic's new Citations API - 24th January 2025