Draft: This is a draft post. Please do not share this URL with anyone else.
o3-mini-system-card.pdf. The o3-mini system card is out - the model itself is likely to be available shortly.
While o3-mini scores higher than o1 and gpt-4o on many of the included benchmarks, especially around coding, it wasn't universally better than them across ever benchmark.
The biggest win was on Codeforces ELO, a competitive programming benchmark where o3-mini scored 2036 against 1841 for o1, 1250 for o1-preview and 900 for GPT-4o. This fits my intuition that inference-scaling models (like R1) are really good at complex code challenges.
Recent articles
- Image segmentation using Gemini 2.5 - 18th April 2025
- GPT-4.1: Three new million token input models from OpenAI, including their cheapest model yet - 14th April 2025
- CaMeL offers a promising new direction for mitigating prompt injection attacks - 11th April 2025