31st January 2025 - Link Blog
o3-mini-system-card.pdf. The o3-mini system card is out - the model itself is likely to be available shortly.
While o3-mini scores higher than o1 and gpt-4o on many of the included benchmarks, especially around coding, it wasn't universally better than them across ever benchmark.
The biggest win was on Codeforces ELO, a competitive programming benchmark where o3-mini scored 2036 against 1841 for o1, 1250 for o1-preview and 900 for GPT-4o. This fits my intuition that inference-scaling models (like R1) are really good at complex code challenges.
Recent articles
- Anthropic's Project Glasswing - restricting Claude Mythos to security researchers - sounds necessary to me - 7th April 2026
- The Axios supply chain attack used individually targeted social engineering - 3rd April 2026
- Highlights from my conversation about agentic engineering on Lenny's Podcast - 2nd April 2026