Draft: This is a draft post. Please do not share this URL with anyone else.
o3-mini-system-card.pdf. The o3-mini system card is out - the model itself is likely to be available shortly.
While o3-mini scores higher than o1 and gpt-4o on many of the included benchmarks, especially around coding, it wasn't universally better than them across ever benchmark.
The biggest win was on Codeforces ELO, a competitive programming benchmark where o3-mini scored 2036 against 1841 for o1, 1250 for o1-preview and 900 for GPT-4o. This fits my intuition that inference-scaling models (like R1) are really good at complex code challenges.
Recent articles
- Video + notes on upgrading a Datasette plugin for the latest 1.0 alpha, with help from uv and OpenAI Codex CLI - 6th November 2025
- Code research projects with async coding agents like Claude Code and Codex - 6th November 2025
- A new SQL-powered permissions system in Datasette 1.0a20 - 4th November 2025