Simon Willison’s Weblog

Subscribe
Draft: This is a draft post. Please do not share this URL with anyone else.

o3-mini-system-card.pdf. The o3-mini system card is out - the model itself is likely to be available shortly.

While o3-mini scores higher than o1 and gpt-4o on many of the included benchmarks, especially around coding, it wasn't universally better than them across ever benchmark.

The biggest win was on Codeforces ELO, a competitive programming benchmark where o3-mini scored 2036 against 1841 for o1, 1250 for o1-preview and 900 for GPT-4o. This fits my intuition that inference-scaling models (like R1) are really good at complex code challenges.