To me, a successful eval meets the following criteria. Say, we currently have system A, and we might tweak it to get a system B:
- If A works significantly better than B according to a skilled human judge, the eval should give A a significantly higher score than B.
- If A and B have similar performance, their eval scores should be similar.
Whenever a pair of systems A and B contradicts these criteria, that is a sign the eval is in “error” and we should tweak it to make it rank A and B correctly.
Recent articles
- Trying out the new Gemini 2.5 model family - 17th June 2025
- The lethal trifecta for AI agents: private data, untrusted content, and external communication - 16th June 2025
- An Introduction to Google’s Approach to AI Agent Security - 15th June 2025