Simon Willison’s Weblog

Subscribe

Frequently Asked Questions (And Answers) About AI Evals (via) Hamel Husain and Shreya Shankar have been running a paid, cohort-based course on AI Evals For Engineers & PMs over the past few months. Here Hamel collects answers to the most common questions asked during the course.

There's a ton of actionable advice in here. I continue to believe that a robust approach to evals is the single most important distinguishing factor between well-engineered, reliable AI systems and YOLO cross-fingers and hope it works development.

Hamel says:

It’s important to recognize that evaluation is part of the development process rather than a distinct line item, similar to how debugging is part of software development. [...]

In the projects we’ve worked on, we’ve spent 60-80% of our development time on error analysis and evaluation. Expect most of your effort to go toward understanding failures (i.e. looking at data) rather than building automated checks.

I found this tip to be useful and surprising:

If you’re passing 100% of your evals, you’re likely not challenging your system enough. A 70% pass rate might indicate a more meaningful evaluation that’s actually stress-testing your application.

Monthly briefing

Sponsor me for $10/month and get a curated email digest of the month's most important LLM developments.

Pay me to send you less!

Sponsor & subscribe