Simon Willison’s Weblog

Subscribe

Pydantic Evals (via) Brand new package from David Montague and the Pydantic AI team which directly tackles what I consider to be the single hardest problem in AI engineering: building evals to determine if your LLM-based system is working correctly and getting better over time.

The feature is described as "in beta" and comes with this very realistic warning:

Unlike unit tests, evals are an emerging art/science; anyone who claims to know for sure exactly how your evals should be defined can safely be ignored.

This code example from their documentation illustrates the relationship between the two key nouns - Cases and Datasets:

from pydantic_evals import Case, Dataset

case1 = Case(
    name="simple_case",
    inputs="What is the capital of France?",
    expected_output="Paris",
    metadata={"difficulty": "easy"},
)

dataset = Dataset(cases=[case1])

The library also supports custom evaluators, including LLM-as-a-judge:

Case(
    name="vegetarian_recipe",
    inputs=CustomerOrder(
        dish_name="Spaghetti Bolognese", dietary_restriction="vegetarian"
    ),
    expected_output=None,
    metadata={"focus": "vegetarian"},
    evaluators=(
        LLMJudge(
            rubric="Recipe should not contain meat or animal products",
        ),
    ),
)

Cases and datasets can also be serialized to YAML.

My first impressions are that this looks like a solid implementation of a sensible design. I'm looking forward to trying it out against a real project.