A quote from Benj Edwards

23rd July 2024

As we've noted many times since March, these benchmarks aren't necessarily scientifically sound and don't convey the subjective experience of interacting with AI language models. [...] We've instead found that measuring the subjective experience of using a conversational AI model (through what might be called "vibemarking") on A/B leaderboards like Chatbot Arena is a better way to judge new LLMs.

— Benj Edwards

Posted 23rd July 2024 at 9:14 pm

Simon Willison’s Weblog

Recent articles