A quote from Benj Edwards

As we've noted many times since March, these benchmarks aren't necessarily scientifically sound and don't convey the subjective experience of interacting with AI language models. [...] We've instead found that measuring the subjective experience of using a conversational AI model (through what might be called "vibemarking") on A/B leaderboards like Chatbot Arena is a better way to judge new LLMs.

— Benj Edwards

Posted 23rd July 2024 at 9:14 pm

Simon Willison’s Weblog

Recent articles