23rd July 2024
As we've noted many times since March, these benchmarks aren't necessarily scientifically sound and don't convey the subjective experience of interacting with AI language models. [...] We've instead found that measuring the subjective experience of using a conversational AI model (through what might be called "vibemarking") on A/B leaderboards like Chatbot Arena is a better way to judge new LLMs.
Recent articles
- Publishing WASM wheels to PyPI for use with Pyodide - 13th June 2026
- Claude Fable is relentlessly proactive - 11th June 2026
- Initial impressions of Claude Fable 5 - 9th June 2026