As we've noted many times since March, these benchmarks aren't necessarily scientifically sound and don't convey the subjective experience of interacting with AI language models. [...] We've instead found that measuring the subjective experience of using a conversational AI model (through what might be called "vibemarking") on A/B leaderboards like Chatbot Arena is a better way to judge new LLMs.
Recent articles
- LLM predictions for 2026, shared with Oxide and Friends - 8th January 2026
- Introducing gisthost.github.io - 1st January 2026
- 2025: The year in LLMs - 31st December 2025