23rd July 2024
As we've noted many times since March, these benchmarks aren't necessarily scientifically sound and don't convey the subjective experience of interacting with AI language models. [...] We've instead found that measuring the subjective experience of using a conversational AI model (through what might be called "vibemarking") on A/B leaderboards like Chatbot Arena is a better way to judge new LLMs.
Recent articles
- DeepSeek V4 - almost on the frontier, a fraction of the price - 24th April 2026
- Extract PDF text in your browser with LiteParse for the web - 23rd April 2026
- A pelican for GPT-5.5 via the semi-official Codex backdoor API - 23rd April 2026