As we've noted many times since March, these benchmarks aren't necessarily scientifically sound and don't convey the subjective experience of interacting with AI language models. [...] We've instead found that measuring the subjective experience of using a conversational AI model (through what might be called "vibemarking") on A/B leaderboards like Chatbot Arena is a better way to judge new LLMs.
Recent articles
- Video scraping: extracting JSON data from a 35 second screen capture for less than 1/10th of a cent - 17th October 2024
- ChatGPT will happily write you a thinly disguised horoscope - 15th October 2024
- OpenAI DevDay: Let’s build developer tools, not digital God - 2nd October 2024