Vision language models are blind (via) A new paper exploring vision LLMs, comparing GPT-4o, Gemini 1.5 Pro, Claude 3 Sonnet and Claude 3.5 Sonnet (I'm surprised they didn't include Claude 3 Opus and Haiku, which are more interesting than Claude 3 Sonnet in my opinion).
I don't like the title and framing of this paper. They describe seven tasks that vision models have trouble with - mainly geometric analysis like identifying intersecting shapes or counting things - and use those to support the following statement:
The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like of a person with myopia seeing fine details as blurry, and at worst, like an intelligent person that is blind making educated guesses.
While the failures they describe are certainly interesting, I don't think they justify that conclusion.
I've felt starved for information about the strengths and weaknesses of these vision LLMs since the good ones started becoming available last November (GPT-4 Vision at OpenAI DevDay) so identifying tasks like this that they fail at is useful. But just like pointing out an LLM can't count letters doesn't mean that LLMs are useless, these limitations of vision models shouldn't be used to declare them "blind" as a sweeping statement.
Recent articles
- Gemini 2.0 Flash: An outstanding multi-modal LLM with a sci-fi streaming mode - 11th December 2024
- ChatGPT Canvas can make API requests now, but it's complicated - 10th December 2024
- I can now run a GPT-4 class model on my laptop - 9th December 2024