Recent years have seen astonishing progress in AI systems that can recognize and analyze the content of complex images. But a new paper highlights how many state-of-the-art “vision learning models” (VLMs) often fail at simple, low-level visual analysis tasks that are trivially easy for a human.
In the provocatively titled pre-print article “Vision language models are Blind” (which has a PDF version that includes a dark sunglasses emoji in the title), researchers from Auburn University and the University of Alberta create eight simple visual acuity tests with objectively correct answers. These range from identifying how many times two colored lines intersect to identifying which letter in a long word is circled to counting how many nested shapes appear in an image (representative examples and results can be viewed on the research team’s webpage).
Crucially, these tests are generated by custom code and do not rely on pre-existing images or tests found on the public internet, “minimizing the chance of errors”.[ing] the likelihood that VLMs can be solved by memorization,” the researchers said. The tests also “require minimal to no world knowledge” beyond basic 2D shapes, making it difficult to infer the answer from “textual questions and choices alone” (which has been identified as a problem for some other visual AI benchmarks).
Are you smarter than a 5th grader?
After running multiple tests on four different visual models—GPT-4o, Gemini-1.5 Pro, Sonnet-3, and Sonnet-3.5—the researchers found that all four fell well short of the 100 percent accuracy you’d expect for such simple visual analysis tasks (and which most sighted people would have no trouble achieving). But the magnitude of the AI’s underperformance varied widely depending on the specific task. For example, when asked to count the number of rows and columns in an empty grid, the best-performing model produced an accurate answer less than 60 percent of the time. On the other hand, Gemini-1.5 Pro achieved nearly 93 percent accuracy when identifying circled letters, approaching human-level performance.
Even small changes in the tasks can lead to dramatic changes in the results. While all four models tested were able to correctly identify five overlapping hollow circles, the accuracy of all models dropped to well below 50 percent when six to nine circles were involved. The researchers hypothesize that this “suggests that VLMs are biased toward the familiar Olympic logo, which has 5 circles.” In other cases, models occasionally hallucinated nonsensical answers, such as guessing “9,” “n,” or “©” as the circled letter in the word “Subdermatoglyphic.”
Overall, the results highlight how AI models that perform well at high-level visual reasoning have some significant “blind spots” (sorry) when it comes to low-level abstract images. It’s all somewhat reminiscent of similar capacity deficits we often see in advanced large language models, which can produce extremely coherent summaries of long texts but at the same time fail to answer extremely basic math and spelling questions.
These gaps in VLM capabilities could stem from the inability of these systems to generalize beyond the types of content they’re explicitly trained on. But when the researchers tried to fine-tune a model using specific images pulled from one of their tasks (the “are two circles touching?” test), that model showed only modest improvement, from 17 percent accuracy to about 37 percent. “The loss values for all of these experiments were very close to zero, indicating that the model overfits the training set, but does not generalize,” the researchers write.
The researchers propose that the VLM performance gap may be related to the so-called “late fusion” of vision encoders on pre-trained large language models. An “early fusion” training approach that integrates visual encoding with language training could lead to better results on these low-level tasks, the researchers suggest (without providing any analysis of this question).