Abstract
We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.
Abstract (translated)
我们介绍了一个新的基准Blink,它专注于其他评估中没有发现的视觉感知能力。大多数Blink任务都可以通过人类“眨眼内解决”(例如,相对深度估计,视觉对应,法医检测和多视角推理)。然而,我们发现这些感知要求的任务对现有的多模态LLM构成了重大挑战,因为它们通过自然语言进行中介。Blink将14个经典计算机视觉任务重新格式化为3,807个多选题,与单张或多张图像和视觉提示搭配。虽然人类平均得到95.70%的准确率,但Blink对于现有的多模态LLM来说仍然具有令人惊讶的挑战性:即使是表现最好的GPT-4V和Gemini,其准确率也只有51.26%和45.72%,只有13.17%和7.63%高于随机猜测,表明在最近的多模态LLM中,这样的感知能力尚未“出现”。我们的分析还强调,专家CV模型本可以更好地解决这些问题,这表明未来改进的潜在途径。我们相信,Blink将激发社区帮助多模态LLM追上人类水平视觉感知。
URL
https://arxiv.org/abs/2404.12390