Abstract
Human capabilities in understanding visual relations are far superior to those of AI systems, especially for previously unseen objects. For example, while AI systems struggle to determine whether two such objects are visually the same or different, humans can do so with ease. Active vision theories postulate that the learning of visual relations is grounded in actions that we take to fixate objects and their parts by moving our eyes. In particular, the low-dimensional spatial information about the corresponding eye movements is hypothesized to facilitate the representation of relations between different image parts. Inspired by these theories, we develop a system equipped with a novel Glimpse-based Active Perception (GAP) that sequentially glimpses at the most salient regions of the input image and processes them at high resolution. Importantly, our system leverages the locations stemming from the glimpsing actions, along with the visual content around them, to represent relations between different parts of the image. The results suggest that the GAP is essential for extracting visual relations that go beyond the immediate visual content. Our approach reaches state-of-the-art performance on several visual reasoning tasks being more sample-efficient, and generalizing better to out-of-distribution visual inputs than prior models.
Abstract (translated)
人类在理解视觉关系方面的能力远远优于AI系统,特别是对于之前未见过的物体。例如,AI系统在确定两个此类物体是否在视觉上相同或不同时会感到困惑,而人类则可以轻松地做到这一点。积极视觉理论认为,学习视觉关系是基于我们移动眼睛来固定物体及其部分的行为。特别是,关于相应眼动低维空间信息的假设,有助于促进不同图像部分之间的关系表示。受到这些理论的启发,我们开发了一种名为Glimpse-based Active Perception(GAP)的新系统,该系统在输入图像的最具突出性的区域进行序列性浏览,并对其进行高分辨率处理。重要的是,我们的系统利用浏览行动产生的位置以及它们周围的视觉内容来表示图像不同部分之间的关系。结果显示,GAP对于提取超越当前视觉内容的视觉关系至关重要。我们的方法在几个视觉推理任务上达到了最先进的性能,具有更高的样本效率,并且对分布不在前的模型的泛化更好。
URL
https://arxiv.org/abs/2409.20213