Paper Reading AI Learner

Mind the GAP: Glimpse-based Active Perception improves generalization and sample efficiency of visual reasoning

2024-09-30 11:48:11
Oleh Kolner, Thomas Ortner, Stanis{\l}aw Wo\'zniak, Angeliki Pantazi

Abstract

Human capabilities in understanding visual relations are far superior to those of AI systems, especially for previously unseen objects. For example, while AI systems struggle to determine whether two such objects are visually the same or different, humans can do so with ease. Active vision theories postulate that the learning of visual relations is grounded in actions that we take to fixate objects and their parts by moving our eyes. In particular, the low-dimensional spatial information about the corresponding eye movements is hypothesized to facilitate the representation of relations between different image parts. Inspired by these theories, we develop a system equipped with a novel Glimpse-based Active Perception (GAP) that sequentially glimpses at the most salient regions of the input image and processes them at high resolution. Importantly, our system leverages the locations stemming from the glimpsing actions, along with the visual content around them, to represent relations between different parts of the image. The results suggest that the GAP is essential for extracting visual relations that go beyond the immediate visual content. Our approach reaches state-of-the-art performance on several visual reasoning tasks being more sample-efficient, and generalizing better to out-of-distribution visual inputs than prior models.

Abstract (translated)

人类在理解视觉关系方面的能力远远优于AI系统,特别是对于之前未见过的物体。例如,AI系统在确定两个此类物体是否在视觉上相同或不同时会感到困惑,而人类则可以轻松地做到这一点。积极视觉理论认为,学习视觉关系是基于我们移动眼睛来固定物体及其部分的行为。特别是,关于相应眼动低维空间信息的假设,有助于促进不同图像部分之间的关系表示。受到这些理论的启发,我们开发了一种名为Glimpse-based Active Perception(GAP)的新系统,该系统在输入图像的最具突出性的区域进行序列性浏览,并对其进行高分辨率处理。重要的是,我们的系统利用浏览行动产生的位置以及它们周围的视觉内容来表示图像不同部分之间的关系。结果显示,GAP对于提取超越当前视觉内容的视觉关系至关重要。我们的方法在几个视觉推理任务上达到了最先进的性能,具有更高的样本效率,并且对分布不在前的模型的泛化更好。

URL

https://arxiv.org/abs/2409.20213

PDF

https://arxiv.org/pdf/2409.20213.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot