Paper Reading AI Learner

The role of object-centric representations, guided attention, and external memory on generalizing visual relations

2023-04-14 12:22:52
Guillermo Puebla, Jeffrey S. Bowers

Abstract

Visual reasoning is a long-term goal of vision research. In the last decade, several works have attempted to apply deep neural networks (DNNs) to the task of learning visual relations from images, with modest results in terms of the generalization of the relations learned. In recent years, several innovations in DNNs have been developed in order to enable learning abstract relation from images. In this work, we systematically evaluate a series of DNNs that integrate mechanism such as slot attention, recurrently guided attention, and external memory, in the simplest possible visual reasoning task: deciding whether two objects are the same or different. We found that, although some models performed better than others in generalizing the same-different relation to specific types of images, no model was able to generalize this relation across the board. We conclude that abstract visual reasoning remains largely an unresolved challenge for DNNs.

Abstract (translated)

视觉推理是视觉研究的长期目标。在过去的十年中,有几种研究尝试将深度学习神经网络(DNN)应用于从图像中学习视觉关系的任务,但所取得的 generalization 效果相对较低。近年来,DNN 中几项创新已经被开发出来,以便从图像中学习抽象关系。在本研究中,我们系统地评估了一系列 DNN,这些 DNN 集成了例如窗体注意力、循环引导注意力和外部记忆等机制,在最简单的视觉推理任务中:决定两个物体是否相同或不同。我们发现,虽然某些模型在将相同-不同关系泛化到特定类型的图像方面表现更好,但没有任何模型能够在所有情况下泛化 this 关系。我们得出结论,抽象的视觉推理仍然是 DNN 面临的未解决挑战。

URL

https://arxiv.org/abs/2304.07091

PDF

https://arxiv.org/pdf/2304.07091.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot