Paper Reading AI Learner

Visual Reasoning in Object-Centric Deep Neural Networks: A Comparative Cognition Approach

2024-02-20 02:48:14
Guillermo Puebla, Jeffrey S. Bowers

Abstract

Achieving visual reasoning is a long-term goal of artificial intelligence. In the last decade, several studies have applied deep neural networks (DNNs) to the task of learning visual relations from images, with modest results in terms of generalization of the relations learned. However, in recent years, object-centric representation learning has been put forward as a way to achieve visual reasoning within the deep learning framework. Object-centric models attempt to model input scenes as compositions of objects and relations between them. To this end, these models use several kinds of attention mechanisms to segregate the individual objects in a scene from the background and from other objects. In this work we tested relation learning and generalization in several object-centric models, as well as a ResNet-50 baseline. In contrast to previous research, which has focused heavily in the same-different task in order to asses relational reasoning in DNNs, we use a set of tasks -- with varying degrees of difficulty -- derived from the comparative cognition literature. Our results show that object-centric models are able to segregate the different objects in a scene, even in many out-of-distribution cases. In our simpler tasks, this improves their capacity to learn and generalize visual relations in comparison to the ResNet-50 baseline. However, object-centric models still struggle in our more difficult tasks and conditions. We conclude that abstract visual reasoning remains an open challenge for DNNs, including object-centric models.

Abstract (translated)

实现视觉推理是人工智能的一个长期目标。在过去的十年里,几项研究将深度神经网络(DNNs)应用于从图像中学习视觉关系,虽然这些模型的泛化关系有所提高,但近年来,以物体为中心的表示学习作为一种在深度学习框架内实现视觉推理的方法被提出。物体中心模型试图将输入场景建模为物体和它们之间的关系的组合。为此,这些模型使用多种关注机制将场景中的单个物体从背景和与其他物体区分。在这项工作中,我们测试了关系学习和泛化在多个物体中心模型以及一个ResNet-50基线上的效果。与之前的研究不同,该研究关注的是相同不同任务,以评估DNNs中的关系推理。我们的结果表明,物体中心模型能够将场景中的不同物体进行区分,即使在很多分布不下的情况下也是如此。在我们的简单任务中,这使得物体中心模型能够更好地学习和泛化视觉关系,与ResNet-50基线相比提高了其能力。然而,在更困难的任务和条件下,物体中心模型仍然存在困难。我们得出结论,对于DNNs来说,抽象视觉推理仍然是一个未解决的问题,包括物体中心模型。

URL

https://arxiv.org/abs/2402.12675

PDF

https://arxiv.org/pdf/2402.12675.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot