Paper Reading AI Learner

RAVEN: A Dataset for Relational and Analogical Visual rEasoNing

2019-03-07 06:28:44
Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, Song-Chun Zhu

Abstract

Dramatic progress has been witnessed in basic vision tasks involving low-level perception, such as object recognition, detection, and tracking. Unfortunately, there is still an enormous performance gap between artificial vision systems and human intelligence in terms of higher-level vision problems, especially ones involving reasoning. Earlier attempts in equipping machines with high-level reasoning have hovered around Visual Question Answering (VQA), one typical task associating vision and language understanding. In this work, we propose a new dataset, built in the context of Raven's Progressive Matrices (RPM) and aimed at lifting machine intelligence by associating vision with structural, relational, and analogical reasoning in a hierarchical representation. Unlike previous works in measuring abstract reasoning using RPM, we establish a semantic link between vision and reasoning by providing structure representation. This addition enables a new type of abstract reasoning by jointly operating on the structure representation. Machine reasoning ability using modern computer vision is evaluated in this newly proposed dataset. Additionally, we also provide human performance as a reference. Finally, we show consistent improvement across all models by incorporating a simple neural module that combines visual understanding and structure reasoning.

Abstract (translated)

在涉及低层次感知的基本视觉任务中,如物体识别、检测和跟踪,已经取得了显著进展。不幸的是,在更高层次的视觉问题,特别是涉及推理的视觉问题上,人工视觉系统与人类智能之间仍然存在巨大的性能差距。早期的机器配备高级推理的尝试一直围绕视觉问答(vqa)展开,视觉问答是一项与视觉和语言理解相关的典型任务。在这项工作中,我们提出了一个新的数据集,建立在Raven的渐进矩阵(RPM)的背景下,旨在提升机器智能,通过将视觉与层次表示中的结构、关系和类比推理联系起来。与以前使用RPM度量抽象推理的工作不同,我们通过提供结构表示在视觉和推理之间建立语义链接。这一增加使得一种新的抽象推理,通过共同操作的结构表示。在新提出的数据集中,评价了利用现代计算机视觉进行机器推理的能力。此外,我们还提供了人的表现作为参考。最后,我们通过合并一个简单的结合视觉理解和结构推理的神经模块,展示了所有模型的一致性改进。

URL

https://arxiv.org/abs/1903.02741

PDF

https://arxiv.org/pdf/1903.02741.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot