Paper Reading AI Learner

Cross-Modal Relationship Inference for Grounding Referring Expressions

2019-06-11 09:47:26
Sibei Yang, Guanbin Li, Yizhou Yu

Abstract

Grounding referring expressions is a fundamental yet challenging task facilitating human-machine communication in the physical world. It locates the target object in an image on the basis of the comprehension of the relationships between referring natural language expressions and the image. A feasible solution for grounding referring expressions not only needs to extract all the necessary information (i.e., objects and the relationships among them) in both the image and referring expressions, but also compute and represent multimodal contexts from the extracted information. Unfortunately, existing work on grounding referring expressions cannot extract multi-order relationships from the referring expressions accurately and the contexts they obtain have discrepancies with the contexts described by referring expressions. In this paper, we propose a Cross-Modal Relationship Extractor (CMRE) to adaptively highlight objects and relationships, that have connections with a given expression, with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual relation graph. In addition, we propose a Gated Graph Convolutional Network (GGCN) to compute multimodal semantic contexts by fusing information from different modes and propagating multimodal information in the structured relation graph. Experiments on various common benchmark datasets show that our Cross-Modal Relationship Inference Network, which consists of CMRE and GGCN, outperforms all existing state-of-the-art methods.

Abstract (translated)

在物理世界中,根植引用表达式是促进人机通信的一项基本而富有挑战性的任务。它在理解所指的自然语言表达与图像之间的关系的基础上,将目标对象定位在图像中。确定引用表达式的可行方案不仅需要提取图像和引用表达式中的所有必要信息(即对象及其之间的关系),还需要从提取的信息中计算和表示多模式上下文。不幸的是,现有的基于引用表达式的工作无法准确地从引用表达式中提取多阶关系,并且它们获得的上下文与引用表达式描述的上下文存在差异。本文提出了一种跨模态关系抽取器(CMRE),用于自适应地突出显示与给定表达式有联系的对象和关系,并采用跨模态注意机制,将提取的信息表示为一种语言引导的视觉关系图。此外,我们还提出了一种门控图卷积网络(GGCN),通过融合不同模式的信息并在结构化关系图中传播多模式信息来计算多模式语义上下文。对各种常用基准数据集的实验表明,我们的跨模态关系推理网络(由CMRE和GGCN组成)优于现有的所有最先进的方法。

URL

https://arxiv.org/abs/1906.04464

PDF

https://arxiv.org/pdf/1906.04464.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot