Abstract
Grounding referring expressions is a fundamental yet challenging task facilitating human-machine communication in the physical world. It locates the target object in an image on the basis of the comprehension of the relationships between referring natural language expressions and the image. A feasible solution for grounding referring expressions not only needs to extract all the necessary information (i.e., objects and the relationships among them) in both the image and referring expressions, but also compute and represent multimodal contexts from the extracted information. Unfortunately, existing work on grounding referring expressions cannot extract multi-order relationships from the referring expressions accurately and the contexts they obtain have discrepancies with the contexts described by referring expressions. In this paper, we propose a Cross-Modal Relationship Extractor (CMRE) to adaptively highlight objects and relationships, that have connections with a given expression, with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual relation graph. In addition, we propose a Gated Graph Convolutional Network (GGCN) to compute multimodal semantic contexts by fusing information from different modes and propagating multimodal information in the structured relation graph. Experiments on various common benchmark datasets show that our Cross-Modal Relationship Inference Network, which consists of CMRE and GGCN, outperforms all existing state-of-the-art methods.
Abstract (translated)
在物理世界中,根植引用表达式是促进人机通信的一项基本而富有挑战性的任务。它在理解所指的自然语言表达与图像之间的关系的基础上,将目标对象定位在图像中。确定引用表达式的可行方案不仅需要提取图像和引用表达式中的所有必要信息(即对象及其之间的关系),还需要从提取的信息中计算和表示多模式上下文。不幸的是,现有的基于引用表达式的工作无法准确地从引用表达式中提取多阶关系,并且它们获得的上下文与引用表达式描述的上下文存在差异。本文提出了一种跨模态关系抽取器(CMRE),用于自适应地突出显示与给定表达式有联系的对象和关系,并采用跨模态注意机制,将提取的信息表示为一种语言引导的视觉关系图。此外,我们还提出了一种门控图卷积网络(GGCN),通过融合不同模式的信息并在结构化关系图中传播多模式信息来计算多模式语义上下文。对各种常用基准数据集的实验表明,我们的跨模态关系推理网络(由CMRE和GGCN组成)优于现有的所有最先进的方法。
URL
https://arxiv.org/abs/1906.04464