Abstract
Visual relationship detection can bridge the gap between computer vision and natural language for scene understanding of images. Different from pure object recognition tasks, the relation triplets of subject-predicate-object lie on an extreme diversity space, such as \textit{person-behind-person} and \textit{car-behind-building}, while suffering from the problem of combinatorial explosion. In this paper, we propose a context-dependent diffusion network (CDDN) framework to deal with visual relationship detection. To capture the interactions of different object instances, two types of graphs, word semantic graph and visual scene graph, are constructed to encode global context interdependency. The semantic graph is built through language priors to model semantic correlations across objects, whilst the visual scene graph defines the connections of scene objects so as to utilize the surrounding scene information. For the graph-structured data, we design a diffusion network to adaptively aggregate information from contexts, which can effectively learn latent representations of visual relationships and well cater to visual relationship detection in view of its isomorphic invariance to graphs. Experiments on two widely-used datasets demonstrate that our proposed method is more effective and achieves the state-of-the-art performance.
Abstract (translated)
视觉关系检测可以弥合计算机视觉和自然语言之间的差距,以便对图像进行场景理解。与纯对象识别任务不同,主谓词对象的三元组关系在极端多样性空间上,例如\ textit {person-behind-person}和\ textit {car-behind-building},同时遭遇问题组合爆炸在本文中,我们提出了一个依赖于上下文的扩散网络(CDDN)框架来处理视觉关系检测。为了捕获不同对象实例的交互,构造了两种类型的图,单词语义图和视觉场景图,以编码全局上下文相互依赖性。语义图是通过语言先验建立的,以模拟对象之间的语义相关性,而视觉场景图定义了场景对象的连接,以便利用周围的场景信息。对于图形结构数据,我们设计了一个扩散网络来自适应地聚合来自上下文的信息,这可以有效地学习视觉关系的潜在表示,并且考虑到它与图形的同构不变性,很好地迎合视觉关系检测。对两个广泛使用的数据集的实验表明,我们提出的方法更有效,并实现了最先进的性能。
URL
https://arxiv.org/abs/1809.06213