Abstract
Structured scene descriptions of images are useful for the automatic processing and querying of large image databases. We show how the combination of a semantic and a visual statistical model can improve on the task of mapping images to their associated scene description. In this paper we consider scene descriptions which are represented as a set of triples (subject, predicate, object), where each triple consists of a pair of visual objects, which appear in the image, and the relationship between them (e.g. man-riding-elephant, man-wearing-hat). We combine a standard visual model for object detection, based on convolutional neural networks, with a latent variable model for link prediction. We apply multiple state-of-the-art link prediction methods and compare their capability for visual relationship detection. One of the main advantages of link prediction methods is that they can also generalize to triples, which have never been observed in the training data. Our experimental results on the recently published Stanford Visual Relationship dataset, a challenging real world dataset, show that the integration of a semantic model using link prediction methods can significantly improve the results for visual relationship detection. Our combined approach achieves superior performance compared to the state-of-the-art method from the Stanford computer vision group.
Abstract (translated)
图像的结构化场景描述对于大型图像数据库的自动处理和查询是有用的。我们展示了语义和视觉统计模型的组合如何改进将图像映射到其相关场景描述的任务。在本文中,我们考虑场景描述,表示为一组三元组(主题,谓词,对象),其中每个三元组由一对视觉对象组成,它们出现在图像中,以及它们之间的关系(例如,骑马 - 大象,戴帽子)。我们将基于卷积神经网络的物体检测的标准视觉模型与用于链路预测的潜变量模型相结合。我们应用多种最先进的链接预测方法,并比较它们的视觉关系检测能力。链接预测方法的主要优点之一是它们也可以推广到三元组,这在训练数据中从未被观察到。我们对最近发布的斯坦福视觉关系数据集(一个具有挑战性的现实世界数据集)的实验结果表明,使用链接预测方法整合语义模型可以显着改善视觉关系检测的结果。与斯坦福大学计算机视觉组的最先进方法相比,我们的组合方法实现了卓越的性能。
URL
https://arxiv.org/abs/1809.00204