Paper Reading AI Learner

Improving Visual Relationship Detection using Semantic Modeling of Scene Descriptions

2018-09-01 15:11:12
Stephan Baier, Yunpu Ma, Volker Tresp

Abstract

Structured scene descriptions of images are useful for the automatic processing and querying of large image databases. We show how the combination of a semantic and a visual statistical model can improve on the task of mapping images to their associated scene description. In this paper we consider scene descriptions which are represented as a set of triples (subject, predicate, object), where each triple consists of a pair of visual objects, which appear in the image, and the relationship between them (e.g. man-riding-elephant, man-wearing-hat). We combine a standard visual model for object detection, based on convolutional neural networks, with a latent variable model for link prediction. We apply multiple state-of-the-art link prediction methods and compare their capability for visual relationship detection. One of the main advantages of link prediction methods is that they can also generalize to triples, which have never been observed in the training data. Our experimental results on the recently published Stanford Visual Relationship dataset, a challenging real world dataset, show that the integration of a semantic model using link prediction methods can significantly improve the results for visual relationship detection. Our combined approach achieves superior performance compared to the state-of-the-art method from the Stanford computer vision group.

Abstract (translated)

图像的结构化场景描述对于大型图像数据库的自动处理和查询是有用的。我们展示了语义和视觉统计模型的组合如何改进将图像映射到其相关场景描述的任务。在本文中,我们考虑场景描述,表示为一组三元组(主题,谓词,对象),其中每个三元组由一对视觉对象组成,它们出现在图像中,以及它们之间的关系(例如,骑马 - 大象,戴帽子)。我们将基于卷积神经网络的物体检测的标准视觉模型与用于链路预测的潜变量模型相结合。我们应用多种最先进的链接预测方法,并比较它们的视觉关系检测能力。链接预测方法的主要优点之一是它们也可以推广到三元组,这在训练数据中从未被观察到。我们对最近发布的斯坦福视觉关系数据集(一个具有挑战性的现实世界数据集)的实验结果表明,使用链接预测方法整合语义模型可以显着改善视觉关系检测的结果。与斯坦福大学计算机视觉组的最先进方法相比,我们的组合方法实现了卓越的性能。

URL

https://arxiv.org/abs/1809.00204

PDF

https://arxiv.org/pdf/1809.00204.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot