Paper Reading AI Learner

Context-Dependent Diffusion Network for Visual Relationship Detection

2018-09-11 02:13:45
Zhen Cui, Chunyan Xu, Wenming Zheng, Jian Yang

Abstract

Visual relationship detection can bridge the gap between computer vision and natural language for scene understanding of images. Different from pure object recognition tasks, the relation triplets of subject-predicate-object lie on an extreme diversity space, such as \textit{person-behind-person} and \textit{car-behind-building}, while suffering from the problem of combinatorial explosion. In this paper, we propose a context-dependent diffusion network (CDDN) framework to deal with visual relationship detection. To capture the interactions of different object instances, two types of graphs, word semantic graph and visual scene graph, are constructed to encode global context interdependency. The semantic graph is built through language priors to model semantic correlations across objects, whilst the visual scene graph defines the connections of scene objects so as to utilize the surrounding scene information. For the graph-structured data, we design a diffusion network to adaptively aggregate information from contexts, which can effectively learn latent representations of visual relationships and well cater to visual relationship detection in view of its isomorphic invariance to graphs. Experiments on two widely-used datasets demonstrate that our proposed method is more effective and achieves the state-of-the-art performance.

Abstract (translated)

视觉关系检测可以弥合计算机视觉和自然语言之间的差距,以便对图像进行场景理解。与纯对象识别任务不同,主谓词对象的三元组关系在极端多样性空间上,例如\ textit {person-behind-person}和\ textit {car-behind-building},同时遭遇问题组合爆炸在本文中,我们提出了一个依赖于上下文的扩散网络(CDDN)框架来处理视觉关系检测。为了捕获不同对象实例的交互,构造了两种类型的图,单词语义图和视觉场景图,以编码全局上下文相互依赖性。语义图是通过语言先验建立的,以模拟对象之间的语义相关性,而视觉场景图定义了场景对象的连接,以便利用周围的场景信息。对于图形结构数据,我们设计了一个扩散网络来自适应地聚合来自上下文的信息,这可以有效地学习视觉关系的潜在表示,并且考虑到它与图形的同构不变性,很好地迎合视觉关系检测。对两个广泛使用的数据集的实验表明,我们提出的方法更有效,并实现了最先进的性能。

URL

https://arxiv.org/abs/1809.06213

PDF

https://arxiv.org/pdf/1809.06213.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot