Paper Reading AI Learner

Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition

2018-07-13 09:20:39
Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, Xiaogang Wang, Jing Shao, Chen Change Loy

Abstract

Recognizing visual relationships <subject-predicate-object> among any pair of localized objects is pivotal for image understanding. Previous studies have shown remarkable progress in exploiting linguistic priors or external textual information to improve the performance. In this work, we investigate an orthogonal perspective based on feature interactions. We show that by encouraging deep message propagation and interactions between local object features and global predicate features, one can achieve compelling performance in recognizing complex relationships without using any linguistic priors. To this end, we present two new pooling cells to encourage feature interactions: (i) Contrastive ROI Pooling Cell, which has a unique deROI pooling that inversely pools local object features to the corresponding area of global predicate features. (ii) Pyramid ROI Pooling Cell, which broadcasts global predicate features to reinforce local object features.The two cells constitute a Spatiality-Context-Appearance Module (SCA-M), which can be further stacked consecutively to form our final Zoom-Net.We further shed light on how one could resolve ambiguous and noisy object and predicate annotations by Intra-Hierarchical trees (IH-tree). Extensive experiments conducted on Visual Genome dataset demonstrate the effectiveness of our feature-oriented approach compared to state-of-the-art methods (Acc@1 11.42% from 8.16%) that depend on explicit modeling of linguistic interactions. We further show that SCA-M can be incorporated seamlessly into existing approaches to improve the performance by a large margin. The source code will be released on https://github.com/gjyin91/ZoomNet.

Abstract (translated)

识别视觉关系< subject-predicate-object>在任何一对本地化对象中,对于图像理解至关重要。以前的研究表明,在利用语言先验或外部文本信息来改善绩效方面取得了显着进展。在这项工作中,我们研究了基于特征相互作用的正交透视图。我们通过鼓励深度消息传播以及本地对象特征与全局谓词特征之间的交互,表明在不使用任何语言先验的情况下识别复杂关系可以获得令人信服的性能。为此,我们提出了两个新的池化单元来鼓励特征交互:(i)对比ROI池化单元,它具有唯一的deROI池,将本地对象特征反向汇集到全局谓词特征的相应区域。 (ii)金字塔ROI Pooling Cell,它广播全局谓词特征以强化本地对象特征。这两个单元构成一个空间 - 上下文 - 外观模块(SCA-M),可以进一步连续堆叠以形成我们的最终Zoom-Net。我们进一步阐明了如何通过Intra-Hierarchical树(IH-tree)来解决模糊和嘈杂的对象以及谓词注释。在Visual Genome数据集上进行的大量实验证明,与最先进的方法(Acc @ 1 11.42%,8.16%)相比,我们的面向特征的方法的有效性依赖于语言交互的显式建模。我们进一步表明,SCA-M可以无缝地整合到现有方法中,以大幅提高性能。源代码将在https://github.com/gjyin91/ZoomNet上发布。

URL

https://arxiv.org/abs/1807.04979

PDF

https://arxiv.org/pdf/1807.04979.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot