Paper Reading AI Learner

Generating Triples with Adversarial Networks for Scene Graph Construction


Abstract

Driven by successes in deep learning, computer vision research has begun to move beyond object detection and image classification to more sophisticated tasks like image captioning or visual question answering. Motivating such endeavors is the desire for models to capture not only objects present in an image, but more fine-grained aspects of a scene such as relationships between objects and their attributes. Scene graphs provide a formal construct for capturing these aspects of an image. Despite this, there have been only a few recent efforts to generate scene graphs from imagery. Previous works limit themselves to settings where bounding box information is available at train time and do not attempt to generate scene graphs with attributes. In this paper we propose a method, based on recent advancements in Generative Adversarial Networks, to overcome these deficiencies. We take the approach of first generating small subgraphs, each describing a single statement about a scene from a specific region of the input image chosen using an attention mechanism. By doing so, our method is able to produce portions of the scene graphs with attribute information without the need for bounding box labels. Then, the complete scene graph is constructed from these subgraphs. We show that our model improves upon prior work in scene graph generation on state-of-the-art data sets and accepted metrics. Further, we demonstrate that our model is capable of handling a larger vocabulary size than prior work has attempted.

Abstract (translated)

在深度学习取得成功的驱动下,计算机视觉研究已经开始超越物体检测和图像分类,成为图像字幕或视觉问题回答等更复杂的任务。激发这样的努力是希望模型不仅捕获图像中存在的对象,还捕获场景中更细粒度的方面,如对象与属性之间的关系。场景图提供了用于捕获图像这些方面的正式构造。尽管如此,最近只有少数几项努力可以从图像中生成场景图。以前的作品将自己限制在列车时间可用边界框信息的设置,并且不会尝试生成具有属性的场景图。在本文中,我们提出了一种方法,基于生成敌对网络的最新进展来克服这些缺陷。我们采用首先生成小型子图的方法,每个小型子图描述关于使用注意机制选择的输入图像的特定区域中的场景的单个陈述。通过这样做,我们的方法能够生成具有属性信息的部分场景图,而不需要边界框标签。然后,从这些子图构建完整的场景图。我们表明,我们的模型改进了先前在关于最新数据集和接受度量的场景图生成方面的工作。此外,我们证明我们的模型能够处理比以前的工作尝试更大的词汇量。

URL

https://arxiv.org/abs/1802.02598

PDF

https://arxiv.org/pdf/1802.02598.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot