Paper Reading AI Learner

Relational Reasoning using Prior Knowledge for Visual Captioning

2019-06-04 09:15:54
Jingyi Hou, Xinxiao Wu, Yayun Qi, Wentian Zhao, Jiebo Luo, Yunde Jia

Abstract

Exploiting relationships among objects has achieved remarkable progress in interpreting images or videos by natural language. Most existing methods resort to first detecting objects and their relationships, and then generating textual descriptions, which heavily depends on pre-trained detectors and leads to performance drop when facing problems of heavy occlusion, tiny-size objects and long-tail in object detection. In addition, the separate procedure of detecting and captioning results in semantic inconsistency between the pre-defined object/relation categories and the target lexical words. We exploit prior human commonsense knowledge for reasoning relationships between objects without any pre-trained detectors and reaching semantic coherency within one image or video in captioning. The prior knowledge (e.g., in the form of knowledge graph) provides commonsense semantic correlation and constraint between objects that are not explicit in the image and video, serving as useful guidance to build semantic graph for sentence generation. Particularly, we present a joint reasoning method that incorporates 1) commonsense reasoning for embedding image or video regions into semantic space to build semantic graph and 2) relational reasoning for encoding semantic graph to generate sentences. Extensive experiments on the MS-COCO image captioning benchmark and the MSVD video captioning benchmark validate the superiority of our method on leveraging prior commonsense knowledge to enhance relational reasoning for visual captioning.

Abstract (translated)

利用物体之间的关系,在用自然语言解释图像或视频方面取得了显著的进展。现有的方法大多是先检测对象及其关系,然后生成文本描述,这在很大程度上依赖于预先训练的检测器,当遇到严重遮挡、小尺寸对象和长尾巴的对象检测问题时,会导致性能下降。另外,检测和字幕的分离过程会导致预先定义的对象/关系类别与目标词汇之间的语义不一致。我们利用现有的人类常识知识,在没有经过任何预先训练的检测器的情况下,对物体之间的关系进行推理,并在一幅图像或视频的字幕中达到语义一致性。先验知识(例如,以知识图的形式)在图像和视频中不明确的对象之间提供常识性的语义关联和约束,作为构建句子生成语义图的有用指导。特别地,我们提出了一种联合推理方法,它包括1)将图像或视频区域嵌入语义空间以构建语义图的常识推理和2)将语义图编码以生成句子的关系推理。在MS-COCO图像字幕基准和MSVD视频字幕基准上进行了大量的实验,验证了我们利用先验常识知识增强视觉字幕关系推理的方法的优越性。

URL

https://arxiv.org/abs/1906.01290

PDF

https://arxiv.org/pdf/1906.01290.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot