Abstract
Exploiting relationships among objects has achieved remarkable progress in interpreting images or videos by natural language. Most existing methods resort to first detecting objects and their relationships, and then generating textual descriptions, which heavily depends on pre-trained detectors and leads to performance drop when facing problems of heavy occlusion, tiny-size objects and long-tail in object detection. In addition, the separate procedure of detecting and captioning results in semantic inconsistency between the pre-defined object/relation categories and the target lexical words. We exploit prior human commonsense knowledge for reasoning relationships between objects without any pre-trained detectors and reaching semantic coherency within one image or video in captioning. The prior knowledge (e.g., in the form of knowledge graph) provides commonsense semantic correlation and constraint between objects that are not explicit in the image and video, serving as useful guidance to build semantic graph for sentence generation. Particularly, we present a joint reasoning method that incorporates 1) commonsense reasoning for embedding image or video regions into semantic space to build semantic graph and 2) relational reasoning for encoding semantic graph to generate sentences. Extensive experiments on the MS-COCO image captioning benchmark and the MSVD video captioning benchmark validate the superiority of our method on leveraging prior commonsense knowledge to enhance relational reasoning for visual captioning.
Abstract (translated)
利用物体之间的关系,在用自然语言解释图像或视频方面取得了显著的进展。现有的方法大多是先检测对象及其关系,然后生成文本描述,这在很大程度上依赖于预先训练的检测器,当遇到严重遮挡、小尺寸对象和长尾巴的对象检测问题时,会导致性能下降。另外,检测和字幕的分离过程会导致预先定义的对象/关系类别与目标词汇之间的语义不一致。我们利用现有的人类常识知识,在没有经过任何预先训练的检测器的情况下,对物体之间的关系进行推理,并在一幅图像或视频的字幕中达到语义一致性。先验知识(例如,以知识图的形式)在图像和视频中不明确的对象之间提供常识性的语义关联和约束,作为构建句子生成语义图的有用指导。特别地,我们提出了一种联合推理方法,它包括1)将图像或视频区域嵌入语义空间以构建语义图的常识推理和2)将语义图编码以生成句子的关系推理。在MS-COCO图像字幕基准和MSVD视频字幕基准上进行了大量的实验,验证了我们利用先验常识知识增强视觉字幕关系推理的方法的优越性。
URL
https://arxiv.org/abs/1906.01290