Abstract
Image captioning, a popular topic in computer vision, has achieved substantial progress in recent years. However, the distinctiveness of natural descriptions is often overlooked in previous work. It is closely related to the quality of captions, as distinctive captions are more likely to describe images with their unique aspects. In this work, we propose a new learning method, Contrastive Learning (CL), for image captioning. Specifically, via two constraints formulated on top of a reference model, the proposed method can encourage distinctiveness, while maintaining the overall quality of the generated captions. We tested our method on two challenging datasets, where it improves the baseline model by significant margins. We also showed in our studies that the proposed method is generic and can be used for models with various structures.
Abstract (translated)
图像标题是计算机视觉领域的热门话题,近年来取得了实质性进展。然而,自然描述的独特性往往在以前的工作中被忽视。它与字幕的质量密切相关,因为独特的字幕更有可能用其独特的方面来描述图像。在这项工作中,我们提出了一种新的学习方法,对比学习(CL),用于图像字幕。具体而言,通过在参考模型之上制定的两个约束,所提出的方法可以鼓励独特性,同时保持生成的字幕的整体质量。我们在两个具有挑战性的数据集上测试了我们的方法,在这个数据集中它以显着的利润率改善了基线模型我们还在我们的研究中表明,所提出的方法是通用的,并且可以用于具有各种结构的模型。
URL
https://arxiv.org/abs/1710.02534