Abstract
We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by the use of hard negatives in structured prediction, and ranking loss functions used in retrieval, we introduce a simple change to common loss functions used to learn multi-modal embeddings. That, combined with fine-tuning and the use of augmented data, yields significant gains in retrieval performance. We showcase our approach, dubbed VSE++, on the MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval, and 11.3% in image retrieval (based on R@1).
Abstract (translated)
我们提出了一种用于跨模式检索的视觉语义嵌入学习的新技术。受结构化预测中使用硬性否定以及检索中使用的排序损失函数的启发,我们对用于学习多模式嵌入的常见损失函数进行了简单的改变。结合微调和增强数据的使用,检索性能会显着提高。我们在MS-COCO和Flickr30K数据集上展示了我们的方法,称为VSE ++,使用消融研究并与现有方法进行比较。在MS-COCO上,我们的方法在字幕检索方面的表现优于现有技术8.8%,在图像检索方面(基于R @ 1)优于11.3%。
URL
https://arxiv.org/abs/1707.05612