Abstract
An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it leads to better cross-task transfer than standard multi-task learning. In particular, the task of visual recognition is aligned to the task of visual question answering by forcing each to use the same word-region embeddings. We show this leads to greater inductive transfer from recognition to VQA than standard multitask learning. Visual recognition also improves, especially for categories that have relatively few recognition training labels but appear often in the VQA setting. Thus, our paper takes a small step towards creating more general vision systems by showing the benefit of interpretable, flexible, and trainable core representations.
Abstract (translated)
计算机视觉的一个重要目标是构建可以应用于许多任务的随着时间学习视觉表示的系统。在本文中,我们将视觉语言嵌入作为核心表示进行研究,并表明它比标准的多任务学习更好地实现了跨任务转移。特别地,视觉识别的任务与视觉问题解答的任务相一致,方法是迫使每个人使用相同的词区嵌入。我们表明,这比标准的多任务学习导致从识别到VQA更大的归纳转移。视觉识别也得到改善,特别是对于识别训练标签相对较少但在VQA设置中经常出现的类别。因此,我们的论文通过展示可解释,灵活和可训练的核心表征的好处,朝着创建更通用的视觉系统迈出了一小步。
URL
https://arxiv.org/abs/1704.00260