Abstract
We address personalization issues of image captioning, which have not been discussed yet in previous research. For a query image, we aim to generate a descriptive sentence, accounting for prior knowledge such as the user's active vocabularies in previous documents. As applications of personalized image captioning, we tackle two post automation tasks: hashtag prediction and post generation, on our newly collected Instagram dataset, consisting of 1.1M posts from 6.3K users. We propose a novel captioning model named Context Sequence Memory Network (CSMN). Its unique updates over previous memory network models include (i) exploiting memory as a repository for multiple types of context information, (ii) appending previously generated words into memory to capture long-term information without suffering from the vanishing gradient problem, and (iii) adopting CNN memory structure to jointly represent nearby ordered memory slots for better context understanding. With quantitative evaluation and user studies via Amazon Mechanical Turk, we show the effectiveness of the three novel features of CSMN and its performance enhancement for personalized image captioning over state-of-the-art captioning models.
Abstract (translated)
我们处理图像字幕的个性化问题,这在以前的研究中还没有讨论过。对于查询图像,我们的目标是生成一个描述性句子,考虑到先前的知识,例如以前文档中用户的活动词汇表。作为个性化图片字幕的应用,我们在新收集的Instagram数据集上处理了两个后期自动化任务:标签预测和后期制作,其中包括来自6.3K用户的1.1M个帖子。我们提出了一种名为上下文序列内存网络(CSMN)的小说字幕模型。它比以前的存储器网络模型的独特更新包括(i)利用存储器作为多种类型的上下文信息的存储库,(ii)将先前生成的单词附加到存储器中以捕获长期信息而不遭受消失梯度问题,以及)采用CNN内存结构来共同代表附近的有序内存插槽,以便更好地理解上下文。通过亚马逊Mechanical Turk的定量评估和用户研究,我们展示了CSMN的三个新特性的有效性,以及其针对最先进字幕模型的个性化图像字幕的性能增强。
URL
https://arxiv.org/abs/1704.06485