Paper Reading AI Learner

Attend to You: Personalized Image Captioning with Context Sequence Memory Networks

2017-04-25 23:30:43
Cesc Chunseong Park, Byeongchang Kim, Gunhee Kim

Abstract

We address personalization issues of image captioning, which have not been discussed yet in previous research. For a query image, we aim to generate a descriptive sentence, accounting for prior knowledge such as the user's active vocabularies in previous documents. As applications of personalized image captioning, we tackle two post automation tasks: hashtag prediction and post generation, on our newly collected Instagram dataset, consisting of 1.1M posts from 6.3K users. We propose a novel captioning model named Context Sequence Memory Network (CSMN). Its unique updates over previous memory network models include (i) exploiting memory as a repository for multiple types of context information, (ii) appending previously generated words into memory to capture long-term information without suffering from the vanishing gradient problem, and (iii) adopting CNN memory structure to jointly represent nearby ordered memory slots for better context understanding. With quantitative evaluation and user studies via Amazon Mechanical Turk, we show the effectiveness of the three novel features of CSMN and its performance enhancement for personalized image captioning over state-of-the-art captioning models.

Abstract (translated)

我们处理图像字幕的个性化问题,这在以前的研究中还没有讨论过。对于查询图像,我们的目标是生成一个描述性句子,考虑到先前的知识,例如以前文档中用户的活动词汇表。作为个性化图片字幕的应用,我们在新收集的Instagram数据集上处理了两个后期自动化任务:标签预测和后期制作,其中包括来自6.3K用户的1.1M个帖子。我们提出了一种名为上下文序列内存网络(CSMN)的小说字幕模型。它比以前的存储器网络模型的独特更新包括(i)利用存储器作为多种类型的上下文信息的存储库,(ii)将先前生成的单词附加到存储器中以捕获长期信息而不遭受消失梯度问题,以及)采用CNN内存结构来共同代表附近的有序内存插槽,以便更好地理解上下文。通过亚马逊Mechanical Turk的定量评估和用户研究,我们展示了CSMN的三个新特性的有效性,以及其针对最先进字幕模型的个性化图像字幕的性能增强。

URL

https://arxiv.org/abs/1704.06485

PDF

https://arxiv.org/pdf/1704.06485.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot