Paper Reading AI Learner

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

2023-12-08 02:08:00
Xuan Wang, Guanhong Wang, Wenhao Chai, Jiayu Zhou, Gaoang Wang

Abstract

Image captioning bridges the gap between vision and language by automatically generating natural language descriptions for images. Traditional image captioning methods often overlook the preferences and characteristics of users. Personalized image captioning solves this problem by incorporating user prior knowledge into the model, such as writing styles and preferred vocabularies. Most existing methods emphasize the user context fusion process by memory networks or transformers. However, these methods ignore the distinct domains of each dataset. Therefore, they need to update the entire caption model parameters when meeting new samples, which is time-consuming and calculation-intensive. To address this challenge, we propose a novel personalized image captioning framework that leverages user context to consider personality factors. Additionally, our framework utilizes the prefix-tuning paradigm to extract knowledge from a frozen large language model, reducing the gap between different language domains. Specifically, we employ CLIP to extract the visual features of an image and align the semantic space using a query-guided mapping network. By incorporating the transformer layer, we merge the visual features with the user's contextual prior knowledge to generate informative prefixes. Moreover, we employ GPT-2 as the frozen large language model. With a small number of parameters to be trained, our model performs efficiently and effectively. Our model outperforms existing baseline models on Instagram and YFCC100M datasets across five evaluation metrics, demonstrating its superiority, including twofold improvements in metrics such as BLEU-4 and CIDEr.

Abstract (translated)

图像标题通过自动生成自然语言描述来弥合视觉和语言之间的差距。传统的图像标题方法通常忽视用户的偏好和特点。个性化的图像标题通过将用户的先前知识融入模型中来解决这一问题,例如写作风格和喜欢的词汇。大多数现有方法强调通过记忆网络或变换器来融合用户上下文的过程。然而,这些方法忽略了每个数据集的独特领域。因此,在遇到新样本时,它们需要更新整个标题模型参数,这耗时且计算密集型。为解决这个问题,我们提出了一个新颖的个性化图像标题框架,它利用用户上下文来考虑个性因素。此外,我们的框架利用了前缀调整范式来提取知识,从而减少不同语言领域之间的差距。具体来说,我们使用CLIP提取图像的视觉特征,并通过查询引导映射网络将语义空间对齐。通过包含变换器层,我们将视觉特征与用户的上下文先验知识相结合,生成有信息的前缀。此外,我们使用GPT-2作为冻结的大语言模型。由于只需要很少的参数来训练,我们的模型具有高效且有效的能力。我们的模型在Instagram和YFCC100M数据集上优于现有基线模型,在五个评估指标上实现了卓越的表现,包括BLEU-4的 twice 改善和CIDEr的改善。

URL

https://arxiv.org/abs/2312.04793

PDF

https://arxiv.org/pdf/2312.04793.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot