Abstract
Image captioning bridges the gap between vision and language by automatically generating natural language descriptions for images. Traditional image captioning methods often overlook the preferences and characteristics of users. Personalized image captioning solves this problem by incorporating user prior knowledge into the model, such as writing styles and preferred vocabularies. Most existing methods emphasize the user context fusion process by memory networks or transformers. However, these methods ignore the distinct domains of each dataset. Therefore, they need to update the entire caption model parameters when meeting new samples, which is time-consuming and calculation-intensive. To address this challenge, we propose a novel personalized image captioning framework that leverages user context to consider personality factors. Additionally, our framework utilizes the prefix-tuning paradigm to extract knowledge from a frozen large language model, reducing the gap between different language domains. Specifically, we employ CLIP to extract the visual features of an image and align the semantic space using a query-guided mapping network. By incorporating the transformer layer, we merge the visual features with the user's contextual prior knowledge to generate informative prefixes. Moreover, we employ GPT-2 as the frozen large language model. With a small number of parameters to be trained, our model performs efficiently and effectively. Our model outperforms existing baseline models on Instagram and YFCC100M datasets across five evaluation metrics, demonstrating its superiority, including twofold improvements in metrics such as BLEU-4 and CIDEr.
Abstract (translated)
图像标题通过自动生成自然语言描述来弥合视觉和语言之间的差距。传统的图像标题方法通常忽视用户的偏好和特点。个性化的图像标题通过将用户的先前知识融入模型中来解决这一问题,例如写作风格和喜欢的词汇。大多数现有方法强调通过记忆网络或变换器来融合用户上下文的过程。然而,这些方法忽略了每个数据集的独特领域。因此,在遇到新样本时,它们需要更新整个标题模型参数,这耗时且计算密集型。为解决这个问题,我们提出了一个新颖的个性化图像标题框架,它利用用户上下文来考虑个性因素。此外,我们的框架利用了前缀调整范式来提取知识,从而减少不同语言领域之间的差距。具体来说,我们使用CLIP提取图像的视觉特征,并通过查询引导映射网络将语义空间对齐。通过包含变换器层,我们将视觉特征与用户的上下文先验知识相结合,生成有信息的前缀。此外,我们使用GPT-2作为冻结的大语言模型。由于只需要很少的参数来训练,我们的模型具有高效且有效的能力。我们的模型在Instagram和YFCC100M数据集上优于现有基线模型,在五个评估指标上实现了卓越的表现,包括BLEU-4的 twice 改善和CIDEr的改善。
URL
https://arxiv.org/abs/2312.04793