Abstract
This report introduces a solution to the Topic 1 Zero-shot Image Captioning of 2024 NICE : New frontiers for zero-shot Image Captioning Evaluation. In contrast to NICE 2023 datasets, this challenge involves new annotations by humans with significant differences in caption style and content. Therefore, we enhance image captions effectively through retrieval augmentation and caption grading methods. At the data level, we utilize high-quality captions generated by image caption models as training data to address the gap in text styles. At the model level, we employ OFA (a large-scale visual-language pre-training model based on handcrafted templates) to perform the image captioning task. Subsequently, we propose caption-level strategy for the high-quality caption data generated by the image caption models and integrate them with retrieval augmentation strategy into the template to compel the model to generate higher quality, more matching, and semantically enriched captions based on the retrieval augmentation prompts. Our approach ranks first on the leaderboard, achieving a CIDEr score of 234.11 and 1st in all other metrics.
Abstract (translated)
本报告为2024 NICE主题1:零 shot 图像标题介绍了一种解决方案:零 shot图像标题评估的新领域。与NICE 2023数据集相比,这个挑战涉及了人类对标题风格和内容的显著差异。因此,我们通过检索增强和评分方法有效地增强图像标题。在数据层面上,我们利用图像标题模型生成的高质量标题作为训练数据来解决文本风格的空白。在模型层面上,我们采用OFA(一个基于手工模板的大型视觉语言预训练模型)进行图像标题任务。随后,我们提出了针对图像标题模型的高质量标题策略,并将它们与检索增强策略集成到模板中,以迫使模型根据检索增强提示生成更高质量、更匹配、更具语义丰富的标题。我们的方法在排行榜上排名第一,实现了CIDEr分数为234.11,并且在所有其他指标上都排名第一。
URL
https://arxiv.org/abs/2404.12739