Paper Reading AI Learner

The Solution for the CVPR2024 NICE Image Captioning Challenge

2024-04-19 09:32:16
Longfei Huang, Shupeng Zhong, Xiangyu Wu, Ruoxuan Li, Qingguo Chen, Yang Yang

Abstract

This report introduces a solution to the Topic 1 Zero-shot Image Captioning of 2024 NICE : New frontiers for zero-shot Image Captioning Evaluation. In contrast to NICE 2023 datasets, this challenge involves new annotations by humans with significant differences in caption style and content. Therefore, we enhance image captions effectively through retrieval augmentation and caption grading methods. At the data level, we utilize high-quality captions generated by image caption models as training data to address the gap in text styles. At the model level, we employ OFA (a large-scale visual-language pre-training model based on handcrafted templates) to perform the image captioning task. Subsequently, we propose caption-level strategy for the high-quality caption data generated by the image caption models and integrate them with retrieval augmentation strategy into the template to compel the model to generate higher quality, more matching, and semantically enriched captions based on the retrieval augmentation prompts. Our approach ranks first on the leaderboard, achieving a CIDEr score of 234.11 and 1st in all other metrics.

Abstract (translated)

本报告为2024 NICE主题1:零 shot 图像标题介绍了一种解决方案:零 shot图像标题评估的新领域。与NICE 2023数据集相比,这个挑战涉及了人类对标题风格和内容的显著差异。因此,我们通过检索增强和评分方法有效地增强图像标题。在数据层面上,我们利用图像标题模型生成的高质量标题作为训练数据来解决文本风格的空白。在模型层面上,我们采用OFA(一个基于手工模板的大型视觉语言预训练模型)进行图像标题任务。随后,我们提出了针对图像标题模型的高质量标题策略,并将它们与检索增强策略集成到模板中,以迫使模型根据检索增强提示生成更高质量、更匹配、更具语义丰富的标题。我们的方法在排行榜上排名第一,实现了CIDEr分数为234.11,并且在所有其他指标上都排名第一。

URL

https://arxiv.org/abs/2404.12739

PDF

https://arxiv.org/pdf/2404.12739.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot