Paper Reading AI Learner

Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data

2023-01-26 15:25:43
Dong-Jin Kim, Tae-Hyun Oh, Jinsoo Choi, In So Kweon

Abstract

We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models. Constructing a large-scale labeled image captioning dataset is an expensive task in terms of labor, time, and cost. In contrast to manually annotating all the training samples, separately collecting uni-modal datasets is immensely easier, e.g., a large-scale image dataset and a sentence dataset. We leverage such massive unpaired image and caption data upon standard paired data by learning to associate them. To this end, our proposed semi-supervised learning method assigns pseudo-labels to unpaired samples in an adversarial learning fashion, where the joint distribution of image and caption is learned. Our method trains a captioner to learn from a paired data and to progressively associate unpaired data. This approach shows noticeable performance improvement even in challenging scenarios including out-of-task data (i.e., relational captioning, where the target task is different from the unpaired data) and web-crawled data. We also show that our proposed method is theoretically well-motivated and has a favorable global optimal property. Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired COCO dataset demonstrate the consistent effectiveness of our semisupervised learning method with unpaired data compared to competing methods.

Abstract (translated)

我们提出了一种高效利用数据的新半监督框架,以改善图像摘要模型的泛化能力。建造大规模的标记图像摘要数据集是一项高昂的任务,在劳动、时间和成本方面。与手动标注所有训练样本相比,单独收集两向数据是更加容易的事情,例如大规模图像数据和句子数据集。我们利用标准配对数据来利用这种巨大的无配对图像和摘要数据,通过学习它们的关系来建立它们的联合分布。我们的 proposed 半监督学习方法将无配对样本赋予伪标签,以进行对抗性学习,学习图像和摘要的联合分布。我们的方法训练一个摘要器从配对数据学习,并逐渐与无配对数据建立关系。这种方法即使在包括任务外数据(即关系摘要,目标任务与无配对数据不同)和爬取数据等挑战性场景中也表现出显著的性能改进。我们还表明,我们的 proposed 方法在理论和实验上都有良好动机,并具有有利的全局最优性质。我们的广泛和全面的经验证据,分别基于 (1) 图像数据和 (2) 密集区域摘要数据,并针对很少配对的COCO数据集进行了全面分析,证明了我们半监督学习方法与无配对数据相比,与竞争方法的一致性有效性。

URL

https://arxiv.org/abs/2301.11174

PDF

https://arxiv.org/pdf/2301.11174.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot