Paper Reading AI Learner

Joint Image Captioning and Question Answering

2018-05-22 04:41:37
Jialin Wu, Zeyuan Hu, Raymond J. Mooney

Abstract

Answering visual questions need acquire daily common knowledge and model the semantic connection among different parts in images, which is too difficult for VQA systems to learn from images with the only supervision from answers. Meanwhile, image captioning systems with beam search strategy tend to generate similar captions and fail to diversely describe images. To address the aforementioned issues, we present a system to have these two tasks compensate with each other, which is capable of jointly producing image captions and answering visual questions. In particular, we utilize question and image features to generate question-related captions and use the generated captions as additional features to provide new knowledge to the VQA system. For image captioning, our system attains more informative results in term of the relative improvements on VQA tasks as well as competitive results using automated metrics. Applying our system to the VQA tasks, our results on VQA v2 dataset achieve 65.8% using generated captions and 69.1% using annotated captions in validation set and 68.4% in the test-standard set. Further, an ensemble of 10 models results in 69.7% in the test-standard split.

Abstract (translated)

回答视觉问题需要掌握日常常识,并对图像中不同部分之间的语义联系进行建模,这对于VQA系统仅仅通过答案监督来学习图像来说太难了。同时,带有波束搜索策略的图像字幕系统倾向于生成类似的字幕,并且不能对图像进行不同的描述。为了解决上述问题,我们提出了一个系统来让这两个任务相互补偿,这个系统能够共同产生图像标题和回答视觉问题。特别是,我们利用问题和图像特征生成与问题相关的标题,并使用生成的标题作为附加功能为VQA系统提供新知识。对于图像字幕,根据VQA任务的相对改进以及使用自动化度量标准的竞争结果,我们的系统会获得更多信息。将我们的系统应用到VQA任务中,我们在VQA v2数据集上的结果使用生成的字幕获得了65.8%,使用验证集中的注释字幕获得了69.1%,在测试标准集中获得了68.4%。此外,在测试标准分割中,10个模型的总体结果为69.7%。

URL

https://arxiv.org/abs/1805.08389

PDF

https://arxiv.org/pdf/1805.08389.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot