Paper Reading AI Learner

InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models

2023-12-21 00:44:45
Bingbing Wen, Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Bill Howe, Lijuan Wang

Abstract

In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content. Different from existing datasets where the answer is compact and short, InfoVisDial contains long free-form answers with rich information in each round of dialogue. For effective data collection, the key idea is to bridge the large-scale multimodal model (e.g., GIT) and the language models (e.g., GPT-3). GIT can describe the image content even with scene text, while GPT-3 can generate informative dialogue based on the image description and appropriate prompting techniques. With such automatic pipeline, we can readily generate informative visual dialogue data at scale. Then, we ask human annotators to rate the generated dialogues to filter the low-quality conversations.Human analyses show that InfoVisDial covers informative and diverse dialogue topics: $54.4\%$ of the dialogue rounds are related to image scene texts, and $36.7\%$ require external knowledge. Each round's answer is also long and open-ended: $87.3\%$ of answers are unique with an average length of $8.9$, compared with $27.37\%$ and $2.9$ in VisDial. Last, we propose a strong baseline by adapting the GIT model for the visual dialogue task and fine-tune the model on InfoVisDial. Hopefully, our work can motivate more effort on this direction.

Abstract (translated)

在本文中,我们构建了一个名为InfoVisDial的视觉对话数据集,为每个回合提供了丰富的信息性回答,即使与视觉内容相关的外部知识很大。与现有数据集不同,InfoVisDial包含了每个回合的长的自由文本答案,每个回合的答案都充满了丰富的信息。为了有效地收集数据,关键思想是连接大规模多模态模型(例如GIT)和语言模型(例如GPT-3)。GIT可以描述场景文本中的图像内容,而GPT-3可以根据图像描述和适当的提示技术生成有用的对话。有了这样的自动工作流程,我们可以在规模上轻松地生成有用的视觉对话数据。然后,我们要求人类标注者对生成的对话进行评分,以过滤低质量的对话。人类分析表明,InfoVisDial涵盖了有益且多样化的对话主题:87.3%的对话回合与图像场景文本相关,而36.7%的对话回合需要外部知识。每个回合的答案也是长且开诚布公的:87.3%的答案是独特的,平均长度为8.9,与VisDial中的27.37%和2.9%相比。最后,我们通过将GIT模型适应视觉对话任务并对InfoVisDial进行微调,提出了一个强大的基线。我们希望,我们的工作可以激励在这个方向上投入更多的努力。

URL

https://arxiv.org/abs/2312.13503

PDF

https://arxiv.org/pdf/2312.13503.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot