InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models

Abstract
Abstract (translated)
URL
PDF

Abstract

In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content. Different from existing datasets where the answer is compact and short, InfoVisDial contains long free-form answers with rich information in each round of dialogue. For effective data collection, the key idea is to bridge the large-scale multimodal model (e.g., GIT) and the language models (e.g., GPT-3). GIT can describe the image content even with scene text, while GPT-3 can generate informative dialogue based on the image description and appropriate prompting techniques. With such automatic pipeline, we can readily generate informative visual dialogue data at scale. Then, we ask human annotators to rate the generated dialogues to filter the low-quality conversations.Human analyses show that InfoVisDial covers informative and diverse dialogue topics: $54.4\%$ of the dialogue rounds are related to image scene texts, and $36.7\%$ require external knowledge. Each round's answer is also long and open-ended: $87.3\%$ of answers are unique with an average length of $8.9$, compared with $27.37\%$ and $2.9$ in VisDial. Last, we propose a strong baseline by adapting the GIT model for the visual dialogue task and fine-tune the model on InfoVisDial. Hopefully, our work can motivate more effort on this direction.

Abstract (translated)

在本文中，我们构建了一个名为InfoVisDial的视觉对话数据集，为每个回合提供了丰富的信息性回答，即使与视觉内容相关的外部知识很大。与现有数据集不同，InfoVisDial包含了每个回合的长的自由文本答案，每个回合的答案都充满了丰富的信息。为了有效地收集数据，关键思想是连接大规模多模态模型（例如GIT）和语言模型（例如GPT-3）。GIT可以描述场景文本中的图像内容，而GPT-3可以根据图像描述和适当的提示技术生成有用的对话。有了这样的自动工作流程，我们可以在规模上轻松地生成有用的视觉对话数据。然后，我们要求人类标注者对生成的对话进行评分，以过滤低质量的对话。人类分析表明，InfoVisDial涵盖了有益且多样化的对话主题：87.3%的对话回合与图像场景文本相关，而36.7%的对话回合需要外部知识。每个回合的答案也是长且开诚布公的：87.3%的答案是独特的，平均长度为8.9，与VisDial中的27.37%和2.9%相比。最后，我们通过将GIT模型适应视觉对话任务并对InfoVisDial进行微调，提出了一个强大的基线。我们希望，我们的工作可以激励在这个方向上投入更多的努力。

URL

https://arxiv.org/abs/2312.13503

PDF

https://arxiv.org/pdf/2312.13503.pdf

InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models

Abstract

Abstract (translated)

URL

PDF Copy

PDF