Abstract
Dialog systems need to understand dynamic visual scenes in order to have conversations with users about the objects and events around them. Scene-aware dialog systems for real-world applications could be developed by integrating state-of-the-art technologies from multiple research areas, including: end-to-end dialog technologies, which generate system responses using models trained from dialog data; visual question answering (VQA) technologies, which answer questions about images using learned image features; and video description technologies, in which descriptions/captions are generated from videos using multimodal information. We introduce a new dataset of dialogs about videos of human behaviors. Each dialog is a typed conversation that consists of a sequence of 10 question-and-answer(QA) pairs between two Amazon Mechanical Turk (AMT) workers. In total, we collected dialogs on roughly 9,000 videos. Using this new dataset for Audio Visual Scene-aware dialog (AVSD), we trained an end-to-end conversation model that generates responses in a dialog about a video. Our experiments demonstrate that using multimodal features that were developed for multimodal attention-based video description enhances the quality of generated dialog about dynamic scenes (videos). Our dataset, model code and pretrained models will be publicly available for a new Video Scene-Aware Dialog challenge.
Abstract (translated)
对话系统需要了解动态视觉场景,以便与用户就他们周围的对象和事件进行对话。通过整合来自多个研究领域的最新技术,可开发用于实际应用的场景感知对话系统,其中包括:端对端对话技术,利用从对话数据训练的模型生成系统响应;视觉问答(VQA)技术,它使用学习的图像特征回答有关图像的问题;和视频描述技术,其中使用多模式信息从视频生成描述/标题。我们介绍一个关于人类行为视频对话框的新数据集。每个对话框都是一个类型化的对话,由两个Amazon Mechanical Turk(AMT)工作人员之间的10个问答组(QA)对组成。总共我们收集了约9000个视频的对话。使用这个新的数据集用于音频视觉场景感知对话框(AVSD),我们训练了一个端到端的对话模型,在关于视频的对话框中生成响应。我们的实验表明,使用为多模态注意力的视频描述而开发的多模式特征可以提高关于动态场景(视频)的生成对话的质量。我们的数据集,模型代码和预训练模型将公开发布用于新的视频场景感知对话挑战。
URL
https://arxiv.org/abs/1806.08409