Abstract
Multimodal systems have great potential to assist humans in procedural activities, where people follow instructions to achieve their goals. Despite diverse application scenarios, systems are typically evaluated on traditional classification tasks, e.g., action recognition or temporal action segmentation. In this paper, we present a novel evaluation dataset, ProMQA, to measure system advancements in application-oriented scenarios. ProMQA consists of 401 multimodal procedural QA pairs on user recording of procedural activities coupled with their corresponding instruction. For QA annotation, we take a cost-effective human-LLM collaborative approach, where the existing annotation is augmented with LLM-generated QA pairs that are later verified by humans. We then provide the benchmark results to set the baseline performance on ProMQA. Our experiment reveals a significant gap between human performance and that of current systems, including competitive proprietary multimodal models. We hope our dataset sheds light on new aspects of models' multimodal understanding capabilities.
Abstract (translated)
多模态系统在辅助人类进行程序化活动方面具有巨大潜力,人们遵循指示来实现他们的目标。尽管应用场景多样,这些系统通常是在传统的分类任务上进行评估的,例如动作识别或时间动作分割。本文中,我们提出一个新颖的评估数据集ProMQA,用于衡量系统在应用导向场景中的进展。ProMQA由401个多模态程序化问答对组成,包括用户记录的程序活动以及相应的指示。对于问答标注,我们采取了一种成本效益较高的人工与大语言模型(LLM)协作的方法,在现有标注基础上增加由LLM生成并经人工验证的问答对。然后,我们提供了基准测试结果以设定ProMQA上的基线性能。我们的实验揭示了当前系统(包括竞争性的专有多模态模型)的表现和人类表现之间存在显著差距。希望我们的数据集能为研究多模态理解能力的新方面提供启示。
URL
https://arxiv.org/abs/2410.22211