Learning and Verification of Task Structure in Instructional Videos

Abstract
Abstract (translated)
URL
PDF

Abstract

Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos. We pre-train VideoTaskformer using a simple and effective objective: predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video (masked step modeling). Compared to prior work which learns step representations locally, our approach involves learning them globally, leveraging video of the entire surrounding task as context. From these learned representations, we can verify if an unseen video correctly executes a given task, as well as forecast which steps are likely to be taken after a given step. We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order. We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step. Our method outperforms previous baselines on these tasks, and we believe the tasks will be a valuable way for the community to measure the quality of step representations. Additionally, we evaluate VideoTaskformer on 3 existing benchmarks -- procedural activity recognition, step classification, and step forecasting -- and demonstrate on each that our method outperforms existing baselines and achieves new state-of-the-art performance.

Abstract (translated)

由于网络上存在大量教学视频，学习从视频中呈现的多步骤任务模型是一个令人着迷的目标。我们引入了一个新的预训练视频模型，VideoTaskformer，专注于代表教学视频语义和结构。我们使用一个简单的有效目标来预训练VideoTaskformer：预测从教学视频中随机掩盖的步骤的弱监督文本标签(掩码步建模)。与以前 Local 学习的步骤表示方法相比，我们的方法涉及全球学习，利用整个任务周围的视频作为上下文。从这些学习表示中，我们可以验证未观测视频是否正确执行给定任务，并预测哪些步骤可能在给定步骤后执行。我们引入了两个新的基准来检测教学视频中的错误，以验证是否存在异常步骤，以及步骤是否按照正确的顺序执行。我们还引入了一个长期预测基准，其目标是从给定步骤预测长期步骤。我们的方法在这些任务中表现出色，我们认为这些任务将成为一个有价值的方式，用于衡量步骤表示质量。此外，我们评估了VideoTaskformer，针对三个现有基准，即操作活动识别、步骤分类和步骤预测，并在每个基准上证明了我们的方法和以前基准的卓越表现，实现了新的技术水平。

URL

https://arxiv.org/abs/2303.13519

PDF

https://arxiv.org/pdf/2303.13519.pdf