Paper Reading AI Learner

Learning and Verification of Task Structure in Instructional Videos

2023-03-23 17:59:54
Medhini Narasimhan, Licheng Yu, Sean Bell, Ning Zhang, Trevor Darrell

Abstract

Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos. We pre-train VideoTaskformer using a simple and effective objective: predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video (masked step modeling). Compared to prior work which learns step representations locally, our approach involves learning them globally, leveraging video of the entire surrounding task as context. From these learned representations, we can verify if an unseen video correctly executes a given task, as well as forecast which steps are likely to be taken after a given step. We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order. We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step. Our method outperforms previous baselines on these tasks, and we believe the tasks will be a valuable way for the community to measure the quality of step representations. Additionally, we evaluate VideoTaskformer on 3 existing benchmarks -- procedural activity recognition, step classification, and step forecasting -- and demonstrate on each that our method outperforms existing baselines and achieves new state-of-the-art performance.

Abstract (translated)

由于网络上存在大量教学视频,学习从视频中呈现的多步骤任务模型是一个令人着迷的目标。我们引入了一个新的预训练视频模型,VideoTaskformer,专注于代表教学视频语义和结构。我们使用一个简单的有效目标来预训练VideoTaskformer:预测从教学视频中随机掩盖的步骤的弱监督文本标签(掩码步建模)。与以前 Local 学习的步骤表示方法相比,我们的方法涉及全球学习,利用整个任务周围的视频作为上下文。从这些学习表示中,我们可以验证未观测视频是否正确执行给定任务,并预测哪些步骤可能在给定步骤后执行。我们引入了两个新的基准来检测教学视频中的错误,以验证是否存在异常步骤,以及步骤是否按照正确的顺序执行。我们还引入了一个长期预测基准,其目标是从给定步骤预测长期步骤。我们的方法在这些任务中表现出色,我们认为这些任务将成为一个有价值的方式,用于衡量步骤表示质量。此外,我们评估了VideoTaskformer,针对三个现有基准,即操作活动识别、步骤分类和步骤预测,并在每个基准上证明了我们的方法和以前基准的卓越表现,实现了新的技术水平。

URL

https://arxiv.org/abs/2303.13519

PDF

https://arxiv.org/pdf/2303.13519.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot