Paper Reading AI Learner

Towards Automatic Learning of Procedures from Web Instructional Videos

2017-11-21 20:37:43
Luowei Zhou, Chenliang Xu, Jason J. Corso

Abstract

The potential for agents, whether embodied or software, to learn by observing other agents performing procedures involving objects and actions is rich. Current research on automatic procedure learning heavily relies on action labels or video subtitles, even during the evaluation phase, which makes them infeasible in real-world scenarios. This leads to our question: can the human-consensus structure of a procedure be learned from a large set of long, unconstrained videos (e.g., instructional videos from YouTube) with only visual evidence? To answer this question, we introduce the problem of procedure segmentation--to segment a video procedure into category-independent procedure segments. Given that no large-scale dataset is available for this problem, we collect a large-scale procedure segmentation dataset with procedure segments temporally localized and described; we use cooking videos and name the dataset YouCook2. We propose a segment-level recurrent network for generating procedure segments by modeling the dependencies across segments. The generated segments can be used as pre-processing for other tasks, such as dense video captioning and event parsing. We show in our experiments that the proposed model outperforms competitive baselines in procedure segmentation.

Abstract (translated)

通过观察执行涉及对象和动作的过程的其他代理来学习代理(无论是具体还是软件)的潜力是非常丰富的。目前关于自动程序学习的研究在很大程度上依赖于动作标签或视频字幕,即使在评估阶段也是如此,这使得它们在真实场景中不可行。这导致了我们的问题:程序的人类共识结构是否可以从一大组长而不受约束的视频(例如来自YouTube的教学视频)中仅以视觉证据学习?为了回答这个问题,我们介绍了过程分割的问题 - 将视频过程分割成与范畴无关的过程段。由于没有大规模数据集可用于解决这个问题,我们收集了一个大规模的程序分割数据集,其中包含了临时定位和描述的程序段;我们使用烹饪视频并将数据集命名为YouCook2。我们提出了一个段级循环网络,通过建立跨段的依赖关系来生成过程段。生成的段可以用作其他任务的预处理,例如密集视频字幕和事件解析。我们在实验中表明,所提出的模型在程序分割方面胜过竞争基线。

URL

https://arxiv.org/abs/1703.09788

PDF

https://arxiv.org/pdf/1703.09788.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot