Abstract
The potential for agents, whether embodied or software, to learn by observing other agents performing procedures involving objects and actions is rich. Current research on automatic procedure learning heavily relies on action labels or video subtitles, even during the evaluation phase, which makes them infeasible in real-world scenarios. This leads to our question: can the human-consensus structure of a procedure be learned from a large set of long, unconstrained videos (e.g., instructional videos from YouTube) with only visual evidence? To answer this question, we introduce the problem of procedure segmentation--to segment a video procedure into category-independent procedure segments. Given that no large-scale dataset is available for this problem, we collect a large-scale procedure segmentation dataset with procedure segments temporally localized and described; we use cooking videos and name the dataset YouCook2. We propose a segment-level recurrent network for generating procedure segments by modeling the dependencies across segments. The generated segments can be used as pre-processing for other tasks, such as dense video captioning and event parsing. We show in our experiments that the proposed model outperforms competitive baselines in procedure segmentation.
Abstract (translated)
通过观察执行涉及对象和动作的过程的其他代理来学习代理(无论是具体还是软件)的潜力是非常丰富的。目前关于自动程序学习的研究在很大程度上依赖于动作标签或视频字幕,即使在评估阶段也是如此,这使得它们在真实场景中不可行。这导致了我们的问题:程序的人类共识结构是否可以从一大组长而不受约束的视频(例如来自YouTube的教学视频)中仅以视觉证据学习?为了回答这个问题,我们介绍了过程分割的问题 - 将视频过程分割成与范畴无关的过程段。由于没有大规模数据集可用于解决这个问题,我们收集了一个大规模的程序分割数据集,其中包含了临时定位和描述的程序段;我们使用烹饪视频并将数据集命名为YouCook2。我们提出了一个段级循环网络,通过建立跨段的依赖关系来生成过程段。生成的段可以用作其他任务的预处理,例如密集视频字幕和事件解析。我们在实验中表明,所提出的模型在程序分割方面胜过竞争基线。
URL
https://arxiv.org/abs/1703.09788