Abstract
Learning skills in open-world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills. Online demonstration videos are typically long but unsegmented, making them difficult to segment and label with skill identifiers. Unlike existing methods that rely on sequence sampling or human labeling, we have developed a self-supervised learning-based approach to segment these long videos into a series of semantic-aware and skill-consistent segments. Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation-free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action-prediction model. This approach is based on the assumption that a significant increase in prediction error indicates a shift in the skill being executed. We evaluated our method in Minecraft, a rich open-world simulator with extensive gameplay videos available online. Our SBD-generated segments improved the average performance of conditioned policies by 63.7% and 52.1% on short-term atomic skill tasks, and their corresponding hierarchical agents by 11.3% and 20.8% on long-horizon tasks. Our method can leverage the diverse YouTube videos to train instruction-following agents. The project page can be found in this https URL.
Abstract (translated)
在开放世界环境中学习技能对于开发能够通过组合基本技能来处理各种任务的代理至关重要。然而,网上的演示视频通常很长且未经过分割和标注,这使得它们难以被分割并用技能标识符进行标记。与现有方法依赖序列采样或人工标注不同,我们提出了一种基于自监督学习的方法,将这些长视频分割成一系列具有语义感知和技能一致性的小段。受人类认知事件分割理论的启发,我们引入了Skill Boundary Detection(SBD),这是一种无需注释的时间视频分割算法。 SBD通过利用预训练无条件动作预测模型产生的预测误差来检测视频中的技能边界。该方法基于假设:预测误差显著增加表明执行的动作或技能发生了转变。 我们在《我的世界》(Minecraft)中测试了我们的方法,这是一个拥有丰富开放世界模拟和大量在线游戏录像的游戏平台。我们发现由SBD生成的片段能够将条件策略在短期原子技能任务上的平均性能分别提高了63.7%和52.1%,以及长期任务上对应的层次化代理性能分别提升了11.3%和20.8%。 我们的方法可以利用多样化的YouTube视频来训练遵循指令的智能体。项目页面可以在提供的URL中找到。
URL
https://arxiv.org/abs/2503.10684