Paper Reading AI Learner

Open-World Skill Discovery from Unsegmented Demonstrations

2025-03-11 18:51:40
Jingwen Deng, Zihao Wang, Shaofei Cai, Anji Liu, Yitao Liang

Abstract

Learning skills in open-world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills. Online demonstration videos are typically long but unsegmented, making them difficult to segment and label with skill identifiers. Unlike existing methods that rely on sequence sampling or human labeling, we have developed a self-supervised learning-based approach to segment these long videos into a series of semantic-aware and skill-consistent segments. Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation-free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action-prediction model. This approach is based on the assumption that a significant increase in prediction error indicates a shift in the skill being executed. We evaluated our method in Minecraft, a rich open-world simulator with extensive gameplay videos available online. Our SBD-generated segments improved the average performance of conditioned policies by 63.7% and 52.1% on short-term atomic skill tasks, and their corresponding hierarchical agents by 11.3% and 20.8% on long-horizon tasks. Our method can leverage the diverse YouTube videos to train instruction-following agents. The project page can be found in this https URL.

Abstract (translated)

在开放世界环境中学习技能对于开发能够通过组合基本技能来处理各种任务的代理至关重要。然而,网上的演示视频通常很长且未经过分割和标注,这使得它们难以被分割并用技能标识符进行标记。与现有方法依赖序列采样或人工标注不同,我们提出了一种基于自监督学习的方法,将这些长视频分割成一系列具有语义感知和技能一致性的小段。受人类认知事件分割理论的启发,我们引入了Skill Boundary Detection(SBD),这是一种无需注释的时间视频分割算法。 SBD通过利用预训练无条件动作预测模型产生的预测误差来检测视频中的技能边界。该方法基于假设:预测误差显著增加表明执行的动作或技能发生了转变。 我们在《我的世界》(Minecraft)中测试了我们的方法,这是一个拥有丰富开放世界模拟和大量在线游戏录像的游戏平台。我们发现由SBD生成的片段能够将条件策略在短期原子技能任务上的平均性能分别提高了63.7%和52.1%,以及长期任务上对应的层次化代理性能分别提升了11.3%和20.8%。 我们的方法可以利用多样化的YouTube视频来训练遵循指令的智能体。项目页面可以在提供的URL中找到。

URL

https://arxiv.org/abs/2503.10684

PDF

https://arxiv.org/pdf/2503.10684.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot