Abstract
This survey presents an overview of methods for learning from video (LfV) in the context of reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large internet video datasets and, in the process, extracting foundational knowledge about the world's dynamics and physical human behaviour. Such methods hold great promise for developing general-purpose robots. We open with an overview of fundamental concepts relevant to the LfV-for-robotics setting. This includes a discussion of the exciting benefits LfV methods can offer (e.g., improved generalization beyond the available robot data) and commentary on key LfV challenges (e.g., challenges related to missing information in video and LfV distribution shifts). Our literature review begins with an analysis of video foundation model techniques that can extract knowledge from large, heterogeneous video datasets. Next, we review methods that specifically leverage video data for robot learning. Here, we categorise work according to which RL knowledge modality benefits from the use of video data. We additionally highlight techniques for mitigating LfV challenges, including reviewing action representations that address the issue of missing action labels in video. Finally, we examine LfV datasets and benchmarks, before concluding the survey by discussing challenges and opportunities in LfV. Here, we advocate for scalable approaches that can leverage the full range of available data and that target the key benefits of LfV. Overall, we hope this survey will serve as a comprehensive reference for the emerging field of LfV, catalysing further research in the area, and ultimately facilitating progress towards obtaining general-purpose robots.
Abstract (translated)
本次调查对从视频(LfV)学习在强化学习(RL)和机器人领域的方法进行了概述。我们重点介绍可以扩展到大型互联网视频数据集的方法,并在此过程中提取关于世界动态和物理人类行为的基本知识。这些方法在发展通用机器人方面具有很大的潜力。我们首先概述了与LfV-机器人设置相关的基本概念。这包括讨论LfV方法可以提供的令人兴奋的益处(例如,超过可用机器人数据的更好的泛化能力)以及评论关键LfV挑战(例如,视频和LfV分布变化相关的信息缺失)。我们的文献综述从分析可以提取知识的大型、异质视频数据集的视频基础模型技术开始。接下来,我们回顾了专门利用视频数据进行机器人学习的方法。在这里,我们将工作按照RL知识模式利用视频数据的影响进行分类。此外,我们重点关注缓解LfV挑战的技术,包括回顾解决视频中的动作标签缺失问题的动作表示。最后,我们检查了LfV数据集和基准,然后通过讨论LfV的挑战和机遇来结束调查。在这里,我们倡导可扩展的方法,可以利用全部可用的数据,并针对LfV的关键好处进行目标。总体而言,我们希望这次调查将成为LfV新兴领域全面参考,催化该领域进一步的研究,并最终推动实现通用机器人的进步。
URL
https://arxiv.org/abs/2404.19664