Abstract
Teaching robots novel skills with demonstrations via human-in-the-loop data collection techniques like kinesthetic teaching or teleoperation puts a heavy burden on human supervisors. In contrast to this paradigm, it is often significantly easier to provide raw, action-free visual data of tasks being performed. Moreover, this data can even be mined from video datasets or the web. Ideally, this data can serve to guide robot learning for new tasks in novel environments, informing both "what" to do and "how" to do it. A powerful way to encode both the "what" and the "how" is to infer a well-shaped reward function for reinforcement learning. The challenge is determining how to ground visual demonstration inputs into a well-shaped and informative reward function. We propose a technique Rank2Reward for learning behaviors from videos of tasks being performed without access to any low-level states and actions. We do so by leveraging the videos to learn a reward function that measures incremental "progress" through a task by learning how to temporally rank the video frames in a demonstration. By inferring an appropriate ranking, the reward function is able to guide reinforcement learning by indicating when task progress is being made. This ranking function can be integrated into an adversarial imitation learning scheme resulting in an algorithm that can learn behaviors without exploiting the learned reward function. We demonstrate the effectiveness of Rank2Reward at learning behaviors from raw video on a number of tabletop manipulation tasks in both simulations and on a real-world robotic arm. We also demonstrate how Rank2Reward can be easily extended to be applicable to web-scale video datasets.
Abstract (translated)
通过使用人机交互数据收集技术(如本体感知教学或遥控操作)进行示例教学,让机器人学习 novel skills 会为人类监督者带来沉重的负担。相比之下,提供原始、无动作的任务执行数据要容易得多。此外,这种数据甚至可以从视频数据集中或互联网上进行挖掘。在理想情况下,这些数据可以为机器人提供关于新任务和新环境下的机器学习指导,告知做什么以及如何做。通过推断一个形状良好且信息丰富的奖励函数来编码 both the "what" 和 the "how" 是一种强大的方法。挑战在于将视觉演示输入 ground 到一个形状良好且具有指导性的奖励函数中。我们提出了 Rank2Reward 技术,用于从执行任务的视频序列中学习行为,而无需访问任何低级状态和动作。我们通过学习如何按时间排序演示视频帧来推断适当的分级,从而使奖励函数能够指导强化学习,表明任务进展。这个排名函数可以集成到 adversarial imitation learning 方案中,从而学习行为而无需利用所学习的奖励函数。我们证明了 Rank2Reward 在从模拟和现实世界机器人手臂的许多表单操作任务中学习行为方面的有效性。我们还展示了 Rank2Reward 如何很容易地扩展到适用于网页规模视频数据集。
URL
https://arxiv.org/abs/2404.14735