Learning Video-Conditioned Policies for Unseen Manipulation Tasks

Abstract
Abstract (translated)
URL
PDF

Abstract

The ability to specify robot commands by a non-expert user is critical for building generalist agents capable of solving a large variety of tasks. One convenient way to specify the intended robot goal is by a video of a person demonstrating the target task. While prior work typically aims to imitate human demonstrations performed in robot environments, here we focus on a more realistic and challenging setup with demonstrations recorded in natural and diverse human environments. We propose Video-conditioned Policy learning (ViP), a data-driven approach that maps human demonstrations of previously unseen tasks to robot manipulation skills. To this end, we learn our policy to generate appropriate actions given current scene observations and a video of the target task. To encourage generalization to new tasks, we avoid particular tasks during training and learn our policy from unlabelled robot trajectories and corresponding robot videos. Both robot and human videos in our framework are represented by video embeddings pre-trained for human action recognition. At test time we first translate human videos to robot videos in the common video embedding space, and then use resulting embeddings to condition our policies. Notably, our approach enables robot control by human demonstrations in a zero-shot manner, i.e., without using robot trajectories paired with human instructions during training. We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art. Our method also demonstrates excellent performance in a new challenging zero-shot setup where no paired data is used during training.

Abstract (translated)

由非专家用户指定机器人命令是构建能够解决多种任务通用代理的关键。一种方便的方式来指定预期的机器人目标的方法是通过演示目标任务的人的视频中进行指定。虽然先前的工作通常旨在模仿人们在机器人环境中的表演，但在这里我们关注于在自然和多样化的人类环境中记录演示的更现实和具有挑战性的setup。我们提出了视频条件决策学习(ViP)，一种基于数据驱动的方法，将人类从未见过的任务的演示映射到机器人操作技能。为此，我们学习我们的政策以生成适当的行动，给定当前场景观察和目标任务的视频。为了鼓励向新任务 generalization，我们在训练期间避免特定的任务，并从未标识的机器人路径和相应的机器人视频学习我们的政策。在我们的框架中，机器人和人类视频都由预训练用于人类行动识别的视频嵌入表示。在测试时，我们首先将人类视频转换为机器人视频在共同视频嵌入空间中，然后使用 resulting 嵌入条件我们的政策。值得注意的是，我们的方法使通过人类演示控制机器人成为零样本控制，即在训练期间不使用机器人路径与人类指示的配对数据。我们验证了我们的方法在一个具有挑战性的多任务机器人操纵环境上，并优于现有技术。我们的方法还演示了在训练期间不使用配对数据的新挑战零样本setup中的出色表现。

URL

https://arxiv.org/abs/2305.06289

PDF

https://arxiv.org/pdf/2305.06289.pdf