Abstract
The performance of image-based Reinforcement Learning (RL) agents can vary depending on the position of the camera used to capture the images. Training on multiple cameras simultaneously, including a first-person egocentric camera, can leverage information from different camera perspectives to improve the performance of RL. However, hardware constraints may limit the availability of multiple cameras in real-world deployment. Additionally, cameras may become damaged in the real-world preventing access to all cameras that were used during training. To overcome these hardware constraints, we propose Multi-View Disentanglement (MVD), which uses multiple cameras to learn a policy that achieves zero-shot generalisation to any single camera from the training set. Our approach is a self-supervised auxiliary task for RL that learns a disentangled representation from multiple cameras, with a shared representation that is aligned across all cameras to allow generalisation to a single camera, and a private representation that is camera-specific. We show experimentally that an RL agent trained on a single third-person camera is unable to learn an optimal policy in many control tasks; but, our approach, benefiting from multiple cameras during training, is able to solve the task using only the same single third-person camera.
Abstract (translated)
基于图像的强化学习(RL)代理器的性能取决于用于捕捉图像的相机的位置。同时使用多个相机进行训练,包括一个第一人称中心相机,可以利用不同相机视角的信息来提高RL代理器的性能。然而,硬件限制可能限制在现实世界中使用多个相机的能力。此外,训练过程中相机可能损坏,导致无法访问所有用于训练的相机。为了克服这些硬件限制,我们提出了多视角去噪(MVD),它使用多个相机学习一个政策,使得对于训练集中的所有相机,实现零样本泛化。我们的方法是针对RL的自监督辅助任务,通过多个相机学习一个分离的表示,具有共享表示,该表示在所有相机上对齐,允许对单个相机进行泛化;以及一个相机特定的私用表示。我们通过实验证明了,在许多控制任务中,仅使用单个第三方相机的RL代理器无法学习最优策略;但是,通过在训练过程中使用多个相机,我们的方法能够仅使用相同的单个第三方相机来解决问题。
URL
https://arxiv.org/abs/2404.14064