Paper Reading AI Learner

Multi-view Disentanglement for Reinforcement Learning with Multiple Cameras

2024-04-22 10:21:41
Mhairi Dunion, Stefano V. Albrecht

Abstract

The performance of image-based Reinforcement Learning (RL) agents can vary depending on the position of the camera used to capture the images. Training on multiple cameras simultaneously, including a first-person egocentric camera, can leverage information from different camera perspectives to improve the performance of RL. However, hardware constraints may limit the availability of multiple cameras in real-world deployment. Additionally, cameras may become damaged in the real-world preventing access to all cameras that were used during training. To overcome these hardware constraints, we propose Multi-View Disentanglement (MVD), which uses multiple cameras to learn a policy that achieves zero-shot generalisation to any single camera from the training set. Our approach is a self-supervised auxiliary task for RL that learns a disentangled representation from multiple cameras, with a shared representation that is aligned across all cameras to allow generalisation to a single camera, and a private representation that is camera-specific. We show experimentally that an RL agent trained on a single third-person camera is unable to learn an optimal policy in many control tasks; but, our approach, benefiting from multiple cameras during training, is able to solve the task using only the same single third-person camera.

Abstract (translated)

基于图像的强化学习(RL)代理器的性能取决于用于捕捉图像的相机的位置。同时使用多个相机进行训练,包括一个第一人称中心相机,可以利用不同相机视角的信息来提高RL代理器的性能。然而,硬件限制可能限制在现实世界中使用多个相机的能力。此外,训练过程中相机可能损坏,导致无法访问所有用于训练的相机。为了克服这些硬件限制,我们提出了多视角去噪(MVD),它使用多个相机学习一个政策,使得对于训练集中的所有相机,实现零样本泛化。我们的方法是针对RL的自监督辅助任务,通过多个相机学习一个分离的表示,具有共享表示,该表示在所有相机上对齐,允许对单个相机进行泛化;以及一个相机特定的私用表示。我们通过实验证明了,在许多控制任务中,仅使用单个第三方相机的RL代理器无法学习最优策略;但是,通过在训练过程中使用多个相机,我们的方法能够仅使用相同的单个第三方相机来解决问题。

URL

https://arxiv.org/abs/2404.14064

PDF

https://arxiv.org/pdf/2404.14064.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot