Paper Reading AI Learner

Towards Generalist Robot Learning from Internet Video: A Survey

2024-04-30 15:57:41
Robert McCarthy, Daniel C.H. Tan, Dominik Schmidt, Fernando Acero, Nathan Herr, Yilun Du, Thomas G. Thuruthel, Zhibin Li

Abstract

This survey presents an overview of methods for learning from video (LfV) in the context of reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large internet video datasets and, in the process, extracting foundational knowledge about the world's dynamics and physical human behaviour. Such methods hold great promise for developing general-purpose robots. We open with an overview of fundamental concepts relevant to the LfV-for-robotics setting. This includes a discussion of the exciting benefits LfV methods can offer (e.g., improved generalization beyond the available robot data) and commentary on key LfV challenges (e.g., challenges related to missing information in video and LfV distribution shifts). Our literature review begins with an analysis of video foundation model techniques that can extract knowledge from large, heterogeneous video datasets. Next, we review methods that specifically leverage video data for robot learning. Here, we categorise work according to which RL knowledge modality benefits from the use of video data. We additionally highlight techniques for mitigating LfV challenges, including reviewing action representations that address the issue of missing action labels in video. Finally, we examine LfV datasets and benchmarks, before concluding the survey by discussing challenges and opportunities in LfV. Here, we advocate for scalable approaches that can leverage the full range of available data and that target the key benefits of LfV. Overall, we hope this survey will serve as a comprehensive reference for the emerging field of LfV, catalysing further research in the area, and ultimately facilitating progress towards obtaining general-purpose robots.

Abstract (translated)

本次调查对从视频(LfV)学习在强化学习(RL)和机器人领域的方法进行了概述。我们重点介绍可以扩展到大型互联网视频数据集的方法,并在此过程中提取关于世界动态和物理人类行为的基本知识。这些方法在发展通用机器人方面具有很大的潜力。我们首先概述了与LfV-机器人设置相关的基本概念。这包括讨论LfV方法可以提供的令人兴奋的益处(例如,超过可用机器人数据的更好的泛化能力)以及评论关键LfV挑战(例如,视频和LfV分布变化相关的信息缺失)。我们的文献综述从分析可以提取知识的大型、异质视频数据集的视频基础模型技术开始。接下来,我们回顾了专门利用视频数据进行机器人学习的方法。在这里,我们将工作按照RL知识模式利用视频数据的影响进行分类。此外,我们重点关注缓解LfV挑战的技术,包括回顾解决视频中的动作标签缺失问题的动作表示。最后,我们检查了LfV数据集和基准,然后通过讨论LfV的挑战和机遇来结束调查。在这里,我们倡导可扩展的方法,可以利用全部可用的数据,并针对LfV的关键好处进行目标。总体而言,我们希望这次调查将成为LfV新兴领域全面参考,催化该领域进一步的研究,并最终推动实现通用机器人的进步。

URL

https://arxiv.org/abs/2404.19664

PDF

https://arxiv.org/pdf/2404.19664.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot