Paper Reading AI Learner

AMPLIFY: Actionless Motion Priors for Robot Learning from Videos

2025-06-17 05:31:42
Jeremy A. Collins, Lor\'and Cheng, Kunal Aneja, Albert Wilcox, Benjamin Joffe, Animesh Garg

Abstract

Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a novel framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it. We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples, allowing for independent scaling. Extensive evaluations demonstrate that the learned dynamics are both accurate, achieving up to 3.7x better MSE and over 2.5x better pixel prediction accuracy compared to prior approaches, and broadly useful. In downstream policy learning, our dynamics predictions enable a 1.2-2.2x improvement in low-data regimes, a 1.4x average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks from zero in-distribution action data. Beyond robotic control, we find the dynamics learned by AMPLIFY to be a versatile latent world model, enhancing video prediction quality. Our results present a novel paradigm leveraging heterogeneous data sources to build efficient, generalizable world models. More information can be found at this https URL.

Abstract (translated)

机器人技术中的行动标记数据稀缺且昂贵,这限制了所学策略的泛化能力。相比之下,大量不含动作的数据(如视频)容易获得,但将这些观察结果转化为有效的政策仍然是一项挑战。我们引入了一个新的框架AMPLIFY,该框架利用大规模视频数据通过关键点轨迹来编码视觉动态为紧凑、离散的动作令牌。 我们的方法是模块化的,它分离了视觉运动预测与动作推理的过程,从而解耦了解决什么运动定义一个任务和机器人如何执行这一任务的问题。我们使用大量的无行动标记的视频训练了一个前向动力学模型,并利用少量带有行动标签的例子训练了一个逆向动力学模型,这使得每个模块可以独立地进行扩展。 广泛的评估显示,学习到的动力学既准确(与先前的方法相比,最大均方误差提高了3.7倍,像素预测准确性提高超过2.5倍),又具有广泛的应用价值。在下游策略学习中,我们的动态预测使低数据环境中的性能提升了1.2至2.2倍,在从无动作标记的人类视频中学到时平均提升了1.4倍,并且首次实现了零分布内行动数据下的LIBERO任务的泛化。 除了机器人控制之外,我们发现AMPLIFY学到的动力学是一个多功能的潜在世界模型,可以提高视频预测的质量。我们的结果展示了利用不同数据源构建高效、通用的世界模型的新范式。有关更多信息,请参阅此链接:[此URL](原文中的"this https URL"应替换为具体链接)。

URL

https://arxiv.org/abs/2506.14198

PDF

https://arxiv.org/pdf/2506.14198.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot