Abstract
Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a novel framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it. We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples, allowing for independent scaling. Extensive evaluations demonstrate that the learned dynamics are both accurate, achieving up to 3.7x better MSE and over 2.5x better pixel prediction accuracy compared to prior approaches, and broadly useful. In downstream policy learning, our dynamics predictions enable a 1.2-2.2x improvement in low-data regimes, a 1.4x average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks from zero in-distribution action data. Beyond robotic control, we find the dynamics learned by AMPLIFY to be a versatile latent world model, enhancing video prediction quality. Our results present a novel paradigm leveraging heterogeneous data sources to build efficient, generalizable world models. More information can be found at this https URL.
Abstract (translated)
机器人技术中的行动标记数据稀缺且昂贵,这限制了所学策略的泛化能力。相比之下,大量不含动作的数据(如视频)容易获得,但将这些观察结果转化为有效的政策仍然是一项挑战。我们引入了一个新的框架AMPLIFY,该框架利用大规模视频数据通过关键点轨迹来编码视觉动态为紧凑、离散的动作令牌。 我们的方法是模块化的,它分离了视觉运动预测与动作推理的过程,从而解耦了解决什么运动定义一个任务和机器人如何执行这一任务的问题。我们使用大量的无行动标记的视频训练了一个前向动力学模型,并利用少量带有行动标签的例子训练了一个逆向动力学模型,这使得每个模块可以独立地进行扩展。 广泛的评估显示,学习到的动力学既准确(与先前的方法相比,最大均方误差提高了3.7倍,像素预测准确性提高超过2.5倍),又具有广泛的应用价值。在下游策略学习中,我们的动态预测使低数据环境中的性能提升了1.2至2.2倍,在从无动作标记的人类视频中学到时平均提升了1.4倍,并且首次实现了零分布内行动数据下的LIBERO任务的泛化。 除了机器人控制之外,我们发现AMPLIFY学到的动力学是一个多功能的潜在世界模型,可以提高视频预测的质量。我们的结果展示了利用不同数据源构建高效、通用的世界模型的新范式。有关更多信息,请参阅此链接:[此URL](原文中的"this https URL"应替换为具体链接)。
URL
https://arxiv.org/abs/2506.14198