Abstract
Manipulation has long been a challenging task for robots, while humans can effortlessly perform complex interactions with objects, such as hanging a cup on the mug rack. A key reason is the lack of a large and uniform dataset for teaching robots manipulation skills. Current robot datasets often record robot action in different action spaces within a simple scene. This hinders the robot to learn a unified and robust action representation for different robots within diverse scenes. Observing how humans understand a manipulation task, we find that understanding how the objects should move in the 3D space is a critical clue for guiding actions. This clue is embodiment-agnostic and suitable for both humans and different robots. Motivated by this, we aim to learn a 3D flow world model from both human and robot manipulation data. This model predicts the future movement of the interacting objects in 3D space, guiding action planning for manipulation. Specifically, we synthesize a large-scale 3D optical flow dataset, named ManiFlow-110k, through a moving object auto-detect pipeline. A video diffusion-based world model then learns manipulation physics from these data, generating 3D optical flow trajectories conditioned on language instructions. With the generated 3D object optical flow, we propose a flow-guided rendering mechanism, which renders the predicted final state and leverages GPT-4o to assess whether the predicted flow aligns with the task description. This equips the robot with a closed-loop planning ability. Finally, we consider the predicted 3D optical flow as constraints for an optimization policy to determine a chunk of robot actions for manipulation. Extensive experiments demonstrate strong generalization across diverse robotic manipulation tasks and reliable cross-embodiment adaptation without hardware-specific training.
Abstract (translated)
机器人操纵长期以来一直是一个具有挑战性的任务,而人类可以轻松地在物体上执行复杂的交互操作,例如将杯子挂在杯架上。这主要是因为缺乏一个大型且统一的数据集来教授机器人的操控技能。目前的机器人数据集通常记录简单场景中不同动作空间内的机器人行为,这阻碍了机器人学习适用于各种环境中不同机器人的统一和稳健的动作表示能力。观察人类如何理解操纵任务时,我们发现了解物体在三维空间中的运动方式是指导行动的关键线索。这一线索不依赖于特定的身体形态,并且适合用于既有人类也包括不同的机器人。受此启发,我们的目标是从人类和机器人的操作数据中学习一个3D流世界模型。该模型预测交互对象在未来三维空间的移动情况,从而为操纵任务中的动作规划提供指导。 具体而言,我们通过一个自动检测移动物体的数据管道合成了一个大规模的3D光流数据集,名为ManiFlow-110k。然后,一种基于视频扩散的世界模型从这些数据中学习操作物理原理,并根据语言指令生成3D光流轨迹。利用生成的3D对象光流,我们提出了一种流引导渲染机制,该机制预测了最终状态并利用GPT-4o评估预测流是否与任务描述一致。这赋予机器人一个闭环规划能力。最后,我们将预测的3D光流视为优化策略中的约束条件以确定一系列机器人的操作动作用于执行操纵任务。 广泛的实验表明,在各种不同的机器人操控任务中具有强大的泛化能力和无需特定硬件训练的可靠跨身体形态适应性。
URL
https://arxiv.org/abs/2506.06199