Abstract
Large Vision-Language Models (LVLMs) have recently advanced robotic manipulation by leveraging vision for scene perception and language for instruction following. However, existing methods rely heavily on costly human-annotated training datasets, which limits their generalization and causes them to struggle in out-of-domain (OOD) scenarios, reducing real-world adaptability. To address these challenges, we propose ManipLVM-R1, a novel reinforcement learning framework that replaces traditional supervision with Reinforcement Learning using Verifiable Rewards (RLVR). By directly optimizing for task-aligned outcomes, our method enhances generalization and physical reasoning while removing the dependence on costly annotations. Specifically, we design two rule-based reward functions targeting key robotic manipulation subtasks: an Affordance Perception Reward to enhance localization of interaction regions, and a Trajectory Match Reward to ensure the physical plausibility of action paths. These rewards provide immediate feedback and impose spatial-logical constraints, encouraging the model to go beyond shallow pattern matching and instead learn deeper, more systematic reasoning about physical interactions.
Abstract (translated)
大型视觉语言模型(LVLMs)最近通过利用视觉进行场景感知和语言指令的遵循,推动了机器人操控技术的进步。然而,现有的方法严重依赖于昂贵的人工标注训练数据集,这限制了它们的泛化能力,并导致在域外(OOD)场景中表现不佳,从而降低了现实世界的适应性。为了解决这些问题,我们提出了ManipLVM-R1,这是一个新颖的强化学习框架,它用基于验证奖励的强化学习(RLVR)替代传统的监督方法。通过直接针对任务相关结果进行优化,我们的方法增强了泛化能力和物理推理能力,并且减少了对昂贵标注数据的依赖。 具体而言,我们设计了两个基于规则的奖励函数,旨在解决机器人操控中的关键子任务:一个是操作感知奖励,用于增强交互区域定位;另一个是轨迹匹配奖励,以确保动作路径的物理合理性。这些奖励提供即时反馈并施加空间逻辑约束,鼓励模型超越浅层模式匹配,并学习更深层次、更具系统性的关于物理互动的理解和推理能力。
URL
https://arxiv.org/abs/2505.16517