Abstract
Model Predictive Control (MPC) is a widely adopted control paradigm that leverages predictive models to estimate future system states and optimize control inputs accordingly. However, while MPC excels in planning and control, it lacks the capability for environmental perception, leading to failures in complex and unstructured scenarios. To address this limitation, we introduce Vision-Language Model Predictive Control (VLMPC), a robotic manipulation planning framework that integrates the perception power of vision-language models (VLMs) with MPC. VLMPC utilizes a conditional action sampling module that takes a goal image or language instruction as input and leverages VLM to generate candidate action sequences. These candidates are fed into a video prediction model that simulates future frames based on the actions. In addition, we propose an enhanced variant, Traj-VLMPC, which replaces video prediction with motion trajectory generation to reduce computational complexity while maintaining accuracy. Traj-VLMPC estimates motion dynamics conditioned on the candidate actions, offering a more efficient alternative for long-horizon tasks and real-time applications. Both VLMPC and Traj-VLMPC select the optimal action sequence using a VLM-based hierarchical cost function that captures both pixel-level and knowledge-level consistency between the current observation and the task input. We demonstrate that both approaches outperform existing state-of-the-art methods on public benchmarks and achieve excellent performance in various real-world robotic manipulation tasks. Code is available at this https URL.
Abstract (translated)
模型预测控制(MPC)是一种广泛应用的控制范式,它利用预测模型来估计未来的系统状态,并据此优化控制输入。然而,尽管MPC在规划和控制方面表现出色,但它缺乏环境感知能力,在复杂和无结构场景中容易出现失败情况。为了解决这一局限性,我们引入了视觉-语言模型预测控制(VLMPC),这是一种机器人操作规划框架,它将视觉-语言模型(VLM)的感知能力与MPC相结合。VLMPC使用一个条件动作采样模块,该模块以目标图像或语言指令作为输入,并利用VLM生成候选的动作序列。这些候选动作被送入视频预测模型中,后者基于这些动作模拟未来的帧。此外,我们还提出了一种增强变体——Traj-VLMPC,它用运动轨迹生成替代视频预测,从而在保持准确性的同时减少计算复杂度。Traj-VLMPC根据候选动作估计运动动力学,为长时任务和实时应用提供了更高效的替代方案。VLMPC和Traj-VLMPC都使用基于VLM的分层成本函数来选择最优的动作序列,该函数捕捉当前观察与任务输入之间的像素级和知识级一致性。我们证明了这两种方法在公共基准测试中优于现有的最先进方法,并在各种现实世界机器人操作任务中表现出色。代码可以在提供的URL获取。
URL
https://arxiv.org/abs/2504.05225