Abstract
There is a growing interest in applying large language models (LLMs) in robotic tasks, due to their remarkable reasoning ability and extensive knowledge learned from vast training corpora. Grounding LLMs in the physical world remains an open challenge as they can only process textual input. Recent advancements in large vision-language models (LVLMs) have enabled a more comprehensive understanding of the physical world by incorporating visual input, which provides richer contextual information than language alone. In this work, we proposed a novel paradigm that leveraged GPT-4V(ision), the state-of-the-art LVLM by OpenAI, to enable embodied agents to perceive liquid objects via image-based environmental feedback. Specifically, we exploited the physical understanding of GPT-4V to interpret the visual representation (e.g., time-series plot) of non-visual feedback (e.g., F/T sensor data), indirectly enabling multimodal perception beyond vision and language using images as proxies. We evaluated our method using 10 common household liquids with containers of various geometry and material. Without any training or fine-tuning, we demonstrated that our method can enable the robot to indirectly perceive the physical response of liquids and estimate their viscosity. We also showed that by jointly reasoning over the visual and physical attributes learned through interactions, our method could recognize liquid objects in the absence of strong visual cues (e.g., container labels with legible text or symbols), increasing the accuracy from 69.0% -- achieved by the best-performing vision-only variant -- to 86.0%.
Abstract (translated)
随着大型语言模型(LLMs)在机器人任务中的应用越来越受到关注,这是因为它们出色的推理能力和从庞大的训练数据集中的广泛知识。然而,将LLMs grounded in the physical world仍然是一个开放挑战,因为它们只能处理文本输入。近年来,大型视觉语言模型(LVLM)的进步使得对物理世界的更全面理解成为可能,通过将视觉输入集成到模型中,提供了比语言更丰富的上下文信息。在这项工作中,我们提出了一个新颖的方法,即利用GPT-4V(由OpenAI开发的最新LVLM),使实体代理通过图像为基础的环境反馈感知液体物体。具体来说,我们利用GPT-4V的物理理解来解释非视觉反馈(如F/T传感器数据)的视觉表示,从而通过图像作为代理实现多模态感知,超越视觉和语言。我们对我们的方法进行了评估,使用具有各种几何形状和材料的10个常见家庭液体容器进行了实验。在没有进行训练或微调的情况下,我们证明了我们的方法可以使机器人间接感知液体的物理响应,并估计其粘度。我们还证明了通过联合推理视觉和物理属性通过交互获得的知识,我们的方法可以在没有强烈视觉提示的情况下识别液体物体,从最佳视觉Only变体的69.0%精度增加到了86.0%的精度。
URL
https://arxiv.org/abs/2404.06904