Abstract
Large vision-language models have recently demonstrated impressive performance in planning and control tasks, driving interest in their application to real-world robotics. However, deploying these models for reasoning in embodied contexts is limited by their ability to incorporate long-term experience collected across multiple days and represented by vast collections of images. Current VLMs typically struggle to process more than a few hundred images concurrently, highlighting the need for more efficient mechanisms to handle long-term memory in embodied settings. To effectively evaluate these models for long-horizon control, a benchmark must specifically target scenarios where memory is crucial for success. Existing long-video QA benchmarks overlook embodied challenges like object manipulation and navigation, which demand low-level skills and fine-grained reasoning over past interactions. Moreover, effective memory integration in embodied agents involves both recalling relevant historical information and executing actions based on that information, making it essential to study these aspects together rather than in isolation. In this work, we introduce a new benchmark for long-range embodied tasks in the Habitat simulator. This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness in an environment. The tasks can also be procedurally extended to longer and more challenging versions, enabling scalable evaluation of memory and reasoning. We also present baselines that integrate state-of-the-art VLMs with low level navigation policies, assessing their performance on these memory-intensive tasks and highlight areas for improvement.
Abstract (translated)
近期,大型视觉-语言模型在规划和控制任务中表现出令人印象深刻的性能,这激发了人们将其应用于真实世界机器人技术的兴趣。然而,在具身环境中应用这些模型进行推理时,它们的局限性在于难以整合跨越多天收集的大量图像所代表的长期经验。当前的视觉语言模型(VLMs)通常只能同时处理几百张图片以内的情况,凸显出在具身场景中更有效地管理长期记忆的需求。为了有效评估这些模型在长周期控制任务中的表现,基准测试必须特别针对那些成功依赖于良好记忆能力的情境。现有的长时间视频问答基准忽略了像物体操作和导航这样的具身挑战,这些问题需要低级技能以及对过去互动的细致推理。 此外,在具身代理中有效地整合记忆不仅包括回忆相关的历史信息,还包括根据这些信息执行动作,这意味着在研究这些方面时应将它们作为一个整体而非孤立地看待。在这项工作中,我们引入了一个新的基准测试,用于评估Habitat模拟器中的长距离具身任务的记忆能力。该基准测试涵盖60个需要环境内持续互动和情境意识的任务,并且可以扩展到更长时间和更具挑战性的版本中去,以实现对记忆和推理的可伸缩性评估。我们还提出了基线方法,这些方法将最先进的VLM与低级导航策略相结合,用以评估它们在这些依赖于强大记忆能力任务上的表现,并指出了改进的方向。
URL
https://arxiv.org/abs/2506.15635