Paper Reading AI Learner

FindingDory: A Benchmark to Evaluate Memory in Embodied Agents

2025-06-18 17:06:28
Karmesh Yadav, Yusuf Ali, Gunshi Gupta, Yarin Gal, Zsolt Kira

Abstract

Large vision-language models have recently demonstrated impressive performance in planning and control tasks, driving interest in their application to real-world robotics. However, deploying these models for reasoning in embodied contexts is limited by their ability to incorporate long-term experience collected across multiple days and represented by vast collections of images. Current VLMs typically struggle to process more than a few hundred images concurrently, highlighting the need for more efficient mechanisms to handle long-term memory in embodied settings. To effectively evaluate these models for long-horizon control, a benchmark must specifically target scenarios where memory is crucial for success. Existing long-video QA benchmarks overlook embodied challenges like object manipulation and navigation, which demand low-level skills and fine-grained reasoning over past interactions. Moreover, effective memory integration in embodied agents involves both recalling relevant historical information and executing actions based on that information, making it essential to study these aspects together rather than in isolation. In this work, we introduce a new benchmark for long-range embodied tasks in the Habitat simulator. This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness in an environment. The tasks can also be procedurally extended to longer and more challenging versions, enabling scalable evaluation of memory and reasoning. We also present baselines that integrate state-of-the-art VLMs with low level navigation policies, assessing their performance on these memory-intensive tasks and highlight areas for improvement.

Abstract (translated)

近期,大型视觉-语言模型在规划和控制任务中表现出令人印象深刻的性能,这激发了人们将其应用于真实世界机器人技术的兴趣。然而,在具身环境中应用这些模型进行推理时,它们的局限性在于难以整合跨越多天收集的大量图像所代表的长期经验。当前的视觉语言模型(VLMs)通常只能同时处理几百张图片以内的情况,凸显出在具身场景中更有效地管理长期记忆的需求。为了有效评估这些模型在长周期控制任务中的表现,基准测试必须特别针对那些成功依赖于良好记忆能力的情境。现有的长时间视频问答基准忽略了像物体操作和导航这样的具身挑战,这些问题需要低级技能以及对过去互动的细致推理。 此外,在具身代理中有效地整合记忆不仅包括回忆相关的历史信息,还包括根据这些信息执行动作,这意味着在研究这些方面时应将它们作为一个整体而非孤立地看待。在这项工作中,我们引入了一个新的基准测试,用于评估Habitat模拟器中的长距离具身任务的记忆能力。该基准测试涵盖60个需要环境内持续互动和情境意识的任务,并且可以扩展到更长时间和更具挑战性的版本中去,以实现对记忆和推理的可伸缩性评估。我们还提出了基线方法,这些方法将最先进的VLM与低级导航策略相结合,用以评估它们在这些依赖于强大记忆能力任务上的表现,并指出了改进的方向。

URL

https://arxiv.org/abs/2506.15635

PDF

https://arxiv.org/pdf/2506.15635.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot