Abstract
Embodied Question Answering (EQA) requires agents to autonomously explore and understand the environment to answer context-dependent questions. Existing frameworks typically center around the planner, which guides the stopping module, memory module, and answering module for reasoning. In this paper, we propose a memory-centric EQA framework named MemoryEQA. Unlike planner-centric EQA models where the memory module cannot fully interact with other modules, MemoryEQA flexible feeds memory information into all modules, thereby enhancing efficiency and accuracy in handling complex tasks, such as those involving multiple targets across different regions. Specifically, we establish a multi-modal hierarchical memory mechanism, which is divided into global memory that stores language-enhanced scene maps, and local memory that retains historical observations and state information. When performing EQA tasks, the multi-modal large language model is leveraged to convert memory information into the required input formats for injection into different modules. To evaluate EQA models' memory capabilities, we constructed the MT-HM3D dataset based on HM3D, comprising 1,587 question-answer pairs involving multiple targets across various regions, which requires agents to maintain memory of exploration-acquired target information. Experimental results on HM-EQA, MT-HM3D, and OpenEQA demonstrate the effectiveness of our framework, where a 19.8% performance gain on MT-HM3D compared to baseline model further underscores memory capability's pivotal role in resolving complex tasks.
Abstract (translated)
Embodied Question Answering(实体化问题回答,EQA)要求代理自主探索和理解环境以回答上下文相关的问题。现有的框架通常围绕规划器构建,该规划器引导停止模块、记忆模块和回答模块进行推理。在本文中,我们提出了一种名为MemoryEQA的记忆中心型EQA框架。与以规划器为中心的EQA模型不同,其中记忆模块无法与其他模块充分交互,MemoryEQA灵活地将记忆信息输入所有模块,从而提高了处理复杂任务(如涉及不同区域多个目标的任务)时的效率和准确性。 具体来说,我们建立了一种多模态分层记忆机制,该机制分为存储增强语言场景图的全局内存以及保留历史观察和状态信息的局部内存。在执行EQA任务时,利用多模态大型语言模型将内存信息转换为不同模块所需的输入格式以注入其中。 为了评估EQA模型的记忆能力,我们基于HM3D构建了MT-HM3D数据集,该数据集中包含1,587个涉及多个目标的问答对,并且这些目标分布在不同的区域。这要求代理保持探索过程中获取的目标信息记忆。在HM-EQA、MT-HM3D和OpenEQA上的实验结果证明了我们框架的有效性,在MT-HM3D上与基线模型相比,MemoryEQA模型实现了19.8%的性能提升,进一步强调了记忆能力解决复杂任务的关键作用。
URL
https://arxiv.org/abs/2505.13948