Abstract
Embodied agents empowered by large language models (LLMs) have shown strong performance in household object rearrangement tasks. However, these tasks primarily focus on single-turn interactions with simplified instructions, which do not truly reflect the challenges of providing meaningful assistance to users. To provide personalized assistance, embodied agents must understand the unique semantics that users assign to the physical world (e.g., favorite cup, breakfast routine) by leveraging prior interaction history to interpret dynamic, real-world instructions. Yet, the effectiveness of embodied agents in utilizing memory for personalized assistance remains largely underexplored. To address this gap, we present MEMENTO, a personalized embodied agent evaluation framework designed to comprehensively assess memory utilization capabilities to provide personalized assistance. Our framework consists of a two-stage memory evaluation process design that enables quantifying the impact of memory utilization on task performance. This process enables the evaluation of agents' understanding of personalized knowledge in object rearrangement tasks by focusing on its role in goal interpretation: (1) the ability to identify target objects based on personal meaning (object semantics), and (2) the ability to infer object-location configurations from consistent user patterns, such as routines (user patterns). Our experiments across various LLMs reveal significant limitations in memory utilization, with even frontier models like GPT-4o experiencing a 30.5% performance drop when required to reference multiple memories, particularly in tasks involving user patterns. These findings, along with our detailed analyses and case studies, provide valuable insights for future research in developing more effective personalized embodied agents. Project website: this https URL
Abstract (translated)
由大型语言模型(LLM)驱动的具身智能体在家庭物品重新排列任务中表现出色。然而,这些任务主要集中在单轮交互和简化指令上,这并不能真正反映向用户提供有意义帮助的挑战。为了提供个性化服务,具身智能体必须利用先前的互动历史来理解用户赋予物理世界的独特语义(例如,最喜欢的杯子、早餐习惯),并解释动态且真实的指示。然而,具身智能体在使用记忆进行个性化辅助的有效性仍被很大程度上忽视。 为了解决这一差距,我们提出了MEMENTO,这是一个用于评估个性化具身智能体的记忆利用能力的框架。该框架包括一个两阶段的记忆评估过程设计,可以量化记忆利用率对任务性能的影响。此流程通过聚焦于目标解释中个人知识理解的作用,来评价代理人在物品重新排列任务中的个性化知识理解能力:(1) 根据个人含义(物体语义)识别目标对象的能力;(2) 从一致的用户模式(如习惯)推断出对象位置配置的能力。 我们在各种LLM上进行的实验揭示了记忆利用方面的显著局限性,即使是前沿模型如GPT-4o,在需要引用多个内存的任务中也出现了高达30.5%的表现下降,尤其是在涉及用户模式的任务中。这些发现,结合我们详细的分析和案例研究,为未来开发更有效的个性化具身智能体的研究提供了宝贵的见解。 项目网站:[此处插入正确的网址链接]
URL
https://arxiv.org/abs/2505.16348