Abstract
We present and tackle the problem of Embodied Question Answering (EQA) with Situational Queries (S-EQA) in a household environment. Unlike prior EQA work tackling simple queries that directly reference target objects and quantifiable properties pertaining them, EQA with situational queries (such as "Is the bathroom clean and dry?") is more challenging, as the agent needs to figure out not just what the target objects pertaining to the query are, but also requires a consensus on their states to be answerable. Towards this objective, we first introduce a novel Prompt-Generate-Evaluate (PGE) scheme that wraps around an LLM's output to create a dataset of unique situational queries, corresponding consensus object information, and predicted answers. PGE maintains uniqueness among the generated queries, using multiple forms of semantic similarity. We validate the generated dataset via a large scale user-study conducted on M-Turk, and introduce it as S-EQA, the first dataset tackling EQA with situational queries. Our user study establishes the authenticity of S-EQA with a high 97.26% of the generated queries being deemed answerable, given the consensus object data. Conversely, we observe a low correlation of 46.2% on the LLM-predicted answers to human-evaluated ones; indicating the LLM's poor capability in directly answering situational queries, while establishing S-EQA's usability in providing a human-validated consensus for an indirect solution. We evaluate S-EQA via Visual Question Answering (VQA) on VirtualHome, which unlike other simulators, contains several objects with modifiable states that also visually appear different upon modification -- enabling us to set a quantitative benchmark for S-EQA. To the best of our knowledge, this is the first work to introduce EQA with situational queries, and also the first to use a generative approach for query creation.
Abstract (translated)
我们在家庭环境中针对情境查询(S-EQA)解决了 embodied 问题回答(EQA)问题。与之前的工作不同,这些工作主要解决与目标对象直接引用并可量化的属性相关的简单查询,而 EQA with situational queries(例如“卫生间干净干燥吗?”)更具挑战性,因为代理需要确定不仅目标对象的答案,而且还需要就它们的状态达成一致。为了实现这个目标,我们首先介绍了一种新颖的提示生成-评估(PGE)方案,该方案围绕 LLM 的输出创建了一个独特的数据集,包括独特的情境查询、相应的共识对象信息和预测的答案。PGE 在生成的查询中保持独特性,利用多种语义相似性。我们通过在 M-Turk 上进行大规模用户研究来验证生成的数据集,并将其作为 S-EQA,第一个处理情境查询的 dataset。我们的用户研究证实 S-EQA 的真实性,其中有 97.26% 的生成查询被认为具有答案,基于共识对象数据。相反,我们在 LLM 预测的答案和人类评估的答案之间观察到较低的相关性,表明 LLM 在直接回答情境查询方面能力较差,但 S-EQA 在提供人类验证的共识方面具有可用性。我们通过在 VirtualHome 上使用视觉问答(VQA)来评估 S-EQA,这个模拟器与其他模拟器不同,包含多个可修改的状态的对象,在修改后也具有不同的视觉表现,使我们能够为 S-EQA 设定一个量化基准。据我们所知,这是第一个介绍 EQA with situational queries 的作品,也是第一个使用生成方法创建查询的。
URL
https://arxiv.org/abs/2405.04732