Paper Reading AI Learner

S-EQA: Tackling Situational Queries in Embodied Question Answering

2024-05-08 00:45:20
Vishnu Sashank Dorbala, Prasoon Goyal, Robinson Piramuthu, Michael Johnston, Dinesh Manocha, Reza Ghanadhan

Abstract

We present and tackle the problem of Embodied Question Answering (EQA) with Situational Queries (S-EQA) in a household environment. Unlike prior EQA work tackling simple queries that directly reference target objects and quantifiable properties pertaining them, EQA with situational queries (such as "Is the bathroom clean and dry?") is more challenging, as the agent needs to figure out not just what the target objects pertaining to the query are, but also requires a consensus on their states to be answerable. Towards this objective, we first introduce a novel Prompt-Generate-Evaluate (PGE) scheme that wraps around an LLM's output to create a dataset of unique situational queries, corresponding consensus object information, and predicted answers. PGE maintains uniqueness among the generated queries, using multiple forms of semantic similarity. We validate the generated dataset via a large scale user-study conducted on M-Turk, and introduce it as S-EQA, the first dataset tackling EQA with situational queries. Our user study establishes the authenticity of S-EQA with a high 97.26% of the generated queries being deemed answerable, given the consensus object data. Conversely, we observe a low correlation of 46.2% on the LLM-predicted answers to human-evaluated ones; indicating the LLM's poor capability in directly answering situational queries, while establishing S-EQA's usability in providing a human-validated consensus for an indirect solution. We evaluate S-EQA via Visual Question Answering (VQA) on VirtualHome, which unlike other simulators, contains several objects with modifiable states that also visually appear different upon modification -- enabling us to set a quantitative benchmark for S-EQA. To the best of our knowledge, this is the first work to introduce EQA with situational queries, and also the first to use a generative approach for query creation.

Abstract (translated)

我们在家庭环境中针对情境查询(S-EQA)解决了 embodied 问题回答(EQA)问题。与之前的工作不同,这些工作主要解决与目标对象直接引用并可量化的属性相关的简单查询,而 EQA with situational queries(例如“卫生间干净干燥吗?”)更具挑战性,因为代理需要确定不仅目标对象的答案,而且还需要就它们的状态达成一致。为了实现这个目标,我们首先介绍了一种新颖的提示生成-评估(PGE)方案,该方案围绕 LLM 的输出创建了一个独特的数据集,包括独特的情境查询、相应的共识对象信息和预测的答案。PGE 在生成的查询中保持独特性,利用多种语义相似性。我们通过在 M-Turk 上进行大规模用户研究来验证生成的数据集,并将其作为 S-EQA,第一个处理情境查询的 dataset。我们的用户研究证实 S-EQA 的真实性,其中有 97.26% 的生成查询被认为具有答案,基于共识对象数据。相反,我们在 LLM 预测的答案和人类评估的答案之间观察到较低的相关性,表明 LLM 在直接回答情境查询方面能力较差,但 S-EQA 在提供人类验证的共识方面具有可用性。我们通过在 VirtualHome 上使用视觉问答(VQA)来评估 S-EQA,这个模拟器与其他模拟器不同,包含多个可修改的状态的对象,在修改后也具有不同的视觉表现,使我们能够为 S-EQA 设定一个量化基准。据我们所知,这是第一个介绍 EQA with situational queries 的作品,也是第一个使用生成方法创建查询的。

URL

https://arxiv.org/abs/2405.04732

PDF

https://arxiv.org/pdf/2405.04732.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot