Abstract
Embodied Question Answering (EQA) is a relatively new task where an agent is asked to answer questions about its environment from egocentric perception. EQA makes the fundamental assumption that every question, e.g., "what color is the car?", has exactly one target ("car") being inquired about. This assumption puts a direct limitation on the abilities of the agent. We present a generalization of EQA - Multi-Target EQA (MT-EQA). Specifically, we study questions that have multiple targets in them, such as "Is the dresser in the bedroom bigger than the oven in the kitchen?", where the agent has to navigate to multiple locations ("dresser in bedroom", "oven in kitchen") and perform comparative reasoning ("dresser" bigger than "oven") before it can answer a question. Such questions require the development of entirely new modules or components in the agent. To address this, we propose a modular architecture composed of a program generator, a controller, a navigator, and a VQA module. The program generator converts the given question into sequential executable sub-programs; the navigator guides the agent to multiple locations pertinent to the navigation-related sub-programs; and the controller learns to select relevant observations along its path. These observations are then fed to the VQA module to predict the answer. We perform detailed analysis for each of the model components and show that our joint model can outperform previous methods and strong baselines by a significant margin.
Abstract (translated)
嵌入式问答(eqa)是一项相对较新的任务,要求代理从自我中心的角度回答有关其环境的问题。eqa基本假设每个问题,例如“汽车是什么颜色?”,只有一个目标(“汽车”)被询问。这一假设直接限制了代理人的能力。我们给出了一个eqa-多目标eqa(mt-eqa)的推广。具体来说,我们研究的问题有多个目标,比如“卧室里的梳妆台比厨房里的烤箱大吗?”,代理必须导航到多个位置(“卧室中的梳妆台”、“厨房中的烤箱”),并执行比较推理(“梳妆台”大于“烤箱”),然后才能回答问题。这些问题需要在代理中开发全新的模块或组件。为了解决这个问题,我们提出了一个由程序生成器、控制器、导航器和VQA模块组成的模块化体系结构。程序生成器将给定的问题转换为顺序可执行的子程序;导航器将代理引导到与导航相关的子程序相关的多个位置;控制器将学习沿着其路径选择相关的观察结果。然后将这些观察结果反馈给vqa模块,以预测答案。我们对每个模型组件进行了详细的分析,并表明我们的联合模型可以在很大程度上优于以前的方法和强大的基线。
URL
https://arxiv.org/abs/1904.04686