Paper Reading AI Learner

Multi-Target Embodied Question Answering

2019-04-09 14:10:40
Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L. Berg, Dhruv Batra

Abstract

Embodied Question Answering (EQA) is a relatively new task where an agent is asked to answer questions about its environment from egocentric perception. EQA makes the fundamental assumption that every question, e.g., "what color is the car?", has exactly one target ("car") being inquired about. This assumption puts a direct limitation on the abilities of the agent. We present a generalization of EQA - Multi-Target EQA (MT-EQA). Specifically, we study questions that have multiple targets in them, such as "Is the dresser in the bedroom bigger than the oven in the kitchen?", where the agent has to navigate to multiple locations ("dresser in bedroom", "oven in kitchen") and perform comparative reasoning ("dresser" bigger than "oven") before it can answer a question. Such questions require the development of entirely new modules or components in the agent. To address this, we propose a modular architecture composed of a program generator, a controller, a navigator, and a VQA module. The program generator converts the given question into sequential executable sub-programs; the navigator guides the agent to multiple locations pertinent to the navigation-related sub-programs; and the controller learns to select relevant observations along its path. These observations are then fed to the VQA module to predict the answer. We perform detailed analysis for each of the model components and show that our joint model can outperform previous methods and strong baselines by a significant margin.

Abstract (translated)

嵌入式问答(eqa)是一项相对较新的任务,要求代理从自我中心的角度回答有关其环境的问题。eqa基本假设每个问题,例如“汽车是什么颜色?”,只有一个目标(“汽车”)被询问。这一假设直接限制了代理人的能力。我们给出了一个eqa-多目标eqa(mt-eqa)的推广。具体来说,我们研究的问题有多个目标,比如“卧室里的梳妆台比厨房里的烤箱大吗?”,代理必须导航到多个位置(“卧室中的梳妆台”、“厨房中的烤箱”),并执行比较推理(“梳妆台”大于“烤箱”),然后才能回答问题。这些问题需要在代理中开发全新的模块或组件。为了解决这个问题,我们提出了一个由程序生成器、控制器、导航器和VQA模块组成的模块化体系结构。程序生成器将给定的问题转换为顺序可执行的子程序;导航器将代理引导到与导航相关的子程序相关的多个位置;控制器将学习沿着其路径选择相关的观察结果。然后将这些观察结果反馈给vqa模块,以预测答案。我们对每个模型组件进行了详细的分析,并表明我们的联合模型可以在很大程度上优于以前的方法和强大的基线。

URL

https://arxiv.org/abs/1904.04686

PDF

https://arxiv.org/pdf/1904.04686.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot