Paper Reading AI Learner

Active Learning for Visual Question Answering: An Empirical Study

2017-11-06 05:28:38
Xiao Lin, Devi Parikh
       

Abstract

We present an empirical study of active learning for Visual Question Answering, where a deep VQA model selects informative question-image pairs from a pool and queries an oracle for answers to maximally improve its performance under a limited query budget. Drawing analogies from human learning, we explore cramming (entropy), curiosity-driven (expected model change), and goal-driven (expected error reduction) active learning approaches, and propose a fast and effective goal-driven active learning scoring function to pick question-image pairs for deep VQA models under the Bayesian Neural Network framework. We find that deep VQA models need large amounts of training data before they can start asking informative questions. But once they do, all three approaches outperform the random selection baseline and achieve significant query savings. For the scenario where the model is allowed to ask generic questions about images but is evaluated only on specific questions (e.g., questions whose answer is either yes or no), our proposed goal-driven scoring function performs the best.

Abstract (translated)

我们提出了一个关于视觉问题回答的主动学习的实证研究,其中一个深层的VQA模型从一个池中选择信息性的问题 - 图像对,并向一个oracle查询答案以在有限的查询预算下最大限度地提高其性能。从人类学习中类比,我们探讨了填充(熵),好奇心驱动(预期模型变化)和目标驱动(预期错误减少)主动学习方法,并提出了一种快速有效的目标驱动主动学习评分函数贝叶斯神经网络框架下的深度VQA模型的问题 - 图像对。我们发现深度VQA模型需要大量的训练数据才能开始提问信息。但是一旦他们这样做了,所有三种方法都会比随机选择基准更胜一筹,并可显着节省查询成本。对于允许模型提出有关图像的一般问题但仅针对特定问题进行评估的情况(例如,答案为是或否的问题),我们提出的目标驱动评分函数表现最佳。

URL

https://arxiv.org/abs/1711.01732

PDF

https://arxiv.org/pdf/1711.01732.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot