Paper Reading AI Learner

Information Maximizing Visual Question Generation

2019-03-27 00:57:25
Ranjay Krishna, Michael Bernstein, Li Fei-Fei

Abstract

Though image-to-sequence generation models have become overwhelmingly popular in human-computer communications, they suffer from strongly favoring safe generic questions ("What is in this picture?"). Generating uninformative but relevant questions is not sufficient or useful. We argue that a good question is one that has a tightly focused purpose --- one that is aimed at expecting a specific type of response. We build a model that maximizes mutual information between the image, the expected answer and the generated question. To overcome the non-differentiability of discrete natural language tokens, we introduce a variational continuous latent space onto which the expected answers project. We regularize this latent space with a second latent space that ensures clustering of similar answers. Even when we don't know the expected answer, this second latent space can generate goal-driven questions specifically aimed at extracting objects ("what is the person throwing"), attributes, ("What kind of shirt is the person wearing?"), color ("what color is the frisbee?"), material ("What material is the frisbee?"), etc. We quantitatively show that our model is able to retain information about an expected answer category, resulting in more diverse, goal-driven questions. We launch our model on a set of real world images and extract previously unseen visual concepts.

Abstract (translated)

尽管图像到序列的生成模型在人机通信中已经非常流行,但它们仍然存在着强烈的倾向于安全通用问题(“这张图中是什么?”).生成不具格式性但相关的问题是不够的或有用的。我们认为,一个好的问题是一个目标明确的问题——一个旨在期待特定类型的回应的问题。我们建立了一个模型,使图像、预期答案和生成的问题之间的相互信息最大化。为了克服离散自然语言标记的不可微性,我们引入了一个变分连续潜在空间,期望答案将投射到这个空间上。我们用第二个潜在空间来规范这个潜在空间,以确保相似答案的聚集。即使我们不知道预期答案,第二个潜在空间也能产生目标驱动的问题,专门针对提取物体(“投掷的是什么”)、属性(“穿的是什么样的衬衫?”),颜色(“飞盘是什么颜色?”),材料(“飞盘是什么材料?”)等。我们定量地表明,我们的模型能够保留有关预期答案类别的信息,从而产生更为多样化、目标驱动的问题。我们在一组真实世界的图像上启动我们的模型,并提取以前看不见的视觉概念。

URL

https://arxiv.org/abs/1903.11207

PDF

https://arxiv.org/pdf/1903.11207.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot