Paper Reading AI Learner

Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

2017-10-16 05:34:24
Tanmay Gupta, Kevin Shih, Saurabh Singh, Derek Hoiem

Abstract

An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it leads to better cross-task transfer than standard multi-task learning. In particular, the task of visual recognition is aligned to the task of visual question answering by forcing each to use the same word-region embeddings. We show this leads to greater inductive transfer from recognition to VQA than standard multitask learning. Visual recognition also improves, especially for categories that have relatively few recognition training labels but appear often in the VQA setting. Thus, our paper takes a small step towards creating more general vision systems by showing the benefit of interpretable, flexible, and trainable core representations.

Abstract (translated)

计算机视觉的一个重要目标是构建可以应用于许多任务的随着时间学习视觉表示的系统。在本文中,我们将视觉语言嵌入作为核心表示进行研究,并表明它比标准的多任务学习更好地实现了跨任务转移。特别地,视觉识别的任务与视觉问题解答的任务相一致,方法是迫使每个人使用相同的词区嵌入。我们表明,这比标准的多任务学习导致从识别到VQA更大的归纳转移。视觉识别也得到改善,特别是对于识别训练标签相对较少但在VQA设置中经常出现的类别。因此,我们的论文通过展示可解释,灵活和可训练的核心表征的好处,朝着创建更通用的视觉系统迈出了一小步。

URL

https://arxiv.org/abs/1704.00260

PDF

https://arxiv.org/pdf/1704.00260.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot