Paper Reading AI Learner

Harvesting Information from Captions for Weakly Supervised Semantic Segmentation

2019-05-16 14:35:09
Johann Sawatzky, Debayan Banerjee, Juergen Gall

Abstract

Since acquiring pixel-wise annotations for training convolutional neural networks for semantic image segmentation is time-consuming, weakly supervised approaches that only require class tags have been proposed. In this work, we propose another form of supervision, namely image captions as they can be found on the Internet. These captions have two advantages. They do not require additional curation as it is the case for the clean class tags used by current weakly supervised approaches and they provide textual context for the classes present in an image. To leverage such textual context, we deploy a multi-modal network that learns a joint embedding of the visual representation of the image and the textual representation of the caption. The network estimates text activation maps (TAMs) for class names as well as compound concepts, i.e. combinations of nouns and their attributes. The TAMs of compound concepts describing classes of interest substantially improve the quality of the estimated class activation maps which are then used to train a network for semantic segmentation. We evaluate our method on the COCO dataset where it achieves state of the art results for weakly supervised image segmentation.

Abstract (translated)

由于为训练卷积神经网络进行语义图像分割而获取像素级的注释是一种耗时的、弱监督的方法,因此提出了一种只需要类标记的方法。在这项工作中,我们提出了另一种形式的监督,即图像字幕,因为他们可以在互联网上找到。这些标题有两个优点。它们不需要额外的管理,因为当前弱监督方法使用的干净类标签是这样的,它们为图像中存在的类提供文本上下文。为了利用这种文本上下文,我们部署了一个多模式网络,学习图像的视觉表示和标题的文本表示的联合嵌入。网络为类名和复合概念(即名词及其属性的组合)估计文本激活映射(TAMS)。描述感兴趣类的复合概念的TAMS大大提高了估计类激活图的质量,这些图随后用于训练用于语义分割的网络。我们在COCO数据集上评估了我们的方法,该数据集实现了弱监督图像分割的最新结果。

URL

https://arxiv.org/abs/1905.06784

PDF

https://arxiv.org/pdf/1905.06784.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot