Paper Reading AI Learner

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models

2023-03-21 08:43:15
Weijia Wu, Yuzhong Zhao, Mike Zheng Shou, Hong Zhou, Chunhua Shen

Abstract

Collecting and annotating images with pixel-wise labels is time-consuming and laborious. In contrast, synthetic data can be freely available using a generative model (e.g., DALL-E, Stable Diffusion). In this paper, we show that it is possible to automatically obtain accurate semantic masks of synthetic images generated by the Off-the-shelf Stable Diffusion model, which uses only text-image pairs during training. Our approach, called DiffuMask, exploits the potential of the cross-attention map between text and image, which is natural and seamless to extend the text-driven image synthesis to semantic mask generation. DiffuMask uses text-guided cross-attention information to localize class/word-specific regions, which are combined with practical techniques to create a novel high-resolution and class-discriminative pixel-wise mask. The methods help to reduce data collection and annotation costs obviously. Experiments demonstrate that the existing segmentation methods trained on synthetic data of DiffuMask can achieve a competitive performance over the counterpart of real data (VOC 2012, Cityscapes). For some classes (e.g., bird), DiffuMask presents promising performance, close to the stateof-the-art result of real data (within 3% mIoU gap). Moreover, in the open-vocabulary segmentation (zero-shot) setting, DiffuMask achieves a new SOTA result on Unseen class of VOC 2012. The project website can be found at this https URL.

Abstract (translated)

收集和标注带有像素级标签的图像是费时且繁琐的。相比之下,使用生成模型(如DALL-E和稳定扩散模型)来生成合成数据可以非常自由地使用,而无需手动收集和标注。在本文中,我们表明,可以使用预定义的稳定扩散模型训练的DiffuMask合成数据中的准确语义掩码,而无需手动收集和标注真实数据。我们的方法称为DiffuMask,利用了文本和图像之间的交叉注意力地图的潜力,这种扩展基于文本驱动的图像合成到语义掩码生成的过程自然且无缝。DiffuMask使用文本指导的交叉注意力信息来定位特定类别/单词的特定区域,并与其他实用技术相结合,创造了一种高分辨率、类别歧视性的像素级掩码。这些方法显然有助于减少数据收集和标注成本。实验结果表明,在DiffuMask训练的人工合成数据上使用的现有分割方法可以在真实数据对照下实现竞争性能(VOC 2012,城市景观)。对于某些类别(如鸟类),DiffuMask表现出令人期望的性能,几乎达到了真实数据的最新结果(仅3%的mIoU差距)。此外,在开放词汇分割(零样本)设置中,DiffuMask在VOC 2012中 unseen类别上实现了新的SOTA结果。项目网站可以在这个httpsURL上找到。

URL

https://arxiv.org/abs/2303.11681

PDF

https://arxiv.org/pdf/2303.11681


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot