Paper Reading AI Learner

ZIM: Zero-Shot Image Matting for Anything

2024-11-01 14:34:33
Beomyoung Kim, Chanyong Shin, Joonhyun Jeong, Hyungsik Jung, Se-Yun Lee, Sewhan Chun, Dong-Hyun Hwang, Joonsang Yu

Abstract

The recent segmentation foundation model, Segment Anything Model (SAM), exhibits strong zero-shot segmentation capabilities, but it falls short in generating fine-grained precise masks. To address this limitation, we propose a novel zero-shot image matting model, called ZIM, with two key contributions: First, we develop a label converter that transforms segmentation labels into detailed matte labels, constructing the new SA1B-Matte dataset without costly manual annotations. Training SAM with this dataset enables it to generate precise matte masks while maintaining its zero-shot capability. Second, we design the zero-shot matting model equipped with a hierarchical pixel decoder to enhance mask representation, along with a prompt-aware masked attention mechanism to improve performance by enabling the model to focus on regions specified by visual prompts. We evaluate ZIM using the newly introduced MicroMat-3K test set, which contains high-quality micro-level matte labels. Experimental results show that ZIM outperforms existing methods in fine-grained mask generation and zero-shot generalization. Furthermore, we demonstrate the versatility of ZIM in various downstream tasks requiring precise masks, such as image inpainting and 3D NeRF. Our contributions provide a robust foundation for advancing zero-shot matting and its downstream applications across a wide range of computer vision tasks. The code is available at \url{this https URL}.

Abstract (translated)

近期的分割基础模型,即Segment Anything Model (SAM),展示了强大的零样本分割能力,但在生成精细准确的掩码方面表现不佳。为解决这一局限性,我们提出了一种新的零样本图像抠图模型ZIM,并有两项关键贡献:首先,我们开发了一个标签转换器,可以将分割标签转化为详细的抠图标签,构建了无需昂贵的人工标注的新数据集SA1B-Matte。用此数据集训练SAM使其能够生成精确的抠图掩码,同时保持零样本能力。其次,我们设计了一种配备分层像素解码器的零样本抠图模型以增强掩码表示,并加入了一个提示感知的掩蔽注意力机制,通过使模型专注于视觉提示指定的区域来提升性能。我们在新引入的MicroMat-3K测试集上评估了ZIM,该数据集包含高质量的微级别抠图标签。实验结果表明,ZIM在精细掩模生成和零样本泛化方面优于现有方法。此外,我们展示了ZIM在需要精确掩码的各种下游任务中的灵活性,例如图像修复和3D NeRF。我们的贡献为推动零样本抠图及其广泛计算机视觉任务的下游应用提供了坚实的基础。代码可在\url{this https URL}获取。

URL

https://arxiv.org/abs/2411.00626

PDF

https://arxiv.org/pdf/2411.00626.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot