Paper Reading AI Learner

Matting Anything

2023-06-08 17:51:58
Jiachen Li, Jitesh Jain, Humphrey Shi

Abstract

In this paper, we propose the Matting Anything Model (MAM), an efficient and versatile framework for estimating the alpha matte of any instance in an image with flexible and interactive visual or linguistic user prompt guidance. MAM offers several significant advantages over previous specialized image matting networks: (i) MAM is capable of dealing with various types of image matting, including semantic, instance, and referring image matting with only a single model; (ii) MAM leverages the feature maps from the Segment Anything Model (SAM) and adopts a lightweight Mask-to-Matte (M2M) module to predict the alpha matte through iterative refinement, which has only 2.7 million trainable parameters. (iii) By incorporating SAM, MAM simplifies the user intervention required for the interactive use of image matting from the trimap to the box, point, or text prompt. We evaluate the performance of MAM on various image matting benchmarks, and the experimental results demonstrate that MAM achieves comparable performance to the state-of-the-art specialized image matting models under different metrics on each benchmark. Overall, MAM shows superior generalization ability and can effectively handle various image matting tasks with fewer parameters, making it a practical solution for unified image matting. Our code and models are open-sourced at this https URL.

Abstract (translated)

本文提出了“裁剪任何东西模型”(MAM),一个高效、多功能的框架,能够在灵活的、交互式的可视化或语言用户提示下,估计图像中任意实例的Alpha matte。相较于以往的专门图像处理网络,MAM提供了多项显著优势:(i) MAM能够处理各种图像处理,包括语义、实例和引用图像处理,仅使用一种模型即可;(ii) MAM利用Segment anything Model(SAM)的特征映射,采用轻量级Mask-to-Matte(M2M)模块进行迭代优化,以预测Alpha matte,该模块仅有2.7百万可训练参数;(iii) 通过集成SAM,MAM简化了用户对于图像处理交互使用的干预,从Trimap到盒子、点或文本提示等各个基准的交互使用所需的用户干预均简化了许多参数。我们评估了MAM在各种图像处理基准上的表现,实验结果表明,MAM在每个基准上的表现与最先进的专门图像处理模型的相似度达到了最高水平。总的来说,MAM具有更好的泛化能力,能够以较少的参数有效处理各种图像处理任务,使其成为统一图像处理的实用解决方案。我们的代码和模型在此httpsURL上开源。

URL

https://arxiv.org/abs/2306.05399

PDF

https://arxiv.org/pdf/2306.05399.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot