Paper Reading AI Learner

MaGGIe: Masked Guided Gradual Human Instance Matting

2024-04-24 17:59:53
Chuong Huynh, Seoung Wug Oh, Abhinav Shrivastava, Joon-Young Lee

Abstract

Human matting is a foundation task in image and video processing, where human foreground pixels are extracted from the input. Prior works either improve the accuracy by additional guidance or improve the temporal consistency of a single instance across frames. We propose a new framework MaGGIe, Masked Guided Gradual Human Instance Matting, which predicts alpha mattes progressively for each human instances while maintaining the computational cost, precision, and consistency. Our method leverages modern architectures, including transformer attention and sparse convolution, to output all instance mattes simultaneously without exploding memory and latency. Although keeping constant inference costs in the multiple-instance scenario, our framework achieves robust and versatile performance on our proposed synthesized benchmarks. With the higher quality image and video matting benchmarks, the novel multi-instance synthesis approach from publicly available sources is introduced to increase the generalization of models in real-world scenarios.

Abstract (translated)

人类遮罩是图像和视频处理中的一个基础任务,其中从输入中提取人类前景像素。先前的 works 要么通过额外的指导来提高准确性,要么通过在帧之间改善单个实例的时序一致性。我们提出了一种新的框架 MaGGIe,掩码引导的逐步人类实例遮罩,在预测每个人类实例的 alpha 遮罩的同时保持计算成本、精度和一致性。我们的方法利用了现代架构,包括 transformer 注意力和稀疏卷积,同时输出所有实例遮罩,而不会导致内存和延迟的爆炸。 尽管在多实例场景中保持不变的推理成本,但我们的框架在拟合真实场景中表现出稳健和多功能的性能。随着高质量图像和视频遮罩基准的提高,我们引入了一种来自公开来源的多实例合成方法,以提高模型在现实场景中的泛化能力。

URL

https://arxiv.org/abs/2404.16035

PDF

https://arxiv.org/pdf/2404.16035.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot