Abstract
Human matting is a foundation task in image and video processing, where human foreground pixels are extracted from the input. Prior works either improve the accuracy by additional guidance or improve the temporal consistency of a single instance across frames. We propose a new framework MaGGIe, Masked Guided Gradual Human Instance Matting, which predicts alpha mattes progressively for each human instances while maintaining the computational cost, precision, and consistency. Our method leverages modern architectures, including transformer attention and sparse convolution, to output all instance mattes simultaneously without exploding memory and latency. Although keeping constant inference costs in the multiple-instance scenario, our framework achieves robust and versatile performance on our proposed synthesized benchmarks. With the higher quality image and video matting benchmarks, the novel multi-instance synthesis approach from publicly available sources is introduced to increase the generalization of models in real-world scenarios.
Abstract (translated)
人类遮罩是图像和视频处理中的一个基础任务,其中从输入中提取人类前景像素。先前的 works 要么通过额外的指导来提高准确性,要么通过在帧之间改善单个实例的时序一致性。我们提出了一种新的框架 MaGGIe,掩码引导的逐步人类实例遮罩,在预测每个人类实例的 alpha 遮罩的同时保持计算成本、精度和一致性。我们的方法利用了现代架构,包括 transformer 注意力和稀疏卷积,同时输出所有实例遮罩,而不会导致内存和延迟的爆炸。 尽管在多实例场景中保持不变的推理成本,但我们的框架在拟合真实场景中表现出稳健和多功能的性能。随着高质量图像和视频遮罩基准的提高,我们引入了一种来自公开来源的多实例合成方法,以提高模型在现实场景中的泛化能力。
URL
https://arxiv.org/abs/2404.16035