Abstract
Natural image matting aims to estimate the alpha matte of the foreground from a given image. Various approaches have been explored to address this problem, such as interactive matting methods that use guidance such as click or trimap, and automatic matting methods tailored to specific objects. However, existing matting methods are designed for specific objects or guidance, neglecting the common requirement of aggregating global and local contexts in image matting. As a result, these methods often encounter challenges in accurately identifying the foreground and generating precise boundaries, which limits their effectiveness in unforeseen scenarios. In this paper, we propose a simple and universal matting framework, named Dual-Context Aggregation Matting (DCAM), which enables robust image matting with arbitrary guidance or without guidance. Specifically, DCAM first adopts a semantic backbone network to extract low-level features and context features from the input image and guidance. Then, we introduce a dual-context aggregation network that incorporates global object aggregators and local appearance aggregators to iteratively refine the extracted context features. By performing both global contour segmentation and local boundary refinement, DCAM exhibits robustness to diverse types of guidance and objects. Finally, we adopt a matting decoder network to fuse the low-level features and the refined context features for alpha matte estimation. Experimental results on five matting datasets demonstrate that the proposed DCAM outperforms state-of-the-art matting methods in both automatic matting and interactive matting tasks, which highlights the strong universality and high performance of DCAM. The source code is available at \url{this https URL}.
Abstract (translated)
自然图像合成旨在从给定的图像中估计前景的alpha遮罩。为了解决这个问题,已经探索了许多方法,例如使用点击或剪裁映像的交互式遮罩方法和针对特定对象的自动遮罩方法。然而,现有的遮罩方法都是为特定物体或指导设计的,忽视了图像遮罩中全局和局部上下文整合的常见要求。因此,这些方法通常会在准确识别前景和生成精确边界方面遇到困难,从而限制其在未知场景中的有效性。在本文中,我们提出了一个简单而通用的遮罩框架,名为双上下文聚合遮罩(DCAM),它具有任意指导或无指导的鲁棒图像合成能力。具体来说,DCAM首先采用语义骨架网络从输入图像和指导中提取低级特征和上下文特征。然后,我们引入了一种双上下文聚合网络,它包括全局物体聚合器和局部外观聚合器,用于迭代优化提取的上下文特征。通过执行全局轮廓分割和局部边界修复,DCAM在各种类型的指导和物体上表现出鲁棒性。最后,我们采用遮罩解码器网络将低级特征和修复后的上下文特征融合进行alpha遮罩估计。在五个遮罩数据集上的实验结果表明,与最先进的遮罩方法相比,DCAM在自动遮罩和交互式遮罩任务上都表现出卓越的性能,这突出了DCAM的宽泛性和高性能。源代码可在此处访问:\url{this <https://this URL>.
URL
https://arxiv.org/abs/2402.18109