Abstract
For natural image matting, context information plays a crucial role in estimating alpha mattes especially when it is challenging to distinguish foreground from its background. Exiting deep learning-based methods exploit specifically designed context aggregation modules to refine encoder features. However, the effectiveness of these modules has not been thoroughly explored. In this paper, we conduct extensive experiments to reveal that the context aggregation modules are actually not as effective as expected. We also demonstrate that when learned on large image patches, basic encoder-decoder networks with a larger receptive field can effectively aggregate context to achieve better performance.Upon the above findings, we propose a simple yet effective matting network, named AEMatter, which enlarges the receptive field by incorporating an appearance-enhanced axis-wise learning block into the encoder and adopting a hybrid-transformer decoder. Experimental results on four datasets demonstrate that our AEMatter significantly outperforms state-of-the-art matting methods (e.g., on the Adobe Composition-1K dataset, \textbf{25\%} and \textbf{40\%} reduction in terms of SAD and MSE, respectively, compared against MatteFormer). The code and model are available at \url{this https URL}.
Abstract (translated)
对于自然图像剪辑,上下文信息在估计 alpha 剪辑方面发挥着至关重要的作用,特别是在难以区分前景和背景时。深度学习方法的exit点利用专门设计的上下文聚合模块来优化编码特征。然而,这些模块的效果并没有得到充分探索。在本文中,我们进行了广泛的实验,以揭示上下文聚合模块实际上并不像预期那样有效。我们还证明,在大型图像补丁上学习时,具有更大接收域的基本编码-解码网络可以 effectively 聚合上下文,实现更好的性能。基于以上发现,我们提出了一种简单但有效的剪辑网络,名为AEMatter,它通过将增强的外观轴学习块嵌入编码器并采用混合Transformer解码器,扩大了接收域。在四个数据集上的实验结果表明,我们的AEMatter在剪接方法中 significantly 超越了当前最先进的方法(例如,在Adobe Composition-1K数据集上,与前一种剪接方法相比,SAD和MSE分别减少了25%和40%)。代码和模型可在 \url{this https URL} 上获取。
URL
https://arxiv.org/abs/2304.01171