Abstract
Transformer-based models have recently achieved outstanding performance in image matting. However, their application to high-resolution images remains challenging due to the quadratic complexity of global self-attention. To address this issue, we propose MEMatte, a memory-efficient matting framework for processing high-resolution images. MEMatte incorporates a router before each global attention block, directing informative tokens to the global attention while routing other tokens to a Lightweight Token Refinement Module (LTRM). Specifically, the router employs a local-global strategy to predict the routing probability of each token, and the LTRM utilizes efficient modules to simulate global attention. Additionally, we introduce a Batch-constrained Adaptive Token Routing (BATR) mechanism, which allows each router to dynamically route tokens based on image content and the stages of attention block in the network. Furthermore, we construct an ultra high-resolution image matting dataset, UHR-395, comprising 35,500 training images and 1,000 test images, with an average resolution of $4872\times6017$. This dataset is created by compositing 395 different alpha mattes across 11 categories onto various backgrounds, all with high-quality manual annotation. Extensive experiments demonstrate that MEMatte outperforms existing methods on both high-resolution and real-world datasets, significantly reducing memory usage by approximately 88% and latency by 50% on the Composition-1K benchmark.
Abstract (translated)
基于Transformer的模型最近在图像抠图方面取得了卓越的成绩。然而,由于全局自注意力具有二次复杂性,这些模型应用于高分辨率图像时仍面临挑战。为了解决这一问题,我们提出了一种内存高效的抠图框架MEMatte,用于处理高分辨率图像。MEMatte在每个全局注意力块前加入了路由器,将包含信息的标记导向全局注意力,而其他标记则被导向轻量级标记优化模块(LTRM)。具体来说,该路由器采用局部-全局策略来预测每个标记的路由概率,并且LTRM使用高效的模块来模拟全局注意力。此外,我们引入了一种批处理约束自适应标记路由机制(BATR),它允许每个路由器根据图像内容和网络中注意块的不同阶段动态地导向标记。另外,我们构建了一个超高分辨率图像抠图数据集UHR-395,该数据集中包含35,500张训练图像和1,000张测试图像,平均分辨率为$4872\times6017$。此数据集通过将395种不同类型的透明度掩模组合到11类背景上,并配以高质量的手动注释而创建。广泛的实验表明,MEMatte在高分辨率和现实世界的数据集上的表现优于现有方法,并且在Composition-1K基准测试中内存使用量减少了大约88%,延迟降低了50%。
URL
https://arxiv.org/abs/2412.10702