Paper Reading AI Learner

Memory Efficient Matting with Adaptive Token Routing

2024-12-14 06:21:24
Yiheng Lin, Yihan Hu, Chenyi Zhang, Ting Liu, Xiaochao Qu, Luoqi Liu, Yao Zhao, Yunchao Wei

Abstract

Transformer-based models have recently achieved outstanding performance in image matting. However, their application to high-resolution images remains challenging due to the quadratic complexity of global self-attention. To address this issue, we propose MEMatte, a memory-efficient matting framework for processing high-resolution images. MEMatte incorporates a router before each global attention block, directing informative tokens to the global attention while routing other tokens to a Lightweight Token Refinement Module (LTRM). Specifically, the router employs a local-global strategy to predict the routing probability of each token, and the LTRM utilizes efficient modules to simulate global attention. Additionally, we introduce a Batch-constrained Adaptive Token Routing (BATR) mechanism, which allows each router to dynamically route tokens based on image content and the stages of attention block in the network. Furthermore, we construct an ultra high-resolution image matting dataset, UHR-395, comprising 35,500 training images and 1,000 test images, with an average resolution of $4872\times6017$. This dataset is created by compositing 395 different alpha mattes across 11 categories onto various backgrounds, all with high-quality manual annotation. Extensive experiments demonstrate that MEMatte outperforms existing methods on both high-resolution and real-world datasets, significantly reducing memory usage by approximately 88% and latency by 50% on the Composition-1K benchmark.

Abstract (translated)

基于Transformer的模型最近在图像抠图方面取得了卓越的成绩。然而,由于全局自注意力具有二次复杂性,这些模型应用于高分辨率图像时仍面临挑战。为了解决这一问题,我们提出了一种内存高效的抠图框架MEMatte,用于处理高分辨率图像。MEMatte在每个全局注意力块前加入了路由器,将包含信息的标记导向全局注意力,而其他标记则被导向轻量级标记优化模块(LTRM)。具体来说,该路由器采用局部-全局策略来预测每个标记的路由概率,并且LTRM使用高效的模块来模拟全局注意力。此外,我们引入了一种批处理约束自适应标记路由机制(BATR),它允许每个路由器根据图像内容和网络中注意块的不同阶段动态地导向标记。另外,我们构建了一个超高分辨率图像抠图数据集UHR-395,该数据集中包含35,500张训练图像和1,000张测试图像,平均分辨率为$4872\times6017$。此数据集通过将395种不同类型的透明度掩模组合到11类背景上,并配以高质量的手动注释而创建。广泛的实验表明,MEMatte在高分辨率和现实世界的数据集上的表现优于现有方法,并且在Composition-1K基准测试中内存使用量减少了大约88%,延迟降低了50%。

URL

https://arxiv.org/abs/2412.10702

PDF

https://arxiv.org/pdf/2412.10702.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot