In this paper, we introduce DiffusionMat, a novel image matting framework that employs a diffusion model for the transition from coarse to refined alpha mattes. Diverging from conventional methods that utilize trimaps merely as loose guidance for alpha matte prediction, our approach treats image matting as a sequential refinement learning process. This process begins with the addition of noise to trimaps and iteratively denoises them using a pre-trained diffusion model, which incrementally guides the prediction towards a clean alpha matte. The key innovation of our framework is a correction module that adjusts the output at each denoising step, ensuring that the final result is consistent with the input image's structures. We also introduce the Alpha Reliability Propagation, a novel technique designed to maximize the utility of available guidance by selectively enhancing the trimap regions with confident alpha information, thus simplifying the correction task. To train the correction module, we devise specialized loss functions that target the accuracy of the alpha matte's edges and the consistency of its opaque and transparent regions. We evaluate our model across several image matting benchmarks, and the results indicate that DiffusionMat consistently outperforms existing methods. Project page at~\url{this https URL
在本文中,我们引入了DiffusionMat,一种新颖的图像遮罩框架,它采用扩散模型来从粗粒度到精细粒度的alpha遮罩的转变。与仅仅利用trimap作为粗略指导的常规方法不同,我们的方法将图像遮罩处理为序列细化学习过程。这个过程从对trimap添加噪声开始,并使用预训练的扩散模型逐步去噪,从而逐步引导预测朝向干净的alpha遮罩。我们框架的关键创新是具有修正模块,它在每个去噪步骤中对输出进行调整,确保最终结果与输入图像的结构保持一致。我们还引入了Alpha可靠性传播,一种旨在通过选择性地增强自信的alpha信息来最大程度地利用可用的指导的技术,从而简化修正任务。为了训练修正模块,我们设计了一些针对alpha遮罩边缘准确性和一致性的专门损失函数。我们在多个图像遮罩基准测试中评估我们的模型,结果表明DiffusionMat consistently优于现有方法。项目页面链接为:https://this URL
https://arxiv.org/abs/2311.13535
We introduce the notion of point affiliation into feature upsampling. By abstracting a feature map into non-overlapped semantic clusters formed by points of identical semantic meaning, feature upsampling can be viewed as point affiliation -- designating a semantic cluster for each upsampled point. In the framework of kernel-based dynamic upsampling, we show that an upsampled point can resort to its low-res decoder neighbors and high-res encoder point to reason the affiliation, conditioned on the mutual similarity between them. We therefore present a generic formulation for generating similarity-aware upsampling kernels and prove that such kernels encourage not only semantic smoothness but also boundary sharpness. This formulation constitutes a novel, lightweight, and universal upsampling solution, Similarity-Aware Point Affiliation (SAPA). We show its working mechanism via our preliminary designs with window-shape kernel. After probing the limitations of the designs on object detection, we reveal additional insights for upsampling, leading to SAPA with the dynamic kernel shape. Extensive experiments demonstrate that SAPA outperforms prior upsamplers and invites consistent performance improvements on a number of dense prediction tasks, including semantic segmentation, object detection, instance segmentation, panoptic segmentation, image matting, and depth estimation. Code is made available at: this https URL
我们引入了点关联的概念,将其引入特征插值。通过将特征映射抽象为由具有相同语义意义的点组成的不重叠的语义簇,特征插值可以被视为点关联——为每个插值点指定一个语义簇。在基于内核的动态插值框架中,我们表明,插值点可以通过其低分辨率解码邻居和高分辨率编码点推理关联,根据它们之间的相互相似性条件。因此,我们提出了一种通用表达式来生成具有相似性感知特征插值内核,并证明了这种内核不仅鼓励语义平滑,还鼓励边界尖锐化。这种表达式构成了一种新型、轻便且通用的特征插值解决方案,称为相似性点关联(SAPA)。我们通过窗口形状内核的初步设计展示了其工作原理。在测试对象检测的设计限制后,我们揭示了增加插值点额外见解的方法,导致动态内核形状的SAPA。广泛的实验结果表明,SAPA优于先前的插值方案,并在许多密集预测任务中表现出一致的性能改进,包括语义分割、对象检测、实例分割、全景分割、图像拼接和深度估计。代码在此https URL上提供。
https://arxiv.org/abs/2307.08198
In this paper, we propose the Matting Anything Model (MAM), an efficient and versatile framework for estimating the alpha matte of any instance in an image with flexible and interactive visual or linguistic user prompt guidance. MAM offers several significant advantages over previous specialized image matting networks: (i) MAM is capable of dealing with various types of image matting, including semantic, instance, and referring image matting with only a single model; (ii) MAM leverages the feature maps from the Segment Anything Model (SAM) and adopts a lightweight Mask-to-Matte (M2M) module to predict the alpha matte through iterative refinement, which has only 2.7 million trainable parameters. (iii) By incorporating SAM, MAM simplifies the user intervention required for the interactive use of image matting from the trimap to the box, point, or text prompt. We evaluate the performance of MAM on various image matting benchmarks, and the experimental results demonstrate that MAM achieves comparable performance to the state-of-the-art specialized image matting models under different metrics on each benchmark. Overall, MAM shows superior generalization ability and can effectively handle various image matting tasks with fewer parameters, making it a practical solution for unified image matting. Our code and models are open-sourced at this https URL.
本文提出了“裁剪任何东西模型”(MAM),一个高效、多功能的框架,能够在灵活的、交互式的可视化或语言用户提示下,估计图像中任意实例的Alpha matte。相较于以往的专门图像处理网络,MAM提供了多项显著优势:(i) MAM能够处理各种图像处理,包括语义、实例和引用图像处理,仅使用一种模型即可;(ii) MAM利用Segment anything Model(SAM)的特征映射,采用轻量级Mask-to-Matte(M2M)模块进行迭代优化,以预测Alpha matte,该模块仅有2.7百万可训练参数;(iii) 通过集成SAM,MAM简化了用户对于图像处理交互使用的干预,从Trimap到盒子、点或文本提示等各个基准的交互使用所需的用户干预均简化了许多参数。我们评估了MAM在各种图像处理基准上的表现,实验结果表明,MAM在每个基准上的表现与最先进的专门图像处理模型的相似度达到了最高水平。总的来说,MAM具有更好的泛化能力,能够以较少的参数有效处理各种图像处理任务,使其成为统一图像处理的实用解决方案。我们的代码和模型在此httpsURL上开源。
https://arxiv.org/abs/2306.05399
Natural image matting algorithms aim to predict the transparency map (alpha-matte) with the trimap guidance. However, the production of trimaps often requires significant labor, which limits the widespread application of matting algorithms on a large scale. To address the issue, we propose Matte Anything model (MatAny), an interactive natural image matting model which could produce high-quality alpha-matte with various simple hints. The key insight of MatAny is to generate pseudo trimap automatically with contour and transparency prediction. We leverage task-specific vision models to enhance the performance of natural image matting. Specifically, we use the segment anything model (SAM) to predict high-quality contour with user interaction and an open-vocabulary (OV) detector to predict the transparency of any object. Subsequently, a pretrained image matting model generates alpha mattes with pseudo trimaps. MatAny is the interactive matting algorithm with the most supported interaction methods and the best performance to date. It consists of orthogonal vision models without any additional training. We evaluate the performance of MatAny against several current image matting algorithms, and the results demonstrate the significant potential of our approach.
自然图像裁剪算法旨在利用三度地图指导预测透明度图(alpha matte),但生产三度地图通常需要大量的劳动,这限制了大规模应用裁剪算法。为了解决这个问题,我们提出了 Matte Anything模型(MatAny),它是一个交互式的自然图像裁剪模型,能够以各种简单提示生产高质量的alpha matte。MatAny的关键见解是自动生成伪三度地图,结合轮廓和透明度预测。我们利用特定的视觉任务模型来提高自然图像裁剪的性能。具体来说,我们使用片段 anything模型(SAM)通过用户交互预测高质量的轮廓,并使用开放词汇表(OV)探测器预测任何物体的透明度。随后,一个预训练的图像裁剪模型生成alpha mattes,使用伪三度地图。MatAny是支持最多交互方法和行为的最佳交互裁剪算法。它由两个互相垂直的视觉模型组成,不需要任何额外的训练。我们评估了MatAny的性能与几种当前的图像裁剪算法,结果证明了我们方法的重大潜力。
https://arxiv.org/abs/2306.04121
Cutting out an object and estimating its opacity mask, known as image matting, is a key task in image and video editing. Due to the highly ill-posed issue, additional inputs, typically user-defined trimaps or scribbles, are usually needed to reduce the uncertainty. Although effective, it is either time consuming or only suitable for experienced users who know where to place the strokes. In this work, we propose a decomposed-uncertainty-guided matting (dugMatting) algorithm, which explores the explicitly decomposed uncertainties to efficiently and effectively improve the results. Basing on the characteristic of these uncertainties, the epistemic uncertainty is reduced in the process of guiding interaction (which introduces prior knowledge), while the aleatoric uncertainty is reduced in modeling data distribution (which introduces statistics for both data and possible noise). The proposed matting framework relieves the requirement for users to determine the interaction areas by using simple and efficient labeling. Extensively quantitative and qualitative results validate that the proposed method significantly improves the original matting algorithms in terms of both efficiency and efficacy.
去除物体并估计其透明度蒙皮,也称为图像剪辑,是图像和视频编辑中的关键任务。由于存在高度不兼容的问题,通常需要用户定义的Trimap或涂鸦等额外的输入来减少不确定性。虽然有效,但它要么需要时间,要么只适用于有经验的用户,知道在哪里画线。在本工作中,我们提出了一种分解不确定性引导剪辑(dug Matting)算法,该算法 explicitly分解不确定性以有效地和有效地改进结果。基于这些不确定性的特征,在指导相互作用的过程中,知识不确定性减少(引入先前知识),而在建模数据分布的过程中, aleatoric不确定性减少(引入数据和可能噪声的统计数据)。提出的剪辑框架解除了用户通过简单高效的标签来确定相互作用区域的 requirement。广泛的定量和定性结果证明了 proposed 方法在效率和效果方面都显著改进了原始的剪辑算法。
https://arxiv.org/abs/2306.01452
Recently, plain vision Transformers (ViTs) have shown impressive performance on various computer vision tasks, thanks to their strong modeling capacity and large-scale pretraining. However, they have not yet conquered the problem of image matting. We hypothesize that image matting could also be boosted by ViTs and present a new efficient and robust ViT-based matting system, named ViTMatte. Our method utilizes (i) a hybrid attention mechanism combined with a convolution neck to help ViTs achieve an excellent performance-computation trade-off in matting tasks. (ii) Additionally, we introduce the detail capture module, which just consists of simple lightweight convolutions to complement the detailed information required by matting. To the best of our knowledge, ViTMatte is the first work to unleash the potential of ViT on image matting with concise adaptation. It inherits many superior properties from ViT to matting, including various pretraining strategies, concise architecture design, and flexible inference strategies. We evaluate ViTMatte on Composition-1k and Distinctions-646, the most commonly used benchmark for image matting, our method achieves state-of-the-art performance and outperforms prior matting works by a large margin.
最近,普通的视觉转换器(ViTs)在各种计算机视觉任务中表现出出色的性能,因为它们具有强大的建模能力和大规模的预训练。然而,他们还没有克服图像拼接的问题。我们假设图像拼接也可以由ViTs来提升,并提出了名为ViTMatte的新高效、可靠的ViT拼接系统。我们的方法和(i)采用混合注意力机制和卷积颈部来帮助ViTs在拼接任务中实现出色的性能-计算权衡。(ii)我们还引入了细节捕捉模块,它仅仅是简单的 lightweight 卷积来补充拼接任务所需的详细信息。据我们所知,ViTMatte是第一款通过简洁适应性释放ViT在图像拼接中的潜力的工作。它继承了ViT在拼接中许多优越的性质,包括各种预训练策略、简洁的建筑设计和灵活的推理策略。我们在Composition-1k和distinction-646等常用的图像拼接基准上评估了ViTMatte,我们的方法和之前的图像拼接工作相比实现了先进的性能,并大幅超越了之前的工作。
https://arxiv.org/abs/2305.15272
Image matting refers to extracting precise alpha matte from natural images, and it plays a critical role in various downstream applications, such as image editing. Despite being an ill-posed problem, traditional methods have been trying to solve it for decades. The emergence of deep learning has revolutionized the field of image matting and given birth to multiple new techniques, including automatic, interactive, and referring image matting. This paper presents a comprehensive review of recent advancements in image matting in the era of deep learning. We focus on two fundamental sub-tasks: auxiliary input-based image matting, which involves user-defined input to predict the alpha matte, and automatic image matting, which generates results without any manual intervention. We systematically review the existing methods for these two tasks according to their task settings and network structures and provide a summary of their advantages and disadvantages. Furthermore, we introduce the commonly used image matting datasets and evaluate the performance of representative matting methods both quantitatively and qualitatively. Finally, we discuss relevant applications of image matting and highlight existing challenges and potential opportunities for future research. We also maintain a public repository to track the rapid development of deep image matting at this https URL.
图像剪辑(image matting)是指从自然图像中提取精确的Alpha值,它在各种后续应用中发挥着关键作用,例如图像编辑。尽管这是一个错误的难题,但传统方法已经试图解决这个问题数十年。深度学习的出现已经彻底改变了图像剪辑领域,并创造了多个新技术,包括自动、交互式和参考图像剪辑。本文全面回顾了在深度学习时代图像剪辑的最新进展。我们关注两个基本的任务:基于辅助输入的图像剪辑,它涉及用户定义输入以预测Alpha值,以及自动图像剪辑,它不需要任何手动干预而生成结果。我们按照任务设置和网络结构系统地审查了这两个任务现有的方法,并总结了它们的优势和劣势。此外,我们介绍了常用的图像剪辑数据集,并评估了代表性剪辑方法的性能,包括定量和定性表现。最后,我们讨论了图像剪辑相关的应用,并强调了现有挑战和未来研究的潜在机会。我们还在此处维护了一个公共存储库,以跟踪深度学习图像剪辑的迅速进展。
https://arxiv.org/abs/2304.04672
For natural image matting, context information plays a crucial role in estimating alpha mattes especially when it is challenging to distinguish foreground from its background. Exiting deep learning-based methods exploit specifically designed context aggregation modules to refine encoder features. However, the effectiveness of these modules has not been thoroughly explored. In this paper, we conduct extensive experiments to reveal that the context aggregation modules are actually not as effective as expected. We also demonstrate that when learned on large image patches, basic encoder-decoder networks with a larger receptive field can effectively aggregate context to achieve better performance.Upon the above findings, we propose a simple yet effective matting network, named AEMatter, which enlarges the receptive field by incorporating an appearance-enhanced axis-wise learning block into the encoder and adopting a hybrid-transformer decoder. Experimental results on four datasets demonstrate that our AEMatter significantly outperforms state-of-the-art matting methods (e.g., on the Adobe Composition-1K dataset, \textbf{25\%} and \textbf{40\%} reduction in terms of SAD and MSE, respectively, compared against MatteFormer). The code and model are available at \url{this https URL}.
对于自然图像剪辑,上下文信息在估计 alpha 剪辑方面发挥着至关重要的作用,特别是在难以区分前景和背景时。深度学习方法的exit点利用专门设计的上下文聚合模块来优化编码特征。然而,这些模块的效果并没有得到充分探索。在本文中,我们进行了广泛的实验,以揭示上下文聚合模块实际上并不像预期那样有效。我们还证明,在大型图像补丁上学习时,具有更大接收域的基本编码-解码网络可以 effectively 聚合上下文,实现更好的性能。基于以上发现,我们提出了一种简单但有效的剪辑网络,名为AEMatter,它通过将增强的外观轴学习块嵌入编码器并采用混合Transformer解码器,扩大了接收域。在四个数据集上的实验结果表明,我们的AEMatter在剪接方法中 significantly 超越了当前最先进的方法(例如,在Adobe Composition-1K数据集上,与前一种剪接方法相比,SAD和MSE分别减少了25%和40%)。代码和模型可在 \url{this https URL} 上获取。
https://arxiv.org/abs/2304.01171
Image matting requires high-quality pixel-level human annotations to support the training of a deep model in recent literature. Whereas such annotation is costly and hard to scale, significantly holding back the development of the research. In this work, we make the first attempt towards addressing this problem, by proposing a self-supervised pre-training approach that can leverage infinite numbers of data to boost the matting performance. The pre-training task is designed in a similar manner as image matting, where random trimap and alpha matte are generated to achieve an image disentanglement objective. The pre-trained model is then used as an initialisation of the downstream matting task for fine-tuning. Extensive experimental evaluations show that the proposed approach outperforms both the state-of-the-art matting methods and other alternative self-supervised initialisation approaches by a large margin. We also show the robustness of the proposed approach over different backbone architectures. The code and models will be publicly available.
图像剪辑需要高质量的像素级人类标注来支持最近文献中的深度学习模型训练。而这种标注成本较高且难以规模扩展,极大地限制了研究的发展。在本研究中,我们尝试解决这个问题,提出了一种自我监督的前训练方法,可以利用无限数据来增强剪辑性能。前训练任务与图像剪辑类似,通过生成随机三视图和Alpha matte来实现图像分离目标。训练好的模型 then 用作后续剪辑任务的初始化,进行微调。广泛的实验评估表明, proposed 方法比最先进的剪辑方法和其他替代的自我监督初始化方法表现更好。我们还展示了该方法在不同主干架构下的鲁棒性。代码和模型将公开可用。
https://arxiv.org/abs/2304.00784
Image matting aims to predict alpha values of elaborate uncertainty areas of natural images, like hairs, smoke, and spider web. However, existing methods perform poorly when faced with highly transparent foreground objects due to the large area of uncertainty to predict and the small receptive field of convolutional networks. To address this issue, we propose a Transformer-based network (TransMatting) to model transparent objects with long-range features and collect a high-resolution matting dataset of transparent objects (Transparent-460) for performance evaluation. Specifically, to utilize semantic information in the trimap flexibly and effectively, we also redesign the trimap as three learnable tokens, named tri-token. Both Transformer and convolution matting models could benefit from our proposed tri-token design. By replacing the traditional trimap concatenation strategy with our tri-token, existing matting methods could achieve about 10% improvement in SAD and 20% in MSE. Equipped with the new tri-token design, our proposed TransMatting outperforms current state-of-the-art methods on several popular matting benchmarks and our newly collected Transparent-460.
图像剪辑的目标是预测自然图像中复杂的不确定性区域,如毛发、烟雾和蜘蛛网等的Alpha值。然而,现有的方法在面对高度透明的前景对象时表现很差,因为预测的不确定性区域很大,而卷积神经网络的响应范围很小。为了解决这一问题,我们提出了一种基于Transformer的网络(TransMatting),以建模具有远程特征的透明对象,并收集了高分辨率的透明对象剪辑数据集(Transparent-460)进行性能评估。具体来说,为了有效地和 flexibly 利用 trimap 中的语义信息,我们还重新设计了 trimap 为三个可学习的标志符,并命名为 tri-token。Transformer 和卷积剪辑模型都可以从我们的 tri-token 设计中获得好处。通过将传统的 trimap concatenation 策略与我们的 tri-token 设计替换,现有的剪辑方法可以实现约 10% 的改进 in SAD 和 20% 的改进 in MSE。借助于新的 tri-token 设计,我们提出的 TransMatting 在多个流行的剪辑基准测试和我们新收集的 Transparent-460 中表现优于当前的前沿技术方法。
https://arxiv.org/abs/2303.06476
We study the composition style in deep image matting, a notion that characterizes a data generation flow on how to exploit limited foregrounds and random backgrounds to form a training dataset. Prior art executes this flow in a completely random manner by simply going through the foreground pool or by optionally combining two foregrounds before foreground-background composition. In this work, we first show that naive foreground combination can be problematic and therefore derive an alternative formulation to reasonably combine foregrounds. Our second contribution is an observation that matting performance can benefit from a certain occurrence frequency of combined foregrounds and their associated source foregrounds during training. Inspired by this, we introduce a novel composition style that binds the source and combined foregrounds in a definite triplet. In addition, we also find that different orders of foreground combination lead to different foreground patterns, which further inspires a quadruplet-based composition style. Results under controlled experiments on four matting baselines show that our composition styles outperform existing ones and invite consistent performance improvement on both composited and real-world datasets. Code is available at: this https URL
https://arxiv.org/abs/2212.13517
Performance of trimap-free image matting methods is limited when trying to decouple the deterministic and undetermined regions, especially in the scenes where foregrounds are semantically ambiguous, chromaless, or high transmittance. In this paper, we propose a novel framework named Privileged Prior Information Distillation for Image Matting (PPID-IM) that can effectively transfer privileged prior environment-aware information to improve the performance of students in solving hard foregrounds. The prior information of trimap regulates only the teacher model during the training stage, while not being fed into the student network during actual inference. In order to achieve effective privileged cross-modality (i.e. trimap and RGB) information distillation, we introduce a Cross-Level Semantic Distillation (CLSD) module that reinforces the trimap-free students with more knowledgeable semantic representations and environment-aware information. We also propose an Attention-Guided Local Distillation module that efficiently transfers privileged local attributes from the trimap-based teacher to trimap-free students for the guidance of local-region optimization. Extensive experiments demonstrate the effectiveness and superiority of our PPID framework on the task of image matting. In addition, our trimap-free IndexNet-PPID surpasses the other competing state-of-the-art methods by a large margin, especially in scenarios with chromaless, weak texture, or irregular objects.
https://arxiv.org/abs/2211.14036
This paper reviews recent deep-learning-based matting research and conceives our wider and higher motivation for image matting. Many approaches achieve alpha mattes with complex encoders to extract robust semantics, then resort to the U-net-like decoder to concatenate or fuse encoder features. However, image matting is essentially a pixel-wise regression, and the ideal situation is to perceive the maximum opacity correspondence from the input image. In this paper, we argue that the high-resolution feature representation, perception and communication are more crucial for matting accuracy. Therefore, we propose an Intensive Integration and Global Foreground Perception network (I2GFP) to integrate wider and higher feature streams. Wider means we combine intensive features in each decoder stage, while higher suggests we retain high-resolution intermediate features and perceive large-scale foreground appearance. Our motivation sacrifices model depth for a significant performance promotion. We perform extensive experiments to prove the proposed I2GFP model, and state-of-the-art results can be achieved on different public datasets.
https://arxiv.org/abs/2210.06919
Most matting researches resort to advanced semantics to achieve high-quality alpha mattes, and direct low-level features combination is usually explored to complement alpha details. However, we argue that appearance-agnostic integration can only provide biased foreground details and alpha mattes require different-level feature aggregation for better pixel-wise opacity perception. In this paper, we propose an end-to-end Hierarchical and Progressive Attention Matting Network (HAttMatting++), which can better predict the opacity of the foreground from single RGB images without additional input. Specifically, we utilize channel-wise attention to distill pyramidal features and employ spatial attention at different levels to filter appearance cues. This progressive attention mechanism can estimate alpha mattes from adaptive semantics and semantics-indicated boundaries. We also introduce a hybrid loss function fusing Structural SIMilarity (SSIM), Mean Square Error (MSE), Adversarial loss, and sentry supervision to guide the network to further improve the overall foreground structure. Besides, we construct a large-scale and challenging image matting dataset comprised of 59, 600 training images and 1000 test images (a total of 646 distinct foreground alpha mattes), which can further improve the robustness of our hierarchical and progressive aggregation model. Extensive experiments demonstrate that the proposed HAttMatting++ can capture sophisticated foreground structures and achieve state-of-the-art performance with single RGB images as input.
https://arxiv.org/abs/2210.06906
Usually, lesions are not isolated but are associated with the surrounding tissues. For example, the growth of a tumour can depend on or infiltrate into the surrounding tissues. Due to the pathological nature of the lesions, it is challenging to distinguish their boundaries in medical imaging. However, these uncertain regions may contain diagnostic information. Therefore, the simple binarization of lesions by traditional binary segmentation can result in the loss of diagnostic information. In this work, we introduce the image matting into the 3D scenes and use the alpha matte, i.e., a soft mask, to describe lesions in a 3D medical image. The traditional soft mask acted as a training trick to compensate for the easily mislabelled or under-labelled ambiguous regions. In contrast, 3D matting uses soft segmentation to characterize the uncertain regions more finely, which means that it retains more structural information for subsequent diagnosis and treatment. The current study of image matting methods in 3D is limited. To address this issue, we conduct a comprehensive study of 3D matting, including both traditional and deep-learning-based methods. We adapt four state-of-the-art 2D image matting algorithms to 3D scenes and further customize the methods for CT images to calibrate the alpha matte with the radiodensity. Moreover, we propose the first end-to-end deep 3D matting network and implement a solid 3D medical image matting benchmark. Its efficient counterparts are also proposed to achieve a good performance-computation balance. Furthermore, there is no high-quality annotated dataset related to 3D matting, slowing down the development of data-driven deep-learning-based methods. To address this issue, we construct the first 3D medical matting dataset. The validity of the dataset was verified through clinicians' assessments and downstream experiments.
https://arxiv.org/abs/2210.05104
We introduce point affiliation into feature upsampling, a notion that describes the affiliation of each upsampled point to a semantic cluster formed by local decoder feature points with semantic similarity. By rethinking point affiliation, we present a generic formulation for generating upsampling kernels. The kernels encourage not only semantic smoothness but also boundary sharpness in the upsampled feature maps. Such properties are particularly useful for some dense prediction tasks such as semantic segmentation. The key idea of our formulation is to generate similarity-aware kernels by comparing the similarity between each encoder feature point and the spatially associated local region of decoder features. In this way, the encoder feature point can function as a cue to inform the semantic cluster of upsampled feature points. To embody the formulation, we further instantiate a lightweight upsampling operator, termed Similarity-Aware Point Affiliation (SAPA), and investigate its variants. SAPA invites consistent performance improvements on a number of dense prediction tasks, including semantic segmentation, object detection, depth estimation, and image matting. Code is available at: this https URL
https://arxiv.org/abs/2209.12866
Three-dimensional (3D) images, such as CT, MRI, and PET, are common in medical imaging applications and important in clinical diagnosis. Semantic ambiguity is a typical feature of many medical image labels. It can be caused by many factors, such as the imaging properties, pathological anatomy, and the weak representation of the binary masks, which brings challenges to accurate 3D segmentation. In 2D medical images, using soft masks instead of binary masks generated by image matting to characterize lesions can provide rich semantic information, describe the structural characteristics of lesions more comprehensively, and thus benefit the subsequent diagnoses and analyses. In this work, we introduce image matting into the 3D scenes to describe the lesions in 3D medical images. The study of image matting in 3D modality is limited, and there is no high-quality annotated dataset related to 3D matting, therefore slowing down the development of data-driven deep-learning-based methods. To address this issue, we constructed the first 3D medical matting dataset and convincingly verified the validity of the dataset through quality control and downstream experiments in lung nodules classification. We then adapt the four selected state-of-the-art 2D image matting algorithms to 3D scenes and further customize the methods for CT images. Also, we propose the first end-to-end deep 3D matting network and implement a solid 3D medical image matting benchmark, which will be released to encourage further research.
https://arxiv.org/abs/2209.07843
Image matting refers to predicting the alpha values of unknown foreground areas from natural images. Prior methods have focused on propagating alpha values from known to unknown regions. However, not all natural images have a specifically known foreground. Images of transparent objects, like glass, smoke, web, etc., have less or no known foreground. In this paper, we propose a Transformer-based network, TransMatting, to model transparent objects with a big receptive field. Specifically, we redesign the trimap as three learnable tri-tokens for introducing advanced semantic features into the self-attention mechanism. A small convolutional network is proposed to utilize the global feature and non-background mask to guide the multi-scale feature propagation from encoder to decoder for maintaining the contexture of transparent objects. In addition, we create a high-resolution matting dataset of transparent objects with small known foreground areas. Experiments on several matting benchmarks demonstrate the superiority of our proposed method over the current state-of-the-art methods.
https://arxiv.org/abs/2208.03007
Recent studies made great progress in video matting by extending the success of trimap-based image matting to the video domain. In this paper, we push this task toward a more practical setting and propose One-Trimap Video Matting network (OTVM) that performs video matting robustly using only one user-annotated trimap. A key of OTVM is the joint modeling of trimap propagation and alpha prediction. Starting from baseline trimap propagation and alpha prediction networks, our OTVM combines the two networks with an alpha-trimap refinement module to facilitate information flow. We also present an end-to-end training strategy to take full advantage of the joint model. Our joint modeling greatly improves the temporal stability of trimap propagation compared to the previous decoupled methods. We evaluate our model on two latest video matting benchmarks, Deep Video Matting and VideoMatting108, and outperform state-of-the-art by significant margins (MSE improvements of 56.4% and 56.7%, respectively). The source code and model are available online: this https URL.
https://arxiv.org/abs/2207.13353
We consider the problem of task-agnostic feature upsampling in dense prediction where an upsampling operator is required to facilitate both region-sensitive tasks like semantic segmentation and detail-sensitive tasks such as image matting. Existing upsampling operators often can work well in either type of the tasks, but not both. In this work, we present FADE, a novel, plug-and-play, and task-agnostic upsampling operator. FADE benefits from three design choices: i) considering encoder and decoder features jointly in upsampling kernel generation; ii) an efficient semi-shift convolutional operator that enables granular control over how each feature point contributes to upsampling kernels; iii) a decoder-dependent gating mechanism for enhanced detail delineation. We first study the upsampling properties of FADE on toy data and then evaluate it on large-scale semantic segmentation and image matting. In particular, FADE reveals its effectiveness and task-agnostic characteristic by consistently outperforming recent dynamic upsampling operators in different tasks. It also generalizes well across convolutional and transformer architectures with little computational overhead. Our work additionally provides thoughtful insights on what makes for task-agnostic upsampling. Code is available at: this http URL
https://arxiv.org/abs/2207.10392