The labelling difficulty has been a longstanding problem in deep image matting. To escape from fine labels, this work explores using rough annotations such as trimaps coarsely indicating the foreground/background as supervision. We present that the cooperation between learned semantics from indicated known regions and proper assumed matting rules can help infer alpha values at transition areas. Inspired by the nonlocal principle in traditional image matting, we build a directional distance consistency loss (DDC loss) at each pixel neighborhood to constrain the alpha values conditioned on the input image. DDC loss forces the distance of similar pairs on the alpha matte and on its corresponding image to be consistent. In this way, the alpha values can be propagated from learned known regions to unknown transition areas. With only images and trimaps, a matting model can be trained under the supervision of a known loss and the proposed DDC loss. Experiments on AM-2K and P3M-10K dataset show that our paradigm achieves comparable performance with the fine-label-supervised baseline, while sometimes offers even more satisfying results than human-labelled ground truth. Code is available at \url{this https URL}.
标签难度一直是一个长期存在于深度图像合成中的问题。为了逃避细粒度标签,这项工作探讨了使用粗粒度注释,如剪切面表示前景/背景,作为监督。我们发现,指示已知区域的预训练语义和学习到的语义之间的合作可以帮助推断在过渡区域中的alpha值。受到传统图像合成中非局部原则的启发,我们在每个像素邻域中建立了一个方向距离一致损失(DDC损失),用于约束基于输入图像的alpha值。DDC损失迫使alphamatte和其相应图像中类似对的距离保持一致。以这种方式,从预训练已知区域中传播alpha值到未知的过渡区域。仅使用图像和剪切面,可以在已知损失的监督下训练合成模型。在AM-2K和P3M-10K数据集上进行的实验证明,我们的范式与细粒度标签监督基线具有可比较的性能,有时甚至比人类标注的地面真实现得更令人满意。代码可在此处访问:\url{this <https://this URL>.}
https://arxiv.org/abs/2408.10539
This paper introduces an innovative approach for image matting that redefines the traditional regression-based task as a generative modeling challenge. Our method harnesses the capabilities of latent diffusion models, enriched with extensive pre-trained knowledge, to regularize the matting process. We present novel architectural innovations that empower our model to produce mattes with superior resolution and detail. The proposed method is versatile and can perform both guidance-free and guidance-based image matting, accommodating a variety of additional cues. Our comprehensive evaluation across three benchmark datasets demonstrates the superior performance of our approach, both quantitatively and qualitatively. The results not only reflect our method's robust effectiveness but also highlight its ability to generate visually compelling mattes that approach photorealistic quality. The project page for this paper is available at this https URL
本文提出了一种创新性的图像配对方法,将传统基于回归任务的图像配对问题重新定义为生成建模挑战。我们的方法利用了潜在扩散模型的功能,通过丰富的预训练知识来规范配对过程。我们提出了新颖的架构创新,使我们的模型能够以卓越的分辨率和高细节生产配对。所提出的方法具有灵活性,能够进行指导无关和指导式的图像配对,适应各种附加提示。本文在三个基准数据集上的全面评估证明了我们方法的优势,不仅在数量上,而且在质量上。结果不仅反映了我们方法的稳健性,而且强调了其生成具有视觉吸引力的配对的能力。本文的论文页URL为https:// URL。
https://arxiv.org/abs/2407.21017
The goal of this work is to develop a task-agnostic feature upsampling operator for dense prediction where the operator is required to facilitate not only region-sensitive tasks like semantic segmentation but also detail-sensitive tasks such as image matting. Prior upsampling operators often can work well in either type of the tasks, but not both. We argue that task-agnostic upsampling should dynamically trade off between semantic preservation and detail delineation, instead of having a bias between the two properties. In this paper, we present FADE, a novel, plug-and-play, lightweight, and task-agnostic upsampling operator by fusing the assets of decoder and encoder features at three levels: i) considering both the encoder and decoder feature in upsampling kernel generation; ii) controlling the per-point contribution of the encoder/decoder feature in upsampling kernels with an efficient semi-shift convolutional operator; and iii) enabling the selective pass of encoder features with a decoder-dependent gating mechanism for compensating details. To improve the practicality of FADE, we additionally study parameter- and memory-efficient implementations of semi-shift convolution. We analyze the upsampling behavior of FADE on toy data and show through large-scale experiments that FADE is task-agnostic with consistent performance improvement on a number of dense prediction tasks with little extra cost. For the first time, we demonstrate robust feature upsampling on both region- and detail-sensitive tasks successfully. Code is made available at: this https URL
本文的工作目标是开发一个任务无关的插件和可扩展操作符,用于密集预测,该操作符不仅要求助于仅对区域敏感的任务(如语义分割),而且还要求助于对细节敏感的任务(如图像匹配)。通常,升级操作符在两种类型的任务上都可以表现良好,但不是两种都可以。我们认为,任务无关的升级应该在语义保留和细节描绘之间动态平衡,而不是在两种属性之间存在偏见。在本文中,我们提出了FADE,一种新颖、可插拔、轻量级、任务无关的升级操作符,通过将编码器和解码器特征资产融合在三个级别:i)考虑在升级核生成中同时使用编码器和解码器特征;ii)通过高效的半移位卷积操作控制编码器/解码器特征的每个点的贡献;和iii)通过与解码器相关的门机制 selective pass 启用编码器特征的筛选。为了提高FADE的实用性,我们还研究了半移位卷积的参数和内存效率实现。我们在玩具数据上分析了FADE的升级行为,并通过大规模实验证明了FADE在许多密集预测任务上具有出色的表现,同时几乎没有额外的成本。为了第一次,我们在区域和细节敏感任务上都成功实现了鲁棒性的特征升级。代码可以在以下这个链接上获取:https://this URL
https://arxiv.org/abs/2407.13500
Image matting aims to obtain an alpha matte that separates foreground objects from the background accurately. Recently, trimap-free matting has been well studied because it requires only the original image without any extra input. Such methods usually extract a rough foreground by itself to take place trimap as further guidance. However, the definition of 'foreground' lacks a unified standard and thus ambiguities arise. Besides, the extracted foreground is sometimes incomplete due to inadequate network design. Most importantly, there is not a large-scale real-world matting dataset, and current trimap-free methods trained with synthetic images suffer from large domain shift problems in practice. In this paper, we define the salient object as foreground, which is consistent with human cognition and annotations of the current matting dataset. Meanwhile, data and technologies in salient object detection can be transferred to matting in a breeze. To obtain a more accurate and complete alpha matte, we propose a network called \textbf{M}ulti-\textbf{F}eature fusion-based \textbf{C}oarse-to-fine Network \textbf{(MFC-Net)}, which fully integrates multiple features for an accurate and complete alpha matte. Furthermore, we introduce image harmony in data composition to bridge the gap between synthetic and real images. More importantly, we establish the largest general matting dataset \textbf{(Real-19k)} in the real world to date. Experiments show that our method is significantly effective on both synthetic and real-world images, and the performance in the real-world dataset is far better than existing matting-free methods. Our code and data will be released soon.
图像分层旨在准确地将前景物体与背景分离。最近,因为只需要原始图像而没有额外的输入,所以 trimap-free 映像分层方法已经得到了很好的研究。然而,前景的定义缺乏统一标准,因此出现了一些歧义。此外,由于网络设计不足,提取的前景有时不完整。最重要的是,目前没有大规模的实时现实世界的映像分层数据集,并且用合成图像训练的 trimap-free 方法在实践中存在较大的领域转移问题。在本文中,我们将显著物体定义为前景,这与人类认知和当前映像数据集的注释是一致的。同时,用于显著物体检测的数据和技术可以很容易地应用于分层。为了获得更准确和完整的 alpha 映像,我们提出了一个名为 Multi-Feature Fusion-based Coarse-to-fine Network (MFC-Net) 的网络,该网络完全整合了多个特征以获得准确和完整的 alpha 映像。此外,我们还引入了图像和谐来解决合成图像和真实图像之间的差距。更重要的是,我们在现实生活中建立了最大的通用映像数据集(现实-19k)。实验表明,我们的方法在合成和真实图像上都有显著的效果,而在现实世界数据集上的性能比现有的无映像分层方法要好得多。我们的代码和数据将很快发布。
https://arxiv.org/abs/2405.17916
Despite significant advancements in image matting, existing models heavily depend on manually-drawn trimaps for accurate results in natural image scenarios. However, the process of obtaining trimaps is time-consuming, lacking user-friendliness and device compatibility. This reliance greatly limits the practical application of all trimap-based matting methods. To address this issue, we introduce Click2Trimap, an interactive model capable of predicting high-quality trimaps and alpha mattes with minimal user click inputs. Through analyzing real users' behavioral logic and characteristics of trimaps, we successfully propose a powerful iterative three-class training strategy and a dedicated simulation function, making Click2Trimap exhibit versatility across various scenarios. Quantitative and qualitative assessments on synthetic and real-world matting datasets demonstrate Click2Trimap's superior performance compared to all existing trimap-free matting methods. Especially, in the user study, Click2Trimap achieves high-quality trimap and matting predictions in just an average of 5 seconds per image, demonstrating its substantial practical value in real-world applications.
尽管在图像修饰方面取得了显著的进步,但现有的模型仍然高度依赖人工绘制的截面图以实现自然图像场景下的准确结果。然而,获取截面图的过程耗时且缺乏用户友好性,也不支持设备兼容性。这种依赖大大限制了所有基于截面图的修饰方法的实际应用。为解决这个问题,我们引入了Click2Trimap,一种具有预测高质量截面图和透明度贴图功能的有交互式模型。通过分析真实用户的心理逻辑和截面图的特点,我们成功提出了一种强大的迭代三分类训练策略和专门的模拟功能,使Click2Trimap在各种场景中都表现出 versatility。在合成和真实世界数据集上的定量定性和定性评估表明,与所有现有的截面图免费修饰方法相比,Click2Trimap的性能优越。特别是在用户研究中,Click2Trimap仅在每张图片上平均需要5秒钟即可实现高质量截面图和贴图预测,这充分证明了其在现实世界应用中的实际价值。
https://arxiv.org/abs/2404.00335
We introduce in-context matting, a novel task setting of image matting. Given a reference image of a certain foreground and guided priors such as points, scribbles, and masks, in-context matting enables automatic alpha estimation on a batch of target images of the same foreground category, without additional auxiliary input. This setting marries good performance in auxiliary input-based matting and ease of use in automatic matting, which finds a good trade-off between customization and automation. To overcome the key challenge of accurate foreground matching, we introduce IconMatting, an in-context matting model built upon a pre-trained text-to-image diffusion model. Conditioned on inter- and intra-similarity matching, IconMatting can make full use of reference context to generate accurate target alpha mattes. To benchmark the task, we also introduce a novel testing dataset ICM-$57$, covering 57 groups of real-world images. Quantitative and qualitative results on the ICM-57 testing set show that IconMatting rivals the accuracy of trimap-based matting while retaining the automation level akin to automatic matting. Code is available at this https URL
我们提出了一个名为"in-context matting"的新图像配对任务,这是一种图像配对任务,其旨在解决基于辅助输入的图像配对中的关键挑战 - 准确的目标匹配。在本文中,我们将介绍一种基于预训练文本到图像扩散模型的"in-context matting"模型,该模型可以在不需要额外辅助输入的情况下,对同一目标类别的目标图像进行自动alpha估计。这一设置将辅助输入驱动的图像配对和自动配对中易用性的优势相结合,找到了定制化和自动化之间的良好平衡。为了克服准确目标匹配的关键挑战,我们引入了"IconMatting"模型,这是一种基于预训练文本到图像扩散模型的"in-context matting"模型。通过条件处理互相似性和内部相似性,IconMatting可以充分利用参考上下文生成准确的靶alpha Mattes。为了验证该任务,我们还引入了"ICM-$57"测试数据集,涵盖了57个真实世界图像组。在ICM-57测试集中的定量和定性结果表明,IconMatting与基于trimap的图像配对中的准确性相媲美,同时保留了自动配对中类似于自动配对的自动化水平。代码可以从该链接https://该链接中获取。
https://arxiv.org/abs/2403.15789
Natural image matting aims to estimate the alpha matte of the foreground from a given image. Various approaches have been explored to address this problem, such as interactive matting methods that use guidance such as click or trimap, and automatic matting methods tailored to specific objects. However, existing matting methods are designed for specific objects or guidance, neglecting the common requirement of aggregating global and local contexts in image matting. As a result, these methods often encounter challenges in accurately identifying the foreground and generating precise boundaries, which limits their effectiveness in unforeseen scenarios. In this paper, we propose a simple and universal matting framework, named Dual-Context Aggregation Matting (DCAM), which enables robust image matting with arbitrary guidance or without guidance. Specifically, DCAM first adopts a semantic backbone network to extract low-level features and context features from the input image and guidance. Then, we introduce a dual-context aggregation network that incorporates global object aggregators and local appearance aggregators to iteratively refine the extracted context features. By performing both global contour segmentation and local boundary refinement, DCAM exhibits robustness to diverse types of guidance and objects. Finally, we adopt a matting decoder network to fuse the low-level features and the refined context features for alpha matte estimation. Experimental results on five matting datasets demonstrate that the proposed DCAM outperforms state-of-the-art matting methods in both automatic matting and interactive matting tasks, which highlights the strong universality and high performance of DCAM. The source code is available at \url{this https URL}.
自然图像合成旨在从给定的图像中估计前景的alpha遮罩。为了解决这个问题,已经探索了许多方法,例如使用点击或剪裁映像的交互式遮罩方法和针对特定对象的自动遮罩方法。然而,现有的遮罩方法都是为特定物体或指导设计的,忽视了图像遮罩中全局和局部上下文整合的常见要求。因此,这些方法通常会在准确识别前景和生成精确边界方面遇到困难,从而限制其在未知场景中的有效性。在本文中,我们提出了一个简单而通用的遮罩框架,名为双上下文聚合遮罩(DCAM),它具有任意指导或无指导的鲁棒图像合成能力。具体来说,DCAM首先采用语义骨架网络从输入图像和指导中提取低级特征和上下文特征。然后,我们引入了一种双上下文聚合网络,它包括全局物体聚合器和局部外观聚合器,用于迭代优化提取的上下文特征。通过执行全局轮廓分割和局部边界修复,DCAM在各种类型的指导和物体上表现出鲁棒性。最后,我们采用遮罩解码器网络将低级特征和修复后的上下文特征融合进行alpha遮罩估计。在五个遮罩数据集上的实验结果表明,与最先进的遮罩方法相比,DCAM在自动遮罩和交互式遮罩任务上都表现出卓越的性能,这突出了DCAM的宽泛性和高性能。源代码可在此处访问:\url{this <https://this URL>.
https://arxiv.org/abs/2402.18109
We aim to leverage diffusion to address the challenging image matting task. However, the presence of high computational overhead and the inconsistency of noise sampling between the training and inference processes pose significant obstacles to achieving this goal. In this paper, we present DiffMatte, a solution designed to effectively overcome these challenges. First, DiffMatte decouples the decoder from the intricately coupled matting network design, involving only one lightweight decoder in the iterations of the diffusion process. With such a strategy, DiffMatte mitigates the growth of computational overhead as the number of samples increases. Second, we employ a self-aligned training strategy with uniform time intervals, ensuring a consistent noise sampling between training and inference across the entire time domain. Our DiffMatte is designed with flexibility in mind and can seamlessly integrate into various modern matting architectures. Extensive experimental results demonstrate that DiffMatte not only reaches the state-of-the-art level on the Composition-1k test set, surpassing the best methods in the past by 5% and 15% in the SAD metric and MSE metric respectively, but also show stronger generalization ability in other benchmarks.
我们希望通过扩散来解决具有挑战性的图像配准任务。然而,高计算开销和训练和推理过程中噪声抽样的不一致性构成了实现这一目标的巨大障碍。在本文中,我们提出了DiffMatte,一种旨在有效克服这些挑战的解决方案。首先,DiffMatte解耦了解码器与复杂耦合的配准网络设计,仅在扩散过程的迭代中使用了一个轻量级的解码器。采用这种方式,DiffMatte可以减轻随着样本数量增加而产生的计算开销的增长。其次,我们采用自对齐的训练策略,具有均匀的时间间隔,确保训练和推理过程中在整个时域内具有一致的噪声采样。我们的DiffMatte设计时考虑了灵活性,可以轻松地整合到各种现代配准架构中。大量实验结果表明,DiffMatte不仅在Composition-1k测试集上达到了最先进的水平,超越了过去最佳方法的5%和15%,而且在其他基准测试中表现出更强的泛化能力。
https://arxiv.org/abs/2312.05915
In this paper, we introduce DiffusionMat, a novel image matting framework that employs a diffusion model for the transition from coarse to refined alpha mattes. Diverging from conventional methods that utilize trimaps merely as loose guidance for alpha matte prediction, our approach treats image matting as a sequential refinement learning process. This process begins with the addition of noise to trimaps and iteratively denoises them using a pre-trained diffusion model, which incrementally guides the prediction towards a clean alpha matte. The key innovation of our framework is a correction module that adjusts the output at each denoising step, ensuring that the final result is consistent with the input image's structures. We also introduce the Alpha Reliability Propagation, a novel technique designed to maximize the utility of available guidance by selectively enhancing the trimap regions with confident alpha information, thus simplifying the correction task. To train the correction module, we devise specialized loss functions that target the accuracy of the alpha matte's edges and the consistency of its opaque and transparent regions. We evaluate our model across several image matting benchmarks, and the results indicate that DiffusionMat consistently outperforms existing methods. Project page at~\url{this https URL
在本文中,我们引入了DiffusionMat,一种新颖的图像遮罩框架,它采用扩散模型来从粗粒度到精细粒度的alpha遮罩的转变。与仅仅利用trimap作为粗略指导的常规方法不同,我们的方法将图像遮罩处理为序列细化学习过程。这个过程从对trimap添加噪声开始,并使用预训练的扩散模型逐步去噪,从而逐步引导预测朝向干净的alpha遮罩。我们框架的关键创新是具有修正模块,它在每个去噪步骤中对输出进行调整,确保最终结果与输入图像的结构保持一致。我们还引入了Alpha可靠性传播,一种旨在通过选择性地增强自信的alpha信息来最大程度地利用可用的指导的技术,从而简化修正任务。为了训练修正模块,我们设计了一些针对alpha遮罩边缘准确性和一致性的专门损失函数。我们在多个图像遮罩基准测试中评估我们的模型,结果表明DiffusionMat consistently优于现有方法。项目页面链接为:https://this URL
https://arxiv.org/abs/2311.13535
We introduce the notion of point affiliation into feature upsampling. By abstracting a feature map into non-overlapped semantic clusters formed by points of identical semantic meaning, feature upsampling can be viewed as point affiliation -- designating a semantic cluster for each upsampled point. In the framework of kernel-based dynamic upsampling, we show that an upsampled point can resort to its low-res decoder neighbors and high-res encoder point to reason the affiliation, conditioned on the mutual similarity between them. We therefore present a generic formulation for generating similarity-aware upsampling kernels and prove that such kernels encourage not only semantic smoothness but also boundary sharpness. This formulation constitutes a novel, lightweight, and universal upsampling solution, Similarity-Aware Point Affiliation (SAPA). We show its working mechanism via our preliminary designs with window-shape kernel. After probing the limitations of the designs on object detection, we reveal additional insights for upsampling, leading to SAPA with the dynamic kernel shape. Extensive experiments demonstrate that SAPA outperforms prior upsamplers and invites consistent performance improvements on a number of dense prediction tasks, including semantic segmentation, object detection, instance segmentation, panoptic segmentation, image matting, and depth estimation. Code is made available at: this https URL
我们引入了点关联的概念,将其引入特征插值。通过将特征映射抽象为由具有相同语义意义的点组成的不重叠的语义簇,特征插值可以被视为点关联——为每个插值点指定一个语义簇。在基于内核的动态插值框架中,我们表明,插值点可以通过其低分辨率解码邻居和高分辨率编码点推理关联,根据它们之间的相互相似性条件。因此,我们提出了一种通用表达式来生成具有相似性感知特征插值内核,并证明了这种内核不仅鼓励语义平滑,还鼓励边界尖锐化。这种表达式构成了一种新型、轻便且通用的特征插值解决方案,称为相似性点关联(SAPA)。我们通过窗口形状内核的初步设计展示了其工作原理。在测试对象检测的设计限制后,我们揭示了增加插值点额外见解的方法,导致动态内核形状的SAPA。广泛的实验结果表明,SAPA优于先前的插值方案,并在许多密集预测任务中表现出一致的性能改进,包括语义分割、对象检测、实例分割、全景分割、图像拼接和深度估计。代码在此https URL上提供。
https://arxiv.org/abs/2307.08198
In this paper, we propose the Matting Anything Model (MAM), an efficient and versatile framework for estimating the alpha matte of any instance in an image with flexible and interactive visual or linguistic user prompt guidance. MAM offers several significant advantages over previous specialized image matting networks: (i) MAM is capable of dealing with various types of image matting, including semantic, instance, and referring image matting with only a single model; (ii) MAM leverages the feature maps from the Segment Anything Model (SAM) and adopts a lightweight Mask-to-Matte (M2M) module to predict the alpha matte through iterative refinement, which has only 2.7 million trainable parameters. (iii) By incorporating SAM, MAM simplifies the user intervention required for the interactive use of image matting from the trimap to the box, point, or text prompt. We evaluate the performance of MAM on various image matting benchmarks, and the experimental results demonstrate that MAM achieves comparable performance to the state-of-the-art specialized image matting models under different metrics on each benchmark. Overall, MAM shows superior generalization ability and can effectively handle various image matting tasks with fewer parameters, making it a practical solution for unified image matting. Our code and models are open-sourced at this https URL.
本文提出了“裁剪任何东西模型”(MAM),一个高效、多功能的框架,能够在灵活的、交互式的可视化或语言用户提示下,估计图像中任意实例的Alpha matte。相较于以往的专门图像处理网络,MAM提供了多项显著优势:(i) MAM能够处理各种图像处理,包括语义、实例和引用图像处理,仅使用一种模型即可;(ii) MAM利用Segment anything Model(SAM)的特征映射,采用轻量级Mask-to-Matte(M2M)模块进行迭代优化,以预测Alpha matte,该模块仅有2.7百万可训练参数;(iii) 通过集成SAM,MAM简化了用户对于图像处理交互使用的干预,从Trimap到盒子、点或文本提示等各个基准的交互使用所需的用户干预均简化了许多参数。我们评估了MAM在各种图像处理基准上的表现,实验结果表明,MAM在每个基准上的表现与最先进的专门图像处理模型的相似度达到了最高水平。总的来说,MAM具有更好的泛化能力,能够以较少的参数有效处理各种图像处理任务,使其成为统一图像处理的实用解决方案。我们的代码和模型在此httpsURL上开源。
https://arxiv.org/abs/2306.05399
Natural image matting algorithms aim to predict the transparency map (alpha-matte) with the trimap guidance. However, the production of trimaps often requires significant labor, which limits the widespread application of matting algorithms on a large scale. To address the issue, we propose Matte Anything model (MatAny), an interactive natural image matting model which could produce high-quality alpha-matte with various simple hints. The key insight of MatAny is to generate pseudo trimap automatically with contour and transparency prediction. We leverage task-specific vision models to enhance the performance of natural image matting. Specifically, we use the segment anything model (SAM) to predict high-quality contour with user interaction and an open-vocabulary (OV) detector to predict the transparency of any object. Subsequently, a pretrained image matting model generates alpha mattes with pseudo trimaps. MatAny is the interactive matting algorithm with the most supported interaction methods and the best performance to date. It consists of orthogonal vision models without any additional training. We evaluate the performance of MatAny against several current image matting algorithms, and the results demonstrate the significant potential of our approach.
自然图像裁剪算法旨在利用三度地图指导预测透明度图(alpha matte),但生产三度地图通常需要大量的劳动,这限制了大规模应用裁剪算法。为了解决这个问题,我们提出了 Matte Anything模型(MatAny),它是一个交互式的自然图像裁剪模型,能够以各种简单提示生产高质量的alpha matte。MatAny的关键见解是自动生成伪三度地图,结合轮廓和透明度预测。我们利用特定的视觉任务模型来提高自然图像裁剪的性能。具体来说,我们使用片段 anything模型(SAM)通过用户交互预测高质量的轮廓,并使用开放词汇表(OV)探测器预测任何物体的透明度。随后,一个预训练的图像裁剪模型生成alpha mattes,使用伪三度地图。MatAny是支持最多交互方法和行为的最佳交互裁剪算法。它由两个互相垂直的视觉模型组成,不需要任何额外的训练。我们评估了MatAny的性能与几种当前的图像裁剪算法,结果证明了我们方法的重大潜力。
https://arxiv.org/abs/2306.04121
Cutting out an object and estimating its opacity mask, known as image matting, is a key task in image and video editing. Due to the highly ill-posed issue, additional inputs, typically user-defined trimaps or scribbles, are usually needed to reduce the uncertainty. Although effective, it is either time consuming or only suitable for experienced users who know where to place the strokes. In this work, we propose a decomposed-uncertainty-guided matting (dugMatting) algorithm, which explores the explicitly decomposed uncertainties to efficiently and effectively improve the results. Basing on the characteristic of these uncertainties, the epistemic uncertainty is reduced in the process of guiding interaction (which introduces prior knowledge), while the aleatoric uncertainty is reduced in modeling data distribution (which introduces statistics for both data and possible noise). The proposed matting framework relieves the requirement for users to determine the interaction areas by using simple and efficient labeling. Extensively quantitative and qualitative results validate that the proposed method significantly improves the original matting algorithms in terms of both efficiency and efficacy.
去除物体并估计其透明度蒙皮,也称为图像剪辑,是图像和视频编辑中的关键任务。由于存在高度不兼容的问题,通常需要用户定义的Trimap或涂鸦等额外的输入来减少不确定性。虽然有效,但它要么需要时间,要么只适用于有经验的用户,知道在哪里画线。在本工作中,我们提出了一种分解不确定性引导剪辑(dug Matting)算法,该算法 explicitly分解不确定性以有效地和有效地改进结果。基于这些不确定性的特征,在指导相互作用的过程中,知识不确定性减少(引入先前知识),而在建模数据分布的过程中, aleatoric不确定性减少(引入数据和可能噪声的统计数据)。提出的剪辑框架解除了用户通过简单高效的标签来确定相互作用区域的 requirement。广泛的定量和定性结果证明了 proposed 方法在效率和效果方面都显著改进了原始的剪辑算法。
https://arxiv.org/abs/2306.01452
Recently, plain vision Transformers (ViTs) have shown impressive performance on various computer vision tasks, thanks to their strong modeling capacity and large-scale pretraining. However, they have not yet conquered the problem of image matting. We hypothesize that image matting could also be boosted by ViTs and present a new efficient and robust ViT-based matting system, named ViTMatte. Our method utilizes (i) a hybrid attention mechanism combined with a convolution neck to help ViTs achieve an excellent performance-computation trade-off in matting tasks. (ii) Additionally, we introduce the detail capture module, which just consists of simple lightweight convolutions to complement the detailed information required by matting. To the best of our knowledge, ViTMatte is the first work to unleash the potential of ViT on image matting with concise adaptation. It inherits many superior properties from ViT to matting, including various pretraining strategies, concise architecture design, and flexible inference strategies. We evaluate ViTMatte on Composition-1k and Distinctions-646, the most commonly used benchmark for image matting, our method achieves state-of-the-art performance and outperforms prior matting works by a large margin.
最近,普通的视觉转换器(ViTs)在各种计算机视觉任务中表现出出色的性能,因为它们具有强大的建模能力和大规模的预训练。然而,他们还没有克服图像拼接的问题。我们假设图像拼接也可以由ViTs来提升,并提出了名为ViTMatte的新高效、可靠的ViT拼接系统。我们的方法和(i)采用混合注意力机制和卷积颈部来帮助ViTs在拼接任务中实现出色的性能-计算权衡。(ii)我们还引入了细节捕捉模块,它仅仅是简单的 lightweight 卷积来补充拼接任务所需的详细信息。据我们所知,ViTMatte是第一款通过简洁适应性释放ViT在图像拼接中的潜力的工作。它继承了ViT在拼接中许多优越的性质,包括各种预训练策略、简洁的建筑设计和灵活的推理策略。我们在Composition-1k和distinction-646等常用的图像拼接基准上评估了ViTMatte,我们的方法和之前的图像拼接工作相比实现了先进的性能,并大幅超越了之前的工作。
https://arxiv.org/abs/2305.15272
Image matting refers to extracting precise alpha matte from natural images, and it plays a critical role in various downstream applications, such as image editing. Despite being an ill-posed problem, traditional methods have been trying to solve it for decades. The emergence of deep learning has revolutionized the field of image matting and given birth to multiple new techniques, including automatic, interactive, and referring image matting. This paper presents a comprehensive review of recent advancements in image matting in the era of deep learning. We focus on two fundamental sub-tasks: auxiliary input-based image matting, which involves user-defined input to predict the alpha matte, and automatic image matting, which generates results without any manual intervention. We systematically review the existing methods for these two tasks according to their task settings and network structures and provide a summary of their advantages and disadvantages. Furthermore, we introduce the commonly used image matting datasets and evaluate the performance of representative matting methods both quantitatively and qualitatively. Finally, we discuss relevant applications of image matting and highlight existing challenges and potential opportunities for future research. We also maintain a public repository to track the rapid development of deep image matting at this https URL.
图像剪辑(image matting)是指从自然图像中提取精确的Alpha值,它在各种后续应用中发挥着关键作用,例如图像编辑。尽管这是一个错误的难题,但传统方法已经试图解决这个问题数十年。深度学习的出现已经彻底改变了图像剪辑领域,并创造了多个新技术,包括自动、交互式和参考图像剪辑。本文全面回顾了在深度学习时代图像剪辑的最新进展。我们关注两个基本的任务:基于辅助输入的图像剪辑,它涉及用户定义输入以预测Alpha值,以及自动图像剪辑,它不需要任何手动干预而生成结果。我们按照任务设置和网络结构系统地审查了这两个任务现有的方法,并总结了它们的优势和劣势。此外,我们介绍了常用的图像剪辑数据集,并评估了代表性剪辑方法的性能,包括定量和定性表现。最后,我们讨论了图像剪辑相关的应用,并强调了现有挑战和未来研究的潜在机会。我们还在此处维护了一个公共存储库,以跟踪深度学习图像剪辑的迅速进展。
https://arxiv.org/abs/2304.04672
For natural image matting, context information plays a crucial role in estimating alpha mattes especially when it is challenging to distinguish foreground from its background. Exiting deep learning-based methods exploit specifically designed context aggregation modules to refine encoder features. However, the effectiveness of these modules has not been thoroughly explored. In this paper, we conduct extensive experiments to reveal that the context aggregation modules are actually not as effective as expected. We also demonstrate that when learned on large image patches, basic encoder-decoder networks with a larger receptive field can effectively aggregate context to achieve better performance.Upon the above findings, we propose a simple yet effective matting network, named AEMatter, which enlarges the receptive field by incorporating an appearance-enhanced axis-wise learning block into the encoder and adopting a hybrid-transformer decoder. Experimental results on four datasets demonstrate that our AEMatter significantly outperforms state-of-the-art matting methods (e.g., on the Adobe Composition-1K dataset, \textbf{25\%} and \textbf{40\%} reduction in terms of SAD and MSE, respectively, compared against MatteFormer). The code and model are available at \url{this https URL}.
对于自然图像剪辑,上下文信息在估计 alpha 剪辑方面发挥着至关重要的作用,特别是在难以区分前景和背景时。深度学习方法的exit点利用专门设计的上下文聚合模块来优化编码特征。然而,这些模块的效果并没有得到充分探索。在本文中,我们进行了广泛的实验,以揭示上下文聚合模块实际上并不像预期那样有效。我们还证明,在大型图像补丁上学习时,具有更大接收域的基本编码-解码网络可以 effectively 聚合上下文,实现更好的性能。基于以上发现,我们提出了一种简单但有效的剪辑网络,名为AEMatter,它通过将增强的外观轴学习块嵌入编码器并采用混合Transformer解码器,扩大了接收域。在四个数据集上的实验结果表明,我们的AEMatter在剪接方法中 significantly 超越了当前最先进的方法(例如,在Adobe Composition-1K数据集上,与前一种剪接方法相比,SAD和MSE分别减少了25%和40%)。代码和模型可在 \url{this https URL} 上获取。
https://arxiv.org/abs/2304.01171
Image matting requires high-quality pixel-level human annotations to support the training of a deep model in recent literature. Whereas such annotation is costly and hard to scale, significantly holding back the development of the research. In this work, we make the first attempt towards addressing this problem, by proposing a self-supervised pre-training approach that can leverage infinite numbers of data to boost the matting performance. The pre-training task is designed in a similar manner as image matting, where random trimap and alpha matte are generated to achieve an image disentanglement objective. The pre-trained model is then used as an initialisation of the downstream matting task for fine-tuning. Extensive experimental evaluations show that the proposed approach outperforms both the state-of-the-art matting methods and other alternative self-supervised initialisation approaches by a large margin. We also show the robustness of the proposed approach over different backbone architectures. The code and models will be publicly available.
图像剪辑需要高质量的像素级人类标注来支持最近文献中的深度学习模型训练。而这种标注成本较高且难以规模扩展,极大地限制了研究的发展。在本研究中,我们尝试解决这个问题,提出了一种自我监督的前训练方法,可以利用无限数据来增强剪辑性能。前训练任务与图像剪辑类似,通过生成随机三视图和Alpha matte来实现图像分离目标。训练好的模型 then 用作后续剪辑任务的初始化,进行微调。广泛的实验评估表明, proposed 方法比最先进的剪辑方法和其他替代的自我监督初始化方法表现更好。我们还展示了该方法在不同主干架构下的鲁棒性。代码和模型将公开可用。
https://arxiv.org/abs/2304.00784
Image matting aims to predict alpha values of elaborate uncertainty areas of natural images, like hairs, smoke, and spider web. However, existing methods perform poorly when faced with highly transparent foreground objects due to the large area of uncertainty to predict and the small receptive field of convolutional networks. To address this issue, we propose a Transformer-based network (TransMatting) to model transparent objects with long-range features and collect a high-resolution matting dataset of transparent objects (Transparent-460) for performance evaluation. Specifically, to utilize semantic information in the trimap flexibly and effectively, we also redesign the trimap as three learnable tokens, named tri-token. Both Transformer and convolution matting models could benefit from our proposed tri-token design. By replacing the traditional trimap concatenation strategy with our tri-token, existing matting methods could achieve about 10% improvement in SAD and 20% in MSE. Equipped with the new tri-token design, our proposed TransMatting outperforms current state-of-the-art methods on several popular matting benchmarks and our newly collected Transparent-460.
图像剪辑的目标是预测自然图像中复杂的不确定性区域,如毛发、烟雾和蜘蛛网等的Alpha值。然而,现有的方法在面对高度透明的前景对象时表现很差,因为预测的不确定性区域很大,而卷积神经网络的响应范围很小。为了解决这一问题,我们提出了一种基于Transformer的网络(TransMatting),以建模具有远程特征的透明对象,并收集了高分辨率的透明对象剪辑数据集(Transparent-460)进行性能评估。具体来说,为了有效地和 flexibly 利用 trimap 中的语义信息,我们还重新设计了 trimap 为三个可学习的标志符,并命名为 tri-token。Transformer 和卷积剪辑模型都可以从我们的 tri-token 设计中获得好处。通过将传统的 trimap concatenation 策略与我们的 tri-token 设计替换,现有的剪辑方法可以实现约 10% 的改进 in SAD 和 20% 的改进 in MSE。借助于新的 tri-token 设计,我们提出的 TransMatting 在多个流行的剪辑基准测试和我们新收集的 Transparent-460 中表现优于当前的前沿技术方法。
https://arxiv.org/abs/2303.06476
We study the composition style in deep image matting, a notion that characterizes a data generation flow on how to exploit limited foregrounds and random backgrounds to form a training dataset. Prior art executes this flow in a completely random manner by simply going through the foreground pool or by optionally combining two foregrounds before foreground-background composition. In this work, we first show that naive foreground combination can be problematic and therefore derive an alternative formulation to reasonably combine foregrounds. Our second contribution is an observation that matting performance can benefit from a certain occurrence frequency of combined foregrounds and their associated source foregrounds during training. Inspired by this, we introduce a novel composition style that binds the source and combined foregrounds in a definite triplet. In addition, we also find that different orders of foreground combination lead to different foreground patterns, which further inspires a quadruplet-based composition style. Results under controlled experiments on four matting baselines show that our composition styles outperform existing ones and invite consistent performance improvement on both composited and real-world datasets. Code is available at: this https URL
https://arxiv.org/abs/2212.13517
Performance of trimap-free image matting methods is limited when trying to decouple the deterministic and undetermined regions, especially in the scenes where foregrounds are semantically ambiguous, chromaless, or high transmittance. In this paper, we propose a novel framework named Privileged Prior Information Distillation for Image Matting (PPID-IM) that can effectively transfer privileged prior environment-aware information to improve the performance of students in solving hard foregrounds. The prior information of trimap regulates only the teacher model during the training stage, while not being fed into the student network during actual inference. In order to achieve effective privileged cross-modality (i.e. trimap and RGB) information distillation, we introduce a Cross-Level Semantic Distillation (CLSD) module that reinforces the trimap-free students with more knowledgeable semantic representations and environment-aware information. We also propose an Attention-Guided Local Distillation module that efficiently transfers privileged local attributes from the trimap-based teacher to trimap-free students for the guidance of local-region optimization. Extensive experiments demonstrate the effectiveness and superiority of our PPID framework on the task of image matting. In addition, our trimap-free IndexNet-PPID surpasses the other competing state-of-the-art methods by a large margin, especially in scenarios with chromaless, weak texture, or irregular objects.
https://arxiv.org/abs/2211.14036