Current approaches to dichotomous image segmentation (DIS) treat image matting and object segmentation as fundamentally different tasks. As improvements in image segmentation become increasingly challenging to achieve, combining image matting and grayscale segmentation techniques offers promising new directions for architectural innovation. Inspired by the possibility of aligning these two model tasks, we propose a new architectural approach for DIS called Confidence-Guided Matting (CGM). We created the first CGM model called Background Erase Network (BEN). BEN is comprised of two components: BEN Base for initial segmentation and BEN Refiner for confidence refinement. Our approach achieves substantial improvements over current state-of-the-art methods on the DIS5K validation dataset, demonstrating that matting-based refinement can significantly enhance segmentation quality. This work opens new possibilities for cross-pollination between matting and segmentation techniques in computer vision.
目前的二值图像分割(DIS)方法将图像抠图和对象分割视为本质上不同的任务。随着图像分割改进变得越来越具有挑战性,结合图像抠图和灰度分割技术为架构创新提供了有前景的新方向。受这两种模型任务相协调的可能性启发,我们提出了一种新的针对DIS的架构方法,称为基于置信度引导抠图(CGM)。我们创建了第一个CGM模型——背景擦除网络(BEN)。BEN由两个组成部分构成:用于初始分割的BEN Base和用于置信度优化的BEN Refiner。我们的方法在DIS5K验证数据集上比当前最先进的方法实现了显著改进,表明基于抠图的优化可以大幅提升分割质量。这项工作开启了计算机视觉领域中图像抠图与分割技术相互融合的新可能。
https://arxiv.org/abs/2501.06230
Transparent objects are ubiquitous in daily life, making their perception and robotics manipulation important. However, they present a major challenge due to their distinct refractive and reflective properties when it comes to accurately estimating the 6D pose. To solve this, we present ReFlow6D, a novel method for transparent object 6D pose estimation that harnesses the refractive-intermediate representation. Unlike conventional approaches, our method leverages a feature space impervious to changes in RGB image space and independent of depth information. Drawing inspiration from image matting, we model the deformation of the light path through transparent objects, yielding a unique object-specific intermediate representation guided by light refraction that is independent of the environment in which objects are observed. By integrating these intermediate features into the pose estimation network, we show that ReFlow6D achieves precise 6D pose estimation of transparent objects, using only RGB images as input. Our method further introduces a novel transparent object compositing loss, fostering the generation of superior refractive-intermediate features. Empirical evaluations show that our approach significantly outperforms state-of-the-art methods on TOD and Trans32K-6D datasets. Robot grasping experiments further demonstrate that ReFlow6D's pose estimation accuracy effectively translates to real-world robotics task. The source code is available at: this https URL and this https URL.
透明物体在日常生活中随处可见,因此对其感知和机器人操控非常重要。然而,由于它们独特的折射和反射特性,在准确估计其6D姿态方面存在重大挑战。为了解决这一问题,我们提出了一种名为ReFlow6D的新方法,该方法利用了折射中间表示来估算透明对象的6D姿态。与传统方法不同的是,我们的方法利用了一个不受RGB图像空间变化影响且独立于深度信息的特征空间。受到图像抠图技术的启发,我们建模了光线穿过透明物体时路径的变形,从而生成了一种独特的、特定于每个对象的中间表示形式,这种表示完全不受观察环境的影响,并由光折射引导。通过将这些中间特征整合到姿态估计网络中,ReFlow6D展示了使用仅RGB图像作为输入即可实现精确估算透明物体6D姿态的能力。此外,我们的方法还引入了一种新颖的透明物体制作损失函数(transparent object compositing loss),以促进产生更优质的折射-中间特征。实证评估表明,在TOD和Trans32K-6D数据集上,我们所提出的方法显著优于现有的先进方法。机器人抓取实验进一步证明了ReFlow6D的姿势估计准确性可以有效应用于现实世界的机器人任务中。 源代码可在以下链接获取: - [链接1](this https URL) - [链接2](this https URL)
https://arxiv.org/abs/2412.20830
Transformer-based models have recently achieved outstanding performance in image matting. However, their application to high-resolution images remains challenging due to the quadratic complexity of global self-attention. To address this issue, we propose MEMatte, a memory-efficient matting framework for processing high-resolution images. MEMatte incorporates a router before each global attention block, directing informative tokens to the global attention while routing other tokens to a Lightweight Token Refinement Module (LTRM). Specifically, the router employs a local-global strategy to predict the routing probability of each token, and the LTRM utilizes efficient modules to simulate global attention. Additionally, we introduce a Batch-constrained Adaptive Token Routing (BATR) mechanism, which allows each router to dynamically route tokens based on image content and the stages of attention block in the network. Furthermore, we construct an ultra high-resolution image matting dataset, UHR-395, comprising 35,500 training images and 1,000 test images, with an average resolution of $4872\times6017$. This dataset is created by compositing 395 different alpha mattes across 11 categories onto various backgrounds, all with high-quality manual annotation. Extensive experiments demonstrate that MEMatte outperforms existing methods on both high-resolution and real-world datasets, significantly reducing memory usage by approximately 88% and latency by 50% on the Composition-1K benchmark.
基于Transformer的模型最近在图像抠图方面取得了卓越的成绩。然而,由于全局自注意力具有二次复杂性,这些模型应用于高分辨率图像时仍面临挑战。为了解决这一问题,我们提出了一种内存高效的抠图框架MEMatte,用于处理高分辨率图像。MEMatte在每个全局注意力块前加入了路由器,将包含信息的标记导向全局注意力,而其他标记则被导向轻量级标记优化模块(LTRM)。具体来说,该路由器采用局部-全局策略来预测每个标记的路由概率,并且LTRM使用高效的模块来模拟全局注意力。此外,我们引入了一种批处理约束自适应标记路由机制(BATR),它允许每个路由器根据图像内容和网络中注意块的不同阶段动态地导向标记。另外,我们构建了一个超高分辨率图像抠图数据集UHR-395,该数据集中包含35,500张训练图像和1,000张测试图像,平均分辨率为$4872\times6017$。此数据集通过将395种不同类型的透明度掩模组合到11类背景上,并配以高质量的手动注释而创建。广泛的实验表明,MEMatte在高分辨率和现实世界的数据集上的表现优于现有方法,并且在Composition-1K基准测试中内存使用量减少了大约88%,延迟降低了50%。
https://arxiv.org/abs/2412.10702
Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) face inherent challenges in image matting, particularly in preserving fine structural details. ViTs, with their global receptive field enabled by the self-attention mechanism, often lose local details such as hair strands. Conversely, CNNs, constrained by their local receptive field, rely on deeper layers to approximate global context but struggle to retain fine structures at greater depths. To overcome these limitations, we propose a novel Morpho-Aware Global Attention (MAGA) mechanism, designed to effectively capture the morphology of fine structures. MAGA employs Tetris-like convolutional patterns to align the local shapes of fine structures, ensuring optimal local correspondence while maintaining sensitivity to morphological details. The extracted local morphology information is used as query embeddings, which are projected onto global key embeddings to emphasize local details in a broader context. Subsequently, by projecting onto value embeddings, MAGA seamlessly integrates these emphasized morphological details into a unified global structure. This approach enables MAGA to simultaneously focus on local morphology and unify these details into a coherent whole, effectively preserving fine structures. Extensive experiments show that our MAGA-based ViT achieves significant performance gains, outperforming state-of-the-art methods across two benchmarks with average improvements of 4.3% in SAD and 39.5% in MSE.
视觉变压器(ViTs)和卷积神经网络(CNNs)在图像抠图时会面临固有的挑战,特别是在保留细微结构细节方面。由于自注意力机制赋予的全局感受野,ViTs往往丢失如发丝这样的局部细节。相反,受限于其局部感受野的CNNs依赖更深的层来近似全局上下文,但在更深层次上难以保持细小结构。为了克服这些限制,我们提出了一种新的形态感知全局注意力(MAGA)机制,旨在有效捕捉细微结构的形态。MAGA采用像俄罗斯方块一样的卷积模式对齐局部精细结构的形状,确保最佳的局部对应关系同时保持对形态细节的敏感性。提取出的局部形态信息用作查询嵌入,并投影到全局键嵌入上以强调更广泛上下文中的局部细节。随后,通过投射到值嵌入上,MAGA无缝地将这些被强调的形态细节整合进统一的全局结构中。这一方法使MAGA能够同时关注局部形态并将这些细节融入一个连贯的整体之中,有效地保留了细小结构。广泛的实验表明,我们的基于MAGA的ViT实现了显著的性能提升,在两个基准测试中的平均改进分别为SAD指标4.3%和MSE指标39.5%,超越了现有最先进方法的表现。
https://arxiv.org/abs/2411.10251
The recent segmentation foundation model, Segment Anything Model (SAM), exhibits strong zero-shot segmentation capabilities, but it falls short in generating fine-grained precise masks. To address this limitation, we propose a novel zero-shot image matting model, called ZIM, with two key contributions: First, we develop a label converter that transforms segmentation labels into detailed matte labels, constructing the new SA1B-Matte dataset without costly manual annotations. Training SAM with this dataset enables it to generate precise matte masks while maintaining its zero-shot capability. Second, we design the zero-shot matting model equipped with a hierarchical pixel decoder to enhance mask representation, along with a prompt-aware masked attention mechanism to improve performance by enabling the model to focus on regions specified by visual prompts. We evaluate ZIM using the newly introduced MicroMat-3K test set, which contains high-quality micro-level matte labels. Experimental results show that ZIM outperforms existing methods in fine-grained mask generation and zero-shot generalization. Furthermore, we demonstrate the versatility of ZIM in various downstream tasks requiring precise masks, such as image inpainting and 3D NeRF. Our contributions provide a robust foundation for advancing zero-shot matting and its downstream applications across a wide range of computer vision tasks. The code is available at \url{this https URL}.
近期的分割基础模型,即Segment Anything Model (SAM),展示了强大的零样本分割能力,但在生成精细准确的掩码方面表现不佳。为解决这一局限性,我们提出了一种新的零样本图像抠图模型ZIM,并有两项关键贡献:首先,我们开发了一个标签转换器,可以将分割标签转化为详细的抠图标签,构建了无需昂贵的人工标注的新数据集SA1B-Matte。用此数据集训练SAM使其能够生成精确的抠图掩码,同时保持零样本能力。其次,我们设计了一种配备分层像素解码器的零样本抠图模型以增强掩码表示,并加入了一个提示感知的掩蔽注意力机制,通过使模型专注于视觉提示指定的区域来提升性能。我们在新引入的MicroMat-3K测试集上评估了ZIM,该数据集包含高质量的微级别抠图标签。实验结果表明,ZIM在精细掩模生成和零样本泛化方面优于现有方法。此外,我们展示了ZIM在需要精确掩码的各种下游任务中的灵活性,例如图像修复和3D NeRF。我们的贡献为推动零样本抠图及其广泛计算机视觉任务的下游应用提供了坚实的基础。代码可在\url{this https URL}获取。
https://arxiv.org/abs/2411.00626
Recent approaches attempt to adapt powerful interactive segmentation models, such as SAM, to interactive matting and fine-tune the models based on synthetic matting datasets. However, models trained on synthetic data fail to generalize to complex and occlusion scenes. We address this challenge by proposing a new matting dataset based on the COCO dataset, namely COCO-Matting. Specifically, the construction of our COCO-Matting includes accessory fusion and mask-to-matte, which selects real-world complex images from COCO and converts semantic segmentation masks to matting labels. The built COCO-Matting comprises an extensive collection of 38,251 human instance-level alpha mattes in complex natural scenarios. Furthermore, existing SAM-based matting methods extract intermediate features and masks from a frozen SAM and only train a lightweight matting decoder by end-to-end matting losses, which do not fully exploit the potential of the pre-trained SAM. Thus, we propose SEMat which revamps the network architecture and training objectives. For network architecture, the proposed feature-aligned transformer learns to extract fine-grained edge and transparency features. The proposed matte-aligned decoder aims to segment matting-specific objects and convert coarse masks into high-precision mattes. For training objectives, the proposed regularization and trimap loss aim to retain the prior from the pre-trained model and push the matting logits extracted from the mask decoder to contain trimap-based semantic information. Extensive experiments across seven diverse datasets demonstrate the superior performance of our method, proving its efficacy in interactive natural image matting. We open-source our code, models, and dataset at this https URL.
https://arxiv.org/abs/2410.06593
The labelling difficulty has been a longstanding problem in deep image matting. To escape from fine labels, this work explores using rough annotations such as trimaps coarsely indicating the foreground/background as supervision. We present that the cooperation between learned semantics from indicated known regions and proper assumed matting rules can help infer alpha values at transition areas. Inspired by the nonlocal principle in traditional image matting, we build a directional distance consistency loss (DDC loss) at each pixel neighborhood to constrain the alpha values conditioned on the input image. DDC loss forces the distance of similar pairs on the alpha matte and on its corresponding image to be consistent. In this way, the alpha values can be propagated from learned known regions to unknown transition areas. With only images and trimaps, a matting model can be trained under the supervision of a known loss and the proposed DDC loss. Experiments on AM-2K and P3M-10K dataset show that our paradigm achieves comparable performance with the fine-label-supervised baseline, while sometimes offers even more satisfying results than human-labelled ground truth. Code is available at \url{this https URL}.
标签难度一直是一个长期存在于深度图像合成中的问题。为了逃避细粒度标签,这项工作探讨了使用粗粒度注释,如剪切面表示前景/背景,作为监督。我们发现,指示已知区域的预训练语义和学习到的语义之间的合作可以帮助推断在过渡区域中的alpha值。受到传统图像合成中非局部原则的启发,我们在每个像素邻域中建立了一个方向距离一致损失(DDC损失),用于约束基于输入图像的alpha值。DDC损失迫使alphamatte和其相应图像中类似对的距离保持一致。以这种方式,从预训练已知区域中传播alpha值到未知的过渡区域。仅使用图像和剪切面,可以在已知损失的监督下训练合成模型。在AM-2K和P3M-10K数据集上进行的实验证明,我们的范式与细粒度标签监督基线具有可比较的性能,有时甚至比人类标注的地面真实现得更令人满意。代码可在此处访问:\url{this <https://this URL>.}
https://arxiv.org/abs/2408.10539
This paper introduces an innovative approach for image matting that redefines the traditional regression-based task as a generative modeling challenge. Our method harnesses the capabilities of latent diffusion models, enriched with extensive pre-trained knowledge, to regularize the matting process. We present novel architectural innovations that empower our model to produce mattes with superior resolution and detail. The proposed method is versatile and can perform both guidance-free and guidance-based image matting, accommodating a variety of additional cues. Our comprehensive evaluation across three benchmark datasets demonstrates the superior performance of our approach, both quantitatively and qualitatively. The results not only reflect our method's robust effectiveness but also highlight its ability to generate visually compelling mattes that approach photorealistic quality. The project page for this paper is available at this https URL
本文提出了一种创新性的图像配对方法,将传统基于回归任务的图像配对问题重新定义为生成建模挑战。我们的方法利用了潜在扩散模型的功能,通过丰富的预训练知识来规范配对过程。我们提出了新颖的架构创新,使我们的模型能够以卓越的分辨率和高细节生产配对。所提出的方法具有灵活性,能够进行指导无关和指导式的图像配对,适应各种附加提示。本文在三个基准数据集上的全面评估证明了我们方法的优势,不仅在数量上,而且在质量上。结果不仅反映了我们方法的稳健性,而且强调了其生成具有视觉吸引力的配对的能力。本文的论文页URL为https:// URL。
https://arxiv.org/abs/2407.21017
The goal of this work is to develop a task-agnostic feature upsampling operator for dense prediction where the operator is required to facilitate not only region-sensitive tasks like semantic segmentation but also detail-sensitive tasks such as image matting. Prior upsampling operators often can work well in either type of the tasks, but not both. We argue that task-agnostic upsampling should dynamically trade off between semantic preservation and detail delineation, instead of having a bias between the two properties. In this paper, we present FADE, a novel, plug-and-play, lightweight, and task-agnostic upsampling operator by fusing the assets of decoder and encoder features at three levels: i) considering both the encoder and decoder feature in upsampling kernel generation; ii) controlling the per-point contribution of the encoder/decoder feature in upsampling kernels with an efficient semi-shift convolutional operator; and iii) enabling the selective pass of encoder features with a decoder-dependent gating mechanism for compensating details. To improve the practicality of FADE, we additionally study parameter- and memory-efficient implementations of semi-shift convolution. We analyze the upsampling behavior of FADE on toy data and show through large-scale experiments that FADE is task-agnostic with consistent performance improvement on a number of dense prediction tasks with little extra cost. For the first time, we demonstrate robust feature upsampling on both region- and detail-sensitive tasks successfully. Code is made available at: this https URL
本文的工作目标是开发一个任务无关的插件和可扩展操作符,用于密集预测,该操作符不仅要求助于仅对区域敏感的任务(如语义分割),而且还要求助于对细节敏感的任务(如图像匹配)。通常,升级操作符在两种类型的任务上都可以表现良好,但不是两种都可以。我们认为,任务无关的升级应该在语义保留和细节描绘之间动态平衡,而不是在两种属性之间存在偏见。在本文中,我们提出了FADE,一种新颖、可插拔、轻量级、任务无关的升级操作符,通过将编码器和解码器特征资产融合在三个级别:i)考虑在升级核生成中同时使用编码器和解码器特征;ii)通过高效的半移位卷积操作控制编码器/解码器特征的每个点的贡献;和iii)通过与解码器相关的门机制 selective pass 启用编码器特征的筛选。为了提高FADE的实用性,我们还研究了半移位卷积的参数和内存效率实现。我们在玩具数据上分析了FADE的升级行为,并通过大规模实验证明了FADE在许多密集预测任务上具有出色的表现,同时几乎没有额外的成本。为了第一次,我们在区域和细节敏感任务上都成功实现了鲁棒性的特征升级。代码可以在以下这个链接上获取:https://this URL
https://arxiv.org/abs/2407.13500
Image matting aims to obtain an alpha matte that separates foreground objects from the background accurately. Recently, trimap-free matting has been well studied because it requires only the original image without any extra input. Such methods usually extract a rough foreground by itself to take place trimap as further guidance. However, the definition of 'foreground' lacks a unified standard and thus ambiguities arise. Besides, the extracted foreground is sometimes incomplete due to inadequate network design. Most importantly, there is not a large-scale real-world matting dataset, and current trimap-free methods trained with synthetic images suffer from large domain shift problems in practice. In this paper, we define the salient object as foreground, which is consistent with human cognition and annotations of the current matting dataset. Meanwhile, data and technologies in salient object detection can be transferred to matting in a breeze. To obtain a more accurate and complete alpha matte, we propose a network called \textbf{M}ulti-\textbf{F}eature fusion-based \textbf{C}oarse-to-fine Network \textbf{(MFC-Net)}, which fully integrates multiple features for an accurate and complete alpha matte. Furthermore, we introduce image harmony in data composition to bridge the gap between synthetic and real images. More importantly, we establish the largest general matting dataset \textbf{(Real-19k)} in the real world to date. Experiments show that our method is significantly effective on both synthetic and real-world images, and the performance in the real-world dataset is far better than existing matting-free methods. Our code and data will be released soon.
图像分层旨在准确地将前景物体与背景分离。最近,因为只需要原始图像而没有额外的输入,所以 trimap-free 映像分层方法已经得到了很好的研究。然而,前景的定义缺乏统一标准,因此出现了一些歧义。此外,由于网络设计不足,提取的前景有时不完整。最重要的是,目前没有大规模的实时现实世界的映像分层数据集,并且用合成图像训练的 trimap-free 方法在实践中存在较大的领域转移问题。在本文中,我们将显著物体定义为前景,这与人类认知和当前映像数据集的注释是一致的。同时,用于显著物体检测的数据和技术可以很容易地应用于分层。为了获得更准确和完整的 alpha 映像,我们提出了一个名为 Multi-Feature Fusion-based Coarse-to-fine Network (MFC-Net) 的网络,该网络完全整合了多个特征以获得准确和完整的 alpha 映像。此外,我们还引入了图像和谐来解决合成图像和真实图像之间的差距。更重要的是,我们在现实生活中建立了最大的通用映像数据集(现实-19k)。实验表明,我们的方法在合成和真实图像上都有显著的效果,而在现实世界数据集上的性能比现有的无映像分层方法要好得多。我们的代码和数据将很快发布。
https://arxiv.org/abs/2405.17916
Despite significant advancements in image matting, existing models heavily depend on manually-drawn trimaps for accurate results in natural image scenarios. However, the process of obtaining trimaps is time-consuming, lacking user-friendliness and device compatibility. This reliance greatly limits the practical application of all trimap-based matting methods. To address this issue, we introduce Click2Trimap, an interactive model capable of predicting high-quality trimaps and alpha mattes with minimal user click inputs. Through analyzing real users' behavioral logic and characteristics of trimaps, we successfully propose a powerful iterative three-class training strategy and a dedicated simulation function, making Click2Trimap exhibit versatility across various scenarios. Quantitative and qualitative assessments on synthetic and real-world matting datasets demonstrate Click2Trimap's superior performance compared to all existing trimap-free matting methods. Especially, in the user study, Click2Trimap achieves high-quality trimap and matting predictions in just an average of 5 seconds per image, demonstrating its substantial practical value in real-world applications.
尽管在图像修饰方面取得了显著的进步,但现有的模型仍然高度依赖人工绘制的截面图以实现自然图像场景下的准确结果。然而,获取截面图的过程耗时且缺乏用户友好性,也不支持设备兼容性。这种依赖大大限制了所有基于截面图的修饰方法的实际应用。为解决这个问题,我们引入了Click2Trimap,一种具有预测高质量截面图和透明度贴图功能的有交互式模型。通过分析真实用户的心理逻辑和截面图的特点,我们成功提出了一种强大的迭代三分类训练策略和专门的模拟功能,使Click2Trimap在各种场景中都表现出 versatility。在合成和真实世界数据集上的定量定性和定性评估表明,与所有现有的截面图免费修饰方法相比,Click2Trimap的性能优越。特别是在用户研究中,Click2Trimap仅在每张图片上平均需要5秒钟即可实现高质量截面图和贴图预测,这充分证明了其在现实世界应用中的实际价值。
https://arxiv.org/abs/2404.00335
We introduce in-context matting, a novel task setting of image matting. Given a reference image of a certain foreground and guided priors such as points, scribbles, and masks, in-context matting enables automatic alpha estimation on a batch of target images of the same foreground category, without additional auxiliary input. This setting marries good performance in auxiliary input-based matting and ease of use in automatic matting, which finds a good trade-off between customization and automation. To overcome the key challenge of accurate foreground matching, we introduce IconMatting, an in-context matting model built upon a pre-trained text-to-image diffusion model. Conditioned on inter- and intra-similarity matching, IconMatting can make full use of reference context to generate accurate target alpha mattes. To benchmark the task, we also introduce a novel testing dataset ICM-$57$, covering 57 groups of real-world images. Quantitative and qualitative results on the ICM-57 testing set show that IconMatting rivals the accuracy of trimap-based matting while retaining the automation level akin to automatic matting. Code is available at this https URL
我们提出了一个名为"in-context matting"的新图像配对任务,这是一种图像配对任务,其旨在解决基于辅助输入的图像配对中的关键挑战 - 准确的目标匹配。在本文中,我们将介绍一种基于预训练文本到图像扩散模型的"in-context matting"模型,该模型可以在不需要额外辅助输入的情况下,对同一目标类别的目标图像进行自动alpha估计。这一设置将辅助输入驱动的图像配对和自动配对中易用性的优势相结合,找到了定制化和自动化之间的良好平衡。为了克服准确目标匹配的关键挑战,我们引入了"IconMatting"模型,这是一种基于预训练文本到图像扩散模型的"in-context matting"模型。通过条件处理互相似性和内部相似性,IconMatting可以充分利用参考上下文生成准确的靶alpha Mattes。为了验证该任务,我们还引入了"ICM-$57"测试数据集,涵盖了57个真实世界图像组。在ICM-57测试集中的定量和定性结果表明,IconMatting与基于trimap的图像配对中的准确性相媲美,同时保留了自动配对中类似于自动配对的自动化水平。代码可以从该链接https://该链接中获取。
https://arxiv.org/abs/2403.15789
Natural image matting aims to estimate the alpha matte of the foreground from a given image. Various approaches have been explored to address this problem, such as interactive matting methods that use guidance such as click or trimap, and automatic matting methods tailored to specific objects. However, existing matting methods are designed for specific objects or guidance, neglecting the common requirement of aggregating global and local contexts in image matting. As a result, these methods often encounter challenges in accurately identifying the foreground and generating precise boundaries, which limits their effectiveness in unforeseen scenarios. In this paper, we propose a simple and universal matting framework, named Dual-Context Aggregation Matting (DCAM), which enables robust image matting with arbitrary guidance or without guidance. Specifically, DCAM first adopts a semantic backbone network to extract low-level features and context features from the input image and guidance. Then, we introduce a dual-context aggregation network that incorporates global object aggregators and local appearance aggregators to iteratively refine the extracted context features. By performing both global contour segmentation and local boundary refinement, DCAM exhibits robustness to diverse types of guidance and objects. Finally, we adopt a matting decoder network to fuse the low-level features and the refined context features for alpha matte estimation. Experimental results on five matting datasets demonstrate that the proposed DCAM outperforms state-of-the-art matting methods in both automatic matting and interactive matting tasks, which highlights the strong universality and high performance of DCAM. The source code is available at \url{this https URL}.
自然图像合成旨在从给定的图像中估计前景的alpha遮罩。为了解决这个问题,已经探索了许多方法,例如使用点击或剪裁映像的交互式遮罩方法和针对特定对象的自动遮罩方法。然而,现有的遮罩方法都是为特定物体或指导设计的,忽视了图像遮罩中全局和局部上下文整合的常见要求。因此,这些方法通常会在准确识别前景和生成精确边界方面遇到困难,从而限制其在未知场景中的有效性。在本文中,我们提出了一个简单而通用的遮罩框架,名为双上下文聚合遮罩(DCAM),它具有任意指导或无指导的鲁棒图像合成能力。具体来说,DCAM首先采用语义骨架网络从输入图像和指导中提取低级特征和上下文特征。然后,我们引入了一种双上下文聚合网络,它包括全局物体聚合器和局部外观聚合器,用于迭代优化提取的上下文特征。通过执行全局轮廓分割和局部边界修复,DCAM在各种类型的指导和物体上表现出鲁棒性。最后,我们采用遮罩解码器网络将低级特征和修复后的上下文特征融合进行alpha遮罩估计。在五个遮罩数据集上的实验结果表明,与最先进的遮罩方法相比,DCAM在自动遮罩和交互式遮罩任务上都表现出卓越的性能,这突出了DCAM的宽泛性和高性能。源代码可在此处访问:\url{this <https://this URL>.
https://arxiv.org/abs/2402.18109
We aim to leverage diffusion to address the challenging image matting task. However, the presence of high computational overhead and the inconsistency of noise sampling between the training and inference processes pose significant obstacles to achieving this goal. In this paper, we present DiffMatte, a solution designed to effectively overcome these challenges. First, DiffMatte decouples the decoder from the intricately coupled matting network design, involving only one lightweight decoder in the iterations of the diffusion process. With such a strategy, DiffMatte mitigates the growth of computational overhead as the number of samples increases. Second, we employ a self-aligned training strategy with uniform time intervals, ensuring a consistent noise sampling between training and inference across the entire time domain. Our DiffMatte is designed with flexibility in mind and can seamlessly integrate into various modern matting architectures. Extensive experimental results demonstrate that DiffMatte not only reaches the state-of-the-art level on the Composition-1k test set, surpassing the best methods in the past by 5% and 15% in the SAD metric and MSE metric respectively, but also show stronger generalization ability in other benchmarks.
我们希望通过扩散来解决具有挑战性的图像配准任务。然而,高计算开销和训练和推理过程中噪声抽样的不一致性构成了实现这一目标的巨大障碍。在本文中,我们提出了DiffMatte,一种旨在有效克服这些挑战的解决方案。首先,DiffMatte解耦了解码器与复杂耦合的配准网络设计,仅在扩散过程的迭代中使用了一个轻量级的解码器。采用这种方式,DiffMatte可以减轻随着样本数量增加而产生的计算开销的增长。其次,我们采用自对齐的训练策略,具有均匀的时间间隔,确保训练和推理过程中在整个时域内具有一致的噪声采样。我们的DiffMatte设计时考虑了灵活性,可以轻松地整合到各种现代配准架构中。大量实验结果表明,DiffMatte不仅在Composition-1k测试集上达到了最先进的水平,超越了过去最佳方法的5%和15%,而且在其他基准测试中表现出更强的泛化能力。
https://arxiv.org/abs/2312.05915
In this paper, we introduce DiffusionMat, a novel image matting framework that employs a diffusion model for the transition from coarse to refined alpha mattes. Diverging from conventional methods that utilize trimaps merely as loose guidance for alpha matte prediction, our approach treats image matting as a sequential refinement learning process. This process begins with the addition of noise to trimaps and iteratively denoises them using a pre-trained diffusion model, which incrementally guides the prediction towards a clean alpha matte. The key innovation of our framework is a correction module that adjusts the output at each denoising step, ensuring that the final result is consistent with the input image's structures. We also introduce the Alpha Reliability Propagation, a novel technique designed to maximize the utility of available guidance by selectively enhancing the trimap regions with confident alpha information, thus simplifying the correction task. To train the correction module, we devise specialized loss functions that target the accuracy of the alpha matte's edges and the consistency of its opaque and transparent regions. We evaluate our model across several image matting benchmarks, and the results indicate that DiffusionMat consistently outperforms existing methods. Project page at~\url{this https URL
在本文中,我们引入了DiffusionMat,一种新颖的图像遮罩框架,它采用扩散模型来从粗粒度到精细粒度的alpha遮罩的转变。与仅仅利用trimap作为粗略指导的常规方法不同,我们的方法将图像遮罩处理为序列细化学习过程。这个过程从对trimap添加噪声开始,并使用预训练的扩散模型逐步去噪,从而逐步引导预测朝向干净的alpha遮罩。我们框架的关键创新是具有修正模块,它在每个去噪步骤中对输出进行调整,确保最终结果与输入图像的结构保持一致。我们还引入了Alpha可靠性传播,一种旨在通过选择性地增强自信的alpha信息来最大程度地利用可用的指导的技术,从而简化修正任务。为了训练修正模块,我们设计了一些针对alpha遮罩边缘准确性和一致性的专门损失函数。我们在多个图像遮罩基准测试中评估我们的模型,结果表明DiffusionMat consistently优于现有方法。项目页面链接为:https://this URL
https://arxiv.org/abs/2311.13535
We introduce the notion of point affiliation into feature upsampling. By abstracting a feature map into non-overlapped semantic clusters formed by points of identical semantic meaning, feature upsampling can be viewed as point affiliation -- designating a semantic cluster for each upsampled point. In the framework of kernel-based dynamic upsampling, we show that an upsampled point can resort to its low-res decoder neighbors and high-res encoder point to reason the affiliation, conditioned on the mutual similarity between them. We therefore present a generic formulation for generating similarity-aware upsampling kernels and prove that such kernels encourage not only semantic smoothness but also boundary sharpness. This formulation constitutes a novel, lightweight, and universal upsampling solution, Similarity-Aware Point Affiliation (SAPA). We show its working mechanism via our preliminary designs with window-shape kernel. After probing the limitations of the designs on object detection, we reveal additional insights for upsampling, leading to SAPA with the dynamic kernel shape. Extensive experiments demonstrate that SAPA outperforms prior upsamplers and invites consistent performance improvements on a number of dense prediction tasks, including semantic segmentation, object detection, instance segmentation, panoptic segmentation, image matting, and depth estimation. Code is made available at: this https URL
我们引入了点关联的概念,将其引入特征插值。通过将特征映射抽象为由具有相同语义意义的点组成的不重叠的语义簇,特征插值可以被视为点关联——为每个插值点指定一个语义簇。在基于内核的动态插值框架中,我们表明,插值点可以通过其低分辨率解码邻居和高分辨率编码点推理关联,根据它们之间的相互相似性条件。因此,我们提出了一种通用表达式来生成具有相似性感知特征插值内核,并证明了这种内核不仅鼓励语义平滑,还鼓励边界尖锐化。这种表达式构成了一种新型、轻便且通用的特征插值解决方案,称为相似性点关联(SAPA)。我们通过窗口形状内核的初步设计展示了其工作原理。在测试对象检测的设计限制后,我们揭示了增加插值点额外见解的方法,导致动态内核形状的SAPA。广泛的实验结果表明,SAPA优于先前的插值方案,并在许多密集预测任务中表现出一致的性能改进,包括语义分割、对象检测、实例分割、全景分割、图像拼接和深度估计。代码在此https URL上提供。
https://arxiv.org/abs/2307.08198
In this paper, we propose the Matting Anything Model (MAM), an efficient and versatile framework for estimating the alpha matte of any instance in an image with flexible and interactive visual or linguistic user prompt guidance. MAM offers several significant advantages over previous specialized image matting networks: (i) MAM is capable of dealing with various types of image matting, including semantic, instance, and referring image matting with only a single model; (ii) MAM leverages the feature maps from the Segment Anything Model (SAM) and adopts a lightweight Mask-to-Matte (M2M) module to predict the alpha matte through iterative refinement, which has only 2.7 million trainable parameters. (iii) By incorporating SAM, MAM simplifies the user intervention required for the interactive use of image matting from the trimap to the box, point, or text prompt. We evaluate the performance of MAM on various image matting benchmarks, and the experimental results demonstrate that MAM achieves comparable performance to the state-of-the-art specialized image matting models under different metrics on each benchmark. Overall, MAM shows superior generalization ability and can effectively handle various image matting tasks with fewer parameters, making it a practical solution for unified image matting. Our code and models are open-sourced at this https URL.
本文提出了“裁剪任何东西模型”(MAM),一个高效、多功能的框架,能够在灵活的、交互式的可视化或语言用户提示下,估计图像中任意实例的Alpha matte。相较于以往的专门图像处理网络,MAM提供了多项显著优势:(i) MAM能够处理各种图像处理,包括语义、实例和引用图像处理,仅使用一种模型即可;(ii) MAM利用Segment anything Model(SAM)的特征映射,采用轻量级Mask-to-Matte(M2M)模块进行迭代优化,以预测Alpha matte,该模块仅有2.7百万可训练参数;(iii) 通过集成SAM,MAM简化了用户对于图像处理交互使用的干预,从Trimap到盒子、点或文本提示等各个基准的交互使用所需的用户干预均简化了许多参数。我们评估了MAM在各种图像处理基准上的表现,实验结果表明,MAM在每个基准上的表现与最先进的专门图像处理模型的相似度达到了最高水平。总的来说,MAM具有更好的泛化能力,能够以较少的参数有效处理各种图像处理任务,使其成为统一图像处理的实用解决方案。我们的代码和模型在此httpsURL上开源。
https://arxiv.org/abs/2306.05399
Natural image matting algorithms aim to predict the transparency map (alpha-matte) with the trimap guidance. However, the production of trimaps often requires significant labor, which limits the widespread application of matting algorithms on a large scale. To address the issue, we propose Matte Anything model (MatAny), an interactive natural image matting model which could produce high-quality alpha-matte with various simple hints. The key insight of MatAny is to generate pseudo trimap automatically with contour and transparency prediction. We leverage task-specific vision models to enhance the performance of natural image matting. Specifically, we use the segment anything model (SAM) to predict high-quality contour with user interaction and an open-vocabulary (OV) detector to predict the transparency of any object. Subsequently, a pretrained image matting model generates alpha mattes with pseudo trimaps. MatAny is the interactive matting algorithm with the most supported interaction methods and the best performance to date. It consists of orthogonal vision models without any additional training. We evaluate the performance of MatAny against several current image matting algorithms, and the results demonstrate the significant potential of our approach.
自然图像裁剪算法旨在利用三度地图指导预测透明度图(alpha matte),但生产三度地图通常需要大量的劳动,这限制了大规模应用裁剪算法。为了解决这个问题,我们提出了 Matte Anything模型(MatAny),它是一个交互式的自然图像裁剪模型,能够以各种简单提示生产高质量的alpha matte。MatAny的关键见解是自动生成伪三度地图,结合轮廓和透明度预测。我们利用特定的视觉任务模型来提高自然图像裁剪的性能。具体来说,我们使用片段 anything模型(SAM)通过用户交互预测高质量的轮廓,并使用开放词汇表(OV)探测器预测任何物体的透明度。随后,一个预训练的图像裁剪模型生成alpha mattes,使用伪三度地图。MatAny是支持最多交互方法和行为的最佳交互裁剪算法。它由两个互相垂直的视觉模型组成,不需要任何额外的训练。我们评估了MatAny的性能与几种当前的图像裁剪算法,结果证明了我们方法的重大潜力。
https://arxiv.org/abs/2306.04121
Cutting out an object and estimating its opacity mask, known as image matting, is a key task in image and video editing. Due to the highly ill-posed issue, additional inputs, typically user-defined trimaps or scribbles, are usually needed to reduce the uncertainty. Although effective, it is either time consuming or only suitable for experienced users who know where to place the strokes. In this work, we propose a decomposed-uncertainty-guided matting (dugMatting) algorithm, which explores the explicitly decomposed uncertainties to efficiently and effectively improve the results. Basing on the characteristic of these uncertainties, the epistemic uncertainty is reduced in the process of guiding interaction (which introduces prior knowledge), while the aleatoric uncertainty is reduced in modeling data distribution (which introduces statistics for both data and possible noise). The proposed matting framework relieves the requirement for users to determine the interaction areas by using simple and efficient labeling. Extensively quantitative and qualitative results validate that the proposed method significantly improves the original matting algorithms in terms of both efficiency and efficacy.
去除物体并估计其透明度蒙皮,也称为图像剪辑,是图像和视频编辑中的关键任务。由于存在高度不兼容的问题,通常需要用户定义的Trimap或涂鸦等额外的输入来减少不确定性。虽然有效,但它要么需要时间,要么只适用于有经验的用户,知道在哪里画线。在本工作中,我们提出了一种分解不确定性引导剪辑(dug Matting)算法,该算法 explicitly分解不确定性以有效地和有效地改进结果。基于这些不确定性的特征,在指导相互作用的过程中,知识不确定性减少(引入先前知识),而在建模数据分布的过程中, aleatoric不确定性减少(引入数据和可能噪声的统计数据)。提出的剪辑框架解除了用户通过简单高效的标签来确定相互作用区域的 requirement。广泛的定量和定性结果证明了 proposed 方法在效率和效果方面都显著改进了原始的剪辑算法。
https://arxiv.org/abs/2306.01452
Recently, plain vision Transformers (ViTs) have shown impressive performance on various computer vision tasks, thanks to their strong modeling capacity and large-scale pretraining. However, they have not yet conquered the problem of image matting. We hypothesize that image matting could also be boosted by ViTs and present a new efficient and robust ViT-based matting system, named ViTMatte. Our method utilizes (i) a hybrid attention mechanism combined with a convolution neck to help ViTs achieve an excellent performance-computation trade-off in matting tasks. (ii) Additionally, we introduce the detail capture module, which just consists of simple lightweight convolutions to complement the detailed information required by matting. To the best of our knowledge, ViTMatte is the first work to unleash the potential of ViT on image matting with concise adaptation. It inherits many superior properties from ViT to matting, including various pretraining strategies, concise architecture design, and flexible inference strategies. We evaluate ViTMatte on Composition-1k and Distinctions-646, the most commonly used benchmark for image matting, our method achieves state-of-the-art performance and outperforms prior matting works by a large margin.
最近,普通的视觉转换器(ViTs)在各种计算机视觉任务中表现出出色的性能,因为它们具有强大的建模能力和大规模的预训练。然而,他们还没有克服图像拼接的问题。我们假设图像拼接也可以由ViTs来提升,并提出了名为ViTMatte的新高效、可靠的ViT拼接系统。我们的方法和(i)采用混合注意力机制和卷积颈部来帮助ViTs在拼接任务中实现出色的性能-计算权衡。(ii)我们还引入了细节捕捉模块,它仅仅是简单的 lightweight 卷积来补充拼接任务所需的详细信息。据我们所知,ViTMatte是第一款通过简洁适应性释放ViT在图像拼接中的潜力的工作。它继承了ViT在拼接中许多优越的性质,包括各种预训练策略、简洁的建筑设计和灵活的推理策略。我们在Composition-1k和distinction-646等常用的图像拼接基准上评估了ViTMatte,我们的方法和之前的图像拼接工作相比实现了先进的性能,并大幅超越了之前的工作。
https://arxiv.org/abs/2305.15272