Visible watermarks pose significant challenges for image restoration techniques, especially when the target background is unknown. Toward this end, we present MorphoMod, a novel method for automated visible watermark removal that operates in a blind setting -- without requiring target images. Unlike existing methods, MorphoMod effectively removes opaque and transparent watermarks while preserving semantic content, making it well-suited for real-world applications. Evaluations on benchmark datasets, including the Colored Large-scale Watermark Dataset (CLWD), LOGO-series, and the newly introduced Alpha1 datasets, demonstrate that MorphoMod achieves up to a 50.8% improvement in watermark removal effectiveness compared to state-of-the-art methods. Ablation studies highlight the impact of prompts used for inpainting, pre-removal filling strategies, and inpainting model performance on watermark removal. Additionally, a case study on steganographic disorientation reveals broader applications for watermark removal in disrupting high-level hidden messages. MorphoMod offers a robust, adaptable solution for watermark removal and opens avenues for further advancements in image restoration and adversarial manipulation.
可见水印对图像恢复技术构成了重大挑战,尤其是在目标背景未知的情况下。为此,我们提出了MorphoMod,这是一种新颖的自动化去除可见水印的方法,在盲处理环境下工作——无需提供目标图片。与现有方法不同的是,MorphoMod能够有效移除不透明和半透明的水印,并且在保留语义内容的同时进行操作,使其非常适合现实世界的应用。 在包括Colored Large-scale Watermark Dataset (CLWD),LOGO系列以及新引入的Alpha1数据集在内的基准数据集上进行评估显示,MorphoMod相比最先进的方法,在水印去除效果方面提高了高达50.8%。消融研究表明了用于修复过程中的提示、预移除填充策略和修复模型性能对水印去除的影响。此外,一项关于隐写术定向的研究案例揭示了水印去除在干扰高级隐藏信息方面的更广泛应用。 MorphoMod为水印去除提供了一种稳健且适应性强的解决方案,并开启了图像恢复及对抗性操作领域进一步发展的新途径。
https://arxiv.org/abs/2502.02676
3D Gaussian Splatting (3DGS) has demonstrated superior quality in modeling 3D objects and scenes. However, generating 3DGS remains challenging due to their discrete, unstructured, and permutation-invariant nature. In this work, we present a simple yet effective method to overcome these challenges. We utilize spherical mapping to transform 3DGS into a structured 2D representation, termed UVGS. UVGS can be viewed as multi-channel images, with feature dimensions as a concatenation of Gaussian attributes such as position, scale, color, opacity, and rotation. We further find that these heterogeneous features can be compressed into a lower-dimensional (e.g., 3-channel) shared feature space using a carefully designed multi-branch network. The compressed UVGS can be treated as typical RGB images. Remarkably, we discover that typical VAEs trained with latent diffusion models can directly generalize to this new representation without additional training. Our novel representation makes it effortless to leverage foundational 2D models, such as diffusion models, to directly model 3DGS. Additionally, one can simply increase the 2D UV resolution to accommodate more Gaussians, making UVGS a scalable solution compared to typical 3D backbones. This approach immediately unlocks various novel generation applications of 3DGS by inherently utilizing the already developed superior 2D generation capabilities. In our experiments, we demonstrate various unconditional, conditional generation, and inpainting applications of 3DGS based on diffusion models, which were previously non-trivial.
3D高斯点阵(3D Gaussian Splatting,简称3DGS)在建模三维物体和场景时已经展现出了卓越的质量。然而,由于其离散、无结构以及排列不变的特性,生成3DGS依然具有挑战性。在这项工作中,我们提出了一种简单而有效的方法来克服这些困难。 我们利用球面映射将3DGS转化为一种结构化的二维表示形式,称之为UVGS(UV Gaussian Splatting)。UVGS可以被视为多通道图像,其特征维度是高斯属性(如位置、尺度、颜色、不透明度和旋转)的串联。此外,我们发现这些异构特性可以通过精心设计的多分支网络压缩到一个较低维(例如3个频道)共享特征空间中,从而将压缩后的UVGS视为典型的RGB图像处理。 令人惊讶的是,我们发现训练有素的标准变分自编码器(VAEs),当与潜在扩散模型结合使用时,可以直接应用于这种新的表示形式而无需额外的训练。我们的新表示方法使得轻松利用基础二维模型(例如扩散模型)直接建模3DGS成为可能。此外,只需简单地增加2D UV分辨率即可容纳更多的高斯点,这使UVGS相比典型的3D骨干网络更具可扩展性。 这种方法立即解锁了各种基于3DGS的新颖生成应用,并且通过利用已经开发出的卓越二维生成能力,这些新表示形式在本质上得到了运用。在我们的实验中,我们展示了基于扩散模型的不同无条件和有条件生成以及修复应用的3DGS实例,而之前这些都是难以实现的任务。
https://arxiv.org/abs/2502.01846
Video inpainting aims to fill in corrupted regions of the video with plausible contents. Existing methods generally assume that the locations of corrupted regions are known, focusing primarily on the "how to inpaint". This reliance necessitates manual annotation of the corrupted regions using binary masks to indicate "whereto inpaint". However, the annotation of these masks is labor-intensive and expensive, limiting the practicality of current methods. In this paper, we expect to relax this assumption by defining a new blind video inpainting setting, enabling the networks to learn the mapping from corrupted video to inpainted result directly, eliminating the need of corrupted region annotations. Specifically, we propose an end-to-end blind video inpainting network (BVINet) to address both "where to inpaint" and "how to inpaint" simultaneously. On the one hand, BVINet can predict the masks of corrupted regions by detecting semantic-discontinuous regions of the frame and utilizing temporal consistency prior of the video. On the other hand, the predicted masks are incorporated into the BVINet, allowing it to capture valid context information from uncorrupted regions to fill in corrupted ones. Besides, we introduce a consistency loss to regularize the training parameters of BVINet. In this way, mask prediction and video completion mutually constrain each other, thereby maximizing the overall performance of the trained model. Furthermore, we customize a dataset consisting of synthetic corrupted videos, real-world corrupted videos, and their corresponding completed videos. This dataset serves as a valuable resource for advancing blind video inpainting research. Extensive experimental results demonstrate the effectiveness and superiority of our method.
视频修复技术旨在用合理的画面填补受损区域。现有方法通常假设已知被破坏区域的位置,主要关注如何进行修补(即“如何填充”)。这种依赖性要求手动使用二值掩码标注被破坏的区域以指示要修复的地方,而这些掩码的标注工作耗时且成本高昂,限制了当前方法的实际应用范围。在本文中,我们希望放宽这一假设,并定义一种新的盲视频修补设置,使网络能够直接从受损视频学习映射到修复结果的过程,从而无需对受损区域进行注释。 为此,我们提出了一种端到端的盲视频修补网络(BVINet),以同时解决“在哪里填充”和“如何填充”的问题。一方面,通过检测帧中的语义不连续区域,并利用视频的时间一致性先验信息,BVINet能够预测被损坏区域的掩码。另一方面,将这些预测出的掩码整合进BVINet中,使其可以从未受损区域捕获有效上下文信息来填补受损部分。 此外,我们引入了一致性损失函数以正则化BVINet的训练参数,在这种情况下,掩码预测和视频完成互相制约,从而最大化训练模型的整体性能。为了进一步推进盲视频修补研究的发展,我们定制了一个包含合成受损视频、现实世界中的受损视频及其对应的修复后的视频的数据集。 广泛的实验结果证明了我们方法的有效性和优越性。
https://arxiv.org/abs/2502.01181
Frequent, high-resolution remote sensing imagery is crucial for agricultural and environmental monitoring. Satellites from the Landsat collection offer detailed imagery at 30m resolution but with lower temporal frequency, whereas missions like MODIS and VIIRS provide daily coverage at coarser resolutions. Clouds and cloud shadows contaminate about 55\% of the optical remote sensing observations, posing additional challenges. To address these challenges, we present SatFlow, a generative model-based framework that fuses low-resolution MODIS imagery and Landsat observations to produce frequent, high-resolution, gap-free surface reflectance imagery. Our model, trained via Conditional Flow Matching, demonstrates better performance in generating imagery with preserved structural and spectral integrity. Cloud imputation is treated as an image inpainting task, where the model reconstructs cloud-contaminated pixels and fills gaps caused by scan lines during inference by leveraging the learned generative processes. Experimental results demonstrate the capability of our approach in reliably imputing cloud-covered regions. This capability is crucial for downstream applications such as crop phenology tracking, environmental change detection etc.,
频繁的高分辨率遥感图像对于农业和环境监测至关重要。Landsat卫星系列提供的30米分辨率影像虽然详细,但时间频率较低;而像MODIS和VIIRS这样的任务则提供每日覆盖范围,尽管其空间分辨率较粗。大约55%的光学遥感观测被云层和云影污染,这给数据利用带来了额外挑战。为了解决这些问题,我们提出了SatFlow框架,这是一个基于生成模型的方法,它融合了低分辨率的MODIS影像与Landsat观测结果,以生产频繁、高分辨率且无缺口的表面反射率图像。 我们的模型通过条件流匹配进行训练,在生成保持结构和光谱完整性的影像方面表现更佳。云层填补被视为一个图像修复任务,其中模型重建被污染的像素,并在推理过程中利用所学习到的生成过程来填充扫描线造成的空缺区域。实验结果表明了该方法可靠地填补云覆盖区域的能力。这种能力对于下游应用(如作物生长阶段追踪、环境变化检测等)至关重要。
https://arxiv.org/abs/2502.01098
In recent years, Transformers-based models have made significant progress in the field of image restoration by leveraging their inherent ability to capture complex contextual features. Recently, Mamba models have made a splash in the field of computer vision due to their ability to handle long-range dependencies and their significant computational efficiency compared to Transformers. However, Mamba currently lags behind Transformers in contextual learning capabilities. To overcome the limitations of these two models, we propose a Mamba-Transformer hybrid image restoration model called MatIR. Specifically, MatIR cross-cycles the blocks of the Transformer layer and the Mamba layer to extract features, thereby taking full advantage of the advantages of the two architectures. In the Mamba module, we introduce the Image Inpainting State Space (IRSS) module, which traverses along four scan paths to achieve efficient processing of long sequence data. In the Transformer module, we combine triangular window-based local attention with channel-based global attention to effectively activate the attention mechanism over a wider range of image pixels. Extensive experimental results and ablation studies demonstrate the effectiveness of our approach.
近年来,基于Transformer的模型通过利用其捕捉复杂上下文特征的能力,在图像恢复领域取得了显著进展。最近,Mamba模型由于能够处理长距离依赖关系,并且与Transformer相比具有显著的计算效率,在计算机视觉领域引起了轰动。然而,Mamba在上下文学习能力上目前仍落后于Transformer。为了克服这两种模型的局限性,我们提出了一种名为MatIR的混合型图像恢复模型,该模型结合了Mamba和Transformer的特点。具体而言,MatIR通过交错使用Transformer层和Mamba层的块来提取特征,从而充分利用两种架构的优势。 在Mamba模块中,我们引入了一个称为“图像修复状态空间”(Image Inpainting State Space, IRSS)的模块,该模块沿着四条扫描路径进行遍历,以实现长序列数据的有效处理。而在Transformer模块中,我们将基于三角窗口的局部注意力与基于通道的全局注意力结合起来,从而更有效地在更大范围内的图像像素上激活注意力机制。 大量的实验结果和消融研究表明了我们方法的有效性。
https://arxiv.org/abs/2501.18401
Object removal is of great significance to 3D scene understanding, essential for applications in content filtering and scene editing. Current mainstream methods primarily focus on removing individual objects, with a few methods dedicated to eliminating an entire area or all objects of a certain category. They however confront the challenge of insufficient granularity and flexibility for real-world applications, where users demand tailored excision and preservation of objects within defined zones. In addition, most of the current methods require kinds of priors when addressing multi-view inpainting, which is time-consuming. To address these limitations, we propose an efficient and user-friendly pipeline for 3D multi-object removal, enabling users to flexibly select areas and define objects for removal or preservation. Concretely, to ensure object consistency and correspondence across multiple views, we propose a novel mask matching and refinement module, which integrates homography-based warping with high-confidence anchor points for segmentation. By leveraging the IoU joint shape context distance loss, we enhance the accuracy of warped masks and improve subsequent inpainting processes. Considering the current immaturity of 3D multi-object removal, we provide a new evaluation dataset to bridge the developmental void. Experimental results demonstrate that our method significantly reduces computational costs, achieving processing speeds more than 80% faster than state-of-the-art methods while maintaining equivalent or higher reconstruction quality.
对象移除对于三维场景理解具有重要意义,是内容过滤和场景编辑应用的基本需求。当前主流方法主要专注于单独移除单个对象,少数方法则致力于消除整个区域或某一类别的所有对象。然而,这些方法在面对现实世界应用场景时遇到了颗粒度不足和灵活性不够的问题,用户需要根据定义的区域灵活选择要移除或保留的对象。此外,大多数现有方法在处理多视角修复任务时依赖于各种先验知识,这会耗费大量时间。 为了克服这些限制,我们提出了一种高效且用户友好的管道流程,用于三维多对象移除,使用户能够根据需求灵活选择和定义要移除或保留的对象区域。具体来说,为了确保在多个视图间物体的一致性和对应性,我们提出了一个新颖的掩码匹配与优化模块,该模块结合了基于同构变换的扭曲处理和高置信度锚点用于分割。通过利用IoU联合形状上下文距离损失函数,我们提高了扭曲后的掩码精度,并提升了后续修复过程的效果。 鉴于当前三维多对象移除技术尚未成熟,我们提供了一个新的评估数据集以填补发展上的空白。实验结果显示,我们的方法显著降低了计算成本,在保持相当或更高重建质量的同时,处理速度比现有最先进的方法快超过80%。
https://arxiv.org/abs/2501.17636
We generate abstractions of buildings, reflecting the essential aspects of their geometry and structure, by learning to invert procedural models. We first build a dataset of abstract procedural building models paired with simulated point clouds and then learn the inverse mapping through a transformer. Given a point cloud, the trained transformer then infers the corresponding abstracted building in terms of a programmatic language description. This approach leverages expressive procedural models developed for gaming and animation, and thereby retains desirable properties such as efficient rendering of the inferred abstractions and strong priors for regularity and symmetry. Our approach achieves good reconstruction accuracy in terms of geometry and structure, as well as structurally consistent inpainting.
我们通过学习逆向程序模型来生成建筑物的抽象表示,反映其几何和结构的本质方面。首先,我们构建了一个由抽象的程序化建筑模型与模拟点云配对组成的数据库,然后通过一个变压器(transformer)学习反向映射。给定点云后,经过训练的变压器将推断出相应抽象建筑的程序化语言描述。这种方法利用了为游戏和动画开发的具有表达力的程序化模型,从而保留了高效渲染推断抽象、以及结构规则性和对称性的强先验等理想特性。我们的方法在几何形状和结构重建准确性方面取得了良好的效果,并且能够实现结构一致的补全(inpainting)。
https://arxiv.org/abs/2501.17044
Recent advancements in virtual fitting for characters and clothing have leveraged diffusion models to improve the realism of garment fitting. However, challenges remain in handling complex scenes and poses, which can result in unnatural garment fitting and poorly rendered intricate patterns. In this work, we introduce ITVTON, a novel method that enhances clothing-character interactions by combining clothing and character images along spatial channels as inputs, thereby improving fitting accuracy for the inpainting model. Additionally, we incorporate integrated textual descriptions from multiple images to boost the realism of the generated visual effects. To optimize computational efficiency, we limit training to the attention parameters within a single diffusion transformer (Single-DiT) block. To more rigorously address the complexities of real-world scenarios, we curated training samples from the IGPair dataset, thereby enhancing ITVTON's performance across diverse environments. Extensive experiments demonstrate that ITVTON outperforms baseline methods both qualitatively and quantitatively, setting a new standard for virtual fitting tasks.
近期,虚拟试衣技术在角色和服装上的应用已利用扩散模型来提高衣物贴合的真实感。然而,在处理复杂场景和姿势时仍然存在挑战,这可能导致不自然的衣物贴合效果以及精细图案渲染不佳的问题。为此,我们引入了ITVTON这一创新方法,通过结合衣物图像与人物图像的空间通道输入,以提升服装与角色之间的互动准确性,从而改进修补模型(inpainting model)在衣物拟合上的精确度。此外,我们还融合了多幅图片中的综合文本描述来增强生成视觉效果的现实感。 为了优化计算效率,我们在单个扩散变换器(Single-DiT)块内的注意力参数上限制训练过程。为更严格地应对真实世界场景的复杂性,我们将从IGPair数据集中精选出的训练样本纳入ITVTON模型中,从而在各种环境条件下提升了其性能表现。 通过广泛的实验测试证明,与基线方法相比,ITVTON无论是在定性评估还是定量指标上都取得了优异的成绩,并为虚拟试衣任务设立了新的标准。
https://arxiv.org/abs/2501.16757
Digital Surface Models (DSMs) are essential for accurately representing Earth's topography in geospatial analyses. DSMs capture detailed elevations of natural and manmade features, crucial for applications like urban planning, vegetation studies, and 3D reconstruction. However, DSMs derived from stereo satellite imagery often contain voids or missing data due to occlusions, shadows, and lowsignal areas. Previous studies have primarily focused on void filling for digital elevation models (DEMs) and Digital Terrain Models (DTMs), employing methods such as inverse distance weighting (IDW), kriging, and spline interpolation. While effective for simpler terrains, these approaches often fail to handle the intricate structures present in DSMs. To overcome these limitations, we introduce Dfilled, a guided DSM void filling method that leverages optical remote sensing images through edge-enhancing diffusion. Dfilled repurposes deep anisotropic diffusion models, which originally designed for super-resolution tasks, to inpaint DSMs. Additionally, we utilize Perlin noise to create inpainting masks that mimic natural void patterns in DSMs. Experimental evaluations demonstrate that Dfilled surpasses traditional interpolation methods and deep learning approaches in DSM void filling tasks. Both quantitative and qualitative assessments highlight the method's ability to manage complex features and deliver accurate, visually coherent results.
数字表面模型(DSM)在地理空间分析中对于精确表示地球地形至关重要。DSM捕获自然和人造特征的详细高程信息,这对于城市规划、植被研究和3D重建等应用极为重要。然而,从立体卫星图像导出的DSM常常包含由于遮挡、阴影和低信号区域导致的数据缺失或空白部分。以往的研究主要集中在数字高程模型(DEM)和数字地形模型(DTM)的空洞填充上,并采用了逆距离加权(IDW)、克里金插值法和样条插值等方法。这些方法虽然对简单地形有效,但在处理DSM中复杂的结构特征时往往表现不佳。 为了解决这些问题,我们提出了一种新的方法Dfilled,它是一种通过光学遥感图像边缘增强扩散来引导DSM空洞填充的方法。Dfilled重新利用了最初用于超分辨率任务的深度各向异性扩散模型,以对DSM进行“修补”。此外,我们还使用Perlin噪声创建仿真自然DSM中空洞模式的“修补”掩模。 实验评估表明,Dfilled在DSM空洞填补任务上超过了传统的插值方法和深度学习方法。无论是定量还是定性评估都显示了该方法能够处理复杂特征,并提供准确且视觉一致的结果的能力。
https://arxiv.org/abs/2501.15440
Diffusion models have indeed shown great promise in solving inverse problems in image processing. In this paper, we propose a novel, problem-agnostic diffusion model called the maximum a posteriori (MAP)-based guided term estimation method for inverse problems. We divide the conditional score function into two terms according to Bayes' rule: the unconditional score function and the guided term. We design the MAP-based guided term estimation method, while the unconditional score function is approximated by an existing score network. To estimate the guided term, we base on the assumption that the space of clean natural images is inherently smooth, and introduce a MAP estimate of the $t$-th latent variable. We then substitute this estimation into the expression of the inverse problem and obtain the approximation of the guided term. We evaluate our method extensively on super-resolution, inpainting, and denoising tasks, and demonstrate comparable performance to DDRM, DMPS, DPS and $\Pi$GDM.
扩散模型在解决图像处理中的逆向问题方面已经展现出了巨大的潜力。在这篇论文中,我们提出了一种新颖的、与具体问题无关的扩散模型,称为基于最大后验估计(MAP)指导项估计算法,用于解决逆向问题。我们将条件分数函数按照贝叶斯规则分解为两个部分:无条件分数函数和引导项。我们设计了基于MAP的引导项估计算法,而无条件分数函数则由现有的分数网络进行近似。 为了估计引导项,我们假设干净自然图像的空间本质上是平滑的,并引入了一个关于第$t$个潜在变量的最大后验估计。然后我们将这个估计值代入逆向问题的表达式中,以获得对引导项的近似值。 我们在超分辨率、图像修复和去噪任务上广泛评估了我们的方法,并展示了与DDRM(Diffusion Denoising Restoration Model)、DMPS(Denoising and Mask Prediction Score)、DPS(Denoising Prior Score)和$\Pi$GDM(Product of Guided Diffusion Models)等现有技术相当的性能。
https://arxiv.org/abs/2501.15128
Single-view novel view synthesis (NVS) is a notorious problem due to its ill-posed nature, and often requires large, computationally expensive approaches to produce tangible results. In this paper, we propose CheapNVS: a fully end-to-end approach for narrow baseline single-view NVS based on a novel, efficient multiple encoder/decoder design trained in a multi-stage fashion. CheapNVS first approximates the laborious 3D image warping with lightweight learnable modules that are conditioned on the camera pose embeddings of the target view, and then performs inpainting on the occluded regions in parallel to achieve significant performance gains. Once trained on a subset of Open Images dataset, CheapNVS outperforms the state-of-the-art despite being 10 times faster and consuming 6% less memory. Furthermore, CheapNVS runs comfortably in real-time on mobile devices, reaching over 30 FPS on a Samsung Tab 9+.
单视角新颖视图合成(NVS)由于其病态特性而成为一个棘手的问题,通常需要采用大规模且计算成本高昂的方法才能获得实质性结果。在本文中,我们提出了一种名为CheapNVS的全新端到端方法,该方法基于一种高效的新颖多编码器/解码器设计,并通过分阶段训练来解决窄基线单视图新颖视图合成问题。 CheapNVS首先使用轻量级可学习模块(这些模块根据目标视图的相机姿态嵌入进行条件化)近似繁重的3D图像扭曲,然后在并行模式下对遮挡区域执行图像修复操作,以此实现显著性能提升。经过Open Images数据集子集训练后,CheapNVS即使计算速度比现有最佳方法快十倍且占用内存少6%,仍然能够超越当前最优水平。 此外,CheapNVS能够在移动设备上以实时的方式运行,在三星Tab 9+上达到超过30 FPS的帧率。
https://arxiv.org/abs/2501.14533
In E-commerce platforms, a full advertising image is composed of a background image and marketing taglines. Automatic ad image design reduces human costs and plays a crucial role. For the convenience of users, a novel automatic framework named Product-Centric Advertising Image Design (PAID) is proposed in this work. PAID takes the product foreground image, required taglines, and target size as input and creates an ad image automatically. PAID consists of four sequential stages: prompt generation, layout generation, background image generation, and graphics rendering. Different expert models are trained to conduct these sub-tasks. A visual language model (VLM) based prompt generation model is leveraged to produce a product-matching background prompt. The layout generation model jointly predicts text and image layout according to the background prompt, product, and taglines to achieve the best harmony. An SDXL-based layout-controlled inpainting model is trained to generate an aesthetic background image. Previous ad image design methods take a background image as input and then predict the layout of taglines, which limits the spatial layout due to fixed image content. Innovatively, our PAID adjusts the stages to produce an unrestricted layout. To complete the PAID framework, we created two high-quality datasets, PITA and PIL. Extensive experimental results show that PAID creates more visually pleasing advertising images than previous methods.
在电子商务平台中,一个完整的广告图片由背景图和营销标语组成。自动化的广告图像设计可以减少人力成本,并发挥重要作用。为方便用户使用,在这项工作中提出了一种名为以产品为中心的广告图像设计(PAID)的新颖自动框架。PAID接受产品前景图、所需标语以及目标尺寸作为输入,然后自动生成广告图片。PAID由四个连续阶段构成:提示生成、布局生成、背景图生成和图形渲染。为了执行这些子任务,训练了不同的专家模型。 一种基于视觉语言模型(VLM)的提示生成模型被用来生产与产品匹配的背景提示。布局生成模型根据背景提示、产品以及标语共同预测文本和图像布局,以实现最佳和谐效果。一个基于SDXL的布局控制型修复模型被训练来生成美观的背景图。此前的方法在输入背景图后会预测标语的布局,由于固定内容的原因限制了空间布局的可能性。创新之处在于,我们的PAID通过调整阶段产生不受限的空间布局。 为了完善PAID框架,我们创建了两个高质量的数据集PITA和PIL。广泛的实验结果表明,PAID生成的广告图片在视觉上比之前的方法更加吸引人。
https://arxiv.org/abs/2501.14316
We introduce the Binary Diffusion Probabilistic Model (BDPM), a novel generative model optimized for binary data representations. While denoising diffusion probabilistic models (DDPMs) have demonstrated notable success in tasks like image synthesis and restoration, traditional DDPMs rely on continuous data representations and mean squared error (MSE) loss for training, applying Gaussian noise models that may not be optimal for discrete or binary data structures. BDPM addresses this by decomposing images into bitplanes and employing XOR-based noise transformations, with a denoising model trained using binary cross-entropy loss. This approach enables precise noise control and computationally efficient inference, significantly lowering computational costs and improving model convergence. When evaluated on image restoration tasks such as image super-resolution, inpainting, and blind image restoration, BDPM outperforms state-of-the-art methods on the FFHQ, CelebA, and CelebA-HQ datasets. Notably, BDPM requires fewer inference steps than traditional DDPM models to reach optimal results, showcasing enhanced inference efficiency.
我们介绍了一种新型生成模型——二进制扩散概率模型(BDPM),该模型专门针对二进制数据表示进行优化。尽管去噪扩散概率模型(DDPM)在图像合成和恢复等任务中取得了显著的成功,传统的DDPM依赖于连续的数据表示,并使用均方误差(MSE)损失进行训练,应用高斯噪声模型可能并不适合离散或二进制数据结构。BDPM通过将图像分解为位平面并采用基于XOR的噪声变换来解决这一问题,并利用二元交叉熵损失训练去噪模型。这种方法使精确控制噪声和更高效的推理成为可能,大大降低了计算成本,并提高了模型收敛速度。在评估图像恢复任务(如超分辨率、修复和盲图恢复)时,BDPM在FFHQ、CelebA和CelebA-HQ数据集上优于现有方法。值得注意的是,与传统DDPM相比,BDPM需要更少的推理步骤即可达到最佳结果,展示了增强的推理效率。
https://arxiv.org/abs/2501.13915
Traditionally, autonomous reconnaissance applications have acted on explicit sets of historical observations. Aided by recent breakthroughs in generative technologies, this work enables robot teams to act beyond what is currently known about the environment by inferring a distribution of reasonable interpretations of the scene. We developed a map predictor that inpaints the unknown space in a multi-agent 2D occupancy map during an exploration mission. From a comparison of several inpainting methods, we found that a fine-tuned latent diffusion inpainting model could provide rich and coherent interpretations of simulated urban environments with relatively little computation time. By iteratively inferring interpretations of the scene throughout an exploration run, we are able to identify areas that exhibit high uncertainty in the prediction, which we formalize with the concept of generative entropy. We prioritize tasks in regions of high generative entropy, hypothesizing that this will expedite convergence on an accurate predicted map of the scene. In our study we juxtapose this new paradigm of task ranking with the state of the art, which ranks regions to explore by those which maximize expected information recovery. We compare both of these methods in a simulated urban environment with three vehicles. Our results demonstrate that by using our new task ranking method, we can predict a correct scene significantly faster than with a traditional information-guided method.
传统上,自主侦察应用程序基于一组明确的历史观察结果进行操作。得益于最近生成技术的突破,这项工作使机器人团队能够超越当前对环境的认识,在推理出场景合理解释的概率分布的基础上采取行动。我们开发了一种地图预测器,它在多智能体2D占据图探索任务中为未知空间填充内容。通过比较几种填充方法,我们发现经过微调的潜在扩散填充模型能够在相对较少的计算时间内提供丰富且连贯的对模拟城市环境的解释。通过在整个探索过程中迭代推理场景解释,我们可以识别出预测不确定性高的区域,并用生成熵的概念对此进行形式化。我们优先处理高生成熵区域的任务,假设这将加速准确预测地图的收敛速度。在我们的研究中,我们将这种新的任务排序方法与最先进的方法进行了比较,后者依据最大化预期信息恢复来排名需要探索的区域。我们在一个模拟的城市环境中使用三种车辆对比了这两种方法。我们的结果表明,通过使用我们新的任务排序方法,我们可以比传统的信息导向方法更快地准确预测场景。
https://arxiv.org/abs/2501.13189
Diffusion models are state-of-the-art for image generation. Trained on large datasets, they capture expressive image priors that have been used for tasks like inpainting, depth, and (surface) normal prediction. However, these models are typically trained for one specific task, e.g., a separate model for each of color, depth, and normal prediction. Such models do not leverage the intrinsic correlation between appearance and geometry, often leading to inconsistent predictions. In this paper, we propose using a novel image diffusion prior that jointly encodes appearance and geometry. We introduce a diffusion model Orchid, comprising a Variational Autoencoder (VAE) to encode color, depth, and surface normals to a latent space, and a Latent Diffusion Model (LDM) for generating these joint latents. Orchid directly generates photo-realistic color images, relative depth, and surface normals from user-provided text, and can be used to create image-aligned partial 3D scenes seamlessly. It can also perform image-conditioned tasks like joint monocular depth and normal prediction and is competitive in accuracy to state-of-the-art methods designed for those tasks alone. Lastly, our model learns a joint prior that can be used zero-shot as a regularizer for many inverse problems that entangle appearance and geometry. For example, we demonstrate its effectiveness in color-depth-normal inpainting, showcasing its applicability to problems in 3D generation from sparse views.
扩散模型在图像生成方面处于领先地位。经过大型数据集的训练,它们捕捉到了富有表现力的图像先验知识,这些先验知识已被用于诸如图像修复、深度预测和表面法线预测等任务中。然而,通常情况下,这些模型只为特定任务进行单独训练,例如为色彩预测、深度预测和表面法线预测分别构建独立的模型。这种做法未能利用外观与几何之间的内在关联性,导致了不一致的预测结果。在本文中,我们提出了一种新的图像扩散先验方法,该方法同时编码了外观和几何信息。为此,我们设计了一个名为Orchid的扩散模型,它包含一个变分自动编码器(VAE),用于将色彩、深度及表面法线信息转换为潜在空间,并使用潜在扩散模型(LDM)生成这些联合潜变量。 Orchid能够直接根据用户提供的文本生成逼真的彩色图像、相对深度和表面法线,并可无缝创建与图像对齐的部分3D场景。此外,它还可以执行基于图像的任务,如联合单眼深度预测和法线预测,并在准确性方面能与专门为此类任务设计的最先进的方法相媲美。最后,我们的模型学习了一种联合先验知识,可以零样本作为许多纠缠着外观与几何问题(如逆向问题)的正则化器使用。例如,我们展示了其在色彩-深度-法线修复方面的有效性,证明了该模型对于从稀疏视角生成3D图像的问题具有实际应用价值。
https://arxiv.org/abs/2501.13087
Rigged objects are commonly used in artist pipelines, as they can flexibly adapt to different scenes and postures. However, articulating the rigs into realistic affordance-aware postures (e.g., following the context, respecting the physics and the personalities of the object) remains time-consuming and heavily relies on human labor from experienced artists. In this paper, we tackle the novel problem and design A3Syn. With a given context, such as the environment mesh and a text prompt of the desired posture, A3Syn synthesizes articulation parameters for arbitrary and open-domain rigged objects obtained from the Internet. The task is incredibly challenging due to the lack of training data, and we do not make any topological assumptions about the open-domain rigs. We propose using 2D inpainting diffusion model and several control techniques to synthesize in-context affordance information. Then, we develop an efficient bone correspondence alignment using a combination of differentiable rendering and semantic correspondence. A3Syn has stable convergence, completes in minutes, and synthesizes plausible affordance on different combinations of in-the-wild object rigs and scenes.
配装物体(rigged objects)在艺术家的工作流程中经常被使用,因为它们能够灵活适应不同的场景和姿态。然而,将这些配装物调整成符合现实情景、遵守物理法则并体现对象个性的姿势仍然是一项耗时且高度依赖于有经验艺术家的人工劳动的任务。本文提出解决这一新颖问题的方法,并设计了A3Syn系统。给定一定的上下文信息(如环境网格和所需姿态的文字提示),A3Syn可以为从互联网上获取的任意开放域配装物体合成出相应的关节参数。 这项任务极具挑战性,原因在于缺乏训练数据且我们不对开放域配装物做任何拓扑假设。为此,我们提出利用二维修补扩散模型以及几种控制技术来生成符合上下文的相关信息(affordance)。接着,通过结合可微渲染和语义对应关系的方法开发了一种高效的骨骼对齐机制。 A3Syn能够稳定收敛,并在几分钟内完成任务,在不同的开放域物体配装组合及场景中都能合成出合理的相关性。
https://arxiv.org/abs/2501.12393
Recent video inpainting methods have achieved encouraging improvements by leveraging optical flow to guide pixel propagation from reference frames either in the image space or feature space. However, they would produce severe artifacts in the mask center when the masked area is too large and no pixel correspondences can be found for the center. Recently, diffusion models have demonstrated impressive performance in generating diverse and high-quality images, and have been exploited in a number of works for image inpainting. These methods, however, cannot be applied directly to videos to produce temporal-coherent inpainting results. In this paper, we propose a training-free framework, named VipDiff, for conditioning diffusion model on the reverse diffusion process to produce temporal-coherent inpainting results without requiring any training data or fine-tuning the pre-trained diffusion models. VipDiff takes optical flow as guidance to extract valid pixels from reference frames to serve as constraints in optimizing the randomly sampled Gaussian noise, and uses the generated results for further pixel propagation and conditional generation. VipDiff also allows for generating diverse video inpainting results over different sampled noise. Experiments demonstrate that VipDiff can largely outperform state-of-the-art video inpainting methods in terms of both spatial-temporal coherence and fidelity.
近期的视频修复方法通过利用光学流来指导像素从参考帧在图像空间或特征空间中的传播,取得了令人鼓舞的进步。然而,当掩码区域过大且无法找到中心位置的像素对应关系时,这些方法会产生严重的伪影。最近,扩散模型(diffusion models)在生成多样性和高质量的图像方面表现出色,并已在若干研究中用于图像修复工作。不过,直接将这些方法应用于视频以产生时间连贯性的修复结果是不可行的。 在这篇论文中,我们提出了一种无需训练的框架VipDiff,通过调整扩散模型在反向扩散过程中的条件设置来生成时间连贯性修复结果,而无需任何训练数据或微调预训练的扩散模型。VipDiff利用光学流作为指导从参考帧提取有效像素以用作优化随机采样的高斯噪声的约束,并使用生成的结果进行进一步的像素传播和有条件生成。此外,VipDiff允许根据不同的噪声样本生成多样化的视频修复结果。 实验表明,在时空连贯性和保真度方面,VipDiff可以大幅超越现有的视频修复方法。
https://arxiv.org/abs/2501.12267
Capturing high dynamic range (HDR) scenes is one of the most important issues in camera design. Majority of cameras use exposure fusion technique, which fuses images captured by different exposure levels, to increase dynamic range. However, this approach can only handle images with limited exposure difference, normally 3-4 stops. When applying to very high dynamic scenes where a large exposure difference is required, this approach often fails due to incorrect alignment or inconsistent lighting between inputs, or tone mapping artifacts. In this work, we propose UltraFusion, the first exposure fusion technique that can merge input with 9 stops differences. The key idea is that we model the exposure fusion as a guided inpainting problem, where the under-exposed image is used as a guidance to fill the missing information of over-exposed highlight in the over-exposed region. Using under-exposed image as a soft guidance, instead of a hard constrain, our model is robust to potential alignment issue or lighting variations. Moreover, utilizing the image prior of the generative model, our model also generates natural tone mapping, even for very high-dynamic range scene. Our approach outperforms HDR-Transformer on latest HDR benchmarks. Moreover, to test its performance in ultra high dynamic range scene, we capture a new real-world exposure fusion benchmark, UltraFusion Dataset, with exposure difference up to 9 stops, and experiments show that \model~can generate beautiful and high-quality fusion results under various scenarios. An online demo is provided at this https URL.
捕捉高动态范围(HDR)场景是相机设计中的一个重要问题。大多数相机采用曝光融合技术,该技术通过合并不同曝光水平拍摄的图像来增加动态范围。然而,这种方法只能处理有限的曝光差异,通常为3-4档。当应用于需要大曝光差的极高动态范围场景时,由于输入间的不正确对齐或照明变化导致的方法失败率很高,或者出现色调映射伪影问题。 在本文中,我们提出了UltraFusion——一种能够合并高达9挡曝光差异输入的第一代曝光融合技术。其核心思想是将曝光融合建模为引导式修复(guided inpainting)问题,在这种情况下,欠曝图像被用作指导来填充过度曝光区域中的高光缺失信息。通过使用欠曝图像作为软引导而不是硬限制,我们的模型能够有效地处理潜在的对齐问题或光照变化带来的影响。此外,利用生成模型中的图像先验知识,即使在极高动态范围场景下,我们的模型也能产生自然色调映射。 相比HDR-Transformer,在最新的HDR基准测试中,我们的方法表现出色。为了验证其在超高动态范围场景下的性能,我们捕捉了一个新的现实世界曝光融合基准数据集——UltraFusion Dataset,其中包含高达9挡的曝光差异。实验表明,无论在何种情况下,\model~都能生成美丽且高质量的融合结果。 此外,您可以在提供的在线链接中查看我们的方法演示(由于是翻译内容,原文中的“this https URL”未能提供具体网址,在实际应用中请替换为有效链接)。
https://arxiv.org/abs/2501.11515
Audio in the real world may be perturbed due to numerous factors, causing the audio quality to be degraded. The following work presents an audio restoration model tailored for high-res music at 44.1kHz. Our model, Audio-to-Audio Schrodinger Bridges (A2SB), is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB is end-to-end without need of a vocoder to predict waveform outputs, able to restore hour-long audio inputs, and trained on permissively licensed music data. A2SB is capable of achieving state-of-the-art bandwidth extension and inpainting quality on several out-of-distribution music test sets. Our demo website is https: //research.this http URL.
在现实世界中,音频可能会因各种因素而受到干扰,导致音质下降。本工作提出了一种专门针对44.1kHz高分辨率音乐的音频修复模型。我们的模型称为Audio-to-Audio薛定谔桥(A2SB),能够同时进行带宽扩展(预测高频分量)和填补缺失部分。特别地,A2SB是一个端到端的模型,无需使用声码器来预测波形输出,可以恢复长达数小时的音频输入,并且是在许可音乐数据上训练的。在多个分布外的音乐测试集上,A2SB达到了带宽扩展和填补质量的最先进水平。我们的演示网站是 https://research.this http URL(请注意:URL似乎被截断了,请确认完整网址以访问演示页面)。
https://arxiv.org/abs/2501.11311
Recent video inpainting algorithms integrate flow-based pixel propagation with transformer-based generation to leverage optical flow for restoring textures and objects using information from neighboring frames, while completing masked regions through visual Transformers. However, these approaches often encounter blurring and temporal inconsistencies when dealing with large masks, highlighting the need for models with enhanced generative capabilities. Recently, diffusion models have emerged as a prominent technique in image and video generation due to their impressive performance. In this paper, we introduce DiffuEraser, a video inpainting model based on stable diffusion, designed to fill masked regions with greater details and more coherent structures. We incorporate prior information to provide initialization and weak conditioning,which helps mitigate noisy artifacts and suppress hallucinations. Additionally, to improve temporal consistency during long-sequence inference, we expand the temporal receptive fields of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing property of Video Diffusion Models. Experimental results demonstrate that our proposed method outperforms state-of-the-art techniques in both content completeness and temporal consistency while maintaining acceptable efficiency.
最近的视频修复算法结合了基于流的像素传播和基于变压器的生成技术,利用光学流动来恢复纹理和对象,并通过视觉变压器完成被遮罩区域。然而,这些方法在处理大范围掩码时经常遇到模糊和时间不一致的问题,这凸显了需要具有增强生成能力的模型。最近,扩散模型因其卓越的表现而成为图像和视频生成领域的突出技术。在这篇论文中,我们引入了一种基于稳定扩散的视频修复模型DiffuEraser,旨在以更详细的内容和更具连贯性的结构填充被遮罩区域。我们整合了先验信息来提供初始化和弱条件设置,这有助于减少噪声效应并抑制幻觉。此外,为了在长序列推理过程中提高时间一致性,我们将先前模型和DiffuEraser的时间感知域扩展,并进一步利用视频扩散模型的时间平滑特性来增强一致性。 实验结果表明,我们提出的方法在内容完整性和时间一致性方面均优于最先进的技术,并且保持了可接受的效率。
https://arxiv.org/abs/2501.10018