Implicit Neural Representation (INR), leveraging a neural network to transform coordinate input into corresponding attributes, has recently driven significant advances in several vision-related domains. However, the performance of INR is heavily influenced by the choice of the nonlinear activation function used in its multilayer perceptron (MLP) architecture. Multiple nonlinearities have been investigated; yet, current INRs face limitations in capturing high-frequency components, diverse signal types, and handling inverse problems. We have identified that these problems can be greatly alleviated by introducing a paradigm shift in INRs. We find that an architecture with learnable activations in initial layers can represent fine details in the underlying signals. Specifically, we propose SL$^{2}$A-INR, a hybrid network for INR with a single-layer learnable activation function, prompting the effectiveness of traditional ReLU-based MLPs. Our method performs superior across diverse tasks, including image representation, 3D shape reconstructions, inpainting, single image super-resolution, CT reconstruction, and novel view synthesis. Through comprehensive experiments, SL$^{2}$A-INR sets new benchmarks in accuracy, quality, and convergence rates for INR.
隐式神经表示(INR)利用神经网络将坐标输入转换为相应的属性,在最近几个与视觉相关的领域中,显著推动了进展。然而,INR的性能在很大程度上受到其在多层感知器(MLP)架构中使用的非线性激活函数的选择影响。我们研究了多个非线性特性;然而,目前的INR在面对高频成分、多样信号类型和处理反问题方面存在局限性。我们发现,通过引入INR范式的转变,这些问题可以大大减轻。我们发现,具有可学习激活函数的层初始架构可以表示底层信号的细小细节。具体来说,我们提出了SL$^{2}$A-INR,一种只使用单层可学习激活函数的INR,推动了传统基于ReLU的MLP的有效性。我们的方法在各种任务上表现出色,包括图像表示、3D形状重构、去噪、单图像超分辨率、CT重建和新的视图合成。通过全面的实验,SL$^{2}$A-INR为INR设立了新的准确度、质量和收敛率基准。
https://arxiv.org/abs/2409.10836
E-commerce image generation has always been one of the core demands in the e-commerce field. The goal is to restore the missing background that matches the main product given. In the post-AIGC era, diffusion models are primarily used to generate product images, achieving impressive results. This paper systematically analyzes and addresses a core pain point in diffusion model generation: overcompletion, which refers to the difficulty in maintaining product features. We propose two solutions: 1. Using an instance mask fine-tuned inpainting model to mitigate this phenomenon; 2. Adopting a train-free mask guidance approach, which incorporates refined product masks as constraints when combining ControlNet and UNet to generate the main product, thereby avoiding overcompletion of the product. Our method has achieved promising results in practical applications and we hope it can serve as an inspiring technical report in this field.
电子商务图像生成一直是电子商务领域的一个核心需求。目标是恢复与给定主要产品相匹配的丢失背景。在AIGC之后,扩散模型主要用于生成产品图像,取得了令人印象深刻的成果。本文系统地分析和解决了扩散模型生成中的核心痛点:过度完成,这指的是保持产品特征的难度。我们提出了两种解决方案:1. 使用在实例掩码上进行微调的修复模型来缓解这种现象;2. 采用训练free的掩码指导方法,在将ControlNet和UNet结合生成主要产品时,将精细的产品掩码作为约束,从而避免产品过度完成。我们的方法在实际应用中取得了良好的效果,并希望它能在该领域成为一种鼓舞人心的技术报告。
https://arxiv.org/abs/2409.09681
This study addresses the challenge of accurately segmenting 3D Gaussian Splatting from 2D masks. Conventional methods often rely on iterative gradient descent to assign each Gaussian a unique label, leading to lengthy optimization and sub-optimal solutions. Instead, we propose a straightforward yet globally optimal solver for 3D-GS segmentation. The core insight of our method is that, with a reconstructed 3D-GS scene, the rendering of the 2D masks is essentially a linear function with respect to the labels of each Gaussian. As such, the optimal label assignment can be solved via linear programming in closed form. This solution capitalizes on the alpha blending characteristic of the splatting process for single step optimization. By incorporating the background bias in our objective function, our method shows superior robustness in 3D segmentation against noises. Remarkably, our optimization completes within 30 seconds, about 50$\times$ faster than the best existing methods. Extensive experiments demonstrate the efficiency and robustness of our method in segmenting various scenes, and its superior performance in downstream tasks such as object removal and inpainting. Demos and code will be available at this https URL.
这项研究解决了从2D掩码中准确分割3D高斯平铺的挑战。传统的方法通常通过迭代梯度下降为每个高斯分配唯一的标签,导致长时间的优化和次优解决方案。相反,我们提出了一种简单而全局最优的3D-GS分割解决方案。我们方法的核心思想是,通过重构3D-GS场景,每个高斯的渲染在很大程度上是一个关于每个高斯标签的线性函数。因此,最优标签分配可以通过线性规划在闭式形式下求解。这个解决方案利用了平铺过程的alpha平滑特性,实现了单步优化的最优性能。通过将背景偏差包含在我们的目标函数中,我们的方法在3D分割在面对噪声时表现出卓越的鲁棒性。实验和代码将在这个[https://www.osso.io/url](https://www.osso.io/url) URL上提供。
https://arxiv.org/abs/2409.08270
Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at \url{this https URL}.
近年来,大型文本到图像扩散模型的成功和它们生成高质量图像的令人印象深刻潜力引起了广泛关注。进一步增加图像的编辑性引发了对于在指定区域图像中描述新对象的有希望的下游任务的浓厚兴趣。然而,这个问题有两个方面:1)仅依赖单个 U-Net 进行文本提示和图像对象对所有去噪时间步的同步对齐是不够的,不足以生成所需的对象;2)扩散模型的对象生成控制是不可靠的。在本文中,我们提出了一种将通常单阶段对象修复分解为两个级联过程的方法:1)多模态特征空间中推断目标对象语义特征的语义预修复;2)基于扩散模型的自适应高场强度物体生成。为了实现这一目标,我们通过级联基于 Transformer 的语义修复器和扩散模型,引入了一个新颖的级联 Transformer-Diffusion(CAT-Diffusion)框架,用于文本指导的对象修复。从技术上讲,语义修复器通过预测未揭示上下文和文本提示来推断目标对象的语义特征。然后,语义修复器的输出作为信息丰富的视觉提示,引导高场强度物体生成通过参考适配层,实现可控制的对象修复。在 OpenImages-V6 和 MSCOCO 等数据集上进行的广泛评估证实了 CAT-Diffusion 相对于最先进方法的优势。代码可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2409.08260
Color image restoration methods typically represent images as vectors in Euclidean space or combinations of three monochrome channels. However, they often overlook the correlation between these channels, leading to color distortion and artifacts in the reconstructed image. To address this, we present Quaternion Nuclear Norm Minus Frobenius Norm Minimization (QNMF), a novel approach for color image reconstruction. QNMF utilizes quaternion algebra to capture the relationships among RGB channels comprehensively. By employing a regularization technique that involves nuclear norm minus Frobenius norm, QNMF approximates the underlying low-rank structure of quaternion-encoded color images. Theoretical proofs are provided to ensure the method's mathematical integrity. Demonstrating versatility and efficacy, the QNMF regularizer excels in various color low-level vision tasks, including denoising, deblurring, inpainting, and random impulse noise removal, achieving state-of-the-art results.
通常,颜色图像修复方法将图像表示为欧氏空间中的向量或三单色通道的组合。然而,它们通常忽视了这些通道之间的相关性,导致重构图像的颜色扭曲和伪影。为了解决这个问题,我们提出了Quaternion Nuclear Norm Minus Frobenius Norm Minimization (QNMF),一种用于颜色图像修复的新颖方法。QNMF利用四元数代数来全面捕捉RGB通道之间的关系。通过采用一种包含核范数减去Frobenius范数的正则化技术,QNMF近似于对四元编码颜色图像的低秩结构。为了确保方法具有数学完整性,提供了理论证明。展示其多样性和有效性,QNMF正则器在各种颜色低级视觉任务中表现出色,包括去噪、去模糊、修复和随机脉冲噪声消除,实现最先进的结果。
https://arxiv.org/abs/2409.07797
This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience. Leveraging foundation models as priors, our approach overcomes the limitations of traditional methods and boosts the performance to ensure the high-fidelity generation required by the display devices. The proposed system consists of two main steps: depth-based video splatting for warping and extracting occlusion mask, and stereo video inpainting. We utilize pre-trained stable video diffusion as the backbone and introduce a fine-tuning protocol for the stereo video inpainting task. To handle input video with varying lengths and resolutions, we explore auto-regressive strategies and tiled processing. Finally, a sophisticated data processing pipeline has been developed to reconstruct a large-scale and high-quality dataset to support our training. Our framework demonstrates significant improvements in 2D-to-3D video conversion, offering a practical solution for creating immersive content for 3D devices like Apple Vision Pro and 3D displays. In summary, this work contributes to the field by presenting an effective method for generating high-quality stereoscopic videos from monocular input, potentially transforming how we experience digital media.
本文提出了一种将2D视频转换为沉浸式立体3D的新框架,以满足沉浸体验中3D内容日益增长的需求。利用基本模型作为先验,我们的方法克服了传统方法的限制,提高了性能,确保了显示设备所需的高保真度生成。所提出的系统包括两个主要步骤:基于深度的视频透视用于扭曲和提取遮挡mask,以及立体视频修复。我们利用预训练的稳定视频扩散作为基础,并引入了立体视频修复任务的微调协议。为了处理具有不同长度和分辨率的输入视频,我们探索了自适应策略和分块处理。最后,为了支持我们的训练,开发了一套复杂的数据处理流程来重建大型和高质量的数据集。我们的框架在2D到3D视频转换方面取得了显著的改进,为像苹果视觉专业版和3D显示器这样的3D设备提供了实用的解决方案。总之,本文为该领域提供了一种从单目输入生成高质量立体视频的有效方法,可能改变了我们数字媒体的经历方式。
https://arxiv.org/abs/2409.07447
Despite promising progress in face swapping task, realistic swapped images remain elusive, often marred by artifacts, particularly in scenarios involving high pose variation, color differences, and occlusion. To address these issues, we propose a novel approach that better harnesses diffusion models for face-swapping by making following core contributions. (a) We propose to re-frame the face-swapping task as a self-supervised, train-time inpainting problem, enhancing the identity transfer while blending with the target image. (b) We introduce a multi-step Denoising Diffusion Implicit Model (DDIM) sampling during training, reinforcing identity and perceptual similarities. (c) Third, we introduce CLIP feature disentanglement to extract pose, expression, and lighting information from the target image, improving fidelity. (d) Further, we introduce a mask shuffling technique during inpainting training, which allows us to create a so-called universal model for swapping, with an additional feature of head swapping. Ours can swap hair and even accessories, beyond traditional face swapping. Unlike prior works reliant on multiple off-the-shelf models, ours is a relatively unified approach and so it is resilient to errors in other off-the-shelf models. Extensive experiments on FFHQ and CelebA datasets validate the efficacy and robustness of our approach, showcasing high-fidelity, realistic face-swapping with minimal inference time. Our code is available at this https URL.
尽管在面部交换任务中取得了承诺的进展,但现实交换图像仍然难以获得,通常受到伪影和其他影响因素的困扰,特别是在高姿态变化、色彩差异和遮挡等场景中。为了应对这些问题,我们提出了一个新方法,它更好地利用扩散模型进行面部交换,并通过以下核心贡献解决了这些问题:(a)我们将面部交换任务重新建模为自监督的训练时间去噪问题,在保留身份转移的同时融合目标图像。(b)我们引入了多级去噪扩散隐含模型(DDIM)采样,在训练过程中加强身份和感知相似性。(c)第三,我们引入了CLIP特征解耦,从目标图像中提取姿态、表情和照明信息,提高保真度。(d)此外,我们在修复训练期间引入了口罩洗牌技术,允许我们创建一个所谓的通用交换模型,附加头交换功能。我们的可以交换头发甚至饰品,超越传统面部交换。与依赖多个标准模型的先验工作不同,我们的方法是一种相对统一的方法,因此它对其他标准模型的错误具有弹性。在FFHQ和CelebA数据集上的大量实验证实了本方法的有效性和鲁棒性,展示了高保真度、现实的高面部交换,且具有最小的推理时间。我们的代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2409.07269
During the COVID-19 pandemic, face masks have become ubiquitous in our lives. Face masks can cause some face recognition models to fail since they cover significant portion of a face. In addition, removing face masks from captured images or videos can be desirable, e.g., for better social interaction and for image/video editing and enhancement purposes. Hence, we propose a generative face inpainting method to effectively recover/reconstruct the masked part of a face. Face inpainting is more challenging compared to traditional inpainting, since it requires high fidelity while maintaining the identity at the same time. Our proposed method includes a Multi-scale Channel-Spatial Attention Module (M-CSAM) to mitigate the spatial information loss and learn the inter- and intra-channel correlation. In addition, we introduce an approach enforcing the supervised signal to focus on masked regions instead of the whole image. We also synthesize our own Masked-Faces dataset from the CelebA dataset by incorporating five different types of face masks, including surgical mask, regular mask and scarves, which also cover the neck area. The experimental results show that our proposed method outperforms different baselines in terms of structural similarity index measure, peak signal-to-noise ratio and l1 loss, while also providing better outputs qualitatively. The code will be made publicly available. Code is available at GitHub.
在新冠病毒疫情期间,口罩在我们的生活中无处不在。口罩会导致一些人脸识别模型失效,因为它们遮盖了面部的大部分。此外,从捕获的图像或视频中移除口罩可能是值得的,例如,为了更好的社交互动和图像/视频编辑和增强目的。因此,我们提出了一个生成式人脸修复方法,以有效地恢复/重构遮住的面部部分。与传统修复相比,修复口罩更具挑战性,因为它需要在保持身份的同时具有高保真度。 我们的方法包括多尺度通道空间关注模块(M-CSAM),以减轻空间信息损失并学习跨通道相关性。此外,我们还引入了一种使监督信号集中于遮住区域的策略。我们还通过将五种不同类型的人脸口罩(包括手术口罩、普通口罩和围巾)集成到CelebA数据集中,合成了自己的人脸口罩数据集。 实验结果表明,与各种基线相比,我们的方法在结构相似性指数测量、峰值信号-噪声比和L1损失方面表现优异,同时提供更好的定性输出。代码将在GitHub上公开发布。代码可用在GitHub上。
https://arxiv.org/abs/2409.06845
Image vectorization is a process to convert a raster image into a scalable vector graphic format. Objective is to effectively remove the pixelization effect while representing boundaries of image by scaleable parameterized curves. We propose new image vectorization with depth which considers depth ordering among shapes and use curvature-based inpainting for convexifying shapes in vectorization process.From a given color quantized raster image, we first define each connected component of the same color as a shape layer, and construct depth ordering among them using a newly proposed depth ordering energy. Global depth ordering among all shapes is described by a directed graph, and we propose an energy to remove cycle within the graph. After constructing depth ordering of shapes, we convexify occluded regions by Euler's elastica curvature-based variational inpainting, and leverage on the stability of Modica-Mortola double-well potential energy to inpaint large regions. This is following human vision perception that boundaries of shapes extend smoothly, and we assume shapes are likely to be convex. Finally, we fit Bézier curves to the boundaries and save vectorization as a SVG file which allows superposition of curvature-based inpainted shapes following the depth ordering. This is a new way to vectorize images, by decomposing an image into scalable shape layers with computed depth ordering. This approach makes editing shapes and images more natural and intuitive. We also consider grouping shape layers for semantic vectorization. We present various numerical results and comparisons against recent layer-based vectorization methods to validate the proposed model.
图像向量化是将光栅图像转换为可缩放矢量图形格式的过程。目标是有效消除光栅化效应,同时通过可缩放参数化曲线表示图像的边界。我们提出了带有深度的新图像向量化方法,考虑形状之间的深度排序,并在向量化过程中使用凸性修复形状。 从给定的彩色量化光栅图像开始,我们首先将同一种颜色的每个连通组件定义为形状层,并使用新提出的深度排序能量在其之间进行深度排序。所有形状之间的全局深度排序由有向图描述,我们还提出了一个能量来消除图中的环。在形状的深度排序之后,我们使用Euler's elasticity-based variational inpainting来平滑遮盖区域,并利用Modica-Mortola双井势能的稳定性在较大区域内修复。这符合人类视觉感知,即形状的边界平滑延伸,我们假设形状通常是凸的。最后,我们将贝塞尔曲线适配到边界上,并将向量化保存为SVG文件,允许在深度排序下对凸性修复形状进行超position。这是通过将图像分解为可缩放的形状层并计算深度排序来 vectorize图像的一种新方法。这种方法使编辑形状和图像更加自然和直观。我们还考虑了分群形状层进行语义向量化。我们给出了各种数值结果和与最近基于层的光栅化方法的比较,以验证所提出的模型。
https://arxiv.org/abs/2409.06648
The paper focuses on inpainting missing parts of an audio signal spectrogram. First, a recent successful approach based on an untrained neural network is revised and its several modifications are proposed, improving the signal-to-noise ratio of the restored audio. Second, the Janssen algorithm, the autoregression-based state-of-the-art for time-domain audio inpainting, is adapted for the time-frequency setting. This novel method, coined Janssen-TF, is compared to the neural network approach using both objective metrics and a subjective listening test, proving Janssen-TF to be superior in all the considered measures.
本文重点关注音频信号频谱图中的缺失部分。首先,基于未训练的神经网络的最近成功方法被回顾,并提出几个改进措施,以改善恢复音频的信号-噪声比。其次,Janssen算法,基于自回归的时域音频修复的最新方法,被适应到时频设置中。这种新颖方法Janssen-TF与神经网络方法进行了比较,两种方法均使用客观指标和主观听觉测试进行比较,证明Janssen-TF在所考虑的指标中优越。
https://arxiv.org/abs/2409.06392
Score-based diffusion methods provide a powerful strategy to solve image restoration tasks by flexibly combining a pre-trained foundational prior model with a likelihood function specified during test time. Such methods are predominantly derived from two stochastic processes: reversing Ornstein-Uhlenbeck, which underpins the celebrated denoising diffusion probabilistic models (DDPM) and denoising diffusion implicit models (DDIM), and the Langevin diffusion process. The solutions delivered by DDPM and DDIM are often remarkably realistic, but they are not always consistent with measurements because of likelihood intractability issues and the associated required approximations. Alternatively, using a Langevin process circumvents the intractable likelihood issue, but usually leads to restoration results of inferior quality and longer computing times. This paper presents a novel and highly computationally efficient image restoration method that carefully embeds a foundational DDPM denoiser within an empirical Bayesian Langevin algorithm, which jointly calibrates key model hyper-parameters as it estimates the model's posterior mean. Extensive experimental results on three canonical tasks (image deblurring, super-resolution, and inpainting) demonstrate that the proposed approach improves on state-of-the-art strategies both in image estimation accuracy and computing time.
基于分数的扩散方法提供了一种通过灵活地将预训练的基本先验模型与在测试时间指定的 likelihood 函数相结合来解决图像修复任务的强大策略。这些方法主要来源于两个随机过程:反演Ornstein-Uhlenbeck过程,这是著名的去噪扩散概率模型(DDPM)和去噪扩散隐式模型的基础,以及Langevin扩散过程。DDPM和DDIM的解决方案通常非常逼真,但由于 likelihood 不可求性和相关所需逼近问题,它们并不总是符合测量结果。相反,使用Langevin过程绕过了不可求 likelihood 的问题,但通常会导致修复结果的质量较差和计算时间更长。本文介绍了一种新颖且计算效率极高的图像修复方法,该方法将基本DDPM去噪器嵌入到一个实证贝叶斯Langevin算法中,该算法在估计模型后验均值的同时共同标定关键模型超参数。对三个典型任务(图像去噪、超分辨率、修复)的实验结果表明,与最先进的策略相比,所提出的方法在图像估计精度和计算时间方面都取得了改善。
https://arxiv.org/abs/2409.04384
Traffic sign recognition systems play a crucial role in assisting drivers to make informed decisions while driving. However, due to the heavy reliance on deep learning technologies, particularly for future connected and autonomous driving, these systems are susceptible to adversarial attacks that pose significant safety risks to both personal and public transportation. Notably, researchers recently identified a new attack vector to deceive sign recognition systems: projecting well-designed adversarial light patches onto traffic signs. In comparison with traditional adversarial stickers or graffiti, these emerging light patches exhibit heightened aggression due to their ease of implementation and outstanding stealthiness. To effectively counter this security threat, we propose a universal image inpainting mechanism, namely, SafeSign. It relies on attention-enabled multi-view image fusion to repair traffic signs contaminated by adversarial light patches, thereby ensuring the accurate sign recognition. Here, we initially explore the fundamental impact of malicious light patches on the local and global feature spaces of authentic traffic signs. Then, we design a binary mask-based U-Net image generation pipeline outputting diverse contaminated sign patterns, to provide our image inpainting model with needed training data. Following this, we develop an attention mechanism-enabled neural network to jointly utilize the complementary information from multi-view images to repair contaminated signs. Finally, extensive experiments are conducted to evaluate SafeSign's effectiveness in resisting potential light patch-based attacks, bringing an average accuracy improvement of 54.8% in three widely-used sign recognition models
交通信号识别系统在协助驾驶员做出明智的驾驶决策方面发挥着关键作用。然而,由于对深度学习技术的依赖程度较高,特别是在未来的自动驾驶和联网驾驶中,这些系统容易受到恶意攻击,对个人和公共交通工具的安全造成严重威胁。值得注意的是,研究人员最近发现了一种新的攻击方式来欺骗信号识别系统:将设计精良的恶意光补丁投射到交通标志上。与传统的恶意补丁或涂鸦相比,这些新兴光补丁由于其易实施和出色的隐身性而表现出更强烈的攻击性。为了有效应对这一安全威胁,我们提出了一个通用的图像修复机制,即SafeSign。它依赖于注意力的多视角图像融合来修复被恶意光补丁污染的交通标志,从而确保准确的交通信号识别。在这里,我们首先探讨了恶意光补丁对真实交通标志的局部和全局特征空间的影响。然后,我们设计了一个基于二进制掩码的U-Net图像生成管道,输出各种污染标志的多样化补丁,为我们的图像修复模型提供所需的数据。接下来,我们开发了一个注意力机制-enabled神经网络,共同利用多视角图像的互补信息来修复污染标志。最后,我们进行了广泛的实验来评估SafeSign在抵抗潜在光补丁攻击方面的有效性,将三种广泛使用的信号识别模型的平均准确度提高了54.8%。
https://arxiv.org/abs/2409.04133
We introduce a novel method for updating 3D geospatial models, specifically targeting occlusion removal in large-scale maritime environments. Traditional 3D reconstruction techniques often face problems with dynamic objects, like cars or vessels, that obscure the true environment, leading to inaccurate models or requiring extensive manual editing. Our approach leverages deep learning techniques, including instance segmentation and generative inpainting, to directly modify both the texture and geometry of 3D meshes without the need for costly reprocessing. By selectively targeting occluding objects and preserving static elements, the method enhances both geometric and visual accuracy. This approach not only preserves structural and textural details of map data but also maintains compatibility with current geospatial standards, ensuring robust performance across diverse datasets. The results demonstrate significant improvements in 3D model fidelity, making this method highly applicable for maritime situational awareness and the dynamic display of auxiliary information.
我们提出了一个新方法来更新3D地理空间模型,特别是针对大规模海上环境中遮挡的消除。传统的3D重建技术通常会遇到动态对象(如汽车或船只)遮挡真实环境的问题,导致不准确模型或需要大量手动编辑。我们的方法利用了深度学习技术,包括实例分割和生成修复,直接修改3D网格的纹理和几何,无需进行昂贵的重新处理。通过选择性地针对遮挡的对象并保留静态元素,该方法提高了几何和视觉准确性。这种方法不仅保留了地图数据的结构和纹理细节,还保持了与现有地理空间标准的兼容性,确保在各种数据集上的稳健性能。结果表明,3D模型保真度得到了显著提高,使得这种方法非常适用于海洋情境感知和辅助信息的动态显示。
https://arxiv.org/abs/2409.03451
Recently, diffusion model-based inverse problem solvers (DIS) have emerged as state-of-the-art approaches for addressing inverse problems, including image super-resolution, deblurring, inpainting, etc. However, their application to video inverse problems arising from spatio-temporal degradation remains largely unexplored due to the challenges in training video diffusion models. To address this issue, here we introduce an innovative video inverse solver that leverages only image diffusion models. Specifically, by drawing inspiration from the success of the recent decomposed diffusion sampler (DDS), our method treats the time dimension of a video as the batch dimension of image diffusion models and solves spatio-temporal optimization problems within denoised spatio-temporal batches derived from each image diffusion model. Moreover, we introduce a batch-consistent diffusion sampling strategy that encourages consistency across batches by synchronizing the stochastic noise components in image diffusion models. Our approach synergistically combines batch-consistent sampling with simultaneous optimization of denoised spatio-temporal batches at each reverse diffusion step, resulting in a novel and efficient diffusion sampling strategy for video inverse problems. Experimental results demonstrate that our method effectively addresses various spatio-temporal degradations in video inverse problems, achieving state-of-the-art reconstructions. Project page: this https URL
近年来,基于扩散模型的反问题求解器(DIS)已成为解决反问题的最先进方法之一,包括图像超分辨率、去噪、修复等。然而,由于训练视频扩散模型的挑战,将扩散模型应用于来自时空衰减的视频反问题仍然是一个未开拓的领域。为解决这个问题,我们介绍了一种创新的视频反问题求解器,它仅利用图像扩散模型。具体来说,我们通过受到最近分块扩散采样器(DDS)成功的启发,将视频的时间维度视为图像扩散模型的批维度,并在每个图像扩散模型生成的无噪声批中解决时空优化问题。此外,我们还引入了一种批一致的扩散采样策略,通过同步图像扩散模型中的随机噪声分量,鼓励批之间的 consistency。我们的方法通过将批一致采样与反向扩散步骤中同时优化无噪声批达到对视频反问题的全新和高效的扩散采样策略。实验结果表明,我们的方法有效地解决了各种视频反问题中的各种时空退化,实现了与最佳重建结果相当的视频反问题求解。项目页面:https:// this URL
https://arxiv.org/abs/2409.02574
Generation of VLSI layout patterns is essential for a wide range of Design For Manufacturability (DFM) studies. In this study, we investigate the potential of generative machine learning models for creating design rule legal metal layout patterns. Our results demonstrate that the proposed model can generate legal patterns in complex design rule settings and achieves a high diversity score. The designed system, with its flexible settings, supports both pattern generation with localized changes, and design rule violation correction. Our methodology is validated on Intel 18A Process Design Kit (PDK) and can produce a wide range of DRC-compliant pattern libraries with only 20 starter patterns.
VLSI布局模式生成对于广泛的DFM研究至关重要。在这项研究中,我们研究了生成式机器学习模型在创建设计规则合法金属布局模式方面的潜力。我们的结果表明,所提出的模型可以在复杂的设计规则设置中生成合法模式,并实现高多样性评分。具有灵活设置的系统支持使用局部更改生成模式和设计规则违规纠正。我们的研究方法在Intel 18A Process Design Kit(PDK)上进行验证,并能在仅使用20个启动模式的情况下生成广泛的DRC合规布局模式库。
https://arxiv.org/abs/2409.01348
We introduce $\texttt{ReMOVE}$, a novel reference-free metric for assessing object erasure efficacy in diffusion-based image editing models post-generation. Unlike existing measures such as LPIPS and CLIPScore, $\texttt{ReMOVE}$ addresses the challenge of evaluating inpainting without a reference image, common in practical scenarios. It effectively distinguishes between object removal and replacement. This is a key issue in diffusion models due to stochastic nature of image generation. Traditional metrics fail to align with the intuitive definition of inpainting, which aims for (1) seamless object removal within masked regions (2) while preserving the background continuity. $\texttt{ReMOVE}$ not only correlates with state-of-the-art metrics and aligns with human perception but also captures the nuanced aspects of the inpainting process, providing a finer-grained evaluation of the generated outputs.
我们提出了$\texttt{ReMOVE}$,一种新的无参考图像的指标,用于评估基于扩散图像编辑模型的修复效果。与现有的指标如LPIPS和CLIPScore不同,$\texttt{ReMOVE}$解决了没有参考图像的情况下评估修复效果的挑战,这是在实际场景中非常普遍的。它有效地区分了物体删除和替换。这是由于扩散模型的随机性导致的。传统的指标无法与修复效果的直觉定义对齐,该定义旨在实现:(1)在遮罩区域中无缝的物体删除;(2)同时保留背景的连续性。$\texttt{ReMOVE}$不仅与最先进的指标相关联,还与人类感知相一致,并捕捉了修复过程的细微方面,为生成输出提供了更细粒度的评估。
https://arxiv.org/abs/2409.00707
Estimating global human motion from moving cameras is challenging due to the entanglement of human and camera motions. To mitigate the ambiguity, existing methods leverage learned human motion priors, which however often result in oversmoothed motions with misaligned 2D projections. To tackle this problem, we propose COIN, a control-inpainting motion diffusion prior that enables fine-grained control to disentangle human and camera motions. Although pre-trained motion diffusion models encode rich motion priors, we find it non-trivial to leverage such knowledge to guide global motion estimation from RGB videos. COIN introduces a novel control-inpainting score distillation sampling method to ensure well-aligned, consistent, and high-quality motion from the diffusion prior within a joint optimization framework. Furthermore, we introduce a new human-scene relation loss to alleviate the scale ambiguity by enforcing consistency among the humans, camera, and scene. Experiments on three challenging benchmarks demonstrate the effectiveness of COIN, which outperforms the state-of-the-art methods in terms of global human motion estimation and camera motion estimation. As an illustrative example, COIN outperforms the state-of-the-art method by 33% in world joint position error (W-MPJPE) on the RICH dataset.
估计全局人类运动是一个具有挑战性的任务,因为人类和相机运动之间存在纠缠。为了减轻这种不确定性,现有的方法利用学习到的人类运动先验,然而通常会导致2D投影不正确的平滑运动。为了解决这个问题,我们提出了COIN,一种控制修复运动扩散 prior,它允许细粒度控制来区分人类和相机运动。尽管预训练的运动扩散模型编码了丰富的运动先验,但我们发现利用这种知识来指导从RGB视频中进行全局运动估计是不简单的。COIN引入了一种新颖的控制修复评分采样方法,以确保在联合优化框架内从扩散先验中进行平滑、一致和高质量的运动。此外,我们还引入了一种新的人类-场景关系损失,通过强制人类、相机和场景之间的一致性来减轻尺度不确定性。在三个具有挑战性的基准上进行的实验证明了COIN的有效性,其在全局人类运动估计和相机运动估计方面超过了最先进的方法。作为示例,COIN在RICH数据集上的世界关节位置误差(W-MPJPE)上比最先进的方法提高了33%。
https://arxiv.org/abs/2408.16426
Histopathological image analysis is crucial for accurate cancer diagnosis and treatment planning. While deep learning models, especially convolutional neural networks, have advanced this field, their "black-box" nature raises concerns about interpretability and trustworthiness. Explainable Artificial Intelligence (XAI) techniques aim to address these concerns, but evaluating their effectiveness remains challenging. A significant issue with current occlusion-based XAI methods is that they often generate Out-of-Distribution (OoD) samples, leading to inaccurate evaluations. In this paper, we introduce Inpainting-Based Occlusion (IBO), a novel occlusion strategy that utilizes a Denoising Diffusion Probabilistic Model to inpaint occluded regions in histopathological images. By replacing cancerous areas with realistic, non-cancerous tissue, IBO minimizes OoD artifacts and preserves data integrity. We evaluate our method on the CAMELYON16 dataset through two phases: first, by assessing perceptual similarity using the Learned Perceptual Image Patch Similarity (LPIPS) metric, and second, by quantifying the impact on model predictions through Area Under the Curve (AUC) analysis. Our results demonstrate that IBO significantly improves perceptual fidelity, achieving nearly twice the improvement in LPIPS scores compared to the best existing occlusion strategy. Additionally, IBO increased the precision of XAI performance prediction from 42% to 71% compared to traditional methods. These results demonstrate IBO's potential to provide more reliable evaluations of XAI techniques, benefiting histopathology and other applications. The source code for this study is available at this https URL.
病理学图像分析是准确癌症诊断和治疗规划的关键。尽管深度学习模型(特别是卷积神经网络)在這個領域取得了進展,但它們的“黑盒子”特點引起了對可解釋性和可信度的關注。可解釋性人工智能(XAI)技術旨在解決這些問題,但評估其有效性仍然具有挑戰性。目前基于遮挡的XAI方法的一个主要問題是,它們通常會生成离群(OoD)樣本,導致不準確的評估。在本文中,我們介紹了基于修复的遮挡(IBO),一種新的遮挡策略,它利用去噪扩散概率模型的特性來在歷史學圖像中修復被遮罩的區域。通過用真實的非癌症組織替換癌細胞區域,IBO最小化OoD artifacts並保留了數據完整性。我們通過CAMELYON16數據集進行實驗,分為兩個階段進行評估:第一階段,使用學習到的感知相似性(LPIPS)指標評估感知相似性;第二階段,通過曲线下面積(AUC)分析評估模型的預測影響。我們的研究結果表明,IBO顯著提高了感知準確性,與最優秀的現有屏蔽策略相比,LPIPS得分進步了近兩倍。此外,IBO將XAI性能預測的準確度從42%提高到71%,與傳統方法相比。這些結果表明,IBO具有提供更多可靠性的XAI技術的潛力,有助於病理學和其他應用。本研究的研究源代碼可在這個https URL找到。
https://arxiv.org/abs/2408.16395
Image restoration refers to the process of restoring a damaged low-quality image back to its corresponding high-quality image. Typically, we use convolutional neural networks to directly learn the mapping from low-quality images to high-quality images achieving image restoration. Recently, a special type of diffusion bridge model has achieved more advanced results in image restoration. It can transform the direct mapping from low-quality to high-quality images into a diffusion process, restoring low-quality images through a reverse process. However, the current diffusion bridge restoration models do not emphasize the idea of conditional control, which may affect performance. This paper introduces the ECDB model enhancing the control of the diffusion bridge with low-quality images as conditions. Moreover, in response to the characteristic of diffusion models having low denoising level at larger values of \(\bm t \), we also propose a Conditional Fusion Schedule, which more effectively handles the conditional feature information of various modules. Experimental results prove that the ECDB model has achieved state-of-the-art results in many image restoration tasks, including deraining, inpainting and super-resolution. Code is avaliable at this https URL.
图像修复是指将受损低质量图像恢复为相应高质量图像的过程。通常,我们使用卷积神经网络直接学习从低质量图像到高质量图像的映射,实现图像修复。最近,一种特殊的扩散桥模型在图像修复方面取得了更先进的结果。它可以通过扩散过程将低质量图像的直接映射转化为高质量图像的扩散映射,通过反向过程恢复低质量图像。然而,当前的扩散桥修复模型并没有强调条件控制的概念,这可能会影响性能。本文介绍了使用ECDB模型增强扩散桥对低质量图像条件的控制。此外,为了更好地处理具有较低去噪水平的扩散模型的条件特征信息,我们还提出了一个条件融合计划。实验结果证明,ECDB模型在许多图像修复任务中已经达到了最先进水平,包括去雾、修复和超分辨率。代码可在此链接处下载:https://url.cn/
https://arxiv.org/abs/2408.16303
Reducing the radiation dose in computed tomography (CT) is crucial, but it often results in sparse-view CT, where the number of available projections is significantly reduced. This reduction in projection data makes it challenging to accurately reconstruct high-quality CT images. In this condition, a sinogram, which is a collection of these projections, becomes incomplete. Sinogram inpainting then becomes essential because it enables accurate image reconstruction with limited projections. Existing models performing well on conventional RGB images for inpainting mostly fail in the case of sinograms. Further, these models usually do not make full use of unique properties, e.g., frequency features and absorption characteristics in the sinogram, and cannot handle large-area masks and complex real-world projections well. To address these limitations, we propose a novel model called the Frequency Convolution Diffusion Model (FCDM). It employs frequency domain convolutions to extract frequency information from various angles and capture the intricate relationships between these angles, which is essential for high-quality CT reconstruction. We also design a specific loss function based on the unique properties of a sinogram to maintain the consistency in physical properties, which allows the model to learn more effectively even in larger mask areas. We compare FCDM using both simulations and real data with nine inpainting models examples, among which two are designed for sinogram and seven for RGB. The results indicate that our model significantly improves the quality of the inpainted sinograms in terms of both visually and quantitatively, with an SSIM of more than 0.95 and PSNR of more than 30, achieving up to a 33% improvement in SSIM and a 29% improvement in PSNR compared to the baseline.
减少计算机断层扫描(CT)中的辐射剂量至关重要,但却经常导致稀疏视野CT,其中可用的投影数量大幅减少。这种投影数据减少使得准确重构高质量CT图像变得具有挑战性。在这种情况下,对比剂图(sinogram)变得不完整。然后,由于它允许在有限的投影下准确重建图像,因此sinogram修复变得至关重要。然而,在传统的RGB图像修复模型中,大多数模型在sinogram上的表现并不理想。此外,这些模型通常没有充分利用独特性质,例如频谱特征和吸收特性,且无法处理大面积掩码和复杂现实投影。为了克服这些限制,我们提出了一个名为频率卷积扩散模型(FCDM)的新模型。它采用频域卷积提取频谱信息,并捕捉这些角度之间的复杂关系,这对于高质量CT重建至关重要。我们还根据sinogram的独有特性设计了一个特定的损失函数,以保持物理性质的一致性,从而使模型在更大的掩码区域上学习更有效。我们比较了FCDM与九个修复模型(其中两个为sinogram,七个为RGB)在模拟数据和真实数据上的表现。结果表明,我们的模型在视觉和量化的意义上显著提高了修复sinogram的质量,其SSIM达到0.95,PSNR达到30,与基线相比,实现了33%的SSIM和29%的PSNR的改善。
https://arxiv.org/abs/2409.06714